At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on May 17-18th, 2021 described below.
Summary:
8:00 AM EDT AdoreMe called in and referred to existing ticket 165233 requesting that we assist them with re-importing a set of orders. Sebastian Lynn, Support Specialist I, received this call as well as an excel spreadsheet containing the list of orders to reimport.
8:30 AM EDT Sebastian sent an email from the ticket verifying whether or not the AdoreMe team had attempted to rename the orders so that the correct versions could be imported.
9:15 AM EDT Sebastian proceeded with removing the orders manually from the database. After running the SQL statement to delete from the order_sub_line table, Sebastian noticed an abnormally large number of records had been removed and began requesting assistance from senior support resources. At this time operators at the Autostore stopped receiving new work.
9:45 AM EDT Owen Graham, Support Analyst I, was formally engaged as the escalation resource for the issue. Owen verified the deleted records were present in the delta schema tables and begin working on the process to restore these records into production.
10:30 AM EDT Owen made multiple attempts to restore the data but was unsuccessful. During this time period, it was also observed that the original delete statement should not have processed in SQL as the columns did not exist. Owen engaged a Bastian DBA for additional assistance.
12:00 PM EDT Justin Compton, Database Analyst, identified that the restoration of these records was being prohibited by conflicts with inventory records that had already been reassigned as the system attempted to recreate the deleted subline. Justin created a script to reinsert the records and reconcile these records at the same time and began running it shortly before noon. By 12:15 this process was around 80% complete and the Autostore was receiving work again.
1:00 PM EDT AdoreMe requested a post mortem analysis and noted that some issues were still lingering. Travis Coleman, Software Support Manager, Owen, and Justin analyzed the original delete query and discovered a bug in the SQL Management Studio that SQL will process a DELETE statement whose WHERE clause contains a column not found in a subquery temp table without filtering the data. Essentially returning all records in the table.
2:00 PM EDT Sebastian continued to work on the remaining issues with waves not completing but was unsuccessful. At 1:50 PM Sebastian re-escalated the issue and engaged Barry Stone, Support Analyst I, as the new escalation resource.
4:00 PM EDT Barry and Sebastian first worked on resolving the issues with waves not completing. One of the waves had been force closed, the remaining wave had an invalid order line on the demand order, assigned to a container not found in Autostore. This line was canceled and the wave completed. Tier 1 then began to investigate the issue of lines receiving batch num -99, which had begun to occur at this time.
5:30 PM EDT Sebastian coordinated a restart of the Autostore system.
6:00 PM EDT The issue was transitioned to Andrew Lynch, Support Specialist II. Andrew re-engaged the on-call Support Analyst, Owen Graham, for assistance. Andrew established that after the Autostore restart the system was still not producing any work. At this time AdoreMe also reported errors in Putwall/SureSort. After requesting guidance from the site on what to prioritize, the support team decided to focus on the lack of work at Autostore.
6:45 PM EDT Andrew and Owen determined that the reason Autostore was not getting any work was that all work in the system had been assigned an invalid batch number (Autostore task group number) of -99. Owen attempted to rebatch these manually but found that the Exacta PreProcessor service reassigned batch -99 each time. Owen reached out to engage Development at this time.
9:00 PM EDT Owen and Andrew worked with developers to get correct batch numbers assigned to the work, but these numbers were rejected by Autostore and new numbers were reassigned. Shortly after 8:00 PM Development identified that the Autostore was rejecting the tasks because location data was missing on one or more lines associated with the transport container, rendering these lines invalid. Support then began the task of manually identifying and canceling these lines in SQL. At 8:50 PM the first new batch was successfully processed by Autostore. At 9:05 PM all the work had been rebatched and the Autostore was running again.
9:45 PM EDT Bastian Support began investigating the errors reported at SureSort and Putwall. An AdoreMe employee cleared locations on the SureSort and reported no further errors. Bastian restarted the Putwall service and the problematic light behaviors reported at Putwall stopped occurring. At 9:45 PM the site confirmed they were up and running.
MAY 18th, 10:30 AM EDT Tyler Ryan, Support Specialist I, received a call from AdoreMe reporting errors scanning products affecting multiple products at Putwall.
11:40 AM EDT Tyler escalated to Andy Downey, Support Analyst I, for assistance with resolving the issue. Andy determined that the errors were caused by discrepancies between customer order status and the status of the demand order.
1:30 PM EDT
Andy consulted with other resources which had been involved and determined there were a small number subline records that had not been restored during the main restoration performed by Justin yesterday.
2:50- 5:50 PM EDT
Andrew engaged DBA resources to further troubleshoot and locate additional records to recover affecting Putwall 4.
Root Cause:
Removal of an excess number of records from the order_sub_line table while trying to delete order information at AdoreMe's request. The query used referenced a column that did not exist in the SQLs IN statement exposing a flaw within SQL that treats this as all rows in the db.
Resolution:
The issues caused by the delete were resolved by manual recovery from delta records by Bastian's DBA.
Next Steps and Preventive Actions:
Implemented policy to prevent Bastian team members from a subquery as part of an update.
Provided additional training to Bastian Employees in regards to SQL updates and deletes.
Implemented Policy to ensure all queries joining multiple tables or have the intention of updating multiple records are peer-reviewed by a T2 or higher level resource.
Our commitment to you:
Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service and we assure you we're taking the required preventative measures to prevent reoccurrence.
|
Comments
0 comments