At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on August 22nd – August 23rd, 2019 described below.
Summary:
Puma frequently calls about various issues such as “slowness” “sushi loop induction” and “order failing to import” which Chase Copley and Nick VanWallaghen meet with them regularly. In this document, we will not discuss those issues and focus on the issues that have crippled production during the above mentioned dates. In regard to the reported issue on Saturday (8/24/2019), it wasn’t crippling production, and Puma had a workaround that they processed before Andrew Lynch could investigate. They did have one problem that required them to cycle count the inventory container which triggered the order to be picked. Root Cause has not yet been determined.
In short, before reading the two issues below, both incidents could have been avoided if proper escalation occurred. The proposed solution until the system stabilized is to coach De’Andra Guthrie as both problems seemed to start with her receiving the call. Also, create a Support SWAT Team on a rotation with Chase Copley, a developer (on deck), escalation resource, database analyst (on deck) and Travis Coleman. Chase or Travis will be present for all support calls, in addition to the escalation resource. Travis will work with Mark Curtis to identify two developers we can rotate between and work with Jason Petrie on two database analysts we can rotate between if needed.
8/22/19 2:09 AM EST Support Specialist I, De'Andra Guthrie, received notification from Puma. Bastian received three calls reporting the outage from Adam, William, and Danny. It was reported that System wasn't processing picks and/or puts. Upon connection, Specialist I identified several SQL jobs running with the jobs scheduled to execute between 2 AM EST and 3 AM EST.
8/22/19 6:02 AM EST De'Andra received an additional call from Puma, this time by Jon Erickson reporting the previously reported issue persisted. Our Specialist I began diagnosing the issue.
8/22/19 6:30 AM EST De'Andra begun escalation procedures to our Analyst I, Jason Whittle. Management escalation occurred at 6:45 AM EST. De'Andra transitioned the call to Specialist II, Owen Graham at approximately 7:15 AM EST. Upon reviewing the journals, we identified the following exception occurring when writing journal entries to table ORDER_JOURNAL.
ERROR - Bastian.Exacta.Business.Persistance.UnitOfWork - NHibernate.Exceptions.GenericADOException: could not execute batch command.[SQL: SQL not available] ---> System.Data.SqlClient.SqlException: Arithmetic overflow error converting IDENTITY to data type numeric. Arithmetic overflow occurred.
8/22/19 7:50 AM EST Bastian engaged our Development and Database teams upon identifying the error. It was identified the SEQ_NUM column in ORDER_JOURNAL had consumed all available values and was unable to reseed, resetting the sequence back to 1. This prevented the system from inserting new records. Our Database Analyst, Justin Compton, deleted records from order_journal, performed the function reseed on the column and rebuilt the indexes for the entire table. Which ultimately resolved the issue.
8/22/19 10:15 AM EST We received confirmation that the issue was resolved and the system was operational. The total reported production loss was 4 hours.
Root Cause:
Table ORDER_JOURNAL contains a column "seq_num" which is an automatically increments upon insert. All available integers were consumed and SQL was unable to reseed the sequence. This prevented additional records from being inserted into the table.
Next Steps and Preventive Actions:
Increase the length of the integer for the seq_num column, allowing for a significant increase in the available number of seq numbers and reseed. Column increased from 9 to 18 characters.
Our commitment to you:
Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service and we assure you we're taking the required preventative measures to prevent reoccurrence.
|
Comments
0 comments