At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on March 14th and 16th, 2020 described below.
Summary:
April 12
2:13PM ESTS
Site called into Support, citing waves were taking an unusually long time to load
2:40AM EST
Tier 1 (T1) analyst connected to the system, started investigation via logs and SQL queries
3:13 PM EST
Escalated issue to Tier 2 (T2) resource Barry Stone
3:13AM - 12:15 PM EST
T2 reviewed the ticket with T1, and looked at logs. Reviewed live examples provided by the site and tried to determine if the IT was affecting all Waves. Checked if Cancelled orders where impacting the import and Assigning process. Site cancelled one order and Imports did begin to process but with extreme slowness.
4:37 PM EST
Development Resource Engaged
4:45 PM EST
Application server was restarted to see if performance was impacted. Additional troubleshooting continued.
5:45PM EST
Escalation was transferred to and additional T2 resource Jason Whittle
6:30 PM - 7:00 PM EST
Development reviewed additional logs and Database performance and found that there CNTNR_HEADER, CNTNR_HEADER_DETAIL and DOCUMENT_LINE tables contained millions of unneeded records.
7:15PM - 7:30 PM EST
An Index was run on the tables in questions and The import process began to post data at the expected rate. It was determined that a DBA resource would be engaged the next day to review the Archive & Delete configuration to see why records where being omitted from the process.
April 13
T2 Resource Andrew Downey engaged a DBA to review A&D Logic. It was determined that the CNTNR_HEADER script needed to be updated to remove the CNTNR_HEADER.CNTNR_NAME is NULL option to clear the orphaned records.
2:00 PM – 5 PM EST
Andrew contacted Kenny and Ron to review the Archive process. It was determined that 50,000 records per run should be archived per the scheduled maintenance job nightly until the cleanup was complete.
April 14
Bastian reviewed the A&D task from the previous night and found the process ran for approximately 27 Minutes, and that all orphaned records where moved from the live ExactaDB to the ExactaArchiveDB. Further review of the logic found that the 50K records count was nested in loop, allowing the task to run until complete rather than being limited to a specific record count.
Root Cause:
Root cause was related to the initial Archive and delete configuration including the “Cntnr_name is NULL” in the archive process. This caused the A&D process to omit containers that did not have their names NULL’d during operations. Since go live the records count (and associated record counts) continued to build in the database eventually causing slowness during the import process.
Our commitment to you:
Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service, and we assure you we're taking the required preventative measures to prevent reoccurrence.
Comments
0 comments