At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on November 21st – November 22nd, 2019 described below.
Summary:
November 21st, 2019 4:10 PM EST Foster Akado, Support Specialist I, received a call from Karl Kern with Kuehne + Nagel. Karl reported an issue with orders disappearing from workstations while they were in “continuous pick”. It was also stated that the AutoStore had been stopped in the middle of picking. Karl noted that the presentation was slower than normal and had been a problem throughout the day.
November 21st, 2019 4:45 PM EST Foster called Karl back to explain what he was seeing in the logs and that he was going to escalate the issue to the development team.
November 21st, 2019 5:15 PM EST Foster called Karl to discuss the next steps. Karl reiterated that bins were presenting slow. Foster reached out to our AutoStore Team for assistance. Joe Doyle, Controls Support Engineer II, worked with Foster on restarting the AutoStore. At this time, Karl said the shift had ended and the next shift would reach out if issues persisted.
November 21st, 6:10 PM EST Foster received a call from Saheed stating that they continued to experience issues with slow bin presentation and totes disappearing from Exacta Touch. It was noted that when this occurred, it would jump to the next order, causing an incomplete pick on the previous order. Escalation to our Tier 2 Team began.
November 21st, 7:15 PM EST Escalation to our Tier 3 began with the following resources: 1. Travis Coleman, Software Support Manager 2. Chase Copley, Regional Support Manager 3. David Strawser, Unified Support Manager 4. Evan Sturgis, Support Analyst III 5. Giovani Guarnero, Project Engineer 6. Justin Compton – Database Analyst
Our Database Analyst was connected and diagnosing the issue. Using SQL Activity Monitor, our team saw high CPU utilization hovering near 60%. We also monitored for missing indexes or queries that were taxing to the system. Upon the conclusion of his review, we decided to move forward with a database failover to the secondary node.
November 21st, 11:30 PM EST The failover was completed, and indexes were rebuilt. Post failover, CPU Utilization for SQL was sub 10%. Foster requested the Kuehne + Nagel team attempt to pick. They attempted port six but encountered issues with the bin being delivered. The robot was hovering at the top. We requested they attempt to pick on another port, port five and everything appeared to work.
November 22nd, 12:01 AM EST We waited on the Kuehne + Nagel team to confirm that everything was operational. During this period, we discussed the next steps in determining the root cause. It was noted that the SQL archive job had not run since 10/29. Upon review, our DBA did not believe this was the culprit. He also noted, RCSI had not been implemented.
November 22nd, 1:20 AM EST We received confirmation from the Kuehne + Nagel team that everything was operational. We provided them with the escalation plan in the event they continued experiencing issues. Upon calling support, T3 and Support Management would be notified.
November 22nd, 6:48 AM EST De’Andra Guthrie, Support Specialist I, began escalation procedure for issues reported. David, Evan, and Travis were all notified. Owen Graham, Support Specialist II, acted as a liaison between Kuehne + Nagel and Bastian.
November 22nd, 7:30 AM EST Began escalation to Database Analyst. At this time, production did not come to a halt, but the overall system was sluggish. While waiting for DB Team, checked indexes, SQL CPU Utilization. Everything looked normal.
November 22nd, 8:20 AM EST Justin Compton, the Database Analyst and David Wagner, SQL Developer, worked on connecting with Giovanni. Giovanni experienced issues connecting.
November 22nd, 9:10 AM EST Continued escalation occurred to Development Team Leads and Management. 1. Mark Curtis, Software Development Manager 2. Jason Petrie, General Manager 3. Cole Werner, Software Developer Team Lead 4. Mike Kovachik, Software Developer Team Lead Provided logs to our Senior Team Leads to perform analysis.
November 22nd, 9:30 AM EST Upon reviewing the logs we saw additional deadlocks recorded in the logs. Recommended proceeding with manual execution of delete and archive scripts.
November 22nd, 10:30 AM EST Our Database Analyst, Justin Compton, initiated the delete and archive procedure. The estimated completion time was 1.5 hours.
November 22nd, 11:53 AM EST Archive job completed. Starting applications and services on all workstations and servers.
November 22nd, 12:30 PM EST The Kuehne + Nagel team had pickers ready to pick. Overall, they said picking speed was normal. Troy and DJ worked with Kuehne + Nagel team to resolve port six issues.
November 22nd, 3:00 PM EST The system is running as expected.
Resolution: Manual execution of the Archive and Delete job.
Root Cause: Receiving exceptions in Automation logs indicating deadlocks or sql timeouts.
SQL View was missing (nolocks) and caused SQL Blocking on table order_sub_line.
ERROR Bastian.Exacta.AutomationService.AutomationService - Error confirming pick with result code FAILURE and message NHibernate.Exceptions.GenericADOException: could not execute batch command.[SQL: SQL not available] ---> System.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
SQL Job “Archive and Delete” had been disabled in late September for cycle counts and re-enabled.
Next Steps and Preventive Actions:
- RCSI Implementation
- Architecture change that will allow the system to run more efficiently and not be impacted by deadlocks.
- Status: Promote received, awaiting confirmation from K+N to install.
- Owner: K+N
- SMTP Mail
- Kuehne + Nagel IT to work with Ryan Bliss on applying SQL Patch that will allow us to send email notifications via SQL.
- Status: Ryan provided information on what update needs to be applied and why to K+N IT. Ryan to coordinate with the K+N DB team on install.
- Owner: K+N
- Migration away from SQL Server Agent and implementation of Exacta Job Runner.
- Status: Requested promote from development. We will need to set up and configure current SQL Jobs to be executed under the job runner.
- Owner: Bastian
- Proactive Evaluation
- IT Infrastructure Assessment
- Status: Open - a formal review will be provided upon completion.
- Owner: Bastian
- Database and Development Assessment
- Status: Open - a formal review will be provided upon completion.
- Owner: Bastian
- Weekly check verifying SQL Jobs are running successfully.
Our commitment to you:
Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service and we assure you we're taking the required preventative measures to prevent reoccurrence.
|
Comments
0 comments