Published November 28th, 2022 at approximately 5 p.m. EST. |
At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on November 25th through 27th, 2022 described below. Summary: 11:10 AM EST (11/25/2022) Janice Violante called Bastian Support from Puma Whitestown and reported waves stuck in assigned status and taking multiple hours to move to cubed status after release. Support Specialist Mark Wallis took the call.
11:50 AM EST (11/25/2022) Mark escalated the issue to Tier 2 within Bastian Support.
12:20 AM EST (11/25/2022) Mark informed Janice that detail-type 1005 was missing from the order information imported from host and requested that the orders be resent with the missing data added.
1:37 PM EST (11/25/2022) Rene Campos called Bastian Support and reports waves still experiencing delays cubing, though the site was still operational and operators had work available to process. Support Specialist Curtis Briscoe took this call. Curtis took the initial step of restarting the cubing and preprocessor services, which provided some improvement but did not resolve the general issue.
2:33 PM EST (11/25/2022) Curtis re-engaged escalation to Tier 2 and began working with Support Analyst Jason Whittle.
3:45 PM EST (11/25/2022) Curtis and Jason worked through the issue and determined that the missing detail 1005 data was not the reason for the delay. A restart of services caused processing to pick up speed, however this was obscured in part by the site dropping two more waves without Rene's authorization. As of 3:45 the first two waves had completed and the new waves were processing.
8:40 PM EST (11/25/2022) Lisa Morgan called Bastian Support to report a wave dropped at 7PM was still cubing. This call was received by Support Specialist Blake Yates. Blake requested escalation to Tier 2 within 15 minutes. Support Analyst Travis Wise joined to assist. At this point the site reported that they had operators unable to work due to delays in getting work into the system. Travis and Blake engaged development and during the course of their investigation they determined both that there were some potential database performance concerns, specifically poor indexing and long running queries.
9:50 PM EST (11/25/2022) Bastian Support reached out for DBA assistance. With Puma's approval Bastian Support restarted SQL services at 10 PM. The availability group did not come back up, and Bastian's DBA joined at 10:23 PM. At this time the entire system was down due to the database service interruption.
10:30 PM EST (11/25/2022) Bastian's DBA, Alex Stogsdill, identified the database was in a partially failed over state. Alex Stogsdill also noted the cluster was showing DNS errors. At this time Rene Campos requested Puma IT join the call but they were not immediately available.
12:56 AM EST (11/26/2022) Shortly after midnight, Bastian Support Analyst Andrew Lynch took over the bridge call from Blake. At this time they were waiting on a Puma DBA to check the status of the cluster. In order to get the system up, Alex moved both dbs back to 3p node, but had to remove exactaDB from the Availability Group in order to do so. Puma IT was able to join shortly after 1 AM.
3:19 AM EST (11/26/2022) Andrew noted that they were finally able to complete failover to the 3P server from the 4, and at that time service was restored and everything was reported to be working normally.
12:10 PM EST (11/26/2022) Janice called Bastian Support again reporting long delays for waves cubing. Support Specialist Mark Wallis received this call.
12:53 PM EST (11/26/2022) Mark escalated this issue to Tier 2, Support Analyst Travis Wise. Support Analyst Jason Whittle also assisted in this investigation.
3:16 PM EST (11/26/2022) Travis Wise and Jason Whittle reported extremely high levels of fragmentation in the database and recommended re-indexing. However, Puma was concerned that re-indexing could add additional performance problems. Bastian DBA Thomas Constantino reviewed with support and concluded re-indexing could take several hours. The teams jointly decided to schedule re-indexing for 1:30AM on 11/27. No further calls were received from Puma on Saturday.
4:00 AM EST (11/27/2022) Andrew Lynch performed the re-indexing and noted that it took 18 minutes to run.
7:53 AM EST (11/27/2022) Jon Coltharp replied to the open support ticket by email and reported that cubing was slow, with a wave that was released at 6:30am only 25% cubed. The site was still working but there were concerns they could potentially run out of work, as happened on Friday night.
9:30 AM EST (11/27/2022) Support Team Lead Owen Graham was alerted to the concerns at Puma and opened a group chat with all working senior support agents to address this issue as a P1. Bastian development and DBA resources were engaged within 1 hour.
11:30 AM EST (11/27/2022) Support identified a configurable setting that limits Exacta to cubing 20 orders per minute. After discussion with Rene on site this limit was raised to 50. We noticed an immediate performance improvement with regard to cubing. No further incidents with delayed cubing were reported after this change.
12:30 PM EST (11/27/2022) Support identified a particular query that could be sped up by adding an index to the order_sub_line table. This index was added and a marginal improvement in speed occurred according to logging.
6:08 PM EST (11/27/2022) Jon Coltharp followed up with a review from the day as whole and confirmed there had been no additional outage s or issues with cubing.
Root Cause: The initial slowness cubing appeared on Friday and was resolved Sunday morning by raising the MAX_ORDERS_PER_PASS value in the config from 20 to 50. Further investigation revealed that this timeframe corresponded with a substantial increase in SLSP orders, to over 5000/day. The PreProcessor config value intended to manage SLSP cubing is disabled. It is possible that enabling this would improve performance further. Much of the outage that occurred from late Friday into early Saturday morning is attributable to the failure of the SQL database cluster to recover fully from the restart performed that night during troubleshooting. The specific cause of this remains unknown but we are continuing to investigate with the data available.
Resolution: As noted above, the immediate resolution for the slowness in cubing was to raise the limit of orders cubed per minute. Given the high volume of SLSP orders, the overall amount of product per order is lower than normal, and so more orders can be cubed at one time. We will validate whether or not the other values mentioned in this config are correctly set before closing the ticket.
As of 2pm 11/30/22, Bastian DBA Alex Stogsdill has succeeded in restoring ExactaDB to the availability group and bringing the database cluster back to its intended state.
Our commitment to you: Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service and we assure you we're taking the required preventative measures to prevent reoccurrence. |
Comments
0 comments