Knowledgebase Article Number
Published May 27, 2020, at approximately 10:30 a.m. EST.
At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on March 14th and 16th, 2020 described below.
Tuesday, May 26
8:54 AM EST
9:00 AM EST
9:53 AM EST
9:53 AM - 10:25 AM EST
10:32 AM EST
10:35 AM - 11:45 AM EST
Found that DB updates were taking a long time (8-10 seconds per order).
Found something wrong with 2L areas, investigating.
For 2L areas, found that subline was shorted, and did not have all fields filled out properly, causing the wave process to "break" for those areas. Updated subline to 'cancelled' status (filling in necessary fields) so that preprocessor would not pick up the subline.
Soft allocation process taking a long time to process, possibly due to DB performance, and in part to influx of imports
11:45 AM EST
12:00 PM EST
12:30 PM - 4:00 PM EST
Observed wave building process for a time, found that one line orders were waving fine, but 2L orders were not.
Eventually started seeing 2L orders waving (~3:00 PM), once the one line orders finished. The system processes in the order that's listed in the preprocessor config, and all of the 2L areas are at the end.
4:00 PM - 5:00 PM EST
Corrected the subline, updated that wave to a closed status, and 'rewaved' the orders.
The cause for no waves being created for 2L work areas was due to bad data on a subline. Found ocurrences where a sub line that is shorted could have not all the fields filled out properly, which would 'break' the wave build process for those areas. This seems to be related to changes that were put in last Thursday (5/21), but were edge scenarios, and not caught in testing.
For the wave build process taking a long time, we found that the nHibernate logging for Preprocessor (waving service) was creating 10 MB of logs every 2 seconds. This caused the order assignment during wave build to take 8-12 seconds per order. Contributing factor (not blaming) was DSG waved a significant amount of orders at 10 AM, so it took a while to soft allocate all those orders, and the system was trying to 'catch up'. We believe that one of the reasons that we hadn't seen this issue before is that we always had time to create and build waves, hiding the fact that it was taking a considerable amount of time to build a wave, which was exposed with this.
Lastly, for the waves that were created that did not cube, it was similar to the first issue, but these orders had a field filled properly that allowed it to wave, but once waved, it failed cubing (cubing fails for the entire wave due to the 1 order). These orders were specifically orders that had a line with a quantity > 1 where part of the line was shorted.
No 2L waves being created issue was resolved by changing the subline to a 'cancelled' status, and filled in the necessary fields so that Preprocessor would not pick up the subline.
For the long wave build process time, nHhibernate logging for Preprocessor was turned off (it's rarely used, and not as important as regular logging). Observed order assignment process go down to about 1 second per order, which significantly reduced wave build cycle time.
For the waves that didn't cube properly, we changed the subline to fill in the necessary fields, updated wave to a closed status, and 'rewaved' the orders.
Next Steps and Preventive Actions:
For 2L wave issue, Bastian Dev resources have been engaged to correct this with a code change to CICM (order assignment service) to properly short the order, and ensure all fields are filled out properly. This has already been promoted, and will need a window to move it into production (once promote has been fully tested).
For the long wave build process, we feel there is no further action needed at this time, as logging has already been disabled.
For the waves that didn't cube properly, Bastian Dev resources have been engaged to correct this with a code change to Preprocessor to ensure that it validates the correct fields on an open line, and ensures that it does not check the field on a shorted or cancelled subline. This has already been promoted, and will need a window to move it into production (once promote has been fully tested).
At this time, QA has done initial testing but is currently running through a few more scenarios. We would need input from DSG on the earliest window that we could install the promotes, once testing is complete.
Our commitment to you:
Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service and we assure you we're taking the required preventative measures to prevent reoccurrence.