Knowledgebase Article Number
At Bastian Solutions, we continually strive to provide outstanding customer service, not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on July 4th and 5th, 2020.
8:07 AM EDT
Support Specialist I Dylan Krawietz received a phone call from the Pottsville distribution center stating that PTL was not working in the high velocity area and that the linkbox was displaying error code 53.
8:07 AM - 8:40 AM EDT
Dylan worked with Dan at HBC on restarting the PTL services along with the AOR program and linkboxes. Dan had to end the call early due to his cell phone losing charge.
Dylan follows up with Dan and is informed that PTL is down and the site is also experiencing issues with the document inserters.
10:19 - 11:12 AM EDT
Dylan restarted additional services and captured an example container to investigate in the AOR logs. He reached out to Support Analyst I Owen Graham for escalation during this timeframe. Additional issues throughout the system were reported during this time.
11:12 AM EDT
Given the numerous issues being reported, Owen recommended a full system restart and performed this using the Exacta Startup Service. The ExactaApplicationService, a core element of the Exacta software, crashed on this restart, giving an error that it could not communicate to other services. Owen then recommended a reboot of the application server, but after the reboot the services still failed to run properly, and would crash after the Automation service failed to start. Full system outage began at around 11:15. At this time Owen began reaching out to Bastian's development team for additional help.
12:15 PM EDT
At the recommendation of senior developers, Owen performed a slow restart of the system services, bypassing the use of the ExactaStartup Service. This operation was successful. As of 12:15 everything was working except the high velocity area.
12:15 PM - 1:55 PM EDT
Owen continued working with Software Development Team Lead Tom Ryan on the high velocity issue, making adjustments to logging for Light Controller in an attempt to gather more information on the issue. At 1:55 Support received word that Perfect Pick was down and showing an error that stated 'Error Connection to application services FAILURE' on all workstations. Owen attempted to connect but the workstations were all offline.
1:55 PM - 3:30 PM EDT
HBC team members determined that the workstations had been turned off after the error appeared. Once they were brought back online, Owen and Tom rolled back the changes that had been made to Light Controller logging and worked with the HBC team to restart each Perfect Pick workstation, which resolved the issue. By 3:30 PM all of Perfect Pick was back up.
3:30 PM - 5:30 PM EDT
At about 3:45 PM Owen and Tom located this error in the light controller logs: <HTP><GTW>10.66.252.110</GTW><SID>8001</SID><DT>E</DT><VAL>53</VAL><LOC>AA-01-AA-99</LOC></HTP>. This message indicates that the linkbox is having trouble communicating to the device at position 1 on the high velocity area. At this time we recommended maintenance perform a thorough check of the wiring and connections in this area.
On-site maintenance reported they were able to get the light bars lit up. Owen continued troubleshooting but continued to encounter issues getting cartons to move past position 1. At 5:30 PM HBC had to stop for the day, and agreed to re-investigate in the morning.
8:30 AM EDT
Owen reached out to HBC via email asking for a system status update as no calls had been received yet.
9:30 AM EDT
Dan called Bastian support from HBC and resumed troubleshooting the high velocity area.
9:30 AM - 12:30 PM EDT
Owen continued working with Dan on the troubleshooting, while also consulting with Tom. At this time we determined the issue was being caused by the Automation service thread for position 1 stopping and not looking for any more work. We were able to get this thread working again by clearing the pre_scan_sequence table and restarting Automation service only, but not LightController. Tom determined the system will do this if it does not receive the message that an area has been cleared. At 12:30 PM the system was fully functional again.
There are two issues to address here - the major system outages that occurred on Saturday and the high velocity specific outage that lasted from Saturday morning until Sunday morning.
In the former case, the system went down because 1) Automation service could not launch successfully due to issues communicating to Light Controller and the linkboxes and 2) the ExactaStartupService Support was using to manage the system restart is configured to stop the other services if one of the core ones fails to run. This is by design to prevent the system from running in an unstable state. The second outage which began at Perfect Pick around 1:55 PM appears to have been triggered by changes to the LightController service that were made during troubleshooting, which were immediately rolled back.
The general outage in the High Velocity area is ultimately attributable to hardware issues within the light system. The Automation service regularly queries the database for work at each position, but will stop querying for a particular position if messages from the light system do not indicate that it is ready for a new task at that spot. The intermittent code 53 errors received by Light Controller service are indicative of interrupted communication between the linkbox and individual light devices. Once Automation stops querying a location, it will not do so again until it has been restarted, at which time the pre_scan_sequence table and the physical conveyor must both be cleared for a clean reset.
On Sunday July 5th we determined that the safest and most reliable way of resetting the high velocity system in the event of an error was to clear the line, remove all stored records from the pre_scan_sequence table, and restart the Automation service only (not the LightController service). This technique was used several times over the next week successfully to reset the system.
The ExactaStartupService has been re-enabled and no further issues have been reported with Automation failing to start. It appears that the issues starting the service on July 4th were likely caused by intermittent communication with LightController, which was taking longer than usual to stabilize itself on restart due to the communication issues at the linkbox.
Next Steps and Preventive Actions:
Owen has taken extensive notes on this case which will be incorporated into a guide article and made available to all Bastian Support agents.
Bastian's on-site team report will detail what changes were necessary to the wiring to improve performance.
Our commitment to you:
Bastian Solutions understands the impact of the downtime that occurred because of the backup management failure, which in turn affected your business processes for your organization. It is our primary objective in providing our clients with outstanding customer service and we assure you that we are taking the required preventative measures to prevent this issue from reoccurring.