Knowledgebase Article Number
Published February 27, 2020, at approximately 11:45 a.m. EST.
At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on February 26th, 2020 described below.
3:39 AM EST
5:29 AM EST
5:29 AM EST - 7:30 AM EST
7:30 AM EST
7:45 AM EST - 8:30 AM EST
8:30 AM EST - 9:45 AM EST
T2 and T3 continued to investigate. No recent software changes have been installed on-site by Bastian. ExactaLightController is the same version that's been running since August 2015. Further investigation showed that multiple (~10) Windows updates were installed at some point on 2/26, and since this was the only software change we could find on the server that was done at approximately the same time that this issue started, it was decided to roll back/uninstall those Windows updates. Windows updates were rolled back, except for 2 that the system wouldn't allow. The server was rebooted and found that the service was still not starting correctly, seeing the same type of socket errors.
9:45 AM EST - 1:15 PM EST
Bridge call requested between site and Bastian. Bridge call was set up and joined at approximately 10 AM by both parties. Explained current troubleshooting steps attempted. T3 recompiled files for ExactaLightController using all up to date patches and service was deleted and reinstalled again. Service is still experiencing errors. We confirmed the test server was offline. Power cycled all link boxes again. Engaged Developer resource to participate. I went through logs and event viewer errors. The developer worked on and sent over updated files for the service, which would help increase the logging detail to get a better idea of the failure point. New service files installed at approximately 12 PM. Service is starting and staying on now, but continuing to see socket connection/time out errors in the logs, indicating that as soon as the socket is open and listening for data, the connection is already closed on the remote side. Several theories were suggested, as there seemed to be something on the network/hardware side that was causing the issue. The host server was rebooted, with no change. The site then had the Digi gateways reset, and after that reset, the socket errors for the service cleared out. All services are now started and staying active. Initial testing showed that the system was now operating normally.
The issue was found to be hardware related, stemming from a connection issue from one of the Digi gateway devices that service the link boxes. It's believed that one or more of the devices' connections to the server was never fully reset/disconnected, causing it to continually attempt to re-establish its connection. This loop was enough to disrupt the ExactaLightController service on startup, causing the crash.
Site power cycled all Digi gateway devices (approximately 48), resulting in the connection loop being disrupted, and allowing service to run normally
Next Steps and Preventive Actions:
Bastian Dev resources will be engaged to upgrade the ExactaLightController service files to include the latest patches, fixes, and logging.
Our commitment to you:
Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service and we assure you we're taking the required preventative measures to prevent reoccurrence.