At Bastian Solutions, we continually strive to provide superb customer service at the highest standard, and not only reactive but proactive support. Transparency is a key component in helping achieve this objective and with that in mind, we are providing a full root cause analysis for the disruptions in service on Sunday, January 24th, 2021 described below. This also shows a similar event on 2/7/2021. Configuration changes due to the events from 1/24 and 2/7 prevented a large outage on 2/9.
Summary:
Sunday, January 24th
6:25 am
Caterpillar called into support stating that they are seeing issues on the Autostore overview. I called controls who got ahold of one of our Autostore specialists.
7:00 am
The Autostore specialist reached out to me stating that they do not believe this is an issue with them and requested that I, a software support member, get connected. The software team did not have the multi-factor authentication enabled that is required to access the site's Citrix environment. From 7am to 8am I worked with the Autostore team member to get connected to the site via his credentials.
8:00 am
I was able to successfully access the Exacta app server as well as the Autostore App server. I attempted to run a communication check from the Autostore app server to the Autostore controller which is running the ASdriver software. While I attempted to connect to the Autostore controller I had restarted the ASbackup service which was causing the errors and had the site restart the ASdriver, AShandler, ASplanner programs. This did not resolve the issue.
9:04 am
I located the IP address that we use to communicate to the autostore. The IP address that the site uses is 192.168.0.230 but the IP that we use to connect and communicate from the exacta system is 10.230.28.33.
I attempted to connect directly to the Autostore controller to see the errors and take screenshots but the usernames we have do not connect to this pc. I continued troubleshooting on the Autostore App server. Since I now had the Autostore controller IP address I was able to successfully run the Autostore communication check which verifies that is can communicate to the ASdriver, AShandler, ASplanner programs running on the controller. This check finished without error.
9:35 am
Our escalation team was able to get me in contact with one of our DBA's to look to see if this is database-related.
10:00 am
After speaking with the DBA we found that the SQL server had failed over from Primary Node 1 to the Secondary Node 2. During this time we went ahead and failed the databases over thinking this was causing the communication issue with ASbackup. As soon as the failover completed back to Primary Node 1 Autostore began to startup.
10:43 am
Due to this sudden change in behavior with Autostore I began looking at the config for ASbackup service. The service was not set up to communicate over the SQL AG listener which can see both nodes 1 and 2. Instead, it was communicating only to node 1. I had updated this config from
<add key="connectionstring" value="Data Source=d9mwexasqlp01.mw.na.cat.com;Initial Catalog=Asbackup;User ID=ASSoftware;Password=Asap1234"/>
to
<add key="connectionstring" value="Data Source=10.230.28.142;Initial Catalog=Asbackup;User ID=ASSoftware;Password=Asap1234"/>
and restart the ASbackup service. From here I had the site restart the ASdriver, AShandler, ASplanner.
After confirming with our Autostore specialist I confirmed that the site needed to verify the locations of each of the robots before they set the robots back to planner mode and started the system.
11:00 am
The site called back stating that they are unable to start the Autostore system and stating that this was only an issue after we made the configuration change. I began transitioning this issue to my first shift team to work. From 11:00 am to 11:15 am I worked on getting the first shift member online into the site's system and I remoted to their PC to show what changes were made and then walked through reverted the config. Once the config was updated and services were restarted Autostore still would not startup.
11:43 am
Without any further changes from the software side, the Autostore console became accessible again and we were able to log in and view Autostore once more from the Autostore App server.
12:00 pm
The site was still unable to get the Autostore system to function as intended. When moving robots from recovery to planner they would switch back. The Autostore specialist worked with the site and additional resources.
1:43 pm
Autostore specialists and resources determined that site needed to physically move all robots and manually set the positions for each robot before starting the system back up.
Root Cause:
Event from 1/24/2021
The database failed over to node #2 due to errors from Windows trying to perform windows updates.
Event from 2/7/2021
The database failed over to node #2 because a SQL update completed and triggered a restart of the SQL server.
Event from 2/9/2021 where ASBAckup lost connection and had to establish a new connection.
Resolution:
Event from 1/24/2021 - Downtime 5hrs.
Performed a failover on the databases from node 2 to node 1. Bastian worked with technicians to recover robots.
Event from 2/7/2021 - Downtime 3hrs.
Performed a failover on the databases from node 2 to node 1.
Event from 2/9/2021 - Downtime 3 minutes.
Due to configuration changes made on 2/9/2021 early am the system was able to regain communications without significant downtime.
Next Steps and Preventive Actions:
1. We will need to discuss with the site a good time for testing the system when it would be the least impactful for them so that we can update the configuration of the service to use the SQL AG listener. This will allow the server to failover with seamless transition. - COMPLETED 2/9/2021
2. Need CAT IT to disable Windows and SQL Updates from automatically updating. Updates should be done manually by the CAT IT group to ensure that DB does not failover and services are working properly. Below are the servers where this should be turned off.
d9mwexactap01 | Prod App Server |
d9mwexasqlp01 | Prod Primary DB AlwaysOn Cluster #1 SQL 2016 |
d9mwexasqlp02 | Prod Secondary DB AlwaysOn Cluster #1 SQL 2016 |
d9mwexadbaop | ExactaDB AG listener |
d9mwexadwaop | ExactaDW AG Listener |
d9mwautosp01 | Prod AS App |
Our commitment to you:
Bastian Solutions understands the impact of the disruption that occurred and affected operations for your organization. It is our primary objective in providing our clients with superb customer service, and we assure you we're taking the required preventative measures to prevent reoccurrence.
Comments
0 comments