By now everyone is aware of the power failure at Delta Airlines Atlanta Technology Center. According to the COO Giles West, a critical power control module failed, which caused a surge to a transformer and resultant loss of power. When this happened, critical networks and systems did not switch over to backups. According to Georgia Power, the problem was failed “switch-gear” – something like a fuse box that fires off and then reroutes connections to backup power supplies. Whether this could have been tested ahead of time remains unclear, but it stresses the importance of testing and redundancy.
With apologies to Scott Adams, if testing your disaster recovery plan is someone yelling “Help…Help” and “Someday we hope to have a budget” you need a better DR test plan. Delta does have redundant datacenters and complex DR plans, but when and how do you test a switch when everything in the airline industry is 7X24 and interconnected? You can’t switch at 2:00am on a Sunday morning when this is prime time 2:00pm for Delta in China or Singapore.
The problem at Delta was with the passenger information system, which meant boarding passes could not be printed off, causing delays in departing flights, which meant hubs were impacted, and delays were worsened as pilots and flight staff were now over their scheduled service hours, causing new crews to be sourced etc. The net impact to Delta is forecast to be over $120 million USD, including lost revenue and $200 travel voucher payouts for over 1,000 cancelled flights. How much would it have cost Delta to run a test of their power systems and components?
In the high availability business, we see, and help, many customers install redundant servers, replication software, and even redundant data centers and networks. We even have customers who have completely outsourced their IT to Mid-Range and have contracted out DR using a second Mid-Range data center or another 3rd party data center. In all cases, we enforce periodic testing of recovery plans, but for many other shops, the DR plan or purchase of high availability software is just a “tick box” for the auditors and it is never actually tested. If you don’t test, you don’t have DR!
As Delta has illustrated, a simple failure of a single component can have drastic consequences that frequent periodic testing can reduce or eliminate. When was the last time you tested your DR solution? Mid-Range can help. firstname.lastname@example.org