Aging Network Infrastructure and Delta’s Virtual CrashRita Mailheau
Back on January 29, 2017, Delta Airline suffered a well-publicized computer crash grounding 280 planes. An earlier disruption back in August of 2016 caused a startling 1,000 flight delays, on day 1, and another 779 the next.
Things happen. We get it.
But when you have 2 such notable instances where flights get delayed or canceled due to a network glitch, it’s probably time for a conversation about IT infrastructure and similar problems other network admins might face in the near future.
There can be no doubt, an organization like Delta Airlines can well afford to stay competitive in the travel industry. But it definitely hurt.
In a public statement, Delta CEO Ed Bastian assured their passengers the company had invested “hundreds of millions of dollars,” over the prior 3 years, “in technology infrastructure upgrades and systems–including backup systems to prevent what happened yesterday from occurring”.
It’s a tough nut to swallow, and Delta is far from being alone. A host of other outages in other carriers and even airports have occurred in both 2016 and 2017.
What, is at the root of outages like the one that caused the Delta’s computer network to fail? Let’s take a closer look.
Why Did Delta’s Network Crash?
Delta’s investigations uncovered an electrical component failure at their hub in Atlanta. Keep in mind that power failures are not uncommon and the airline does have redundancies and backup power systems in place.
Unfortunately, like we said earlier, things happen. In this case, the backup power system did not reboot. If you’re curious about how these redundant systems are setup, here’s a Tom’s article worth reading.
Critical systems and network equipment to Delta’s backup systems didn’t switch over. Furthermore, all the downstream network devices that had lost power could not be reset or brought back online.
An oddly similar outage occurred at earlier the same month at Southwest Airlines.
Delta’s enormous computer network built on a highly complex combination of aging technology.
Newer technologies are far more resilient and responsive, but staying up to date with all of the latest advances is easier said than done. It can also be very difficult to stomach the budget required.
Cycle Time is Nothing New to Aviation
The irony here is how stringent the standards are for the aircraft themselves. Every component on an airplane has a part number and serial number.
Each part is meticulously tracked through a computer database by aircraft maintenance teams. Furthermore, each part has a standard lifecycle, after which that part must be replaced.
Cycle times for aircraft components are based on the following data points:
- Every time the part is turned on and off
- How many times the plane flies
- Flight duration
And when I say each part, I mean each part. Even things like food-service appliances in the galley are clocked.
If the airlines are so particular about their actual aviation equipment, why are they not even close to comparably as strict when it comes to their network equipment? After all, the network is a critical spoke on the proverbial wheel that enables safe aviation.
Last but not least, the same systems are also responsible for essential business functions like scheduling, sales, and customer service portals.
The truth is, network admins have to face this issue whether they’re in the aviation industry or elsewhere.
The average life expectancy for 24/7 servers and storage equipment is getting shorter. Dimension Data showed just that in their most recent Network Barometer Report, a study involving 97,000 network devices in 28 countries.
For the first time in 5 years, companies are starting to refresh equipment earlier in their product lifecycles.
Here are some key findings in that report:
- 73 percent of service incidents fall outside of standard break-fix support contracts
- 37 percent fail due to configuration or other human error and could be avoided with proper monitoring, configuration management, and automation
- Adoption of IPv6-ready equipment has risen 20 percent since last year
Dimension Data goes on to define product lifecycles as:
- Obsolete Equipment as past the end-of-support
- Aging Equipment is past end-of-sales, but not end-of-life
- Current Equipment is currently sold and supported
“Old isn’t necessarily bad,” they go on to say. “You just need to understand the implications. Older devices simply need a different support construct.”
One might argue that it’s all well and good to run old equipment if you’re responsible for a back-office network in a low traffic arm of a business. But when you’re responsible for running close to 1000 flights per day out of a hub at the company headquarters, it’s probably warranted to consider cycling out equipment a bit faster.
All of this said, behemoths like Delta may not be able to quickly replace vast amounts of networking equipment. The issue isn’t only the cost, but also the timing.
One expert at Datos IO suggested that the company could develop an organizational process to ensure fast recovery.
Why they didn’t in Delta’s case earlier this year, is an ongoing investigation. What’s important for you to keep in mind, is that this could happen to any aging network.
So with that in mind, here are a few pointers to avoid having your own Delta-style crashes and shorten disaster recovery times.
How to Prevent Outages and Long Recovery Times
If you want to understand how to implement recovery and redundancy strategies, take a look at Tier 4 data center models. These are the most robust and least prone to failure.
Tier 4 is designed to host mission critical servers and computer systems. It’s packed with fully redundant subsystems (cooling, power, network links, storage etc) and compartmentalized security zones controlled by biometric access methods.
It also includes:
- Non-redundant equipment
- Dual-powered equipment and multiple uplinks
- Fault tolerant uplinks, storage, chillers
At this point, Tier 4 is tried and true. Having been thoroughly tested and widely utilized, it guarantees 99.995% availability.
This type of operation provides an ideal model that any network admin can aspire to mimic. While you may not be able to get there in one fell swoop, you can incrementally work towards that goal.
It’s a good goal to have for pretty much any business that relies heavily on network systems for day to day operations.
Overall, it seems likely that the severity of Delta’s outage would have been mitigated if they’d stayed more up to date with current systems, equipment and best practices.
But in fairness, the combination of aging software and the hardware it runs on can make upgrading to modern systems quite difficult.
So the real question is, what can we take from this experience?
For us, it’s a wake up call. There is an urgent need for more businesses and network admins to design a lifecycle plan for their IT networks.
With IoT on the rise, capacity demands will continue to rise sharply. So it’d be wise to plan ahead and keep network infrastructures at peak performance and maximum stability.
As always, if we can help you plan for your next network purchase, give us a call. We’d be happy to advise on how to do just that.