99.9% is Good Enough? Sure. Until It Isn’t.

down network end of world

They call it “Five 9’s.” Guess what? No one cares about 99.999% uptime anymore. Unless they want to fly safely, make money, or read reliable news.

FIVE MINUTES AND 15 SECONDS of downtime a year.

That’s what 99.999% reliability gives you. Just five minutes with no internet, five minutes without email, five minutes where the plain old telephone service isn’t working. Over the course of a year. It was the gold standard of uptime and something everyone in technology strove for.

Maybe it was a lot easier when your network was literally physical lines connecting two points via a few switchboards. We used that technology to put a man on the moon. So you’d think that with the cheaper and yet infinitely more capable networks of today we’d be have reached six or seven nines. Redundancy is cheap. Virtually 100% uptime is possible.

Instead, businesses are heading in the opposite direction.

In early July 2015, three simultaneous network outages – the New York Stock Exchange, United Airlines, and the Wall Street Journal website – made the conspiracy theorists go wild and brought out the clever on Twitter:

“9 questions about the Illuminati you were too afraid to ask: Did you cause the NYSE to crash?”

“Looks like I picked the wrong day to fly United to the NYSE for a story about WSJ.”

“Nobody panic, all the Bitcoin exchanges are fine.”

“Hey NYSE, did you try blowing into the cartridge?” “Did you unplug it and plug it back in again?” (My favorite).

Their four-hour network (240 minute) outage put them at 99.954% uptime for the year assuming no other outages. Not bad. Unless you’re a finance company losing $5 to 10 grand every minute the stock exchange is down. Losing 0.046% of the $169 billion traded every day doesn’t seem like much. Until you do the math and it’s tens of millions of dollars lost. In a matter of hours. Someone’s retirement is going to suffer.

Anyone want to fly with an airline where they guarantee their I.T. is 99.9% good? With 87,000 flights a day in the U.S., that 0.1% is only 87 flights EVERY DAY lost, misplaced, or delayed. Ready to take your chances?

Forget tech for a second. How about a hospital maternity ward that is 99.9% good at not dropping its 27,000 delivered babies each year? That’s only 1 baby dropped every other week.

“Close enough” is good for jazz, but doesn’t work for a massive business like United Airlines or the NYSE. And before you tell me that you’re not crying in your beer for the likes of those ultrarich jerks, what if it happened to you?

Could your business afford to lose 1%, 5%, or 10% of your revenue due to a technical glitch?

You and I know that the smaller the business, the bigger the deal it is. Imagine if you were down for a day, two days, or the average six weeks it takes to recover from a crashed hard drive. You can’t afford to lose even a small percent.

We’re all working on razor thin margins. Maybe you’d go out of business. Or maybe your employees don’t get a Christmas bonus. Or maybe nothing bad happens. Are you willing to take that risk?

Surely big corporations don’t do that, right? With their massive I.T. budgets and nearly endless resources, you don’t expect them to neglect their infrastructure; to let things go until a router blows out and gums up the whole works. Right? Wrong.

What Happened to Network Reliability?

A Wall Street Journal article called “Network Reliability, a Relic of Business?” suggested multiple, converging factors in declining network stability:

  • Increasing complexity of networks.
  • Sheer volume of data the networks are expected to handle.
  • The pace of technological development.
  • Lingering legacy software written with little regard to usability or user experience.
  • Interdependency of systems that don’t seem to be related on the surface.
  • The rise of digital services in all industries.

The authors also mention “insufficient organizational and cultural practices.”

I call that the “Everything-is- Fine-Until-It-Isn’t Philosophy” of I.T. system management.

Everything is fine. The system is mostly reliable. There are a few warning signs, but your in-house computer guy or your team finds a workaround. Or multiple workarounds. Until one day, something fails.

At United Airlines it was a network router. When that one router went out, it “degraded network connectivity for various applications,” said Jennifer Dohm, a United spokesperson. Nothing important except for the reservation system.

This wasn’t the first sign that their system had problems. Their computers briefly shutdown in February of 2014 and delayed hundreds of flights in August and November of 2012. They’re not just saving money on in-flight snacks, they’ve been heavily criticized for their network infrastructure problems.

Small businesses sometimes have to “make do” because they’re on a tight budget, they’re understaffed, or don’t have the expertise to run a network. You’d think multimillion-dollar companies and billion-dollar industries would be operating on a different level. But experts blame issues like the United Airlines failure on reduced I.T. infrastructure funding and the NYSE outage on recent cuts to their technical team which may have seen the most experienced (i.e. expensive) engineers go first.

Caused by “Organizational Issues and Cultural Practices”

Its why people stay with I.T. support companies that are “just okay” or even “not good.” It feels like a huge, complicated task to find a new managed service provider.

In very large organizations, the culture doesn’t support an understanding of the different mindset that it takes to be a systems engineer or network administrator. Upper management doesn’t want to deal with “those geeks in I.T.”

They’re always asking for money for equipment and when they explain what it’s for, it sounds like something out of a random technical jargon generator.

What self-respecting V.P. of Finance sitting in their mahogany office overlooking Central Park wants to listen to that nonsense? A V.P. who just watched his firm lose millions of dollars because of the high-tech equivalent of a blown fuse, that’s who.

“Until there’s a day like today, everybody invests in other things,” said Thomas Bayer, CIO at the S&P Ratings Service.

Because everything was fine. Until it wasn’t.

Post-Apocalyptic city” by Ty’Onah Gallman is licensed under CC BY 2.0