NATS Disaster: What Can Businesses Learn From This?

Any IT failure can cause chaos, but it’s difficult to imagine the reaction of those on the support desk when the NATS (National Air Traffic Services) system went into ‘meltdown’ last Friday.

The system was back up and running within 45 minutes – which for some businesses would be a completely acceptable recovery time objective (or RTO – here’s some more information about that) – but that was long enough to disrupt more than 300 flights and leave thousands of frustrated (or should that be fuming) passengers stranded.

According to reports, the problem was caused by a rogue line of code that was triggered by a random sequence of events that would have been almost impossible to predict – and I think it’s safe to assume that the testing and QA processes are pretty rigorous.

There’s also been a lot a criticism of antiquated systems and suggestions that they have been patched together over the years.

Although that’s likely to be a familiar situation for many businesses, the challenge, both for NATS and anyone looking to update or upgrade, is managing the process of change required to adopt new technologies, and make the transition seamlessly.  

It’s not an easy task, bearing in mind nothing in your IT environment exists in isolation, and I suspect the old saying ‘if it ain’t broke, don’t fix it’ will spring to the minds of many.

(Actually, in the case of NATS that doesn’t strictly apply as they have an enviable (or not?) budget of £575 million to replace systems that have evolved over the last 30 years or more.)

The word ‘resilience’ has been used many times in news reports over the last couple of days, but what does that mean in practical terms when businesses rely so heavily on technology to keep everything running smoothly? 

In any case, resilience should certainly be top of mind for all businesses – for us it drives the way we work across all areas, from infrastructure to applications.

When it comes to making your business technology resilient, there are more and more streamlined ways of going about it. So here are a few things you might like to consider:

  1. High availability is becoming increasingly affordable. We’ve implemented a number of solutions with older servers being redeployed for backup / failover across multi-site operations, and connectivity upgraded to provide close to real-time recovery. And, of course, moving at least some services to the cloud – assuming you choose the right cloud – is certainly an increasingly realistic option. Just remember that connectivity and an understanding of SLAs are critical factors.
  2. Don’t ignore the obvious. When a product comes to end-of-life, it’s a real risk to take no action. Thankfully, most of our customers are already taking steps to move away from Windows Server 2003. It’s been incredibly reliable for more than a decade but from July 2015, any repairs or recovery work won’t be covered by support contracts.
  3. With subscription based applications such as Microsoft Dynamics CRM online or Sage 200 online, customers prepared to accept the ‘near fit’ principle benefit from regular enhancements rather than infrequent but radical shifts from version to version.  We’ve taken the same approach in the development of our new products, Tribe and Elementary, moving away from the traditional reseller approach where every installation is highly customised and that brings significant advantages when it comes to upgrades and compatibility.
  4. An increasing number of our customers are looking to Microsoft Sharepoint to deliver everything from asset management to compliance, and benefiting from the fact that solutions are built with no compiled code. And with the inclusion of SharePoint as part of Office 365, it’s now within reach of smaller businesses.  You can read more about SharePoint in my colleague Tony Hughes’ blog.
  5. Improvements in connectivity, with non-contended bandwidth and structured SLAs, are making Internet Protocol (or IP telephony) a realistic option with SIP trunks replacing or acting as a backup for ISDN lines to remove the potential for a single point of failure.

If you’re going to reduce the chances of a system failure, it’s important to cover all bases – hardware and software, and all of the possible inter-relationships.

The likelihood is that it’s the one things that you haven’t thought about that will bring you down, although hopefully that won’t result in front page headlines.

It’s also important to consider that it’s not just about risk management.

New technologies bring new possibilities, so at the same time as increasing resilience, the chances are you’ll find ways to work better or more effectively. And maybe make the headlines for different reasons.