Should have tested more – Five times bugs and errors had massive consequences

When oversights lead to catastrophic failures

Generic AI Artwork

It won’t come as a great surprise to you to hear that we believe regular, comprehensive testing is of significant importance to you, your business, and your customers. It’s the only way to ensure everything works as it should, and customers aren’t running into problems.

If you don’t test thoroughly enough, there’s a higher chance something will go wrong. In most cases, this’ll lead to reputational damage, or lost business.

That’s bad. But it could be worse. So much worse.

Here are five times organizations really should have been a bit more thorough in their testing.

Dead code bankrupts a $400m business in 45 minutes

When you have $400m of assets and regularly handle billions of dollars of stock exchange trades a day, any change to your system needs to be thoroughly checked. Unfortunately for Knight Capital Group in 2012, they didn’t do this. When they introduced a new high-speed algorithmic router, they hoped it’d handle trades for clients quicker and easier than before. In tests, it worked as intended.

Unfortunately, they didn’t test it on the servers they’d be deploying the code on. These servers contained an older system for purchasing stocks that had been unused for years, but not deleted. When the new software was deployed, it activated the older software, which prevented the new software from knowing when an order had been fulfilled.

The result? An unstoppable 45-minute buying frenzy, where the new software would endlessly buy stocks despite orders being fulfilled multiple times over. There was no kill switch and no easy way to stop this happening, leaving Knight facing bankruptcy. All because they hadn’t removed some old code from their systems, and hadn’t tested the new code on the servers.

Wrong unit of measurement = one destroyed space probe

If there’s any environment where test, test, test again is a critically important step, it’s space. That’s why NASA found itself in an embarrassing position in 1999 when a $125m probe burnt up in the atmosphere of Mars.

It turned out that the reason the Mars Climate Orbiter was lost, was not down to technical malfunction or unavoidable circumstances, instead, it was lost because of a unit conversion issue.

The probe’s designers at Lockheed Martin had developed thruster control software that calculated in imperial measurements, while NASA’s own technicians worked in metric. Can you guess what went wrong?

The result was the probe travelling far too quickly, getting too close to the planet, and burning up as a result. All because all parties assumed everything would work, with no testing carried out.

A programming oversight in World of Warcraft leads to a pandemic

The corrupted blood incident is one of the most famous incidents in the history of video games. It’s been studied by epidemiologists, anthropologists, and programmers alike.

If you’re not familiar, in 2005, an update was released for World of Warcraft, at the time, the most popular multiplayer game in the world. This update included a new boss, with an ability called ‘Corrupted Blood’. This ability could do enough damage to kill a player character in 30 seconds. It was only supposed to last ten seconds, and only work in a certain zone.

What programmers hadn’t planned for, or tested, was the effects of Corrupted Blood on a player’s pet. What happened was, the pet would be immune from damage in the fight, but they could still be infected. When the fight ended and a player character returned to a populated area, the pet would bring the virus with them.

The result was as you’d expect. Quarantines, mass deaths, social distancing, and players scattering to sparsely populated areas. Sounds familiar, doesn’t it?

A minor CrowdStrike update brings Microsoft systems down

This one is very recent and still painful to many of the people affected by it. But it’s also potentially the largest software outage in history, so how could it not make the list?

On the 19th July 2024, Microsoft systems across the globe shut down, grounding planes, stopping business transactions, crippling emergency response systems, and generally effecting every manner of tech across the globe.

The cause? A logic flaw from an undertested update to CrowdStrike Falcon. Because this platform has kernel access to Windows machines, the update caused PCs globally to crash on startup.

While only 1% of Windows devices were effected, that’s still 4.5m machines, and primarily those that were running CrowdStrike, so businesses and public organizations were impacted, while personal users weren’t.

The cost of the outage to the global economy is unknown, but it will be in the billions. Awkward.

Integer overflow sends a $300m rocket the wrong way

Every computer has a limit to the highest number it can handle. On modern supercomputers, that number is so big as to be essentially irrelevant, but it hasn’t always been like that.

The 1996 Ariane 5 crewless rocket, operated by the European Space Agency, used a 16-bit register, capable of storing and processing figures, up to 32,767.

However, this is a space rocket. Big numbers are the order of the day. So when a launched Ariane 5 started returning numbers far larger than it could handle, it caused the rocket to drastically veer off target, forcing it to self-destruct.

A thorough investigation returned several key conclusions for future launches. One of the primary suggestions? More thorough testing.

What does this teach us?

While most of us will never face such severe consequences when something goes wrong, it’s important to remember that, if something can go wrong, assume it will go wrong.

The best way to prevent errors and issues is to thoroughly and continuously test every element of a project, business plan, or communication. Find the tools and systems that work for you, and get them running.

If we can help you with your testing challenges, we’d be happy to. Start a free trial or speak to us here.