CloudBleed

What programming languages have to do with safety, and why it matters

by Malcolm Sparks

Published 2017-02-28

A major security event happened on the internet last week where a bug at the firm Cloudflare was found to have inadvertently leaked vast amounts of private data into the public domain. These included internet user's passwords, cookies, credentials, secret keys, chat messages on dating sites, photos, private health data and much, much more.

Like the Heartbleed incident of 2014, the impact of CloudBleed is not limited to a single website but affected thousands of sites from as far back as last September 2016 until the bug was fixed last week (February 18th), but debates continue as to how long the clean-up effort will take.

To understand why so many sites are affected, let's take a moment to explain what service the firm Cloudflare offers its customers.

When you visit a site of one of Cloudflare's customers, say uber.com for example, your request for the page is sent to a server in Cloudflare's large global network before being redirected to Uber's website. This way, Cloudflare can detect and defend Uber (and its other customers) from a distributed denial of service attack (DDOS), while also reducing the distance between you and the website content, making the website load faster.

Cloudflare is a successful service, as it claims on its home page:

Cloudflare makes more than 5,500,000 Internet properties faster and safer (sic).

Many of these websites are among the most visited sites on the web.

While Cloudflare themselves admit the bug "affected 3,438 domains, and 150 Cloudflare customers.", that's obscuring the severity of this issue. Even if you were using a website that happened to be sharing the same Cloudflare instance as any of those 3,438 domains, then your private data could have been leaked.

What was the cause?

Cloudflare's service is built on the Nginx web server which is written in C. Part of the reason for Nginx's success was its ability to exploit improvements made in operating systems designed to increase the maximum number of network connections a single machine could support.

Back in 2002 when the Nginx project was started, these improvements were only accessible from C, and so naturally Nginx (and its extensions) were and are written in that language.

(Nowadays, these improvements are enjoyed by languages such as Clojure, as our yada library demonstrates.)

One of the features that Cloudflare provides its customers is the ability to modify web pages in transit, as they travel between the home website and the user's browser. One reason for doing this is to add code to track users and provide analytics for website owners, as well as security features such as redacting email addresses and changing 'http' links to more secure 'https' ones.

To achieve this Cloudflare have written some extensions to Nginx, in C, which in order to modify a web page must first read its HTML. It was this HTML reader that contained the bug.

The original investigation report posted by Cloudflare gives a detailed discussion of the subtlety of the bug and how it was caused by a confluence of factors. But the bottom line was that the bug caused a buffer overflow error.

Buffer overflows are a notorious class of bug where a program reads and writes from memory locations that are not meant to be possible for the program to read.

It's a bit like a car smashing through the central barriers of a highway and continuing to drive on the other side into oncoming traffic.

Or getting a lost in the departures area of an airport and finding yourself in the arrivals hall. The designers of airport security work hard to make sure this cannot happen, so if it does, it's due to a mistake somewhere.

Who's to blame?

It's tempting to lay blame the individual developer who wrote the bug, but that would be a mistake.

Let's compare our attitude to programming bugs to those involved in the aircraft safety industry.

I recently read a book by Chelsey Sullenberger about the events of January 2009 where he and his crew remarkably downed an Airbus A320 into the Hudson river, without a single loss of life.

The airline industry has been enjoying a remarkable safety streak in recent years. As an undoubtedly skillful and experienced pilot himself, Chelsey Sullenberger is the first to attribute safety improvements to the disciplined exercise of learning from accidents and building of a culture of safety inside the industry designed to avoid them. I was surprised that much of the book was about the history of airline safety improvement rather than the heroics of the day, but I realised at the end of the book that the author was right to highlight it.

We have known about buffer overflow errors for many years. One of principal contributions of the Java programming language was the widespread adoption of the 'virtual machine' designed to eliminate once-and-for-all this class of bug.

Java is an old language, but has built-in defences against buffer-overflow bugs, while other JVM languages such as Clojure are bolstering these defences to protect against other classes of bug, like those caused by mutable state, side-effects and race-conditions.

Of course, with knowledge and discipline it's possible for a skilled engineer to write safe code in C. The point is that it's also possible to write unsafe code, and when the consequences are severe as CloudBleed, we have to move beyond knowledge and discipline.


Building a culture of safety in programming

Today, safety in programming languages is no longer an academic argument between software engineers but a matter of critical importance. Gone are the days where software bugs didn't matter - today the world's financial, transportation and medical systems rely on safe software. We can no longer be complacent.

There is no doubt in my mind that C is an entirely inappropriate language to build systems with today, with few exceptions. The performance of a language like C may be better than that of Java or Clojure, but the difference is hardly substantial, and certainly not enough to justify the safety trade-offs involved.

As for embedded devices, as Jarppe Lansio reminded us at ClojuTRE 2016, small mobile devices were running Java decades ago and there are numerous options for embedding Java. JDK 9, slated for release in the coming months will help to reduce the JVM footprint further. And if you're working on an 'Internet-of-Things' device, you have even less excuse for writing unsafe code.

If we want to improve safety, trusting in the perfection of developers is a foolish hope. Developers are humans, and humans make mistakes. We should instead adopt a strategy of continuous improvement, analysing the cause of accidents, developing strategies to immunise against future occurrences and baking those strategies into our tools.

Blaming individual programmers for bugs is like blaming the operators for the nuclear disaster at Chernobyl:

A 2004 report by the National Academy of Sciences identified two important differences between the conditions that led up to the Chernobyl disaster and the U.S. nuclear energy program. The first key difference is in how the plants are designed and built. All U.S. power reactors have extensive safety features to prevent large-scale accidents and radioactive releases. The Chernobyl reactor had no such features and was unstable at low power levels.

Of course, there are times when humans make mistakes. The key is to build safety into the system to ensure that the impact of those mistakes is limited. In software, improvements to our programming languages and tools are our best defence.

How many more incidents like CloudBleed must we endure before we finally learn that lesson?

submit to reddit