Beating bugs with brute force

Improve product quality and find bugs faster by generating tests.

Beating bugs with brute force


For years I've been writing the tests for the applications I write. However, it turns out that computers can do a better job. Property based testing is the doorway to a more advanced world of testing that can dramatically improve quality and find the kinds of bugs that would often go undetected outside the live system.

Generative Testing

When you write tests, you will often have to write test data (aka fixtures).

For example, let's say we have a micro-service for dealing with customer details. This is likely a CRUD service, so we might write a test that POSTs a new customer and then tries to GET the customer. We will define some sample data to build a new customer request:

(def a-customer
  {:name "David Smith"
   :age 33
   :gender "male"})

We can then re-use this in various different tests. We may even turn this into a function to try to vary the details for certain tests:

(defn a-customer
  (merge {:name "David Smith"
          :age 33
          :gender "male"}

An alternative would be to use generators to create our data:

(def a-customer
  (gen/hash-map :name gen/string
                :age gen/int
                :gender gen/string))

(deftest my-test
  (let [new-customer (gen/generate a-customer)]))

The gen namespace is part of a Clojure library called test.check which provides functions for generating random data. For example, gen/string will generate a random string, gen/int will generate a random integer etc.

So why would you use the generated version? First it means you don't have to waste your time coming up with witty values but more importantly you are more correctly defining how your function works. In production, the service will not always receive a customer whose name is "David Smith", it will receive a name whose value is a string. With a generator we state this explicitly. On top of that, generators tend to generate loads of rubbish that can screw with your functions surprisingly quickly; I've found quite a few bugs the first time I hit the service with generated data.

Generators can be a bit daunting at first, I thought that they may become so complicated that you would need to test your generators! It turns out though that this is not the case and property testing libraries like test.check have the tools to generate just about anything fairly easily. You can also come up with your own patterns and helpers to make things easier. One of the best examples of this is Plumatic Schema's experimental generators.

Given any schema, the library will provide you with a generator to generate values that conform to this schema. If you are already validating your new customer in the microservice using schema then there is really no work involved:

(defschema Customer
  {:name s/Str
   :age s/Int
   :gender s/Str})

(sg/generate Customer)

It really is as simple as that, we've eliminated the tedious work of writing sample data and at the same time we've increased the scope for finding bugs.

Property Based Testing

Property based testing is a method of testing functions pioneered by the Haskell community. From Hackage:

QuickCheck is a library for random testing of program properties.

The programmer provides a specification of the program, in the form of properties which functions should satisfy, and QuickCheck then tests that the properties hold in a large number of randomly generated cases.

Property based testing libraries such as test.check have 2 distinct parts. The first part is a framework for random value generation as we saw above, the second part is a clever test runner that will try to find the simplest failing case.

As a simple example taken directly from the test.check, lets say you have a function called sort which will reverse a vector of integers. You provide a generator which will generate vectors of random sizes containing random integers, you then use these as inputs into your functions. Finally you provide a set of properties that should hold true, in this example we can say that reversing a list twice should result in the original list. A library such as QuickCheck or Clojure's test.check will then try to find an example that will cause the test to fail by generating hundreds or thousands of test cases.

(def sort-idempotent-prop
  (prop/for-all [v (gen/vector gen/int)]
    (= (sort v) (sort (sort v)))))

(tc/quick-check 100 sort-idempotent-prop)
;; => {:result true, :num-tests 100, :seed 1382488326530}

This all sounds great however all the online examples are testing small, pure functions that are only a small part of the software we write. Impressive as it is, I was struggling to see how often I would use this type of testing in my everyday development of systems such as HTTP microservices, which often have limited functionality and not much complex logic. However that all changed once I started to have a go!

You wanna play rough?

In a recent project we had built a microservice that would take a request through a RESTful interface, provide a small amount of validation and then place the result on RabbitMQ. We decided to use our new yada library to take care of all the HTTP/REST infrastructure for us.

The service wouldn't be used in a particularly intensive way however the team felt that it would be a good idea to write some load tests to see at what point it falls down and what happens when it does.

Say hello to my little friend!

We decided to use clj-gatling for our load testing. This is a clojure testing tool which is designed primarily for hitting servers with thousands of requests in parallel and producing nice reports about what happened. Since we had already written integration tests to check the functionality of the service (using test.check), it was simply a matter of reusing these tests in a slightly modified manner. We would hit the service on a few of the endpoints and check that the appropriate messages were present on the Rabbit queue. I knew that both RabbitMQ and the aleph server that yada is built on were designed for high performance so I imagined that we would have to really push things to see any problems, after all, we had already verified that the service worked reliably with the integration tests.

(deftest load-test-all-endpoints
  (let [{:keys [api-root test-config]} (test-common/load-config)]
    (g/run-simulation [{:name     "sequentially try each endpoint"
                        :requests [{:name "Put user on queue"
                                    :fn   (partial post-user api-root)}
                                   {:name "Put articles queue"
                                    :fn   (partial post-articles-csv api-root)}
                                   {:name "hit health check endoint"
                                    :fn   (partial health-check api-root)}]}]
                      (:users test-config)
                      {:requests (:requests test-config)})
    (let [total-tests (+ @post-user-count @post-articles-csv-count)]
      (is (= 0 (count @errors)) (format "some requests failed e.g. %s" (first @errors)))
      (eventually-is (= total-tests (count (keys (deref test-common/msgs)))) (:message-timeout test-config)
                     (format "all messages should be received within %sms" (:message-timeout test-config))))))

Who put this thing together?

In the first run I decided to hit the service with 1000 requests from 10 'users' in parallel. One of the endpoints was a CSV file upload and I was surprised to find that some of the messages from this endpoint had not appeared on the queue. My initial reaction was that perhaps there was a small overhead getting messages on to Rabbit and although throughput would be high, I might need to give a bit of time after the test had fired it's requests to see all the results. However I discovered that the messages were simply not getting put on the Rabbit queue, they were just disappearing.

With some old-school 'print line' debugging, it was possible to see that request was getting in to the server but the body was not appearing in my yada handler. This would happen for about 0.5% - 1% of requests, which of course we would never have found with our integration tests. Perhaps occasionally we would have a failed Jenkins build but run it again and everything would pass, it would, in all probability, be put down to something weird on the Jenkins slave and be ignored. We would have lost data in production at some point.

Lesson number one; Lesson number two

Firstly, this made us realise that we should give a 400 response if the body was empty, something we had failed to think about.

Next, careful investigation revealed that the library yada was using for finding multipart boundaries was broken. As a side note, this library was a prime candidate for property based testing and it would have revealed this bug. We were making use of the new multipart streaming-upload feature of yada which allows a web API to process large request bodies asynchronously, useful for uploading large images and videos. When handling multipart request bodies, yada needs to efficiently detect boundaries (known sequences of characters). The library it was using (clj-index) had a bug in it that meant that in certain circumstances boundaries would go undetected.

Malcolm, the primary author of yada, developed a new asynchronous implementation of the Boyer-Moore-Horspool algorithm and released a new version.

We ran the tests again but still found some failures! Working with Malcolm we found that under certain circumstances, the logic of piecing together the chunks of an uploaded file was incorrect.

The issue was fixed and finally the tests passed. We were able to push the service hard and it continued working flawlessly (well, until we finally ran out of file descriptors!).

Now you're talking to me baby!

So what did I learn from this experience?

  • Load tests are important, they can test more than just performance.
  • Generative tests are vital and can find bugs that would have resulted in loss of revenue.
  • Even wearing a QA hat, we can miss simple failure scenarios that should be planned for and dealt with appropriately (the 400 response in this case).
  • It's vital to use libraries that are either battle-tested or that are actively maintained so that bugs can be fixed promptly.
  • Property based testing should be applied where possible, especially when it comes to testing implementations of algorithms such as Boyer-Moore-Horspool string finding.