Data at the Gates

Coerce to avoid data inconsistencies polluting your codebase

author picture
Jon Pither
CEO & Co-founder
image

A saving architectural pattern is to marshal your data into some canonical format at the gates of your application.

The problem

A classic anti-example is pulling data out of your relational DB, i.e. with Postgresql or Sqlite and leaving the data untreated.

Databases aren’t usually happy when you want to have column names with hyphens (kebabs) in them, and so we have to use last_updated instead of last-updated. Since our JDBC libraries will typically give us a Clojure map back for each row in the table when we do queries, then it’s easy to have last_updated kicking around further up the application stack, if we don’t step takes to convert. This could disenfranchise other developers who expect to see the idiomatic last-updated instead. Soon the code-base could become littered with transformations munging the data back and forth between representations.

Tensions can escalate when someone starts pulling down data from a 3rd party JSON endpoint containing the lethal lastUpdated, allowing it to proceed unfettered. It’s a matter of time before a triple store gets added and we see product/last-updated. Would someone please think of the children?

Aside from our data keys, the values can be mish-mashed too. A frequent example is being confused as to whether a UUID is either a String or an actual UUID. We also have the what-should-a-date-be conundrum; Joda, java.util, String, or epoch based.

The Approach

I’ve learnt my lesson: always get the data sorted out at the boundaries of the application. In the Postgresql/Sqlite example, whenever rows are returned as Clojure maps, I typically run them through the excellent little camel-snake-kebab library.

For example, I could provide my own wrapper replacement for the clojure.java.jdbc/query function:

(defn- query [db q]
  (clojure.jdbc/query db q :row-fn kebab/->kebab-case-keyword))

The magic happens in +->kebab-case-keyword+ that will coerce all “last_updated“s to last-updated when they come out of the db (it can also convert from camel-case, helping with the third party JSON example).

Taking this approach further, when we receive data via our exposed REST endpoints we could and should coerce the data right then and there. Libaries such as compojure-api, tripod, pedestal-swagger and Yada, aim to help with this, allowing you to specify the data schema alongside your routes. Common coercions are then applied such as sorting out the pesky UUID and date conversions.

Prismatic Schema is often the hero here. I once worked on a data processing pipeline where we erected schema barriers at the start of the pipeline (where data was picked up from a file drop), and at the end (where we published data into an ElasticSearch instance). Along the way we also persisted the data at various stages, and so we had intermediate schemas.

Not only did this give us contractual guarantees about the data at certain points in the processing pipeline, it also helped our overall reasoning about the code. Once you then have a Prismatic schema definition, it can then become a basis for writing applicable coercions, using the Schema API.

Recommended Resources
Head Office
Norfolk House, Silbury Blvd.
Milton Keynes, MK9 2AH
United Kingdom
Company registration: 08457399
Copyright © JUXT LTD. 2012-2024
Privacy Policy Terms of Use Contact Us
Get industry news, insights, research, updates and events directly to your inbox

Sign up for our newsletter