The Crux 'Document Store'

Making it possible to store your documents in S3 + more

The Crux 'Document Store'

May 12, 2020
James Henderson

We’ve always aimed for Crux to be 'unbundled' - giving users the choice of which underlying storage engines to use for their specific deployments. Whether it be Apache Kafka, Facebook’s RocksDB or the many supported JDBC backends, Crux users benefit from decades of development effort put into the stability and performance of these data stores.

We have been busy updating the core components of Crux to better reflect what Crux needs from these storage engines, and to make the most of what these engines can provide.

Crux now supports the concept of a pluggable document store, which is the central location for master document storage in Crux. You are still able to use an event log such as Kafka to hold all your data, including documents, but you can also use JDBC-based document stores or more cost-effective object storage services like S3. The choice is yours!

Since 20.05-1.8.3-alpha Crux now has both 'transaction log' and 'document store' protocols:

(defprotocol TxLog
  (submit-tx [this tx-events])
  (open-tx-log [this after-tx-id]))

(defprotocol DocumentStore
  (submit-docs [this docs])
  (fetch-docs [this ids]))

It is now possible, for example, to use Kafka as the TxLog, and a JDBC database as the DocumentStore, taking advantage of the performance and scaling characteristics of both.

Alongside this change we also released an S3 document store implementation for users deploying Crux within AWS. This module demonstrates how to implement the protocol and we are excited to see a variety of alternative DocumentStore implementations emerge in the community for other popular storage systems over the coming months!

Before & After Crux Architecture

For more background on why we made this change, let’s start with how Crux works when processing transactions…

Separating transactions from documents

In Crux, we make a distinction between a transaction’s operations and its documents - for example, given a transaction of [[:crux.tx/put {:crux.db/id :luke, :name "Luke"}]]:

  • we hash the {:crux.db/id :luke, :name "Luke"} document, giving us "ee9d9c46…​"

  • we store the document, keyed by its hash

  • we put the transaction operation on the transaction log, but replace the document with its hash: [[:crux.tx/put "ee9d9c46…​"]]

We do this for eviction - this distinction between documents and transactions means that, if Luke were to request that we irretrievably remove all his data (e.g. due to a GDPR/CCPA request), we only need to forget the content of the document and the transaction log can stay immutable.

When the indexer receives one of these 'evict' transactions, it not only has to remove the data from its indexes, it also has to request that the pluggable transaction log component do the same.

Consuming the transaction log

In Kafka (one implementation of this transaction log), we achieved this using two topics - one immutable 'tx-topic' to queue the operations, one mutable 'doc-topic' with the documents.

This is optimal for users who want to keep all their data on Kafka as the primary golden store, but other groups of users may want to take advantage of other stores and to reduce their Kafka foot-print. Ultimately there is no one-size-fits-all and what users want is more configuration flexibility.

Now we’ve separated out these two golden stores: the 'transaction log' and 'document store'. This simplifies the overall architecture of Crux and gives users more flexibility in defining their own Crux topologies.


For an overview and demo of the S3 module, see this excerpt from our recent Development Showcase:

Get in touch!

Please let us know if you’ve got any thoughts or questions about this change - we can be contacted through all of our usual channels.

Cheers and have fun!

Crux Team