Datalog for trees in Clojure

How to use Datalog in Clojure to match and extract information from trees

I recently became curious about Datalog and its applications on program analysis. There seems to be quite a lot of literature around the subject, and given that all code can be represented as a tree, it would imply that Datalog is especially well-suited for querying trees. I was keen to find out more on a practical level, but I thought that learning about program analysis and tree queries with Datalog at the same time would be a bit challenging, so I thought to experiment on a kind of tree that I know a bit better: the DOM.

This blog post will show you how to encode a DOM-like structure to DataScript data, and how to use DataScript queries (along with a few rules) to perform queries against the structure in order to extract data. You can use this as a way to bootstrap your understanding of how Datalog can be used to query any tree structure.

Weapons of choice

The experiment was done in Clojure of course, and I've uploaded it as a project on github, so if you'd like to try stuff as you read, do:

git clone git@github.com:stathissideris/datascript-dom.git

And start your REPL on the project with Leiningen.

There are two flavours of Datalog in the Clojure ecosystem, Datomic and DataScript. Datomic is of course a commercial database with a free version, and DataScript is an open source in-memory-only Datalog database that used to be ClojureScript-only, but now also runs on the JVM. Because this is only a small experiment (and because it's more forgiving/flexible with schemas etc), I chose to go with DataScript in this case, but the same queries would apply to Datomic with very few changes (the Datomic schema would need to be much more extensive).

For HTML parsing, I used Daniel Janus' clj-tagsoup which provides a robust way to parse messy HTML into a hiccup-like data structure (you may end up with extra <body> and <head> tags when using it – not sure why, but this is just an experiment).

Encoding the tree

The first task is to take the output of clj-tagsoup and turn it into transactions that can then be passed to DataScript in order to establish a number of facts about our tree. Let's have a look at the output of clj-tagsoup:

(require '[pl.danieljanus.tagsoup :as html])
(html/parse-string "<html><body class=\"test\"><p>A<b>B</b>C<i>D</i>E</p><p>hohoho</p></body></html>")

Produces:

[:html {} [:body {:class "test"} [:p {} "A" [:b {} "B"] "C" [:i {} "D"] "E"] [:p {} "hohoho"]]]

You can see that it looks like the hiccup syntax, but a bit more consistent because you always get a map in the second position, even if there are no attributes.

There is a nice correspondence between DOM and DataScript: the DOM elements have attributes, entities in Datalog also have attributes, and that's how we are going to encode them. To simplify things a bit, we are not going to prefix the HTML attribute keywords with a namespace, we'll just use them as they are returned from clj-tagsoup.

The tag name is going to be encoded as a :dom/tag attribute in DataScript to prevent any clashes with the bare keywords of the actual attributes. DataScript does not expect you to put anything in the schema unless there is something special about it (cardinality or a special type), so we don't need to declare any of the attributes in advance, we will simply store them as we encounter them. Both DataScript and Datomic support transactions that look like nested maps, but we would need to transform the original tree to something that resembles the enlive syntax:

[{:dom/tag :html
  :child
  [{:dom/tag :body
    :class "test"
    :child
    [{:dom/tag :p
      :child
      [{:dom/tag :text-node :text "A"}
       {:dom/tag :b :child [{:dom/tag :text-node :text "B"}]}
       {:dom/tag :text-node :text "C"}
       {:dom/tag :i :child [{:dom/tag :text-node :text "D"}]}
       {:dom/tag :text-node :text "E"}]}
     {:dom/tag :p
      :child [{:dom/tag :text-node :text "hohoho"}]}]}]}]

Notice that free text elements are encoded as artificial :text-node tags. This transaction captures the basic information about tags and the structure of the tree. The naming of the :child key may seem odd in this context because it's singular, but it reads better like that in queries ([?a :child ?b] – "a has a child called b").

What's missing from this transaction is the order of siblings on the same level: if we stored it like that, there wouldn't be a way to figure out the order of the children of the first paragraph, or even which paragraph comes first because Datalog's relationships do not retain order. Because of this, we need to add an index to each entity to explicitly capture its order. So our task is to transform this:

[:html {} [:body {:class "test"} [:p {} "A" [:b {} "B"] "C" [:i {} "D"] "E"] [:p {} "hohoho"]]]

Into this:

[{:dom/tag :html
  :child
  [{:dom/tag :body
    :class "test"
    :child
    [{:dom/tag :p
      :child
      [{:dom/tag :text-node :text "A" :dom/index 0}
       {:dom/tag :b
        :child [{:dom/tag :text-node :text "B" :dom/index 0}]
        :dom/index 1}
       {:dom/tag :text-node :text "C" :dom/index 2}
       {:dom/tag :i
        :child [{:dom/tag :text-node :text "D" :dom/index 0}]
        :dom/index 3}
       {:dom/tag :text-node :text "E" :dom/index 4}]
      :dom/index 0}
     {:dom/tag :p
      :child [{:dom/tag :text-node :text "hohoho" :dom/index 0}]
      :dom/index 1}]
    :dom/index 0}]
  :dom/index 0}]

In contrast to Datomic, DataScript is pretty free-form when it comes to schemas, so the schema you are going to need is pretty minimal:

(def schema
  {:child {:db/valueType   :db.type/ref
           :db/cardinality :db.cardinality/many}})

Because all relationships are bi-directional in Datalog, you can always query for the parent when you know the child, so it doesn't matter which direction the reference is pointing to.

You may have noticed that in order to perform this conversion we need to have some context for each node of the tree being visited, namely its previous sibling in order to figure out what index to assign to it. Because we are doing a context-aware tree walk, the functions of the clojure.walk namespace won't be sufficient. We need to bring out the tree superweapon: zippers.

As we walk the tree, we are going to convert each node from its clj-tagsoup vector-based representation into the enlive-like representation suitable for the transaction, using a simple function:

(defn- to-entity [node]
  (if-not (string? node)
    (merge
     {:dom/tag (html/tag node)}

     ;;include attributes, but not :id
     (dissoc (html/attributes node) :id)

     ;;id will ne given a special name
     (when-let [id (:id node)] {:dom/id id})

     ;;assoc children -- if present
     (when-let [children (html/children node)]
       {:child children}))

    ;;if it's a string, encode as text-node
    {:dom/tag :text-node
    :text    node}))

If you look at the documentation of clojure.zip/zipper, you'll see that in order to use a zipper over your tree structure, you need 3 functions: branch?, children and make-node. Based on the output of to-entity, creating the zipper will look like this:

(defn dom-zipper [dom]
  (zip/zipper
   (fn [node] (not-empty (:child node))) ;;branch?
   :child ;;children
   (fn [node children] (assoc node :child children)) ;;make-node
   dom))

We can use this zipper to walk and transform the original tree:

(defn dom->transaction [dom]
  (loop [zipper (dom-zipper dom)]
    (if (zip/end? zipper)
      [(zip/root zipper)]
      (recur (zip/next (replace-node zipper))))))

zip/next visits each node in the tree, depth-first. When the end is reached, zip/root returns the tranformed tree. The actual node tranformation happens in replace-node:

(defn- replace-node [zipper]
  (let [dom-index (or (some-> zipper zip/left zip/node :dom/index inc) 0)]
    (zip/replace
     zipper
     (-> zipper
         zip/node
         to-entity
         (assoc :dom/index dom-index)))))

dom-index is calculated by looking to the left of the current position of the zipper, which if it contains a node, it will have already been transformed (because zip/next traverses the tree left to right). If there are no siblings to the left of the current position, zip/left will return nil, and some-> will bail out and return nil. Otherwise the chain of calls continues to zip/node and all the way to inc which calculates the index as the index of the node to our left +1. If at any point this expression evaluates to nil, we assume we are at the first node, so the index becomes zero. The rest of the function replaces the old tree node with the result of to-entity and assocs the dom-index.

At this stage, we are ready to load up our data into DataScript and start querying!

(def small-dom
  (html/parse-string
    "<html><body class=\"test\"><p>A<b>B</b>C<i>D</i>E</p><p>hohoho</p></body></html>"))
(def small-conn (d/create-conn schema))
(d/transact small-conn (dom->transaction small-dom))

First contact with queries

First let's see if we can query on the class attribute to find the <body> tag:

(d/q '[:find [(pull ?node [*]) ...]
       :where
       [?node :class "test"]]
     @small-conn)

Produces:

[{:db/id 2 :child [{:db/id 3} {:db/id 11}] :class "test" :dom/index 0 :dom/tag :body}]

For the query syntax see the Datomic documentation. The weird syntax in the find specification is using the pull API to retrieve the whole of the entity (if we just said ?node we would simply get the entity's :db/id).

To find the root:

(d/q '[:find (pull ?node [:dom/tag]) .
       :where
       [?node _]
       [(missing? $ ?node :_child)]]
     @small-conn)

Produces:

{:dom/tag :html}

The dot at the end of the find spec instructs DataScript to return a single result (we do expect to have a single root after all!)

As we experiment with queries, certain patterns emerge that can be defined as re-usable rules in DataScript. The previous query can be written as:

(d/q '[:find (pull ?node [:dom/tag]) .
       :in $ %
       :where (root ?node)]
     @small-conn
     '[[(root ?node)
        [?node _]
        [(missing? $ ?node :_child)]]])

See the actual code for the rules that I defined while experimenting (stored at the top, see (def rules ...)). There are a few rules that warrant some explanation:

[(siblings ?a ?b)
 [?p :child ?a]
 [?p :child ?b]
 [(!= ?a ?b)]]

This simply says that ?a and ?b are siblings when they have the same parent ?p, but they can't be identical.

Here is a slightly more involved rule, which compares the :dom/index attributes of nodes to determine whether they are siblings that are adjacent to each other:

[(next-sibling ?this ?sib)
 (siblings ?sib ?this)
 [?this :dom/index ?i1]
 [?sib :dom/index ?i2]
 [(- ?i2 ?i1) ?diff]
 [(= ?diff 1)]]

Here's a recursive rule that tests whether a node is an ancestor of another node:

[(anc ?par ?child)
 (?par :child ?child)]
[(anc ?anc ?child)
 (?par :child ?child)
 (anc ?anc ?par)]

Note that rules can have multiple definitions (like the anc rule), but each definition has to have the same arity (number of variables). The rules can be used in multiple ways. For example, you can use the anc rule if you know the child but not the ancestor, or you may only know the ancestor and you are querying for the child. Logic programming gives you this kind of flexibility.

Here's the anc rule in action, used to retrieve all the ancestors of the <b> tag recursively:

(d/q '[:find [(pull ?anc [:dom/tag]) ...]
       :in $ %
       :where
       [?node :dom/tag :b]
       (anc ?anc ?node)]
     @small-conn rules)

Produces:

[{:dom/tag :p} {:dom/tag :body} {:dom/tag :html}]

More complex querying with real-life HTML

At this stage we're ready to try the same approach on a more complex DOM. I've saved the IMDB page of the film "TRON" in the resources folder, and we will attempt to extract some basic facts about the film from the HTML:

(def dom (html/parse (io/file "resources/tron.html")))
(def conn (d/create-conn schema))
(def _ (d/transact conn (dom->transaction dom)))

The (def _ ...) is to prevent Clojure from printing the whole result of transact, because it's too large.

Here's how to extract the title and year:

(let [[title year]
      (d/q '[:find [?title ?year]
             :in $ %
             :where
             [?h1 :dom/tag :h1]
             [?h1 :child ?name-node]
             [?name-node :dom/tag :span]
             [?name-node :itemprop "name"]
             (text ?name-node ?title)

             ;;year-node is the next sibling of name-node
             (next-sibling ?name-node ?year-node)
             (?year-node :child ?a-node) ;;...and it contains an <a> tag
             (?a-node :dom/tag :a)
             (text ?a-node ?year)] ;;and the <a> tag contains the year
           @conn rules)]
  {:title title
   :year  year})

Produces:

{:title "TRON", :year "1982"}

The text rule in the previous example either extracts the text of the all text nodes within a container node (caution, their order is not guaranteed!), or returns the text attribute of the node itself if it happens to be a text-node:

[(text ?container ?text)
 [?container :child ?text-node]
 [?text-node :dom/tag :text-node]
 [?text-node :text ?text]]
[(text ?container ?text)
 [?container :dom/tag :text-node]
 [?container :text ?text]]

Here's a function that demonstrates how to use the dom/index attribute to return results in the correct order:

(defn tech-spec [conn label]
  (->>
   (d/q '[:find ?index ?value-text ;;return both to get order
          :in $ % ?label-text
          :where
          [?label :dom/tag :h4]
          [?label :class "inline"]
          (text ?label ?label-text)

          (siblings ?label ?value)
          (text ?value ?value-text)
          [(!= "|" ?value-text)] ;;exclude separators
          (?value :dom/index ?index)]
        @conn rules label)
   (sort-by first)
   (map (comp (fn [x] (.trim x)) second))))

Example output:

(tech-spec conn "Runtime:") => ("96 min")
(tech-spec conn "Aspect Ratio:") => ("2.35 : 1")
(tech-spec conn "Sound Mix:") => ("70 mm 6-Track" "(70 mm prints)" "Dolby" "(35 mm prints)")

It's not worth copying and pasting the rest of the code here (it's either self-explanatory or heavily commented). You can find queries that extract the title, the ratings, the cast (which also demonstrates passing and using custom functions), genres, duration and tech specs.

Conclusion

I think using DataScript for querying the DOM worked out pretty well, and it looks promising for other types of trees. DataScript's documentation is still a bit sparse – I found myself either reading its source or referring to Datomic's documentation and hoping that it applies to DataScript (in most cases it does, with the exception of custom functions). DataScript assumes some Datomic knowledge, and, to be fair, the readme is very clear about differences to Datomic and about features that are not implemented yet (like not clauses).

This approach is definitely way more verbose than something like XPATH, or a number of other libraries implementing CSS selectors, but the fact that it's extensible with rules and custom functions makes it more versatile, and certainly a good fit for cases where your tree data is not DOM-like.

See also

datomizer - An experimental data structure marshaling library for Datomic

Discuss below or join the discussion on Hacker News.

submit to reddit
Training Information