Processing documents with transducers

Using transducers to extract information from XML, JSON and other documents

by Malcolm Sparks

Published 2015-06-02

In this article we will develop a set of transducers which can be composed to extract data from an XML document (although the principles here can be applied to any tree-structured format, including JSON).

The built-in Clojure library already has support for zippers to navigate an XML document. But there are limitations. Zippers are complicated animals to work with, take quite a lot of setting up and require that the document fits in memory. As an experiment, let's see how far we can get with transducers instead.

Getting started

Let's start with an example XML file, which we'll save to example.xml.

  <chapter name="Introduction">
    <para>Here is the intro</para>
    <para>Another paragraph</para>
  <chapter name="Conclusion">
    <para>All done now</para>

Clojure already contains an XML parser. We'll use it to parse our document into a Clojure map.

(require '[clojure.xml :as xml])

(def doc (xml/parse "example.xml"))

xml/parse parses the file into a tree structure like this

{:tag :book :attrs {} :content [...]}

Descending into the content

The first transducer we shall write will be one that descends into the node's child elements.

Transducers work on sequences (or streams) of data. Since we want to get the children of multiple nodes into a single combined sequence, we'll use mapcat rather than map.

Here's our first transducer

(def children (mapcat :content))

Give it a sequence of nodes and it will return a sequence of children.

(sequence children [doc])

which results in

({:tag :frontmatter, :attrs nil, :content nil}
 {:tag :chapter,
  :attrs {:name "Introduction"},
 {:tag :chapter,
  :attrs {:name "Conclusion"},
  :content […]})

Select children based on their tag

Usually we want to select children by some criteria. It's very common to select child elements by their tag names.

So let's compose a transducer that will filter children by their tag name, using an arbitary predicate.

(defn tagp [pred]
  (comp children (filter (comp pred :tag))))

We've just created a function, named tagp, that given a predicate, returns a transducer which selects some nodes' children and filters them by their tag names.

(Transducers are infinitely composeable in this way, which gives great flexibility.)

It's very common to want to use an equality predicate, so we can select children by tag name. So let's create a convenience function for this.

(defn tag= [tag]
  (tagp (partial = tag)))

Now we can select all the chapter nodes of our example doc like this.

(sequence (tag= :chapter) [doc])

which results in

({:tag :chapter,
  :attrs {:name "Introduction"},
  [{:tag :para, :attrs nil, :content ["Here is the intro"]}
   {:tag :para, :attrs nil, :content ["Another paragraph"]}]}
 {:tag :chapter,
  :attrs {:name "Conclusion"},
  :content [{:tag :para, :attrs nil, :content ["All done now"]}]})

Note that we have only replaced the children transducer with our new 'get elements by tag name' one.

Let's use this transducer to count the number of chapters in our example

(->> [doc]
     (sequence (tag= :chapter))

which results in 2.

Select children based on their attributes

Similarly, we can build attrp and attr= to select children based on their attributes. Since the attributes are contained in the :attrs entry, we need to compose an accessor to access them.

(defn attr-accessor [a]
  (comp a :attrs))

Now we can create our 2 attribute transducers.

(defn attrp [a pred]
  (filter (comp pred (attr-accessor a))))

(defn attr= [a v]
  (attrp a (partial = v)))

Let's use our attr= transducer to count how many chapters have an attribute name of 'Conclusion'.

(->> [doc]
       (sequence (comp (tag= :chapter)
                       (attr= :name "Conclusion")))

which results in 1.

Towards a path-based query language

Let's re-write that code above to expose a path syntax. We simply apply comp over the path to yield the same composite transducer as before.

(let [path [(tag= :chapter) (attr= :name "Conclusion")]]
  (->> [doc]
         (sequence (apply comp path))

Note how we have now built a primitive, yet arguably more powerful, version of XPath.

Let's implement XPath's text() function as a transducer

(def text (comp (mapcat :content) (filter string?)))

We could implement other XPath node selectors in the same straight-forward way.

Putting it all together

We can now compose all these transducers to extract paragraphs in our document. Let's extract just the paragraphs of the chapter named 'Introduction'.

(let [path
       [(tag= :chapter) (attr= :name "Introduction") (tag= :para) text]]
    (sequence (apply comp path) [doc]))

which results in this list

("Here is the intro" "Another paragraph")

One reason why this approach is more powerful than XPath is that we are not limited to using XPath's built-in functions, but can inject custom transducers or any other functions into our 'path' to do anything we like, even something that's would be impossible in XPath, such as creating a map between the chapter names and the text of their paragraphs.

(->> [doc]
     (sequence (comp
                 (tag= :chapter)
                 (map (juxt (attr-accessor :name)
                            #(sequence (comp (tag= :para) text) [%])))))
     (into {}))

which results in

{"Introduction" ("Here is the intro" "Another paragraph"),
 "Conclusion" ("All done now")}

Nor are we limited to XML. With some easy modifications, we can build a path query for JSON or other formats.


As an exercise left for the reader, can you implement XPath's descendants as a transducer?

Discuss below or join the discussion on Hacker News.

submit to reddit