sax-plucker

A Clojure library to deal with SAX-streams in a chunked manner, plucking cohesive pieces and streaming them. Uses data.xml - without absolutely any validations. Assumes well-formed XML.

Motivation

This library is primarily meant to deal with very large XMLs (hence, SAX), but where there are many repeating, self-contained entities (sub-trees) that are the primary targets of iteration and processing. You don't care about the actual semantics of the content, as much as the ability to pluck such entities out, one at a time, and hand them off to another (stream-) processing entity.

The purpose is to present a lazy stream of such mini-DOM-mable collections of XML entities. Because, well, DOM presents a much better semantic view of the data that SAX does not. Re-building such trees over SAX streams is a repetitive, boring job. Irrespective of the contents.

Usage

(stream-plucks (-> "/path/to/xml/file.gz" 
                   (create-sax-streamer :gzipped? true) 
                   (skip "step/down/path")))

For example, you have an XML which looks like the following

<?xml version="1.0" encoding="UTF-8"?>
<Root>
    <noisyTag>
        <ignorableItems>
            <ignorableItem>
                I am an ignorable piece of text
            </ignorableItem>
        </ignorableItems>
        <items>
            <item>
                <tagOne>
                    I am some text in tagOne
                </tagOne>
                <tagTwo>
                    I am some text in tagTwo
                </tagTwo>
            </item>
            <item>
                <tagOne>
                    I am some text in tagOne TWO
                </tagOne>
                <tagTwo>
                    I am some text in tagTwo TWO
                </tagTwo>
            </item>
        </items>
    </noisyTag>
    <!-- This is a comment for your pleasure. -->
</Root>

The following code will give you two groups of XML elements, each rooted at item

(:require [org.msync.sax-plucker :refer [stream-plucks]]
          [xml-in.core :refer [find-all]]) ;; https://github.com/tolitius/xml-in for testing.
;...
; Pseudo-clojure.test code
(let [doms (stream-plucks "sample.xml" :descend-path "Root/noisyTag/items" :as-dom? true)
           ;; Returns a stream of DOM trees rooted at the "item" nodes.
      first-dom (first doms)
      last-dom (last doms)]
  (is (= ["I am some text in tagOne"] (find-all first-dom [:item :tagOne])))
  (is (= ["I am some text in tagTwo"] (find-all first-dom [:item :tagTwo])))
  (is (= ["I am some text in tagOne TWO"] (find-all last-dom [:item :tagOne])))
  (is (= ["I am some text in tagTwo TWO"] (find-all last-dom [:item :tagTwo]))))

More usage examples to follow.

Quick note for large files - you need to pass some options to the JVM at startup. Example

-DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000

The default limit is 50000000 entities otherwise, after which the underlying XML library aborts processing. See one example at dbpedia/extraction-framework#487

Notes:

This library was quick-tested on a > 1.7G compressed XML file and more than 250M lines, and the aggregates matched perfectly with a grep-check.

License

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src/org/msync		src/org/msync
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
README.md		README.md
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sax-plucker

Motivation

Usage

License

About

Releases

Packages

Languages

License

jaju/sax-plucker-clj

Folders and files

Latest commit

History

Repository files navigation

sax-plucker

Motivation

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages