GitHub - framed-data/clj-btable: A binary serialization format for sparse, labeled 2D numeric datasets

BTable

A fast, compact binary serialization format for sparse, labeled 2D numeric datasets ('binary tables').

Motivations

Prior to BTables, we were storing large sparse 2D datasets in dense CSVs, which is highly space- and performance-inefficient. We looked into HDF5 although found it to be overly complex for our use case, and early investigation did not yield compelling gains in performance or space. Thus BTables was designed to be a simple, fast, and compact format to represent sparse numeric datasets.

A BTable is basically a binary representation of a sparse matrix on disk, and the format is inspired by the Compressed Row Storage (CRS) format, saving space by only storing the indices/values of nonzero cells. It is designed in a strictly row-oriented format for efficient iteration, and is not a library for matrix computation or linear algebra.

Note that BTables are not a drop-in replacement for all datasets stored as CSV: the increases in efficiency is proportional to the sparsity of the dataset. For a pathological fully-nonzero dataset, the space occupied can be much larger than a CSV!

Examples

(require '[clj-btable.core :as btable])

(def labels ["login", "view_item", "purchase"])
(def rows [[5.0 3.0 1.0] [2.0 0.0 0.0] [0.0 0.0 0.0]])
(btable/write "out.btable" labels rows)
; #<File out.btable>

(btable/labels "out.btable")
; => ["login", "view_item", "purchase"]

(doseq [row (btable/rows "out.btable")]
  ; Process a single row in a lazy sequence of rows
  )

Also see the documentation.

Disk format

See the wiki for a detailed description of the representation on disk.

Performance

An optimized Java backend using NIO can write a table of 50,000 rows, each with 500 columns (25 million cells) in just under 6 seconds. The same table can be read/traversed in ~7s.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
doc		doc
src		src
test/clj_btable		test/clj_btable
.gitignore		.gitignore
README.md		README.md
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BTable

Motivations

Examples

Disk format

Performance

About

Releases

Packages

Languages

framed-data/clj-btable

Folders and files

Latest commit

History

Repository files navigation

BTable

Motivations

Examples

Disk format

Performance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages