Skip to content

A binary serialization format for sparse, labeled 2D numeric datasets

Notifications You must be signed in to change notification settings

framed-data/clj-btable

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BTable

A fast, compact binary serialization format for sparse, labeled 2D numeric datasets ('binary tables').

Motivations

Prior to BTables, we were storing large sparse 2D datasets in dense CSVs, which is highly space- and performance-inefficient. We looked into HDF5 although found it to be overly complex for our use case, and early investigation did not yield compelling gains in performance or space. Thus BTables was designed to be a simple, fast, and compact format to represent sparse numeric datasets.

A BTable is basically a binary representation of a sparse matrix on disk, and the format is inspired by the Compressed Row Storage (CRS) format, saving space by only storing the indices/values of nonzero cells. It is designed in a strictly row-oriented format for efficient iteration, and is not a library for matrix computation or linear algebra.

Note that BTables are not a drop-in replacement for all datasets stored as CSV: the increases in efficiency is proportional to the sparsity of the dataset. For a pathological fully-nonzero dataset, the space occupied can be much larger than a CSV!

Examples

(require '[clj-btable.core :as btable])

(def labels ["login", "view_item", "purchase"])
(def rows [[5.0 3.0 1.0] [2.0 0.0 0.0] [0.0 0.0 0.0]])
(btable/write "out.btable" labels rows)
; #<File out.btable>

(btable/labels "out.btable")
; => ["login", "view_item", "purchase"]

(doseq [row (btable/rows "out.btable")]
  ; Process a single row in a lazy sequence of rows
  )

Also see the documentation.

Disk format

See the wiki for a detailed description of the representation on disk.

Performance

An optimized Java backend using NIO can write a table of 50,000 rows, each with 500 columns (25 million cells) in just under 6 seconds. The same table can be read/traversed in ~7s.

About

A binary serialization format for sparse, labeled 2D numeric datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published