Skip to content
Bill Katz edited this page Feb 21, 2020 · 23 revisions

Planned and Existing Features for DVID:

Distributed operation: Once a DVID repo is created and loaded with data, it can be pushed to remote sites using an optional ROI as well as pulled. Each DVID server chooses how much of the data set is held locally.

Status: Repo push with optional data instance specification added in September 2014 and refactored in April 2016. See published one-column repo. Can push all repo data or just data corresponding to one version ("flattened" push), select particular data instances to push, and delimit the transmitted data by datatype-specific filters, e.g., "roi" for datatypes that understand voxel space (uint8blk, labelblk, labelvol, imagetile) and "tile" (xy/xz/yz) for imagetile. Performance of push/clone operations is not impressive as of Jan 2019 so we will create a DAGStore that exploits immutability to improve performance and transfer efficiency.

Versioning: Each version of a DVID repo corresponds to a node in a version DAG (Directed Acyclic Graph). Versions are identified through a UUID that can be composed locally yet are unique globally. Versioning and distribution follow patterns similar to distributed version control systems like git and mercurial. Provenance is kept in the DAG.

Status: Versioning is currently used for FlyEM production tasks. Conflict-free merging (versions are disjoint at key-value pair level) has been implemented but not thoroughly tested or even used as a routine part of the Janelia production workflow. There is some question whether sophisticated merging tools should be an intrinsic part of DVID or an external client of DVID that can read data from nodes and generate merged data into a child node.

Flexible Data Types: DVID provides a well­-defined interface to datatype code that can be easily added by users. A DVID server provides HTTP and RPC APIs, versioning, provenance, storage engines, and (in the future) authentication and authorization. It delegates datatype­-specific commands and processing to datatype code. As long as a DVID type can return data for its implemented commands, we don’t care how it's implemented. Finer grain provenance can also be collected at the datatype level. For example, the labelmap implementation stores a mutation log for splits, merges, and cleaves, which would allow complete provenance in addition to the checkpoint-style versioning intrinsic to DVID.

Status: Variety of voxel types, tiles, labels, label graph, label-aware annotations, key-value, and ROI have been implemented. As an example of a simple proxy datatype, googlevoxels proxies requests between DVID and the Google BrainMaps API, taking care of OAuth2 authentication within the datatype implementation. A key-value type that uses IPFS is planned so distributed DVIDs could share data although at a higher latency due to peer-to-peer and lookup costs. A FUSE interface for key-value type was working but not used for last year. Lightweight authentication and authorization support planned using something like password-less tokens.

Scalable Storage Engine: Although DVID may support polyglot persistence (i.e., allow use of relational, graph, or NoSQL databases), we are initially focused on key­-value stores. DVID has an abstract key­-value interface to its swappable storage engine. We choose a key­-value interface because (1) there are a large number of high­-performance, open­-source implementations that run from embedded to clustered systems, (2) the surface area of the API is very small, even after adding important cases like bulk loads or sequential key read/write, and (3) novel technology tends to match key­-value interfaces. As storage becomes more log structured, the key-value API becomes a more natural fit.

A key part of the DVID vision is the flexibility to choose storage engines and tradeoff speed, storage capacity, and cost. By focusing on key-value stores, we have a variety of solutions.

Spectrum of key-value stores

Status: Currently built with Basho-tuned leveldb and other leveldb variants have been tested successfully in past: Google's open source version and HyperLevelDB.

Google Cloud BigTable support was added by Ignacio Tartavull but not throughly tested. Google Cloud Storage (similar to Amazon S3) backend was added by Steve Plaza and is currently used for DVID Spark Services. Use of a petabyte-capable immutable store (MongoDB for ordered indexing + Scality for object store) was tested at Janelia but not used due to IOPS concerns.

RocksDB and Badger support support are planned for Q1 2019. In the past, Lightning MDB and also experimental use of Bolt were tested, although neither were tuned to work as well as the leveldb variants.

DVID allows assignment of different storage engines to each data type, data instance, and the global metadata (e.g., the DAG). This allows you to use a smaller, fast/expensive storage like SSDs for frequently mutating labels while using read-optimized, larger storage for immutable image data. Within Janelia, we typically delegate storage of our mostly immutable grayscale (25+ TB) to Google Cloud Storage with local caching, and use high-speed NVMe SSDs to hold the rest of our data locally.

Denormalized Views: DVID contains a pub/sub system where instances of some data types can subscribe to mutations in instances of other data types. For example, a labelsz instance subscribes to either an annotation (synapses) and/or labelmap (segmentation) instances, so merges/splits/cleaves can trigger updates in all related data. Some denormalizations are built into the data type, e.g., multiscale support in labelarray/labelmap where changes at level 0 are automatically percolated to lower-resolution levels.

Status: Pub/sub framework for syncing is currently used for a number of datatypes that need to be coordinated (labelblk, labelvol, labelsz, annotations). Multi-scale 2d images in XY, XZ, YZ, and sparse volumes implemented. We now use the tarsupervoxels data type to store meshes for each supervoxel. Skeleton representations are stored using a keyvalue instance.

Deprecated: Multi-scale 3d was first introduced Oct 2016 via synced instances for each level of resolution but now is handled via all-in-one datatypes like labelmap/labelarray. Rapid label surfaces (labelsurf) was implemented but deprecated until a production client plans to use it.

Mutation and Activity Logging: DVID added extensive use of kafka for both activity and mutation logging. Activity logging from kafka is processed and eventually visualized via Kibana where we can track DVID server performance.

Example of Kibana to track DVID performance via kafka activity log

Analysis-focused connectomics systems like (NeuPrint)[https://github.com/connectome-neuprint/neuPrint] can be updated by subscribing to the mutations log.

Status: We are still experimenting with the best way to log mutations and allow easy access to the data stream by other tools.

Clone this wiki locally