Skip to content

davecromberge/datasketches-sandbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datasketches sandbox project

Apache Datasketches is a software library of stochastic streaming algorithms.

This repository provides a simple HTTP interface to evaluate datasketches on your own data.

For large datasets, the following problems are typically difficult to measure exactly using limited resources:

  • distinct count
  • quantiles and histograms
  • frequent items
  • reservoir sampling

Datasketches makes use of sketches with mathematically proven error bounds to provide robust solutions to these problems. Moreover, it is order insensitive to input data and only has to see a data item once ("one touch") making it ideal for streaming and big data use cases.

Usage notes

The service maintains a stateful in-memory sketch/exact copy for each dataset, which can be periodically interrogated for approximate results. This stateful operation allows set operations between sketches.

In order to use the exact equivalent to a sketch, append the ?exact flag to the endpoint.

Each sketch needs to be assigned a key for reference, which typically adheres to the following format:

dataset-dimension1-dimension2-dimensionN

For example:

# country dataset, country code
country-jp
country-us

# occupation dataset, job name, state
occupation-technician-ca
occupation-surgeon-co
occupation-surgeon-tx

Finally, see the useful helper scripts in the scripts directory.

Running in Docker

# Starts the published container from Github container service
docker run -d -p 8099:8080/tcp ghcr.io/davecromberge/datasketches-sandbox/ds-sandbox-server:latest
→ container-id

# Tests the container
curl -X GET http://0.0.0.0:8099/ping
→ pong

# Stops the container
docker stop container-id

Distinct count

Problem: Gather a distinct count of identities, independent of the order of the input.

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-jp/user-id1
→ Accepted

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-jp/user-id2
→ Accepted

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-us/user-id2
→ Accepted

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-jp
→ {"value":2.0,"lowerBound":2.0,"upperBound":2.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-us/union/country-jp
→ {"value":2.0,"lowerBound":2.0,"upperBound":2.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-us/intersect/country-jp
→ {"value":1.0,"lowerBound":1.0,"upperBound":1.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-jp/anotb/country-us
→ {"value":1.0,"lowerBound":1.0,"upperBound":1.0}

curl -F [email protected]  http://127.0.0.1:8099/v1/distinct/count/country-us
→ Accepted

curl -F [email protected]  http://127.0.0.1:8099/v1/distinct/count/country-jp?exact
→ Accepted

curl -X DELETE http://127.0.0.1:8099/v1/distinct/count/country-jp
→ Ok

curl -X DELETE http://127.0.0.1:8099/v1/distinct/count/country-us
→ Ok

For comparison purposes, any of the above URLs can have the ?exact flag set to perform an exact count distinct. Uploading large input streams to the exact endpoints can be orders of magnitude slower, whereas the sketches grow sub-linearly in relation to the input data size.

Environment variables

By default, the sketch nominal entries setting is 2^16, and affects the accuracy of the final estimate.

To alter the defaults, run the docker image with the relevant environment variables set:

docker run -d --env SKETCH_ACCURACY=12 -p 8099:8080/tcp datasketches-sandbox/ds-sandbox-server

Building a Linux executable

  1. Build the Docker image in the docker directory
docker build -f docker/GraalDockerfile -t datasketches-sandbox/graalvm-native-image .

  1. Run the nativeImage task from sbt. The result will be a Linux executable.

  2. Build the lightweight docker image locally

docker build -f docker/SandboxDockerfile -t datasketches-sandbox/ds-sandbox-server .

Acknowledgements

  • The Apache Datasketches team and community for the incredibly useful library.
  • This blog post by Noel Welsh describes how to build a GraalVM service using SBT and docker.

Todos

  • Support more sketch types
  • Create a java equivalent for the Apache organisation
  • Add better documentation
  • Github actions for automatically publishing the package to ghcr

About

An HTTP interface for evaluating Apache Datasketches

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published