Skip to content

Latest commit

 

History

History
167 lines (133 loc) · 6.64 KB

README.md

File metadata and controls

167 lines (133 loc) · 6.64 KB

Babbage Analytical Engine

Gitter Build Status Coverage Status

babbage is a lightweight implementation of an OLAP-style database query tool for PostgreSQL. Given a database schema and a logical model of the data, it can be used to perform analytical queries against that data - programmatically or via a web API.

It is heavily inspired by Cubes but has less ambitious goals, i.e. no pre-computation of aggregates, or multiple storage backends.

babbage is not specific to government finances, and could easily be used e.g. for ReGENESIS, a project that makes German national statistics available via an API. The API functions by interpreting modelling metadata generated by the user (measures and dimensions).

Installation and test

babbage will normally included as a PyPI dependency, or installed via pip:

$ pip install babbage

People interested in contributing to the package should instead check out the source repository and then use the provided Makefile to install the library (this requires virtualenv to be installed):

$ git clone https://github.com/openspending/babbage.git
$ cd babbage
$ make install
$ pip install tox
$ export BABBAGE_TEST_DB=postgresql://postgres@localhost:5432/postgres
$ make test

Usage

babbage is used to query a set of existing database tables, using an abstract, logical model to query them. A sample of a logical model can be found in tests/fixtures/models/cra.json, and a JSON schema specifying the model is available in babbage/schema/model.json.

The central unit of babbage is a Cube, i.e. a OLAP cube that uses the provided model metadata to construct queries against a database table. Additionally, the application supports managing multiple cubes at the same time via a CubeManager, which can be subclassed to enable application-specific ways of defining cubes and where their metadata is stored.

Futher, babbage includes a Flask Blueprint that can be used to expose a standard API via HTTP. This API is consumed by the JavaScript babbage.ui package and it is very closely modelled on the Cubes and OpenSpending HTTP APIs.

Programmatic usage

Let's assume you have an existing database table of procurement data and want to query it using babbage in a Python shell. A session might look like this:

import json
from sqlalchemy import create_engine
from babbage.cube import Cube
from babbage.model import Measure

engine = create_engine('postgresql://localhost/procurement')
model = json.load(open('procurement_model.json', 'r'))

cube = Cube(engine, 'procurement', model)
facts = cube.facts(page_size=5)

# There are 17201 rows in the table:
assert facts['total_fact_count'] == 17201

# There's a field called 'total_value':
assert 'total_value' in facts['fields']

# We can get metadata about it:
concept = cube.model['total_value']
assert isinstance(concept, Measure)
assert concept.label == 'Total Value'

# And there's some actual data:
assert len(facts['data']) == 5
fact_0 = facts['data'][0]
assert 'total_value' in fact_0

# For dimensions, we can get all the distinct values:
members = cube.members('supplier', cut='year:2015', page_size=500)
assert len(members['data']) <= 500
assert members['total_member_count']

# And, finally, we can aggregate by specific dimensions:
aggregate = cube.aggregate(aggregates='total_value.sum',
                           drilldowns='supplier|authority'
                           cut='year:2015|authority.country:GB',
                           page_size=500)
# This translates to: 
#   Aggregate the procurement data by summing up the 'total_value'
#   for each unique pair of values in the 'supplier' and 'authority'
#   dimensions, and filter for only those entries where the 'year'
#   dimensions key attribute is '2015' and the 'authority' dimensions
#   'country' attribute is 'GB'. Return the first 500 results.
assert aggregate['total_cell_count']
assert len(aggregate['cells']) <= 500
aggregate_0 = aggregate['cells'][0]
assert 'total_value.sum' in aggregate_0

# Note that these attribute names are made up for this example, they
# should be reflected from the model:
assert 'supplier.code' in aggregate_0
assert 'supplier.label' in aggregate_0
assert 'authority.code' in aggregate_0
assert 'authority.label' in aggregate_0

Using the HTTP API

The HTTP API for babbage is a simple Flask Blueprint used to expose a small set of calls that correspond to the cube functions listed above. To include it into an existing Flask application, you would need to create a CubeManager and then configure the API like this:

from flask import Flask
from sqlalchemy import create_engine
from babbage.manager import JSONCubeManager
from babbage.api import configure_api

app = Flask('demo')
engine = 
models_directory = 'models/'
manager = JSONCubeManager(engine, models_directory)
blueprint = configure_api(app, manager)
app.register_blueprint(blueprint, url_prefix='/api/babbage')

app.run()

Of course, you can define your own CubeManager, for example if you wish to retrieve model metadata from a database.

When enabled, the API will expose a number of JSON(P) endpoints relative to the given url_prefix:

  • /, returns the system status and version.
  • /cubes, returns a list of the available cubes (name only).
  • /cubes/<name>/model, returns full metadata for a given cube (i.e. measures, dimensions, aggregates etc.)
  • /cubes/<name>/facts is used to return individual entries from the cube in a non-aggregated form. Supports filters (cut), a set of fields to return and a sort (field_name:direction), as well as page and page_size.
  • /cubes/<name>/members is used to return the distinct set of values for a given dimension, e.g. all the suppliers mentioned in a procurement dataset. Supports filters (cut), a and a sort (field_name:direction), as well as page and page_size.
  • /cubes/<name>/aggregate is the main endpoint for generating aggregate views of the data. Supports specifying the aggregates to include, the drilldowns to aggregate by, a set of filters (cut), a and a sort (field_name:direction), as well as page and page_size.