Skip to content

Commit

Permalink
Merge pull request #102 from pescadores/1.1-release-docs
Browse files Browse the repository at this point in the history
1.1 release docs
  • Loading branch information
bmcfee authored Aug 25, 2017
2 parents df01c72 + d345400 commit ff7fb84
Show file tree
Hide file tree
Showing 12 changed files with 339 additions and 102 deletions.
29 changes: 27 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,34 @@ pescador
[![Documentation Status](https://readthedocs.org/projects/pescador/badge/?version=latest)](https://readthedocs.org/projects/pescador/?badge=latest)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.400700.svg)](https://doi.org/10.5281/zenodo.400700)

A sampling and buffering module for iterative learning.
Pescador is a library for streaming (numerical) data, primarily for use in machine learning applications.

Read the [documentation](http://pescador.readthedocs.org)
Pescador addresses the following use cases:

- **Hierarchical sampling**
- **Out-of-core learning**
- **Parallel streaming**

These use cases arise in the following common scenarios:

- Say you have three data sources `(A, B, C)` that you want to sample.
For example, each data source could contain all the examples of a particular category.

Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!

- Now, say you have 3000 data sources, each of which may contain a large number of samples. Maybe that's too much data to fit in RAM at once.

Pescador makes it easy to interleave these sources while maintaining a small `working set`.
Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.

- If loading data incurs substantial latency (e.g., due to accessing storage access
or pre-processing), this can be a problem.

Pescador can seamlessly move data generation into a background process, so that your main thread can continue working.


Want to learn more? [Read the docs!](http://pescador.readthedocs.org)


Installation
Expand Down
20 changes: 20 additions & 0 deletions docs/changes.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,26 @@
Changes
=======

v1.1.0
------
This is primarily a maintenance release, and will be the last in the 1.x series.

- `#97`_ Fixed an infinite loop in `Mux`
- `#91`_ Changed the default timeout for `ZMQStreamer` to 5 seconds.
- `#90`_ Fixed conda-forge package distribution
- `#89`_ Refactored internals of the `Mux` class toward the 2.x series
- `#88`_, `#100`_ improved unit tests
- `#73`_, `#95`_ Updated documentation

.. _#73: https://github.com/pescadores/pescador/pull/73
.. _#88: https://github.com/pescadores/pescador/pull/88
.. _#89: https://github.com/pescadores/pescador/pull/89
.. _#90: https://github.com/pescadores/pescador/pull/90
.. _#91: https://github.com/pescadores/pescador/pull/91
.. _#95: https://github.com/pescadores/pescador/pull/95
.. _#97: https://github.com/pescadores/pescador/pull/97
.. _#100: https://github.com/pescadores/pescador/pull/100

v1.0.0
------
This release constitutes a major revision over the 0.x series, and the new interface
Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.intersphinx',
'sphinx.ext.mathjax',
# 'sphinx.ext.coverage',
# 'sphinx.ext.viewcode',
# 'sphinx.ext.doctest',
Expand Down
31 changes: 18 additions & 13 deletions docs/example1.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
.. _example1:

Basic example
=============
Streaming data
==============

This document will walk through the basics of using pescador to stream samples from a generator.
This example will walk through the basics of using pescador to stream samples from a generator.

Our running example will be learning from an infinite stream of stochastically perturbed samples from the Iris dataset.


Sample generators
-----------------
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
key: `X`. For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.
Streamers are intended to transparently pass data without modifying them.
However, Pescador assumes that Streamers produce output in a particular format.
Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`.
For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one key: `X`.
For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.

Here's a simple example generator that draws random samples of data from the Iris dataset, and adds gaussian noise to the features.

Expand Down Expand Up @@ -43,7 +45,6 @@ Here's a simple example generator that draws random samples of data from the Iri
sample['Y'] is a scalar `np.ndarray` of shape `(,)`
'''
n, d = X.shape
while True:
Expand All @@ -53,16 +54,20 @@ Here's a simple example generator that draws random samples of data from the Iri
yield dict(X=X[i] + noise, Y=Y[i])
In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop. Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels.
In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop.
Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels.


Streamers
---------
Generators in python have a couple of limitations for common stream learning pipelines. First, once instantiated, a generator cannot be "restarted". Second, an instantiated generator cannot be serialized
directly, so they are difficult to use in distributed computation environments.

Pescador provides the `Streamer` class to circumvent these issues. `Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`. Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams. Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.
Generators in python have a couple of limitations for common stream learning pipelines.
First, once instantiated, a generator cannot be "restarted".
Second, an instantiated generator cannot be serialized directly, so they are difficult to use in distributed computation environments.

Pescador provides the `Streamer` class to circumvent these issues.
`Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`.
Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams.
Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.

Here's a simple example, using the generator from the previous section.

Expand Down
8 changes: 5 additions & 3 deletions docs/example2.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
.. _example2:

This document will walk through some advanced usage of pescador.
This example demonstrates how to re-use and multiplex streamers.

We will assume a working understanding of the simple example in the previous section.

Stream re-use and multiplexing
==============================

The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams. `Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time.
The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams.
`Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time.

As a concrete example, we can simulate a mixture of noisy streams with differing variances.

Expand Down Expand Up @@ -66,7 +67,8 @@ As a concrete example, we can simulate a mixture of noisy streams with differing
print('Test accuracy: {:.3f}'.format(accuracy_score(Y[test], Ypred)))
In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`. When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place.
In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`.
When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place.

Setting `rate=None` disables the random stream bounding, and `mux()` simply runs each active stream until exhaustion.

Expand Down
7 changes: 5 additions & 2 deletions docs/example3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@
Sampling from disk
==================

A common use case for `pescador` is to sample data from a large collection of existing archives. As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings. When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters. Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive. The problem then becomes sampling data from a collection of `NPZ` archives.
A common use case for `pescador` is to sample data from a large collection of existing archives.
As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings.
When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters.
Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive.
The problem then becomes sampling data from a collection of `NPZ` archives.

Here, we will assume that the pre-processing has already been done so that each `NPZ` file contains a numpy array of features `X` and labels `Y`.
We will define infinite samplers that pull `n` examples per iterate.
Expand Down Expand Up @@ -86,7 +90,6 @@ Alternatively, *memory-mapping* can be used to only load data as needed, but req
yield dict(X=X[idx:idx + n],
Y=Y[idx:idx + n])
# Using this streamer is similar to the first example, but now you need a separate
# NPY file for each X and Y
npy_x_files = #LIST OF PRE-COMPUTED NPY FILES (X)
Expand Down
13 changes: 13 additions & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.. _examples:

**************
Basic examples
**************

.. toctree::
:maxdepth: 2

example1
example2
example3
bufferedstreaming
Loading

0 comments on commit ff7fb84

Please sign in to comment.