Skip to content

Commit

Permalink
Merge pull request #123 from pescadores/2.0.0rc0
Browse files Browse the repository at this point in the history
docs and version updates for 2.0 pre-release [ci skip]
  • Loading branch information
bmcfee authored Jan 29, 2018
2 parents 8c5f998 + a6c7bac commit c634d8e
Show file tree
Hide file tree
Showing 6 changed files with 66 additions and 36 deletions.
49 changes: 37 additions & 12 deletions docs/changes.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,33 @@
Changes
=======

v1.1.0
------
v2.0.0 (2018-01-29)
-------------------
This release is the second major revision of the pescador architecture, and
includes many substantial changes to the API.

- `#103`_ Deprecation and refactor of the `Mux` class. Its functionality has
been superseded by new classes `StochasticMux`, `ShuffledMux`, `ChainMux`,
and `RoundRobinMux`.
- `#109`_ Removed deprecated features from the 1.x series:
- `BufferedStreamer` class
- `Streamer.tuples` method
- `#111`_ Removed the internally-facing `StreamActivator` class
- `#113`_ Bugfix: multiply-activated streamers (and muxes) no longer share state
- `#116`_ `Streamer.cycle` now respects the `max_iter` parameter
- `#121`_ Added minimum dependency version requirements
- `#106`_ Added more advanced examples in the documentation

.. _#103: https://github.com/pescadores/pescador/pull/103
.. _#109: https://github.com/pescadores/pescador/pull/109
.. _#111: https://github.com/pescadores/pescador/pull/111
.. _#113: https://github.com/pescadores/pescador/pull/113
.. _#116: https://github.com/pescadores/pescador/pull/116
.. _#121: https://github.com/pescadores/pescador/pull/121
.. _#106: https://github.com/pescadores/pescador/pull/106

v1.1.0 (2017-08-25)
-------------------
This is primarily a maintenance release, and will be the last in the 1.x series.

- `#97`_ Fixed an infinite loop in `Mux`
Expand All @@ -21,8 +46,8 @@ This is primarily a maintenance release, and will be the last in the 1.x series.
.. _#97: https://github.com/pescadores/pescador/pull/97
.. _#100: https://github.com/pescadores/pescador/pull/100

v1.0.0
------
v1.0.0 (2017-03-18)
-------------------
This release constitutes a major revision over the 0.x series, and the new interface
is not backward-compatible.

Expand Down Expand Up @@ -55,24 +80,24 @@ is not backward-compatible.
.. _#34: https://github.com/pescadores/pescador/pull/34
.. _#23: https://github.com/pescadores/pescador/pull/23

v0.1.3
------
v0.1.3 (2016-07-28)
-------------------
- Added support for ``joblib>=0.10``

v0.1.2
------
v0.1.2 (2016-02-22)
-------------------

- Added ``pescador.mux`` parameter `revive`. Calling with `with_replacement=False, revive=True`
will use each seed at most once at any given time.
- Added ``pescador.zmq_stream`` parameter `timeout`. Setting this to a positive number will terminate
dangling worker threads after `timeout` is exceeded on join. See also: ``multiprocessing.Process.join``.

v0.1.1
------
v0.1.1 (2016-01-07)
-------------------

- ``pescador.mux`` now throws a ``RuntimeError`` exception if the seed pool is empty


v0.1.0
------
v0.1.0 (2015-10-20)
-------------------
Initial public release
6 changes: 3 additions & 3 deletions docs/example3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Applying the `sample_npz` function above to a list of `npz_files`, we can make a
# Keep 32 streams alive at once
# Draw on average 16 patches from each stream before deactivating
mux_stream = pescador.Mux(streams, k=32, rate=16)
mux_stream = pescador.StochasticMux(streams, n_active=32, rate=16)
for batch in mux_stream(max_iter=1000):
# DO LEARNING HERE
Expand Down Expand Up @@ -97,7 +97,7 @@ Alternatively, *memory-mapping* can be used to only load data as needed, but req
streams = [pescador.Streamer(sample_npz, npy_x, npy_y n)
for (npy_x, npy_y) in zip(npy_x_files, npy_y_files)]
# Then construct the `Mux` from the streams, as above
mux_streame = pescador.Mux(streams, k=32, rate=16)
# Then construct the `StochasticMux` from the streams, as above
mux_streame = pescador.StochasticMux(streams, n_active=32, rate=16)
...
9 changes: 5 additions & 4 deletions docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,13 @@ Streaming Data

Multiplexing Data Streams
-------------------------
1. Pescador defines an object called a `Mux` for the purposes of multiplexing streams of data.
1. Pescador defines a family of multiplexer or `Mux` classes for the purposes of multiplexing streams of data.
For stochastic sampling applications, `ShuffledMux` and `StochasticMux` are the most useful classes.

2. `Mux` inherits from `Streamer`, which makes it both iterable and recomposable. Muxes allow you to
construct arbitrary trees of data streams. This is useful for hierarchical sampling.
2. `BaseMux` inherits from `Streamer`, which makes all muxes both iterable and recomposable.
Muxes allow you to construct arbitrary trees of data streams. This is useful for hierarchical sampling.

3. A `Mux` is initialized with a container of one or more iterables, and parameters to control the stochastic behavior of the object.
3. Muxes are initialized with a container of one or more streamers, and parameters to control the mux's sampling behavior..

4. As a subclass of `Streamer`, a `Mux` also transparently yields the stream flowing through it, i.e. :ref:`streaming-data`.

Expand Down
24 changes: 14 additions & 10 deletions docs/why.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ It can also be useful when dealing with data that has natural grouping substruct
For example, when modeling a large collection of audio files, each file may generate multiple observations, which will all be mutually correlated.
Hierarchical sampling can be useful in neutralizing this bias during the training process.

Pescador implements hierarchical sampling via the :ref:`Mux` abstraction.
In its simplest form, `Mux` takes as input a set of :ref:`Streamer` objects from which samples are drawn randomly.
Pescador implements hierarchical sampling through a family of :ref:`Mux` classes.
In its simplest form, the `ShuffledMux` takes as input a set of :ref:`Streamer` objects from which samples are drawn randomly.
This effectively generates data by a process similar to the following pseudo-code:

.. code-block:: python
Expand All @@ -35,18 +35,18 @@ This effectively generates data by a process similar to the following pseudo-cod
stream_id = random_choice(streamers)
yield next(streamers[stream_id])
The `Mux` object also lets you specify an arbitrary distribution over the set of streamers, giving you fine-grained control over the resulting distribution of samples.
The `ShuffledMux` object also lets you specify an arbitrary distribution over the set of streamers, giving you fine-grained control over the resulting distribution of samples.


The `Mux` object is also a `Streamer`, so sampling hierarchies can be nested arbitrarily deep.
Muxes are also `Streamers`, so sampling hierarchies can be nested arbitrarily deep.

Out-of-core sampling
--------------------

Another common problem occurs when the size of the dataset is too large for the machine to fit in RAM simultaneously.
Going back to the audio example above, consider a problem where there are 30,000 source files, each of which generates 1GB of observation data, and the machine can only fit 100 source files in memory at any given time.

To facilitate this use case, the `Mux` object allows you to specify a maximum number of simultaneously active streams (i.e., the *working set*).
To facilitate this use case, the `StochasticMux` object allows you to specify a maximum number of simultaneously active streams (i.e., the *working set*).
In this case, you would most likely implement a `generator` for each file as follows:

.. code-block:: python
Expand All @@ -62,16 +62,20 @@ In this case, you would most likely implement a `generator` for each file as fol
streamers = [pescador.Streamer(sample_file, fname) for fname in ALL_30K_FILES]
for item in pescador.Mux(streamers, 100):
# Keep 100 streamers active at a time
# Replace a streamer after it has generated (on average) 8 samples
for item in pescador.StochasticMux(streamers, n_active=100, rate=8):
model.partial_fit(item['X'])
Note that data is not loaded until the generator is instantiated.
If you specify a working set of size `k=100`, then `Mux` will select 100 streamers at random to form the working set, and only sample data from within that set.
`Mux` will then randomly evict streamers from the working set and replace them with new streamers, according to its `rate` parameter.
If you specify a working set of size `n_active=100`, then `StochasticMux` will select 100 streamers at random to form the working set, and only sample data from within that set.
`StochasticMux` will then randomly evict streamers from the working set and replace them with new streamers, according to its `rate` parameter.
This results in a simple interface to draw data from all input sources but using limited memory.

`Mux` provides a great deal of flexibility over how streamers are replaced, what to do when streamers are exhausted, etc.
`StochasticMux` provides a great deal of flexibility over how streamers are replaced, what to do when streamers are exhausted, etc.

In addition to `ShuffledMux` and `StochasticMux`, there are also deterministic multiplexers `ChainMux` and
`RoundRobinMux`, which are useful when random sampling is undesirable.

Parallel processing
-------------------
Expand All @@ -86,7 +90,7 @@ Continuing the above example:
.. code-block:: python
:linenos:
mux_stream = pescador.Mux(streamers, 100)
mux_stream = pescador.StochasticMux(streamers, n_active=100, rate=8)
for item in pescador.ZMQStreamer(mux_stream):
model.partial_fit(item['X'])
Expand Down
10 changes: 5 additions & 5 deletions examples/frameworks/keras_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,11 +191,11 @@ def additive_noise(stream, key='X', scale=1e-1):
noisy_stream = pescador.Streamer(additive_noise, stream, 'X')

# Multiplex the two streamers together.
mux = pescador.Mux([stream, noisy_stream],
# Two streams, always active.
k=2,
# We want to sample from each stream infinitely.
rate=None)
mux = pescador.StochasticMux([stream, noisy_stream],
# Two streams, always active.
n_active=2,
# We want to sample from each stream infinitely.
rate=None)

# Buffer the stream into minibatches.
batches = pescador.buffer_stream(mux, batch_size)
Expand Down
4 changes: 2 additions & 2 deletions pescador/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
# -*- coding: utf-8 -*-
"""Version info"""

short_version = '1.1'
version = '1.1.0'
short_version = '2.0'
version = '2.0.0rc0'

0 comments on commit c634d8e

Please sign in to comment.