Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting to split off the state of the data stream into its own class #2242

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions docs/data_ingestion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,74 @@ Two of the new format data readers are the ``python``, ``SMILES``, and
Several of these readers (SMILES and
:ref:`HDF5<sec:hdf5_data_reader>`) support the use of :ref:`sample
lists<sec:sample-lists>`.

"Really New" Data Subsystem
---------------------------

During execution LBANN will ingest one or more streams of data. There
will be unique streams of data for each execution mode:
- training
- validation
- tournament
- testing
- inference

Note that execution modes should become more flexible and should be
able to be arbitrarily named.

The data stream object is responsible for keeping track of the "count"
/ state of that data stream for that execution context. For bounded /
batched data streams, this would be the current position within the
stream and the total number of passes over the stream. (index and
epoch)

For infinite streams the object will just maintain the index /
position within the stream.

In both cases it is necessary for the object to track the "step" size
(i.e. mini-batch size). Additionally, because the data stream will be
accessed in parallel, it is necessary to track the position of each
rank within the stream in terms of offset.

..
Data source class file: The data source class tracks the statefule
aspects of one logical stream of data.
Data sources are either bounded or infinite
data sources. The class is responsible for keeping track of state
with respect to


Sample list:

Track how to retrive a data set from the outside world. This
typically is a set of file locations for each sample as well as a
count of how many samples are in the set.

Data coordinator:

Responsible for managing one or more data streams for each execution
context. It is


data reader / loader:

Function to ingest bits from outside and place them into an in-memory
object that is managed by the data coordinator.

Data store:
in-memory data repository for holding samples that have been read in

io_data_buffer:
Holds sample being fetched or the future of it.

data packer:
copies data fields from conduit nodes and maps them to Hydrogen
matrices. Specific to a data set

Data Set:

Composed of:
- data reader
- data stream
- sample list
- data packer
Loading