Skip to content

Commit

Permalink
Starting to split off the state of the data stream into its own class
Browse files Browse the repository at this point in the history
hierarchy.
  • Loading branch information
bvanessen committed May 20, 2023
1 parent c9d643b commit 9b7bcca
Show file tree
Hide file tree
Showing 8 changed files with 986 additions and 686 deletions.
71 changes: 71 additions & 0 deletions docs/data_ingestion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,74 @@ Two of the new format data readers are the ``python``, ``SMILES``, and
Several of these readers (SMILES and
:ref:`HDF5<sec:hdf5_data_reader>`) support the use of :ref:`sample
lists<sec:sample-lists>`.

"Really New" Data Subsystem
---------------------------

During execution LBANN will ingest one or more streams of data. There
will be unique streams of data for each execution mode:
- training
- validation
- tournament
- testing
- inference

Note that execution modes should become more flexible and should be
able to be arbitrarily named.

The data stream object is responsible for keeping track of the "count"
/ state of that data stream for that execution context. For bounded /
batched data streams, this would be the current position within the
stream and the total number of passes over the stream. (index and
epoch)

For infinite streams the object will just maintain the index /
position within the stream.

In both cases it is necessary for the object to track the "step" size
(i.e. mini-batch size). Additionally, because the data stream will be
accessed in parallel, it is necessary to track the position of each
rank within the stream in terms of offset.

..
Data source class file: The data source class tracks the statefule
aspects of one logical stream of data.
Data sources are either bounded or infinite
data sources. The class is responsible for keeping track of state
with respect to

Sample list:

Track how to retrive a data set from the outside world. This
typically is a set of file locations for each sample as well as a
count of how many samples are in the set.

Data coordinator:

Responsible for managing one or more data streams for each execution
context. It is


data reader / loader:

Function to ingest bits from outside and place them into an in-memory
object that is managed by the data coordinator.

Data store:
in-memory data repository for holding samples that have been read in

io_data_buffer:
Holds sample being fetched or the future of it.

data packer:
copies data fields from conduit nodes and maps them to Hydrogen
matrices. Specific to a data set

Data Set:

Composed of:
- data reader
- data stream
- sample list
- data packer
Loading

0 comments on commit 9b7bcca

Please sign in to comment.