Starting to split off the state of the data stream into its own class

hierarchy.
LLNL · May 20, 2023 · 9b7bcca · 9b7bcca
1 parent c9d643b
commit 9b7bcca
Show file tree

Hide file tree

Showing 8 changed files with 986 additions and 686 deletions.
diff --git a/docs/data_ingestion.rst b/docs/data_ingestion.rst
@@ -37,3 +37,74 @@ Two of the new format data readers are the ``python``, ``SMILES``, and
 Several of these readers (SMILES and
 :ref:`HDF5<sec:hdf5_data_reader>`) support the use of :ref:`sample
 lists<sec:sample-lists>`.
+
+"Really New" Data Subsystem
+---------------------------
+
+During execution LBANN will ingest one or more streams of data.  There
+will be unique streams of data for each execution mode:
+ - training
+ - validation
+ - tournament
+ - testing
+ - inference
+
+Note that execution modes should become more flexible and should be
+able to be arbitrarily named.
+
+The data stream object is responsible for keeping track of the "count"
+/ state of that data stream for that execution context.  For bounded /
+batched data streams, this would be the current position within the
+stream and the total number of passes over the stream. (index and
+epoch)
+
+For infinite streams the object will just maintain the index /
+position within the stream.
+
+In both cases it is necessary for the object to track the "step" size
+(i.e. mini-batch size).  Additionally, because the data stream will be
+accessed in parallel, it is necessary to track the position of each
+rank within the stream in terms of offset.
+
+..
+   Data source class file:  The data source class tracks the statefule
+   aspects of one logical stream of data.
+   Data sources are either bounded or infinite
+   data sources.  The class is responsible for keeping track of state
+   with respect to
+
+
+Sample list:
+
+Track how to retrive a data set from the outside world.  This
+typically is a set of file locations for each sample as well as a
+count of how many samples are in the set.
+
+Data coordinator:
+
+Responsible for managing one or more data streams for each execution
+context.  It is
+
+
+data reader / loader:
+
+Function to ingest bits from outside and place them into an in-memory
+object that is managed by the data coordinator.
+
+Data store:
+in-memory data repository for holding samples that have been read in
+
+io_data_buffer:
+Holds sample being fetched or the future of it.
+
+data packer:
+copies data fields from conduit nodes and maps them to Hydrogen
+matrices.  Specific to a data set
+
+Data Set:
+
+Composed of:
+ - data reader
+ - data stream
+ - sample list
+ - data packer