Skip to content

Commit

Permalink
Start developer_docs directory with some content (ornladios#4385)
Browse files Browse the repository at this point in the history
  • Loading branch information
eisenhauer authored Oct 30, 2024
1 parent 2707576 commit c3cc3a2
Show file tree
Hide file tree
Showing 3 changed files with 622 additions and 264 deletions.
310 changes: 310 additions & 0 deletions developer_docs/bp5format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
# BP5 Metadata Marshaling, writer-side focus

BP5 Metadata Marshalling is based upon FFS, which provides the ability
to serialize a C-style pointer-based data structure (starting with a
base struct) and to deserialize it in-place on the receiving side.
This is what we'll do to encode BP5 Metadata, create a custom C-style
struct on the writer side and then use FFS to make that same struct
available to the reader.

Normally, in order to use FFS, an application must fully describe
the base structure using an FMFieldList, where each element
describes a field in the structure, including the field's name,
basic type (integer, float, etc.), size and offset from the start
of the structure. In "normal" scenarios, like in SST this is
straightforward because we're describing a structure that exists
at compile-time and all of those things are compile-time static.
However, ADIOS metadata represents information about variables
that we don't know about until run-time, so if we're going to use
FFS here, things have to be a bit more dynamic. In particular,
we'll represent ADIOS metadata with a "virtual" structure, one
whose description we'll construct on the fly and which will only
ever exist virtually, making up offsets as we go. We just have to
be careful about keeping things aligned appropriately because we
want this to land on the receiver and be appropriately aligned
there. (Normally the compiler takes care of this, but this
virtual structure is never seen by a compiler, so we're doing it.)
The field name that we specify to FFS is also important because we
use it to communicate a lot of information between writer and
reader. While it always contains the variable name, it also
encodes the variable type (local or global, atomic or array,
compressed, derived, etc.). Because the variable name only
appears in the metametadata (ffs format), this is a great place to
put more static information about the variable, specifically
anything that is fixed after definition and doesn't change on a
per-timestep basis. More on names later.

To accomplish managing the structure on the writer side, we
principally track two things, the FMFieldList that represents the
description of the virtual struct, and a malloc'd region where we
build the virtual struct itself. While the description is
interpreted by FFS, the most important thing for BP5 to remember
is this field's offset because that's where the (meta)data will
go. When we Marshal a simple atomic value (local or global), we
calculate an appropriately aligned new offset in the buffer, add
to the FMFieldList (maintained in Info.MetaFields on the writer)
and copy the data into the virtual field at that offset in the
buffer. On future timesteps, the field already exists, so we just
use the offset and copy the data into the buffer. Arrays are a
bit more complex, but lets start with the simple case. FFS
supports substructures, I.E. fields which themselves are a
structure and we use that feature for all array representations.
There are several things that may change on a per-timestep basis
for arrays, including Shape, Count and Offset values (which are
themselves arrays), and we also need to track the location of the
related data block (offset in this rank's data segment). Except
for Shape (which we assume is set for at least this timestep), all
of these things are per-block.

Back to FFS capabilities for a moment. FFS's pointer-based
structures include dynamically-sized arrays, and the size of those
arrays must be specified by an integer-typed field in that
structure. There are three different array lengths required here.
Shape is of length Dims (how many dimensions the array has),
DataBlockLocation is of length BlockCount (how many blocks were
written on this rank), and for Count and Offsets we must have
those per-block, so the length is Dims*BlockCount. To satisfy
FFS's constraints, that means we must have integer fields
representing all three lengths in the array metadata struct, and
we need pointers to the dynamic arrays representing Shape, Count,
Offsets, and DataBlockLocation. These are the BASE_FIELDS below
and the FFS FMField entries are BASE_FIELD_ENTRIES in BP5Base.cpp.
```
#define BASE_FIELDS \
size_t Dims; /* How many dimensions does this array have */ \
size_t BlockCount; /* How many blocks are written */ \
size_t DBCount; /* Dimens * BlockCount */ \
size_t *Shape; /* Global dimensionality [Dims] NULL for local */ \
size_t *Count; /* Per-block Counts [DBCount] */ \
size_t *Offsets; /* Per-block Offsets [DBCount] NULL for local */ \
size_t *DataBlockLocation; /* Per-block Offset in PG [BlockCount] */
```
```
#define BASE_FIELD_ENTRIES \
{"Dims", "integer", sizeof(size_t), FMOffset(BP5Base::MetaArrayRec *, Dims)}, \
{"BlockCount", "integer", sizeof(size_t), FMOffset(BP5Base::MetaArrayRec *, BlockCount)}, \
{"DBCount", "integer", sizeof(size_t), FMOffset(BP5Base::MetaArrayRec *, DBCount)}, \
{"Shape", "integer[Dims]", sizeof(size_t), FMOffset(BP5Base::MetaArrayRec *, Shape)}, \
{"Count", "integer[DBCount]", sizeof(size_t), FMOffset(BP5Base::MetaArrayRec *, Count)}, \
{"Offset", "integer[DBCount]", sizeof(size_t), \
FMOffset(BP5Base::MetaArrayRec *, Offsets)}, \
{"DataBlockLocation", "integer[BlockCount]", sizeof(size_t), \
FMOffset(BP5Base::MetaArrayRec *, DataBlockLocation)},
```
While more complex arrays metadata entries are necessary, these
must be the first fields in those structures. While there can't
be a static struct declaration for all of the metadata, there is a
static declaration for the array metadata substructure,
`MetaArrayRec` below.
```
typedef struct _MetaArrayRec
{
BASE_FIELDS
} MetaArrayRec;
```
Mostly you'll see this used like this:
```
MetaArrayRec *MetaEntry = (MetaArrayRec *)((char *)(MetadataBuf) + Rec->MetaOffset);
```
This gives us a nice way of accessing the key fields in an array's
metadata entry.

So, what about more complex arrays? All of our compression
operators require the length of the encrypted field as input to
the uncompress operator. Generally we don't include data block
length as part of metadata because it's easily calculated from the
Count values and the length of the data type, but in order to
support compression we have to communicate it from the writer to
the reader so we can uncompress. Therefore every field with an
operator has as its next field (after BASE_FIELDS) DataBlockSize.
Like DataBlockLocation, this is per block (and so it's FFS
description also uses BlockCount). This arrangement is
represented by the `struct MetaArrayRecOperator` below. Note that
BP5 does not itself use the DataBlockSize in the metadata. The
size of the compressed data is returned from the compression
operator, and is used by BP5 to copy that data into the data
block, but after that it is only passed to the Uncompress operator
on the receiving side, so operators like MGard may choose to use
this differently.
```
typedef struct _MetaArrayRecOperator
{
BASE_FIELDS
size_t *DataBlockSize; // Per-block Lengths [BlockCount]
} MetaArrayRecOperator;
```
The last case is arrays that also have Min/Max stats associated
with them. Since this can be combined with operators, that gives
us two more possible structs for array metadata, a plain array
with Min/Max or an array with an operator and Min/Max, these are
represented by the structs `MetaArrayRecMM` and
`MetaArrayRecOperatorMM` below. Note that MinMax in that struct is
a `char*`, but obviously the data type of Min/Max depends upon the
element type of the array. How does that work? The actual size
in bytes of the MinMax array is `BlockCount * sizeof(array element) * 2`, but in order to avoid introducing yet another integer-typed
size value into the structure we've gone to some effort in order
to leverage the existing BlockCount value. In particular, there
are a number of FMField lists for The MM and OperatorMM arrays,
each giving FFS a different element size for the MinMax Array.
ADIOS types of size 1 use `MetarrayRecMM1List`, those of size 2 use
`MetaArrayRecMM2List`, etc., up to `MetaArrayRecMM16List`, which would
be used by long double. Note that BP5 doesn't define or support
MinMax for string, complex, or structure types.
```
typedef struct _MetaArrayRecMM
{
BASE_FIELDS
char *MinMax; // char[TYPESIZE][BlockCount] varies by type
} MetaArrayRecMM;
typedef struct _MetaArrayRecOperatorMM
{
BASE_FIELDS
size_t *DataBlockSize; // Per-block Lengths [BlockCount]
char *MinMax; // char[TYPESIZE][BlockCount] varies by type
} MetaArrayRecOperatorMM;
```
For each of the array variations above, when we add the field
associated with that array to the metadata field list, we specify
the appropriate FieldList in the FFS "field_type" value, and
allocate space for the relevant structure in the virtual metadata
struct we're building. (Example MetaArrayRecOperatorMM8List below.)
```
static FMField MetaArrayRecOperatorMM8List[] = {
BASE_FIELD_ENTRIES
{"DataBlockSize", "integer[BlockCount]", sizeof(size_t),
FMOffset(BP5Base::MetaArrayRecOperator *, DataBlockSize)},
{"MinMax", "char[16][BlockCount]", 1, FMOffset(BP5Base::MetaArrayRecOperatorMM *, MinMax)},
{NULL, NULL, 0, 0}};
```
We mentioned field names above, we actually encode a lot of
information into the FFS field names, including the variable name,
shape, element_size, ADIOS type, any operator that might be
applied, the name of the substructure (if the array is a struct
type), and even the expression that is to be used for derived
variables. These are all encoded in different ways, for example
the basic shape of the variable is encoded in the three letter
prefix of the FFS fieldname: GlobalValue: = "BPg", GlobalArray =
"BPG"JoinedArray = "BPJ", LocalValue = "BPl", LocalArray = "BPL".
The details of the encoding are buried in the logic, but important
bit is knowing that there's a lot of information there and some of
it (like the expression) is base64 encoded to avoid having special
characters in the FFS field name. From the BP5 point of view,
anything that can be encoded in the field name is a good thing
because it travels in the metametadata, not the metadata, so it
only gets moved around if the field set changes.

Speaking of changes, there are some details that are omitted above
to get the main points across, but lets talk about other details.
First, when you put a first block of an array, we fill out the
Dims field, init BlockCount to 1, DBCount (the `Dims*BlockCount`
value) to Dims and then we malloc memory to hold a copy of the
Shape, Count and Offset values. (We need to copy these anyway as
part of serialization as they must be captured at the time of Put,
so we can't, say, just reference the values in the VariableBase
class.) For LocalArrays, the Shape value stays at a NULL pointer,
as does the Start value. If after the first there's another Put()
on that variable, we add 1 to BlockCount, increment DBCount by
Dims, and realloc() the Count and Offset arrays so that we can add
the new Count and Offset values after the ones that are already
there. This means that the Count values for block 1 start at
`Count[Dims]`, for block 2 they start at `Count[2*Dims]`, etc. At the
end of the timestep after using FFSencode() to serialize the
metadata, `FMfree_var_rec_elements()` is used to free() all these
subarrays that we've malloc'd. It understands the structure of
our entire Metadata structure, walks the field list and
deallocates appropriately. Once this has been done, we can
memset() the whole metadata structure back to zeros and we're
ready to start again. (All pointers NULL and counts are zero.)

When we do start again with the next timestep, we don't start from
scratch with a new Fieldlist and virtual structure, but instead
try to reuse the old one. The anticipation is that step-based HPC
applications are highly regular and the set of variables that are
output on step N+1 are likely the same as what they output for
step N. So when we get a Put() for a variable, we look up its
entry in internal bookkeeping and if it has an entry in the
structure we reuse it, putting the appropriate data in the virtual
structure as described above. This is fine if we write the exact
same set of variables in subsequent steps, but what if we don't?
Well, if we write a new variable, then the procedure above
happens, but we also take steps to make sure that we generate new
MetaMetaData (I.E. re-register the format with FFS). We do this
by setting the Info.MetaFormat value to NULL.

Handling a non-written variable is done differently. We don't
really want to bear the cost of new MetaMetaData frequently
(because MetaMetaData can be big), so instead we're willing to
bear the costs of not using some of the data in the virtual
structure. So if the app Puts an atomic variable on timestep N,
but skips it on N+1, we essentially leave that fraction of the
metadata buffer unused in N+1. It's transmitted or stored, but it
doesn't contain anything useful. But the reader still needs to
know that it wasn't written, so BP5 metadata carries with it a
bitmap showing if a variable that is part of the metadata has
actually been written and is valid. This bitmap, contained in the
BitField[BitFieldCount] fields in the MetadataFieldList is the
ultimate authority as to what has been written. Variables are
assigned an index in order when they are first entered into
metadata and if the bit at that index isn't set, that variable
wasn't written on that timestep.

Now, this does bring up a vulnerability with BP5. If an application
were to write a lot of variables on one step and then never use them
again, we might end up with a big metadata block that mostly carried
unused (junk) bytes. We have not yet run into this in a real
application, so it isn't specifically handled. In an ideal world, one
would look at the "occcupancy rate" of metadata in EndStep() and make
a decision that for either this timestep or the next, we'd start from
scratch with an empty field list. There's a tradeoff here. Do this
too often and we've got big MetaMetadata costs, do it too little and
our metadata has a lot of useless bytes. Future work. Note that this
is mostly a writer-side thing to fix/optimize. The reader will
appropriately handple new metadata, including new metametadata.

The stuff above applies to ADIOS variables, but attributes are always
handled separately. In the initial FFS-marshalling implementation,
Attributes, while separate, were handled very similarly to variables.
That is, there was a field list and virtual structure maintained where
we entered attributes much like Global and local values are described
above. There was a metametadata generated it it and it was moved
around like other metametadata blocks. This old way of doing things
is still present in the code and gets used if `MarshalAttribute()` is
called by the engine. Engines that use this marshall all attributes
in `Endstep()`, calling MarshalAttribute for all attributes and only
doing this when some attribute has changed. The resulting Attribute
data always contains ==all== the current attribute values, a situation
that works out well for engines like SST where readers might join
after timestep 0. The SST writer can save the most recent Attribute
data block and provide it to a newly-joined reader so that it has all
available attributes.

However, this encoding mechanism has some significant disadvantages
under almost all situations. This separation of metametadata and
metadata was designed for Variables, where the set of variables was
likely to be reused without changes repeatedly. However, attributes
aren't like that, particularly in the original situation where
attributes once set can never change. Then we're only doing this when
we add an attribute, we're always generating new MetaMetadata whenever
we have a change, and MetaMetadata + Metadata size is always going to
be bigger than some simpler encoding mechanism. So, BP5 file engine
now does things differently. It calls OnetimeMarshalAttribute() which
uses a simpler FFS representation for attributes with the attribute
"name" being part of the data, not part of the metametadata as it is
with variables. This means that the metametadata never changes, so we
don't have the same issues as with the prior approach. That
metametadata struct (BP5AttrStruct) describes a relatively simple
structure with two lists, one for attributes of any non-string type,
and the other a list of string and array-of-string attributes.
Generally we only want attributes to appear here when they change, so
the BP5Writer calls OnetimeMarshlAttribute whenever it gets the
NotifyEngineAttribute call (whenever an attribute changes). However
it also gets called in BeginStep if that step is the first every
called, because some attributes may have been defined before the
engine was ever created. In BP5 file, attribute blocks then only
every contain an attribute once, unless the attribute changes in which
case it will appear again. This is not such a good situation for SST
because of the late-coming-reader issue, so that still uses the old
marshaling mechanism.


Loading

0 comments on commit c3cc3a2

Please sign in to comment.