Add a way to automatically take train ID offsets into account #488

philsmt · 2024-02-22T14:57:40Z

At times, devices incur an offset between the train ID saved and the actual train ID the data belongs to. This can happen in particular for devices that synchronize the train ID in software rather than hardware, e.g. to integrate vendor software. While there is no easy automatic way to figure out this offset, once determined empirically it would be useful if EXtra-data provides a mechanism to automatically take this into account.

An easy way would simply be as part of the SourceData interface, e.g. some

class SourceData:
    def with_train_offset(self, offset: int) -> SourceData:

Still, this makes matching such data to other sources with different or no offset hard, as this happens within DataCollection. Since a while DataCollection.select() actually accepts KeyData objects (which would also carry the offset). This should be extended to SourceData objects anyway, and could extract the offset this way for the selection.

The text was updated successfully, but these errors were encountered:

philsmt · 2024-06-21T12:26:20Z

I spend a few moments to look into this, as it keeps on coming up sporadically and makes train alignment quite annoying to do in EXtra-data. I have a rough idea how to do it, but it will be quite intrusive in a couple of places, so I'd like to have some discussion before spending the time:

SourceData and KeyData gain a new property train_offset describing the offset to apply
The existing property train_ids is renamed to the private _train_ids and continues to ignore offset, with train_id becoming a computed property optionally adding the offset for public consumption
Internal operation switch to using _train_ids, applying the offset themselves as needed. Examples:
- KeyData.data_counts() e.g. will use _train_ids to build the result and add train_offset at the very end
- The main data retrieval methods like KeyData.ndarray() are automatically fixed through KeyData.train_id_coordinates()
DataCollection.train_ids remains untouched, but must potentially correct its trains when it contains a source with offset.
DataCollection.select_trains builds the trains for alignment by source by file, so it can easily be added there.

Generally the idea is to not change the internal machinery in most places, except when it comes to alignment or when building a labelled result. I have one open design question:

How should the public API to manipulate the offset of internal SourceData objects in a DataCollection look like?
- Force usage of SourceData with offset for selection, and then copy this offset.
- DataCollection.with_train_offset(sources_or_glob: Union[list, str], offset: int) -> DataCollection?

@takluyver @tmichela

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to automatically take train ID offsets into account #488

Add a way to automatically take train ID offsets into account #488

philsmt commented Feb 22, 2024

philsmt commented Jun 21, 2024

Add a way to automatically take train ID offsets into account #488

Add a way to automatically take train ID offsets into account #488

Comments

philsmt commented Feb 22, 2024

philsmt commented Jun 21, 2024