Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to automatically take train ID offsets into account #488

Open
philsmt opened this issue Feb 22, 2024 · 1 comment
Open

Add a way to automatically take train ID offsets into account #488

philsmt opened this issue Feb 22, 2024 · 1 comment

Comments

@philsmt
Copy link
Contributor

philsmt commented Feb 22, 2024

At times, devices incur an offset between the train ID saved and the actual train ID the data belongs to. This can happen in particular for devices that synchronize the train ID in software rather than hardware, e.g. to integrate vendor software. While there is no easy automatic way to figure out this offset, once determined empirically it would be useful if EXtra-data provides a mechanism to automatically take this into account.

An easy way would simply be as part of the SourceData interface, e.g. some

class SourceData:
    def with_train_offset(self, offset: int) -> SourceData:

Still, this makes matching such data to other sources with different or no offset hard, as this happens within DataCollection. Since a while DataCollection.select() actually accepts KeyData objects (which would also carry the offset). This should be extended to SourceData objects anyway, and could extract the offset this way for the selection.

@philsmt
Copy link
Contributor Author

philsmt commented Jun 21, 2024

I spend a few moments to look into this, as it keeps on coming up sporadically and makes train alignment quite annoying to do in EXtra-data. I have a rough idea how to do it, but it will be quite intrusive in a couple of places, so I'd like to have some discussion before spending the time:

  • SourceData and KeyData gain a new property train_offset describing the offset to apply
  • The existing property train_ids is renamed to the private _train_ids and continues to ignore offset, with train_id becoming a computed property optionally adding the offset for public consumption
  • Internal operation switch to using _train_ids, applying the offset themselves as needed. Examples:
    • KeyData.data_counts() e.g. will use _train_ids to build the result and add train_offset at the very end
    • The main data retrieval methods like KeyData.ndarray() are automatically fixed through KeyData.train_id_coordinates()
  • DataCollection.train_ids remains untouched, but must potentially correct its trains when it contains a source with offset.
  • DataCollection.select_trains builds the trains for alignment by source by file, so it can easily be added there.

Generally the idea is to not change the internal machinery in most places, except when it comes to alignment or when building a labelled result. I have one open design question:

  • How should the public API to manipulate the offset of internal SourceData objects in a DataCollection look like?
    • Force usage of SourceData with offset for selection, and then copy this offset.
    • DataCollection.with_train_offset(sources_or_glob: Union[list, str], offset: int) -> DataCollection?

@takluyver @tmichela

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant