You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the dataset class built into vak that loads splits from CMACBench uses the filename of the splits path to determine metadata about the split. We use this metadata directly when we need the duration of a frame, and indirectly when we need to determine which labelmap to load, based on the biosound group, unit, and ID (all three things we currently consider metadata). I am already in the process of refactoring so that we can specify a different labelmap, e.g. through a class method: vocalpy/vak#776
I am realizing that a better way to handle this might be to replace the dataset parameter "splits_path" with "metadata_path".
Here's my logic: we determine the metadata by using an (undeclared) naming scheme. This is fragile; if the naming scheme changes, the function breaks. This makes it harder to run any experiment that is slightly different from what is captured by the naming scheme. E.g., if we train models on multi-species datasets, then we are no longer thinking about IDs within one species / biosound group, and so it's not meaningful to put the ID in the filename for the splits. So in theory we have the flexibility to specify different splits through splits_path, but in practice as soon as we do anything besides the exact experiments prescribed by CMACBench, we break the naming scheme, and to get around this we have to use a hack where we put some placeholder in the field of the naming scheme that is not relevant (e.g. a fake ID like "all-species"). The same change in experiments (species instead of ID) also breaks the logic in the vak dataset class that relies on group name + unit name + ID to determine which labelmap to use. But we don't actually use any of the rest of the metadata (group, unit, ID, etc.) to train the model.
So instead of relying on a a naming scheme, or having a dataclass that represents all the metadata as we do in this repo, I think we should just put the metadata directly in the json file when we prep the dataset here, and then the only thing vak needs to know is that it can get the exact metadata that it needs out of that file
We already save metadata for each split when we make the splits, but it's in one big separate json file. Also, the built-in vak datapipe classes already do some similar, use metadata that is saved in a json file during the prep step.
So the change here is just to save a "metadata.json" file for each split. We can use a naming scheme for these, but just to keep the files from over-writing each other. And then a user can provide their own metadata as long as it provides a splits path, frame dur, labelmap json path, and the bookkeeping vector paths (I guess I just declared a schema).
We can fix split functions to do this later -- for now I will use the existing files to fix splits
The text was updated successfully, but these errors were encountered:
Currently the dataset class built into vak that loads splits from CMACBench uses the filename of the splits path to determine metadata about the split. We use this metadata directly when we need the duration of a frame, and indirectly when we need to determine which labelmap to load, based on the biosound group, unit, and ID (all three things we currently consider metadata). I am already in the process of refactoring so that we can specify a different labelmap, e.g. through a class method: vocalpy/vak#776
I am realizing that a better way to handle this might be to replace the dataset parameter
"splits_path"
with"metadata_path"
.Here's my logic: we determine the metadata by using an (undeclared) naming scheme. This is fragile; if the naming scheme changes, the function breaks. This makes it harder to run any experiment that is slightly different from what is captured by the naming scheme. E.g., if we train models on multi-species datasets, then we are no longer thinking about IDs within one species / biosound group, and so it's not meaningful to put the ID in the filename for the splits. So in theory we have the flexibility to specify different splits through
splits_path
, but in practice as soon as we do anything besides the exact experiments prescribed by CMACBench, we break the naming scheme, and to get around this we have to use a hack where we put some placeholder in the field of the naming scheme that is not relevant (e.g. a fake ID like "all-species"). The same change in experiments (species instead of ID) also breaks the logic in the vak dataset class that relies on group name + unit name + ID to determine which labelmap to use. But we don't actually use any of the rest of the metadata (group, unit, ID, etc.) to train the model.So instead of relying on a a naming scheme, or having a dataclass that represents all the metadata as we do in this repo, I think we should just put the metadata directly in the json file when we prep the dataset here, and then the only thing vak needs to know is that it can get the exact metadata that it needs out of that file
We already save metadata for each split when we make the splits, but it's in one big separate json file. Also, the built-in vak datapipe classes already do some similar, use metadata that is saved in a json file during the prep step.
So the change here is just to save a
"metadata.json"
file for each split. We can use a naming scheme for these, but just to keep the files from over-writing each other. And then a user can provide their own metadata as long as it provides a splits path, frame dur, labelmap json path, and the bookkeeping vector paths (I guess I just declared a schema).We can fix
split
functions to do this later -- for now I will use the existing files to fix splitsThe text was updated successfully, but these errors were encountered: