harmonize Loader API #44

johentsch · 2023-06-27T13:13:25Z

It would be great and make for a simple API to stick to a single source argument which can be a directory (or URL later), or some iterable of filepaths (URLs). The current approach suggested in #47 is to simply store the source argument and to let the property paths deal with it when needed, e.g. by instantiating a PathFactory when source is a directory. What is currently unclear is how the serialization should take place. Currently, paths is not included in the Schemas with the rationale that, for reproducibility, it is sufficient to know the directory (and the commit SHA at a later point. But what if the argument to source is a list of filepaths? The easiest solution I see is to add an envelope to the marshmallow field to allow it to (de)serialize either a single string or a collection of strings. Do you see any problems with this approach?

The second topic where I would love to have some input concerns the creation of IDs, which, finally, need to have the shape ('<corpus>', '<piece>'). The cases I'm imagining so far:

the MuseScoreLoader uses the ms3 parser already provides these IDs based on its built-in rules of how to recognize a (DCML) corpus, i.e., by looking for metadata.tsv files, among other things. Initializing MuseScoreLoader(package_name='my_corpus', as_corpus=True, ) overrides this mechanism and forces all IDs to be ('my_corpus', '<piece>'). It might make sense to include this latter option for all loaders.
The paths encountered have the shape .../corpus_folder/piece.ext or .../corpus_folder/subdirectory/piece.ext or .../corpus_folder/piece/file.ext and should yield `('<corpus_folder>', '') IDs. What would be the best way to let specify the users these options?
In a more diverse corpus such as https://github.com/MarkGotham/When-in-Rome/ these strategies may need to be mixed. Again, what would be an easy API allowing to specify, e.g., a mapping from subfolder to an ID-generating callable (see previous point).
In another case, the user may want to specify custom IDs, e.g. by passing a {corpus -> [piece]} dict or similar. Would that be a good solution? This may lead us back to the first paragraph, if the solution is to engineer a user-friendly yet versatile source argument that allows not only for lists of paths but also for mappings from IDs to paths.

All feedback welcome!

The text was updated successfully, but these errors were encountered:

huguesdevimeux · 2023-06-29T17:40:24Z

The first thing that comes into my mind is that if source can be of multiple forms (str, list, etc.) with different interpretations (URI, dir), then .. it should not be a parameter.
I would see more some sort of factory methods, like from_dir(source_dir), from_url(source_url), etc. It would also standardize a bit how paths are made. (it would be constructed in these factory methods). Then, you can use an envelope.

apmcleod · 2023-06-30T12:51:00Z

I feel I'm missing a huge amount of background needed to understand the purpose behind these decisions, but I can offer my initial thoughts, just take them with a grain of salt:

I wouldn't worry about optimizing anything with the thought of URLs yet, since they are not implemented, and may never be.
I would've thought any default way of serializing/deserializing a list should be fine, whatever marshmallow is... (surely serializing lists has been done before)
Why package_name and not corpus_name? In any case, I don't see a problem with adding that parameter to all loaders...
Point 2: Do you mean, whether to search in subdirectories by various names, and how to get the piece name? You might give an option piece_name with valid options file_base, subdir ? The options could also be callables that take as input a string/Path, and output a string (ID). Then the user could customize as in Point 4?
I don't think mixing would be easy. If a user specifies the (same) corpus_name in all cases, different loading methods could be unified after the fact?

johentsch mentioned this issue Jun 27, 2023

Implement Loaders #26

Open

9 tasks

johentsch added this to the Loaders milestone Jun 27, 2023

johentsch mentioned this issue Jun 27, 2023

New loaders #47

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harmonize Loader API #44

harmonize Loader API #44

johentsch commented Jun 27, 2023 •

edited

Loading

huguesdevimeux commented Jun 29, 2023 •

edited

Loading

apmcleod commented Jun 30, 2023

harmonize Loader API #44

harmonize Loader API #44

Comments

johentsch commented Jun 27, 2023 • edited Loading

huguesdevimeux commented Jun 29, 2023 • edited Loading

apmcleod commented Jun 30, 2023

johentsch commented Jun 27, 2023 •

edited

Loading

huguesdevimeux commented Jun 29, 2023 •

edited

Loading