Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harmonize Loader API #44

Open
Tracked by #26 ...
johentsch opened this issue Jun 27, 2023 · 2 comments
Open
Tracked by #26 ...

harmonize Loader API #44

johentsch opened this issue Jun 27, 2023 · 2 comments
Milestone

Comments

@johentsch
Copy link
Member

johentsch commented Jun 27, 2023

It would be great and make for a simple API to stick to a single source argument which can be a directory (or URL later), or some iterable of filepaths (URLs). The current approach suggested in #47 is to simply store the source argument and to let the property paths deal with it when needed, e.g. by instantiating a PathFactory when source is a directory. What is currently unclear is how the serialization should take place. Currently, paths is not included in the Schemas with the rationale that, for reproducibility, it is sufficient to know the directory (and the commit SHA at a later point. But what if the argument to source is a list of filepaths? The easiest solution I see is to add an envelope to the marshmallow field to allow it to (de)serialize either a single string or a collection of strings. Do you see any problems with this approach?

The second topic where I would love to have some input concerns the creation of IDs, which, finally, need to have the shape ('<corpus>', '<piece>'). The cases I'm imagining so far:

  • the MuseScoreLoader uses the ms3 parser already provides these IDs based on its built-in rules of how to recognize a (DCML) corpus, i.e., by looking for metadata.tsv files, among other things. Initializing MuseScoreLoader(package_name='my_corpus', as_corpus=True, ) overrides this mechanism and forces all IDs to be ('my_corpus', '<piece>'). It might make sense to include this latter option for all loaders.
  • The paths encountered have the shape .../corpus_folder/piece.ext or .../corpus_folder/subdirectory/piece.ext or .../corpus_folder/piece/file.ext and should yield `('<corpus_folder>', '') IDs. What would be the best way to let specify the users these options?
  • In a more diverse corpus such as https://github.com/MarkGotham/When-in-Rome/ these strategies may need to be mixed. Again, what would be an easy API allowing to specify, e.g., a mapping from subfolder to an ID-generating callable (see previous point).
  • In another case, the user may want to specify custom IDs, e.g. by passing a {corpus -> [piece]} dict or similar. Would that be a good solution? This may lead us back to the first paragraph, if the solution is to engineer a user-friendly yet versatile source argument that allows not only for lists of paths but also for mappings from IDs to paths.

All feedback welcome!

@johentsch johentsch mentioned this issue Jun 27, 2023
9 tasks
@johentsch johentsch added this to the Loaders milestone Jun 27, 2023
@johentsch johentsch mentioned this issue Jun 27, 2023
10 tasks
@huguesdevimeux
Copy link
Collaborator

huguesdevimeux commented Jun 29, 2023

The first thing that comes into my mind is that if source can be of multiple forms (str, list, etc.) with different interpretations (URI, dir), then .. it should not be a parameter.
I would see more some sort of factory methods, like from_dir(source_dir), from_url(source_url), etc. It would also standardize a bit how paths are made. (it would be constructed in these factory methods). Then, you can use an envelope.

@apmcleod
Copy link

I feel I'm missing a huge amount of background needed to understand the purpose behind these decisions, but I can offer my initial thoughts, just take them with a grain of salt:

  • I wouldn't worry about optimizing anything with the thought of URLs yet, since they are not implemented, and may never be.
  • I would've thought any default way of serializing/deserializing a list should be fine, whatever marshmallow is... (surely serializing lists has been done before)
  • Why package_name and not corpus_name? In any case, I don't see a problem with adding that parameter to all loaders...
  • Point 2: Do you mean, whether to search in subdirectories by various names, and how to get the piece name? You might give an option piece_name with valid options file_base, subdir ? The options could also be callables that take as input a string/Path, and output a string (ID). Then the user could customize as in Point 4?
  • I don't think mixing would be easy. If a user specifies the (same) corpus_name in all cases, different loading methods could be unified after the fact?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants