You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be great and make for a simple API to stick to a single source argument which can be a directory (or URL later), or some iterable of filepaths (URLs). The current approach suggested in #47 is to simply store the source argument and to let the property paths deal with it when needed, e.g. by instantiating a PathFactory when source is a directory. What is currently unclear is how the serialization should take place. Currently, paths is not included in the Schemas with the rationale that, for reproducibility, it is sufficient to know the directory (and the commit SHA at a later point. But what if the argument to source is a list of filepaths? The easiest solution I see is to add an envelope to the marshmallow field to allow it to (de)serialize either a single string or a collection of strings. Do you see any problems with this approach?
The second topic where I would love to have some input concerns the creation of IDs, which, finally, need to have the shape ('<corpus>', '<piece>'). The cases I'm imagining so far:
the MuseScoreLoader uses the ms3 parser already provides these IDs based on its built-in rules of how to recognize a (DCML) corpus, i.e., by looking for metadata.tsv files, among other things. Initializing MuseScoreLoader(package_name='my_corpus', as_corpus=True, ) overrides this mechanism and forces all IDs to be ('my_corpus', '<piece>'). It might make sense to include this latter option for all loaders.
The paths encountered have the shape .../corpus_folder/piece.ext or .../corpus_folder/subdirectory/piece.ext or .../corpus_folder/piece/file.ext and should yield `('<corpus_folder>', '') IDs. What would be the best way to let specify the users these options?
In a more diverse corpus such as https://github.com/MarkGotham/When-in-Rome/ these strategies may need to be mixed. Again, what would be an easy API allowing to specify, e.g., a mapping from subfolder to an ID-generating callable (see previous point).
In another case, the user may want to specify custom IDs, e.g. by passing a {corpus -> [piece]} dict or similar. Would that be a good solution? This may lead us back to the first paragraph, if the solution is to engineer a user-friendly yet versatile source argument that allows not only for lists of paths but also for mappings from IDs to paths.
All feedback welcome!
The text was updated successfully, but these errors were encountered:
The first thing that comes into my mind is that if source can be of multiple forms (str, list, etc.) with different interpretations (URI, dir), then .. it should not be a parameter.
I would see more some sort of factory methods, like from_dir(source_dir), from_url(source_url), etc. It would also standardize a bit how paths are made. (it would be constructed in these factory methods). Then, you can use an envelope.
I feel I'm missing a huge amount of background needed to understand the purpose behind these decisions, but I can offer my initial thoughts, just take them with a grain of salt:
I wouldn't worry about optimizing anything with the thought of URLs yet, since they are not implemented, and may never be.
I would've thought any default way of serializing/deserializing a list should be fine, whatever marshmallow is... (surely serializing lists has been done before)
Why package_name and not corpus_name? In any case, I don't see a problem with adding that parameter to all loaders...
Point 2: Do you mean, whether to search in subdirectories by various names, and how to get the piece name? You might give an option piece_name with valid options file_base, subdir ? The options could also be callables that take as input a string/Path, and output a string (ID). Then the user could customize as in Point 4?
I don't think mixing would be easy. If a user specifies the (same) corpus_name in all cases, different loading methods could be unified after the fact?
It would be great and make for a simple API to stick to a single
source
argument which can be a directory (or URL later), or some iterable of filepaths (URLs). The current approach suggested in #47 is to simply store thesource
argument and to let the propertypaths
deal with it when needed, e.g. by instantiating aPathFactory
whensource
is a directory. What is currently unclear is how the serialization should take place. Currently,paths
is not included in the Schemas with the rationale that, for reproducibility, it is sufficient to know the directory (and the commit SHA at a later point. But what if the argument to source is a list of filepaths? The easiest solution I see is to add an envelope to the marshmallow field to allow it to (de)serialize either a single string or a collection of strings. Do you see any problems with this approach?The second topic where I would love to have some input concerns the creation of IDs, which, finally, need to have the shape
('<corpus>', '<piece>')
. The cases I'm imagining so far:MuseScoreLoader
uses thems3
parser already provides these IDs based on its built-in rules of how to recognize a (DCML) corpus, i.e., by looking formetadata.tsv
files, among other things. InitializingMuseScoreLoader(package_name='my_corpus', as_corpus=True, )
overrides this mechanism and forces all IDs to be('my_corpus', '<piece>')
. It might make sense to include this latter option for all loaders..../corpus_folder/piece.ext
or.../corpus_folder/subdirectory/piece.ext
or.../corpus_folder/piece/file.ext
and should yield `('<corpus_folder>', '') IDs. What would be the best way to let specify the users these options?{corpus -> [piece]}
dict or similar. Would that be a good solution? This may lead us back to the first paragraph, if the solution is to engineer a user-friendly yet versatilesource
argument that allows not only for lists of paths but also for mappings from IDs to paths.All feedback welcome!
The text was updated successfully, but these errors were encountered: