Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow JsonDataSource and CsvDataSource to read all files from a folder (not just a single file) #123

Open
seinecle opened this issue Mar 31, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@seinecle
Copy link

Is your feature request related to a problem? Please describe.
My data source is a list of json files. I have to concatenate them into one file before using the JsonDataSource consrtructor

Describe the solution you'd like
Allow the JsonDataSource constructor to accept the path to a folder

Describe alternatives you've considered
Doing the aggregation of all files into one big file myself, before feeding it to JsonDataSource

Additional context
It is not uncommen to receive data sources as a collection of files, instead of one big file. A convenience method to handle this common case would be appreciated.

@seinecle seinecle added the enhancement New feature or request label Mar 31, 2021
@Craigacp
Copy link
Member

Craigacp commented Mar 31, 2021

In Tribuo 4.0.X you can use AggregateDataSource to aggregate across files if you're happy with it round-robining the iterators and that the provenance won't be able to produce a configuration.

In the short term (i.e. potentially before 4.1) we can add an enum to AggregateDataSource which lets the iterators work sequentially (so it will preserve the example ordering according to the order you specify the files in) looks like it's already sequential, we'll update the docs to make that clear, and also add a version which is configurable and operates on ConfigurableDataSource which will allow the provenance to convert into a configuration for re-running the experiment. Both of these changes are straightforward and easy to make compatible with existing config files & provenance objects.

Longer term (i.e. after 4.1) I agree that extending those data sources to operate on folders would be a good change. It's a longer term thing because we'll need to think through the implications for existing provenance & configuration files and try to evolve those DataSources in a compatible way. Issue #70 involves extending the loading in a few ways, some of which could be integrated into such a change (e.g. supporting compressed files), so we'd prefer to do any more substantial refactor once and cover more of the different extensions at the same time.

@Craigacp
Copy link
Member

Craigacp commented Apr 2, 2021

The PR is out for the short term work which extends AggregateDataSource and adds AggregateConfigurableDataSource. Once it's merged you'll be able to put a collection of JsonDataSources into a config file, and have a single aggregate source which collects all of them up. The individual JsonDataSources can share the RowProcessor provided the json files all have the same fields.

As I mentioned above we'll look into a bigger refactoring of the data sources to allow them to iterate multiple files, and to operate on compressed files in a future release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants