Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loader configurations -> Usable datasets for downstream applications #5

Open
3 tasks
JimCircadian opened this issue Jul 24, 2024 · 0 comments
Open
3 tasks
Assignees
Milestone

Comments

@JimCircadian
Copy link
Contributor

JimCircadian commented Jul 24, 2024

The nature of the library is that we:

  1. Preprocess datasets from download-toolbox
  2. Generate a loader configuration, applying additional metadata (arbitrary channels and masks) providing initial access to the collected data
  3. Use this data loader to produce usable application datasets for downstream applications (testing with IceNet and another internal application)

This issue will capture everything to do with developing step (3) of this process. To do this, consider the original command for icenet to create a so called "network dataset" from a data loader configuration contained multiple source datasets and generated "other" data:

icenet_dataset_create -v -p -ob 4 -w 16 -fl loader.test.json test_net_ds

To approach this, we'll get that working against the original library with the new data loader configuration structure. Then we can drag the processing framework (most of which is agnostic to the underlying application implementation) over where it makes sense. We'd then end up with a command, similarly, such as

preprocess_dataset_create -v -p -ob 4 -w 16 -fl loader.test.json icenet.data.loader:generate_and_write test_net_ds

Performing the exact same in/out processing but backreferencing the structure of the dataset to generate_and_write (which is responsible for constructing, as it currently is, the (x, y, sample_weight) tuple for all elements iterable out of the data loader. We need to consider output types, in so much as we don't want ML specific logic in here (e.g. tf.data, PyTorch Lightning or other specific logic) so it might be that an assessment leaves this out, but at least the first should be working to close the issue and assessment done on working towards the second.

  • Refactor IceNet (under this issue to accept the new loader configuration methodology
    • Ensure extraneous parameter specifying is rationalised (remove time and space configuration parameters within network dataset generation, as these are predetermined earlier in the processing chain, unless there is a good reason to override them)
  • Assess the creation of the preprocess_dataset_create type commands
@JimCircadian JimCircadian self-assigned this Jul 24, 2024
@JimCircadian JimCircadian added enhancement New feature or request and removed enhancement New feature or request labels Jul 25, 2024
JimCircadian added a commit that referenced this issue Jul 25, 2024
…, but able to produce a network trainable dataset
@JimCircadian JimCircadian added this to the 0.1.0 milestone Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant