Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producing "latest" training data potentially includes invalid ground truth dates #262

Open
JimCircadian opened this issue May 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@JimCircadian
Copy link
Member

Description

Running a big icenet_dataset_create to cache the tfrecords. The available data is up to 25/12/2023, so the end date is configured as such. In running the process scripts with that as the end date, an invalid SIC selection is happening:

Traceback (most recent call last):
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 408, in generate_sample
    sample_output = var_ds.siconca_abs.sel(time=forecast_dts)
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1536, in sel
    ds = self._to_temp_dataset().sel(
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/dataset.py", line 2573, in sel
    query_results = map_index_queries(
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/indexing.py", line 188, in map_index_queries
    results.append(index.sel(labels, **options))
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/indexes.py", line 489, in sel
    raise KeyError(f"not all values found in index {coord_name!r}")
KeyError: "not all values found in index 'time'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USER/.conda/envs/icenet/bin/icenet_dataset_create", line 33, in <module>
    sys.exit(load_entry_point('icenet', 'console_scripts', 'icenet_dataset_create')())
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loader.py", line 126, in create
    dl.generate()
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 78, in generate
    self.client_generate(client,
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 218, in client_generate
    in client.gather(futures):
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/distributed/client.py", line 2372, in gather
    return self.sync(
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 340, in generate_and_write
    x, y, sample_weights = generate_sample(date, var_ds, var_files,
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 414, in generate_sample
    raise RuntimeError(sic_ex)
RuntimeError: "not all values found in index 'time'"

The location of this looks like it's in the ground truth select, meaning the generate_sample is maybe selecting dates past the range of the available training data. The icenet_process commands do not limit training date ranges based on number of days forecast, so we need to ensure the forecast window is correctly accounted for when creating samples.

$ cat loader.full_train_south.json | jq '.sources.osisaf.dates.train' | grep '2023_12_25'
  "2023_12_25"

This is likely only being observed as this training configuration is introducing data at the END of the full data window: the test and validation sets are pre-2023.

@JimCircadian JimCircadian added the bug Something isn't working label May 15, 2024
@JimCircadian
Copy link
Member Author

JimCircadian commented May 15, 2024

The earliest 93 day window we can train from with that end date is 23/09/2023, so skimming 23/09/23 onwards from the loader dates for the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant