Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NWP loaded as xarray.Dataset cannot be iterated over when combining into single dataset #322

Open
AUdaltsova opened this issue May 31, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@AUdaltsova
Copy link
Contributor

AUdaltsova commented May 31, 2024

Describe the bug

If NWP provider returns xarray.Dataset the following error occurs:

AttributeError: 'str' object has no attribute 'assign_attrs'
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=DictDatasetIterDataPipe, fn=combine_to_single_dataset, input_col=None, output_col=None)

This is caused by this bit in ocf_datapipes/utils/utils.py

To Reproduce

Steps to reproduce the behavior:
Generally will be thrown by any xr.Dataset NWP, so either loading ICON or changing another provider to return xr.Dataset should reproduce. Steps I used to get this error from ECMWF:

  1. Go to ocf_datapipes/load/nwp/providers/ecmwf.py
  2. Comment out lines 25-26
  3. Substitute with ifs = nwp
  4. run PVNet/save_batches.py

Expected behaviour

Should be returning xr.DataArray when iterating over it, but returns str because iteration over xr.Dataset iterates over variable names instead of their contents.

There might be a different way around this, especially if some data sources cannot be manipulated to be output as a xr.DataArray, but there may be merit to committing to a more standardised loader output.

@AUdaltsova AUdaltsova added the bug Something isn't working label May 31, 2024
@AUdaltsova
Copy link
Contributor Author

AUdaltsova commented Jul 26, 2024

On reflection, the thing that's crashing generally seems a bit odd; my best guess it was supposed to support multiple pv sources? (In this case, should we hide it under an if statement by key or something? Because it's doing a lot of unnecessary stuff when it is used for anything else as far as I can tell)

Overall by the time you get to that, iteration for key, datasets in dataset_dict.items(): will iterate through sources as in key="wind", datasets=wind_data and for nwps it looks like key="nwp-{source}", datasets={source_data}. At this point source_data is normally a xr.DataArray with variables stored as a channel dimension, but technically there is nothing preventing it from being a xr.Dataset with variables still being weather variables. However, the next loop for dataset in datasets: would either loop over variable names in a xr.Dataset, which is the source of the issue above, or just return the dataArray itself for xr.DataArray -- anyway, in short, unless I'm reading this very wrong, this entire bit does nothing but assign attributes to the data in most cases (all nwps, wind, pvsite data), and looks kind of convoluted for what it is.

The next for loop loops over all the same things, and also does nothing to the actual structure other than prepend source to all names (useful, but again, very convoluted for what it is)

I guess my question is, what is the use case for this if it exists anymore at all? Because someone clearly went to a lot of trouble to do it this way and make absolutely sure that it combines xr.Datasets, but it just seems to do a lot of unnecessary operations most of the time now.

@AUdaltsova
Copy link
Contributor Author

Basically I think it operates under the assumption that every source supplies a dictionary of datasets/dataArrays, which is currently false most of the time (probably only true for gsp data? I might be wrong here as have experience only with pv_site and windnet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant