-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GlacierMappingAlps dataset #2508
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the contribution :) I hope this first round of review is able to resolve the import xarray
error that is preventing the unit tests from running. Once they are working, will take another look for anything else!
@microsoft-github-policy-service agree |
I hope it will be somehow useful. Happy to contribute and learn new things on the way :) |
I'm completely fine with adding xarray and netcdf4 as optional dependencies. We've been talking about adding xarray support for a long time. We're still not sure if it will be directly through xarray or something like rioxarray, but that doesn't matter for this PR. |
Can you resolve the merge conflicts so the tests run? |
Can you resolve the merge conflicts? These requirements files unfortunately get updated a lot, causing merge issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fantastic, just a few minor formatting comments. Thanks for taking the time to contribute your dataset to TorchGeo, let's hope it makes it easier for people to actually use and do cool science with!
@@ -7,3 +7,5 @@ pycocotools==2.0.8 | |||
pyvista==0.44.2 | |||
scikit-image==0.25.0 | |||
scipy==1.15.0 | |||
xarray==2024.11.0 | |||
netcdf4==1.7.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep these in alphabetical order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These also need to be added to pyproject.toml. We'll need to find minimum versions that allow the tests to pass. I can help with this if you want.
) | ||
|
||
|
||
class GlacierMappingAlps(NonGeoDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename the dataset to DL4GAM or DL4GAMAlps? Or is DL4GAM the project/model and GlacierMappingAlps is the name of the dataset? It's up to you, it's your dataset.
class GlacierMappingAlps(NonGeoDataset): | ||
r"""A Multi-modal Dataset for Glacier Mapping (Segmentation) in the European Alps. | ||
|
||
The dataset consists of Sentinel-2 images from 2015 (mainly), 2016 and 2017, and binary segmentation masks for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we could limit all lines to 88 chars, but if ruff doesn't complain it isn't technically required
* `xarray <https://docs.xarray.dev/en/stable/getting-started-guide/installing.html>`_ | ||
* `netcdf4 <https://unidata.github.io/netcdf4-python/>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually just link to the PyPI homepage
checksum: if True, check the MD5 of the downloaded files (may be slow) | ||
|
||
Raises: | ||
AssertionError: if the ``split``, ``cv_iter``, ``version``, ``bands`` or ``extra_features`` are invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually use *foo*
instead of backticks because that's what the Python docs uses. Could also be less specific and just say "if any parameters are invalid".
return len(self.fp_patches) | ||
|
||
def __getitem__(self, index: int) -> dict[str, Tensor]: | ||
"""It loads the netcdf file for the given index and returns the sample as a dict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""It loads the netcdf file for the given index and returns the sample as a dict. | |
"""Load the NetCDF file for the given index and return the sample as a dict. |
* the cloud and shadow mask | ||
* the additional features (DEM, derived features, etc.) if required | ||
""" | ||
xr = lazy_import('xarray') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other datasets, we added a lazy_import
to the __init__
just so that the dataset would fail more quickly.
version: str = 'small', | ||
bands: Sequence[str] = rgb_nir_swir_bands, | ||
extra_features: Sequence[str] | None = None, | ||
transforms: Callable[[dict[str, Tensor]], dict[str, Tensor]] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't currently being used in __getitem__
@@ -21,6 +21,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands | |||
`Forest Damage`_,OD,Drone imagery,"CDLA-Permissive-1.0","1,543",4,"1,500x1,500",,RGB | |||
`GeoNRW`_,S,Aerial,"CC-BY-4.0","7,783",11,"1,000x1,000",1,"RGB, DEM" | |||
`GID-15`_,S,Gaofen-2,-,150,15,"6,800x7,200",3,RGB | |||
`Glacier Mapping Alps`_,S,"Sentinel-2","CC-BY-4.0","2,251 or 11,440","2","256x256","10","MSI" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the license
This PR adds a a Multi-modal Dataset for Glacier Mapping (Segmentation) in the European Alps.
The dataset consists of Sentinel-2 images from 2015 (mainly), 2016 and 2017, and binary segmentation masks for glaciers, based on an inventory built by glaciology experts (Paul et al. 2020).
Given that glacier ice is not always visible in the images, due to seasonal snow, shadow/cloud cover and, most importantly, debris cover, the dataset also includes additional features that can help in the segmentation task.
Here's a sample extracted with the plot function implemented in the dataset:
A preprint is available here that describes in detail how the dataset was constructed.
For a shorter description check also: https://huggingface.co/datasets/dcodrut/dl4gam_alps.