Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding statistics to Raster #638

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

vschaffn
Copy link

@vschaffn vschaffn commented Dec 3, 2024

Resolves #602 GlacioHack/xdem#660

Description:

This pull request introduces enhancements to the statistics calculation methods in the project. The goal is to provide users with more flexible and comprehensive statistics functionality for raster data. The get_stats method has been created to:

  • Return a list of statistics in a dict containing mean, median, max, mean, sum, sum of squares, 90th percentile, nmad, rmse, std.
  • Support single statistic: User can request one statistic.
  • Support multiple statistics: Users can request multiple statistics in one call.
  • Support custom callables: Users can pass any callable function to calculate custom statistics.

A list of aliases has been drawn up in order to be as flexible as possible when users request statistics.

Changes:

  • NMAD in geoutils.stats
    note : nmad can't be called directly from a raster because of circular imports, you have to go through an array like.
  • _statistics method: Calculates common statistics (mean, median, max, min, sum, RMSE, etc.) for a specified band of raster data.
  • get_stats method: Retrieves requested statistics by their name (either from a predefined list or user-defined), with support for passing a callable function.
  • _get_single_stat method: Handles fetching a single statistic by matching it with its alias or function name. Prints a warning and return NaN if the statistic required is not in the list
  • Support for callable statistics: Users can now provide any custom function that reduces the raster data (such as custom percentiles).

Usage Examples:

  1. Get all statistics:

    stats = raster.get_stats()
  2. Get a single statistic (e.g., 'mean'):

    mean_value = raster.get_stats("mean")
  3. Get multiple statistics:

    selected_stats = raster.get_stats(["mean", "max", "rmse"])
  4. Using a custom callable statistic:

    def custom_stat(data):
        return np.nansum(data > 100)  # Count the number of pixels above 100
    custom_value = raster.get_stats(custom_stat)

Tests

The following tests ensure that the new functionality in the get_stats method works as expected:

  1. Full Statistics:

    • The test retrieves all available statistics (mean, median, max, min, sum, etc.) for a raster and ensures each statistic is present and not None.
  2. Single Statistic:

    • A test is included to fetch a single statistic, "Average" (which is an alias for "Mean"), and verifies that the returned value is a float.
  3. Selected Statistics:

    • The test checks the retrieval of a selected set of statistics: mean, maximum, and std. Additionally, it includes the use of a custom statistic function (percentile_95), which computes the 95th percentile of the data.
    • This validates that both built-in and user-defined statistics can be fetched successfully.
  4. Unrecognized Statistics:

    • Ensure that get_stats return NaN and a warning is raised if the statistic requested by the users is not in the list of aliases
  5. NMAD tests in test_stats:

    • Ensure the good behavior of the scale factor
    • Check if the returned nmad is the same than scipy.stats.median_abs_deviation

Documentation

A section about how to use get_stats has been added to the raster class documentation.

@vschaffn vschaffn force-pushed the 660-add_missing_stats branch from 31c857d to 2311ee6 Compare December 4, 2024 10:35
geoutils/raster/raster.py Outdated Show resolved Hide resolved
geoutils/raster/raster.py Outdated Show resolved Hide resolved
actual_name = stats_aliases[normalized_name]
return stats_dict[actual_name]
else:
logging.warning("Statistic name '%s' is not recognized", stat_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or raising an error here? Or is the reasoning for warning instead to accept potential typos in the statistic list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not raised an error here in order to obtain results if stats_name is a list and only one stat is not recognized. But it could be passed as a raise if needed

geoutils/raster/raster.py Outdated Show resolved Hide resolved
@rhugonnet
Copy link
Member

Amazing! 😄
I only have small comments on structure and documenting the alias names above.
Also had to approve the test run (as this is your first PR on this repo, I think), all passing!

Copy link

@adebardo adebardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you add a note to the documentation?
  • Can you add a test with an inconsistent stats name and check the log returned?
  • Did you manage to get it running on xdem?

"sum": "Sum",
"sumofsquares": "Sum of squares",
"sum2": "Sum of squares",
"percentile": "90th percentile",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind to consider the user, but it does make the documentation a bit more complex. We'll see if we leave that much flexibility.

with caplog.at_level(logging.WARNING):
stat = raster.get_stats(stats_name="80 percentile")
assert isnan(stat)
assert "Statistic name '80 percentile' is not recognized" in caplog.text
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adebardo There is already a test with an inconsistent stats name and log return checked here 😃

@vschaffn
Copy link
Author

vschaffn commented Dec 13, 2024

@adebardo I added a section about get_stats in the raster class documentation.
I did not manage to get it running on xdem yet, I am not sure I can without geoutils be released with the update because of circular import conflicts.

@rhugonnet
Copy link
Member

@vschaffn To coordinate the changes between GeoUtils and xDEM, I typically do a pip install -e . in geoutils/ after installing xdem-dev with mamba. Then, if you are on the right branches for both packages locally, you should be able to run everything (tests, doc) with both versions where you want them to be.

If you want to run the tests/docs on xDEM's CI too, you can uncomment this line here in your xDEM PR: https://github.com/GlacioHack/xdem/blob/29a8cf4c8979d51723fe8b5a5a86e1fd5ba7cb96/environment.yml#L23. And potentially specify the GeoUtils branch at the end of the github link following https://pip.pypa.io/en/stable/topics/vcs-support/ 😉
This is especially useful if you know something might be behaving differently in CI than locally or is exclusive to CI, for example testing a dev change of the CI itself (otherwise can usually wait for GeoUtils release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[POC] add metrics from demcompare
3 participants