-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier support for raster data: custom bins, user-defined maxlag, return count, etc... #111
Comments
Thanks a lot for the feedback!
I am looking forward to your PRs, thanks a lot! BTW.: We also recently added a ProbabilisticMetricPair class, which only samples a subset of large input data in a probabilistic way. Maybe that is of intereset for your usecase as well. And by we, I mean @redhog. He's maybe a little bit more into your usecase. Looking forward to your contributions! Mirko |
Ok, so I might not have understood 100% of what you are trying to achieve, but here are my 5 cents: You should be able to just use ProbabilisticMetricPair for the random subsampling: coordinates = np.column_stack([a.flatten() for a in np.meshgrid(np.arange(image.shape[0]), np.arange(image.shape[1]))]) However after some more consideration, I think it would be advantageous to write a new class, because your input set of coordinates is dense - generating/storing the meshgrid output is quite wasteful. Such a class would be very similar to ProbabilisticMetricPair, but only store the image width and height internally. |
Btw, your use case is a bit similar to ours - our input is a dense triangulated surface of measurement points (e.g. surface elevation), typically over an area orders of magnitude larger than the lag. Don't hesitate to ping me if you have problems building your optimizations or want feedback! |
Thanks @redhog for the help! I went through the If this is the case, then it is indeed what we do most of the time previous to running the However, recently I realized this was not always optimal because sampling a grid of a certain size completely randomly will always lead to a certain density of point/clustering (like we see in the first Figure here: https://scipython.com/blog/poisson-disc-sampling-in-python/) which might lead to difficulties to constrain the empirical variogram for some categories of lags. Typically for very small lags (only a few grid points completely adjacent one to another will be sampled) and large lags (fewer grid points are sampled with pairwise distances in the order of magnitude close to the size of the grid). My idea to remediate this issue was to not subsample completely randomly but rather according to a certain spatial distribution. For instance, in 2D, subsampling iteratively in rings of increasing radius (e.g., start by sampling many small disks to resolve the variance at small lags, then many medium-size rings for that at medium lags and finally big rings for that at large lags). The center of the ring would also be sampled randomly several time across the grid. It's a bit of a "hacky" solution, but for big grid data it can lead to better sampling at all lags, and this in a lot shorter times (maybe a hundredth? even a thousandth of the computing time when using random selection; depends a lot on the grid size). |
Yes, it subsamples coordinates randomly. This means that it does not subsample coordinate pairs randomly, however. What it does is to sample all coordinates into twice, into the two sets L and R, and then produce all distance pairs between these two sets. When working with a raster, you could probably do better than this. The only property that your class has to implement is
This is supposed to return a scipy.sparse.spmatrix (Optimally in CSR or CSC format), where the column and row indexes both corresponds to indexes into your values array for the variogram. Since your "sampling" is just generating coordinate pairs where x and y are ints in the range (0, image.shape[0]) and (0, image.shape[1]) respectively, what you describe should be pretty straight forward to implement in numpy & scipy. You probably have to tell your class what the original image dimensions where, since values is not available in a MetricSpace, and would need to be flattened anyway... |
I would suggest making a RasterSamplingMetricSpace class that can be parameterized with different ways to sample the space as you describe. |
I did some experimentation, and realized something: You can't both get an even sample of distances and of locations and distances at locations at the same time: The really long distances can only have their endpoints closer to the sides of the grid. So you can have an even sample of points, but then short distances are more common. |
Hi there!
First, thanks for all the work on scikit-gstat 👍.
It's great and I'm very happy to see geostatistics emerging in Python!
I used
skgstat
quite a bit recently (moving away from R), cited it in our last paper! 😉I work a lot with satellite data, and quantifying spatial correlations is crucial there too, especially for uncertainty estimation.
However, due to the large number of samples, we can't calculate pairwise differences for ~ten millions points in an image. We have to subsample intelligently, and sometimes combine variograms from different grids.
For this, there is still some shackles in
skgstat
which make things a bit difficult. I intended to submit PRs with some proposed changes, but I think it's better if I first present those issues and we discuss on if and what changes would be best!maxlag
If I want to subsample different grids, or several times a grid that has too many samples, it is currently not convenient to aggregate results over a similar bin distribution because the
maxlag
(and therefore all the bins) will be defined by the maximum distance forced byskgstat
(note: we need to pass amaxlag
larger than the grid size to ensure we use all samples).Example below:
[1386.8327224290606, 1392.293431716174, 1376.7370845589944, 1383.8731878318908, 1388.1156291894417, 1385.964646013743, 1371.1458711603227, 1372.5906163164602, 1393.026202194345, 1385.9552662333658]
This is an easy fix really, changing in
binning.py
:into:
But there is also other issues which are closely related, so I think it's worth looking at the bigger picture first.
With satellite data, we tend to have instrument noise that plagues the structure of our data.
Due to this, we sometimes see variograms that show correlations at different scales, for example for a 5 m gridded satellite image will show: a first (large) sill at a range of 15 meters which will be linked to the native resolution of the data, a second sill at 500 meters will be due to some short-range instrument noise, and a third at 5 km will be due to long-range instrument noise.
To better work with those correlation at different ranges, it would be good to be able to pass user-defined, custom
bins
.Maybe through the
bin_func
? Or an additional argument (asbin_func
is currently a string that defines the distribution depending on thedistances
, although it could be made into aUnion[str,Iterable]
and thenp.ndarray
could be passed as custom binned edges!Based on what was said above, for our applications we often need to fit a sum of variogram model to get a good representation of the spatial correlation at different scales (e.g., https://www.frontiersin.org/files/Articles/566802/feart-08-566802-HTML-r1/image_m/feart-08-566802-g009.jpg, https://www.nature.com/articles/s41586-021-03436-z/figures/9).
This is something we are currently working on implement in our package
xdem
: https://github.com/GlacioHack/xdem/blob/main/xdem/spstats.py#L531 (early stages)But, if this kind of feature feels useful for
skgstat
, I think it would make sense to have it available directly here as well!Other minors issues (based on experience of the past year, I hope those are still relevant):
If I'm not mistaken, right now it can only be re-derived using:
It would be practical to have a
count
attribute storing this directly.To wrap things up: if those issues are of interest, I could push propositions of changes as PRs for 1/2/4 quite rapidly! I let you tell me what's best.
And thanks again for the great work!
The text was updated successfully, but these errors were encountered: