Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RasterMetricSpace for improved pairwise sampling of large 2D grids #114

Merged
merged 18 commits into from
Jul 23, 2021

Conversation

rhugonnet
Copy link
Contributor

@rhugonnet rhugonnet commented Jul 14, 2021

Discussion started in #111 with @mmaelicke and @redhog

Added RasterEquidistantMetricSpace

Performs subsampling adapted to Raster/Grid data to improve the empirical Variogram estimation:

What the subclass does:

  • Selects a disk sample centered on a random point of the grid, with radius based on the resolution of the grid and the number of samples specified in subsample,
  • Selects successive equidistant samples from the center disk sample, until the max_dist is reached, in 2D here it corresponds to several subsampling within rings based on the center point,
  • Calculates the pairwises distance between the two samples.

This is performed iteratively for several runs, corresponding to different center points on the grid.

Example

I used GSTools to generate random correlated fields, I couldn't find an easy way with skgstat!

import gstools as gs
import numpy as np
import skgstat as skg
import matplotlib.pyplot as plt

# Grid/Raster of 1000 x 1000 pixels
shape = (1000, 1000)
x = np.arange(0, 1000)
y = np.arange(0, 1000)

# Using GSTools, let's generate a correlated signal at two different length: 5 and 100 (spherical)
model = gs.Spherical(dim=2, var=0.5, len_scale=5)
srf = gs.SRF(model, seed=42)

model2 = gs.Spherical(dim=2, var=0.5, len_scale=100)
srf2 = gs.SRF(model2, seed=42)

# We combine the two random correlated fields (e.g, short-range could represent resolution, and long-range the noise)
field = srf.structured([x, y]) + srf2.structured([x, y])

# Let's see how the correlated field looks
plt.imshow(field, vmin=-3, vmax=3)
plt.colorbar()

Figure_1

# Get grid coordinates, and fatten everything because we don't care about the 2D
xx, yy = np.meshgrid(x, y)
coords = np.dstack((xx.flatten(), yy.flatten())).squeeze()

# We generate a RasterEquidistantMetricSpace instance, providing the shape and coordinate extent of the grid.
# We run 100 samplings of pairwise distances between a random center sample of 25 points and several samples of
# equidistant points
rems = skg.RasterEquidistantMetricSpace(coords, shape=shape, extent=(x[0], x[-1], y[0], y[-1]), samples=25,
                                        runs=100)

# We compute the variogram with custom bins to look at all ranges more precisely
custom_bins = np.concatenate([np.arange(0, 20)] + [np.arange(20, 200, 20)] + [np.arange(200, 1200, 200)])

V1 = skg.Variogram(rems, field.flatten(), bin_func=custom_bins)

# How many pairwise distances did we compute:
print(np.sum(V1.bin_count))

Number of pairwises distances computed:
411183

# Wrapper plotting function to set a log x axis
def plot_vgm(V):
    fig = plt.figure()
    grid = plt.GridSpec(10, 10, wspace=0.5, hspace=0.5)
    ax1 = fig.add_subplot(grid[:3, :])
    ax2 = fig.add_subplot(grid[3:, :])
    ax1.set_xscale('log')
    ax2.set_xscale('log')
    V.plot(axes=[ax2, ax1], grid=grid)

# Plot the empirical variogram (ignore the single-range model fit)
plot_vgm(V1)

Figure_2

# Let's compare with a completely random subsampling
pms = skg.ProbabalisticMetricSpace(coords, samples=1000)
V2 = skg.Variogram(pms, field.flatten(), bin_func=custom_bins)

# Check we computed a similar amount of pairwise distances
print(np.sum(V2.bin_count))

Number of pairwises distances computed:
489951

# Plot the empirical variogram (ignore the single-range model fit)
plot_vgm(V2)

Figure_3

We can see that the sampling is largely improved at shorter distances, and is quite similar at larger distances. The method is much more efficient than a fully random sampling in relation to the pixel size and extent of a Raster/Grid,.
The results could be further improved by increasing the amount of samples drawn, but I kept it small here to limit the computing time of the example.

Update

  • Added tests
  • Optimized computing times
  • Added parameters + descriptions for the spatial sampling
  • Furnished class documentation

Resolves #111

@redhog
Copy link
Collaborator

redhog commented Jul 14, 2021

cool!, reading up on this now.
Btw, the easiest way to generate a randomly correlated field in scikit-gstat is to randomly generate a few points and krige them using a variogram you design yourself (i.e. chose a sill and range).

@redhog
Copy link
Collaborator

redhog commented Jul 14, 2021

There is an example in the tests: https://github.com/mmaelicke/scikit-gstat/blob/master/skgstat/tests/test_variogram.py#L24

The advantage of this approach for testing, is that the range/sill you should get from creating a variogram from the generated correlated data,is known (approximately).

@rhugonnet
Copy link
Contributor Author

There is an example in the tests: https://github.com/mmaelicke/scikit-gstat/blob/master/skgstat/tests/test_variogram.py#L24

The advantage of this approach for testing, is that the range/sill you should get from creating a variogram from the generated correlated data,is known (approximately).

I tried to use this example with 100 random points, setting the Variogram parameters and then kriging with OK on the same (1000, 1000) grid as above, but the calculation kept running for several minutes so I gave up! 😅

@redhog
Copy link
Collaborator

redhog commented Jul 14, 2021

Ah... this is possibly a bug/limitation in OK: It needs to allocate distance pairs between all points to krige from and all points to krige to, so if the destination array is big, it can eat a lot of ram and start swapping... Batching could help this I guess?

@redhog
Copy link
Collaborator

redhog commented Jul 14, 2021

As for your code, I have two main feedbacks:

You loop over runs in python. Would it be possible to vectorize this in numpy (haven't looked too deeply at this)?
You represent the grid coordinates as a normal coords array with all the coords, when it could be represented using just width and height (self.coords would have to be a property that uses meshgrid).

@rhugonnet
Copy link
Contributor Author

Thanks for the feedback!
I looked a bit at the performance.

Findings

  1. The removal of duplicates was what took most computing time (95%+) for small to medium pairwise sample sizes (10,000 - 1,000,000), typically between 5 and 100 seconds. And it was still ~80% of the computing time for sample sizes above this.

The implementation was as follows, based on set(), the most performant method without preserving order (https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-whilst-preserving-order) but requires to zip a Tuple of coordinates + data. Performs the same as dict.fromkeys(list).

 # remove possible duplicates (that would be summed by default)
c, eq, d = zip(*set(zip(c, eq, d)))
dists = sparse.csr_matrix((d, (c, eq)), shape=(len(self.coords), len(self.coords)))

Using the recently updated solution in https://stackoverflow.com/questions/28677162/ignoring-duplicate-entries-in-sparse-matrix reduces that computing time by a factor of 5 (1 and 20 seconds respective to the previous 5 and 100 seconds), as it is based solely on the coordinates:

# Solution 5+ times faster than the preceding, but relies on _update() which might change in scipy (which
# only has an implemented method for summing duplicates, and not ignoring them yet)
dok = sparse.dok_matrix((len(self.coords), len(self.coords)))
dok._update(zip(zip(c, eq), d))
dists = dok.tocsr()
  1. Deriving the random indexes through cidx and eqidx (the for loop you mention @redhog) to sample the pairwise distances was what took the most computing time for medium to large sample sizes (1,000,000 - 100,000,000++).

Because it is random sampling without replacement, it is not straightforward to vectorize the indices operations of numpy because we have no idea in advance how many points we need to subsample (coordinates can be on the edges of the grid, NaN data, ...).
Therefore, I vectorized all the calculations except for the random sampling, that is all the steps that derive the distances from the center points.
In cidx, from:

dist_center = np.sqrt((self.coords[:, 0] - self._center[0]) ** 2 + (
        self.coords[:, 1] - self._center[1]) ** 2)
idx1 = dist_center < self._center_radius

To:

dist_center = np.sqrt((self.coords[None, :, 0] - self._centers[:, 0, None]) ** 2 + (
        self.coords[None, :, 1] - self._centers[:, 0, None]) ** 2)
idx1 = dist_center < self._center_radius

And same for eqidx, with also another dimension for the different radius:

# Get distances from all centers
dist_center = np.sqrt((self.coords[None, :, 0] - self._centers[:, 0, None]) ** 2 + (
        self.coords[None, :, 1] - self._centers[:, 1, None]) ** 2)

# Select samples in a ring of inside radius and outside radius for all runs
idx = np.logical_and(dist_center[None, :] >= np.array(list_in_radius)[:, None, None],
                     dist_center[None, :] < np.array(list_out_radius)[:, None, None])

The downside of this method is that it needs everything in memory to resolve all distances simultaneously. So for 1000 runs, a grid of 1000 x 1000 coordinates and a dozen radius for sampling rings, one rapidly ends up with matrices of size 10,000,000,000 which results in memory errors. While, without vectorizing these steps, one doesn't run as rapidly into errors (with the for loop).
And it is only slightly faster than the for loop (10% faster with 1,000,000 samples, 50% with 100,000,000).

Summary

  • The new removal of duplicates saves a lot of time, but it is still what takes the most time. I'm not sure if there is a way to remove duplicate a priori using some other approach... so many pairwise combinations possibilites, I don't see a way to sample only independent samples!
  • I did not add the vectorized approach because it limits memory usage. An option would be to add a multiprocessing wrapper to allow running the subsamplings in parallel? Using several cores would increase memory usage, but this can be tuned a lot more easily by the user, and would help speed up calculations, including the random samplings. What do you think?

Overall, I think the speed is satisfactory for the amount of pairwise distances that are sampled.

@redhog What would be the benefit of using width/height for the coords? Isn't it still width * height coordinates combinations?
I hesitated at first, but decided to provide directly coordinates to avoid dealing with NaNs in the grid data.

@rhugonnet
Copy link
Contributor Author

I have now added optional input parameters to allow users to tweak with the subsampling as they like, with default values that are described.
Also added some basic tests, and a full description of the functionality and parameters directly in the subclass doc 😄

@mmaelicke
Copy link
Owner

Hey Romain,
Thanks for this nice PR. I will happily review it, once you consider it finished. Just request my review and I will try to free some time...

@rhugonnet
Copy link
Contributor Author

rhugonnet commented Jul 19, 2021

Hi Mirko,

PR finished and ready to be reviewed!
(couldn't press the GitHub review button for some reason, maybe this is only available to a specific group)

I have moved the calculations that were into self.eqidx and self.cidx into functions out of the class. This feels cleaner: those variables didn't fit as @property as they were updated within the loop. Now the output given by those is the fixed, final list of pairs of ensembles. Same for the ctree and eqtree.

It was the occasion to add a parallel option for the subsampling. In the end, not much improvement brought by parallel processing, because 90% of the computing time comes from the duplicate removal.
I'm thinking of leaving it in, though, as it does not change anything for the default processing on a single core ncores=1. And it might still be useful at a later stage if a better solution is found for duplicates (e.g., using the duplicate pairwise samples several times instead of removing them? this is currently not possible because CSR is the format expected by the Variogram class, and CSR sparse matrices automatically sum duplicates thus providing wrong pairwise distances).

@mmaelicke mmaelicke self-requested a review July 20, 2021 06:26
Copy link
Owner

@mmaelicke mmaelicke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like your code, it's clean and very well documented. I have just a few comments, after answering (and maybe changing), we can merge this PR.

BTW, I always had the plan to implement some sampling strategies (mainly for virtual landscapes) into the package. Do you think it is easier to inherit/extent this class, by making the rings just one of many options, or shall I go for one class per strategy? What do you think?

skgstat/MetricSpace.py Outdated Show resolved Hide resolved
skgstat/MetricSpace.py Outdated Show resolved Hide resolved
skgstat/MetricSpace.py Show resolved Hide resolved
skgstat/MetricSpace.py Outdated Show resolved Hide resolved
skgstat/MetricSpace.py Outdated Show resolved Hide resolved
ratio_subsamp=0.2,
runs=None,
ncores=1,
exp_increase_fac=np.sqrt(2),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sqrt(2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the surface area of successive rings is going to be preserved when increasing the radius by a factor of exactly sqrt(2). The equation for the area of the ring (disk minus disk) looks like this (once the pis that even out are removed): (sqrt(2)R)**2 - R**2 = R**2

It's briefly described in the class description + the __init__ function at the line it uses the factor

@rhugonnet
Copy link
Contributor Author

I really like your code, it's clean and very well documented. I have just a few comments, after answering (and maybe changing), we can merge this PR.

BTW, I always had the plan to implement some sampling strategies (mainly for virtual landscapes) into the package. Do you think it is easier to inherit/extent this class, by making the rings just one of many options, or shall I go for one class per strategy? What do you think?

Thanks for the nice feedback (and quick review)! 😊

For other sampling strategies: it really depends what they are. I think this class with equal distance to a center sample could be easily extended in a higher number of dimensions (with spheres in 3D, etc...).
For other shapes, it would be doable to integrate those into the same class, but the problem would be to manage the many possible input parameters the user could pass... My guess is that you'd reach a high number of those and need to pass them all as a dictionary or a kwargs and select only the ones of interest. Not so difficult to code, but not so easy to use! So maybe creating different classes for different shapes would be preferable here, not sure.

Copy link
Owner

@mmaelicke mmaelicke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your quick changes!

@mmaelicke mmaelicke merged commit 258b687 into mmaelicke:master Jul 23, 2021
@mmaelicke
Copy link
Owner

Just released as 0.6.7.

I quickly changed the from __future__ import annotations to from typing import Tuple, List, as the annotations module needs Python >= 3.7. SciKit-GStat still supports 3.6. However, if you specifically need annotations module, we can open an issue and discuss. 3.6 might be dropped soon anyway (and eol is in dec I think).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Easier support for raster data: custom bins, user-defined maxlag, return count, etc...
3 participants