-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RasterMetricSpace for improved pairwise sampling of large 2D grids #114
Conversation
cool!, reading up on this now. |
There is an example in the tests: https://github.com/mmaelicke/scikit-gstat/blob/master/skgstat/tests/test_variogram.py#L24 The advantage of this approach for testing, is that the range/sill you should get from creating a variogram from the generated correlated data,is known (approximately). |
I tried to use this example with 100 random points, setting the Variogram parameters and then kriging with OK on the same (1000, 1000) grid as above, but the calculation kept running for several minutes so I gave up! 😅 |
Ah... this is possibly a bug/limitation in OK: It needs to allocate distance pairs between all points to krige from and all points to krige to, so if the destination array is big, it can eat a lot of ram and start swapping... Batching could help this I guess? |
As for your code, I have two main feedbacks: You loop over runs in python. Would it be possible to vectorize this in numpy (haven't looked too deeply at this)? |
Thanks for the feedback! Findings
The implementation was as follows, based on # remove possible duplicates (that would be summed by default)
c, eq, d = zip(*set(zip(c, eq, d)))
dists = sparse.csr_matrix((d, (c, eq)), shape=(len(self.coords), len(self.coords))) Using the recently updated solution in https://stackoverflow.com/questions/28677162/ignoring-duplicate-entries-in-sparse-matrix reduces that computing time by a factor of 5 (1 and 20 seconds respective to the previous 5 and 100 seconds), as it is based solely on the coordinates: # Solution 5+ times faster than the preceding, but relies on _update() which might change in scipy (which
# only has an implemented method for summing duplicates, and not ignoring them yet)
dok = sparse.dok_matrix((len(self.coords), len(self.coords)))
dok._update(zip(zip(c, eq), d))
dists = dok.tocsr()
Because it is random sampling without replacement, it is not straightforward to vectorize the indices operations of dist_center = np.sqrt((self.coords[:, 0] - self._center[0]) ** 2 + (
self.coords[:, 1] - self._center[1]) ** 2)
idx1 = dist_center < self._center_radius To: dist_center = np.sqrt((self.coords[None, :, 0] - self._centers[:, 0, None]) ** 2 + (
self.coords[None, :, 1] - self._centers[:, 0, None]) ** 2)
idx1 = dist_center < self._center_radius And same for # Get distances from all centers
dist_center = np.sqrt((self.coords[None, :, 0] - self._centers[:, 0, None]) ** 2 + (
self.coords[None, :, 1] - self._centers[:, 1, None]) ** 2)
# Select samples in a ring of inside radius and outside radius for all runs
idx = np.logical_and(dist_center[None, :] >= np.array(list_in_radius)[:, None, None],
dist_center[None, :] < np.array(list_out_radius)[:, None, None]) The downside of this method is that it needs everything in memory to resolve all distances simultaneously. So for 1000 Summary
Overall, I think the speed is satisfactory for the amount of pairwise distances that are sampled. @redhog What would be the benefit of using width/height for the coords? Isn't it still width * height coordinates combinations? |
I have now added optional input parameters to allow users to tweak with the subsampling as they like, with default values that are described. |
Hey Romain, |
Hi Mirko, PR finished and ready to be reviewed! I have moved the calculations that were into It was the occasion to add a parallel option for the subsampling. In the end, not much improvement brought by parallel processing, because 90% of the computing time comes from the duplicate removal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like your code, it's clean and very well documented. I have just a few comments, after answering (and maybe changing), we can merge this PR.
BTW, I always had the plan to implement some sampling strategies (mainly for virtual landscapes) into the package. Do you think it is easier to inherit/extent this class, by making the rings just one of many options, or shall I go for one class per strategy? What do you think?
ratio_subsamp=0.2, | ||
runs=None, | ||
ncores=1, | ||
exp_increase_fac=np.sqrt(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why sqrt(2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the surface area of successive rings is going to be preserved when increasing the radius by a factor of exactly sqrt(2)
. The equation for the area of the ring (disk minus disk) looks like this (once the pis that even out are removed): (sqrt(2)R)**2 - R**2 = R**2
It's briefly described in the class description + the __init__
function at the line it uses the factor
Thanks for the nice feedback (and quick review)! 😊 For other sampling strategies: it really depends what they are. I think this class with equal distance to a center sample could be easily extended in a higher number of dimensions (with spheres in 3D, etc...). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your quick changes!
Just released as I quickly changed the |
Discussion started in #111 with @mmaelicke and @redhog
Added
RasterEquidistantMetricSpace
Performs subsampling adapted to Raster/Grid data to improve the empirical
Variogram
estimation:What the subclass does:
subsample
,max_dist
is reached, in 2D here it corresponds to several subsampling within rings based on the center point,This is performed iteratively for several
runs
, corresponding to different center points on the grid.Example
I used GSTools to generate random correlated fields, I couldn't find an easy way with
skgstat
!Number of pairwises distances computed:
411183
Number of pairwises distances computed:
489951
We can see that the sampling is largely improved at shorter distances, and is quite similar at larger distances. The method is much more efficient than a fully random sampling in relation to the pixel size and extent of a Raster/Grid,.
The results could be further improved by increasing the amount of samples drawn, but I kept it small here to limit the computing time of the example.
Update
Resolves #111