Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cudf _from_columns with a public API #1326

Merged
merged 3 commits into from
Jan 23, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/cuspatial/cuspatial/core/spatial/distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def directed_hausdorff_distance(multipoints: GeoSeries):
as_column(multipoints.multipoints.geometry_offset[:-1]),
)

return DataFrame._from_columns(result, range(num_spaces))
return DataFrame(dict(enumerate(result)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this as performant as before? It's definitely less obvious what is happening compared to before, so perhaps a 1-liner doc? As a non-Python-native, is this considered Pythonic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a little more overhead due to validation checks in DataFrame as opposed to DataFrame._from_columns, but this is the most equivalent public API to the previous line. Here's the performance of this patch with 500k points

In [1]: from shapely.geometry import MultiPoint
   ...: 

In [2]: import cuspatial
   ...: 

In [3]: data = [[(0, 0)] * 500_000]

In [4]: s = cuspatial.GeoSeries([MultiPoint(coords) for coords in data])

# PR
In [5]: %timeit cuspatial.directed_hausdorff_distance(s)
555 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As a non-Python-native, is this considered Pythonic?

I would say so yes. I will add a 1 line comment about what this line is doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dict(enumerate(...)) is a Pythonic way to construct a dictionary of the form {i: values[i]} from a sequence values. This constructor is less performant than the old approach. The more performant choice here would be to use DataFrame._from_data(dict(enumerate(...))). I assume Matt picked this alternative to avoid using an internal method. The reality though is that cuspatial hooks heavily into cudf internals at the Python layer. Until we get around to a more thorough refactoring of cudf to better expose the functionality cuspatial relies on, it's probably OK to continue using cudf internals for better perf. I'm OK with either though and will leave it up to Matt to decide what to use.

Copy link
Contributor Author

@mroeschke mroeschke Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's an understanding that cuspatial uses internal cudf APIs already and performant, public APIs don't exist for cuspatial yet, I'll switch this to the more performant _from_data.

Copy link
Member

@harrism harrism Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.

I don't know what the performance benefit is here, I think you only showed the performance of the dict() approach above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK actually it appears validation through the public cudf.DataFrame doesn't scale well with wide frames. Not sure what is a representative scale for directed_hausdorff_distance is, but for a 1000 x 1000 point comparison, this routine's microbenchmark is equivalent to

In [1]: import cudf

In [2]: columns = [cudf.core.column.as_column(range(1000), dtype="float64")] * 1000

In [3]: %timeit cudf.DataFrame._from_data(dict(enumerate(columns)))
371 µs ± 6.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [4]: %timeit cudf.DataFrame(dict(enumerate(columns)))
11.3 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So let's keep _from_data here for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.

I agree, but IIRC there are also cases where cuSpatial uses cuDF internals because there is no public-facing API for doing what cuSpatial needs. For example, cuDF is not really designed for extensible sets of dtypes, so cuSpatial uses internal methods to set up usage of the GeoColumn because there is no public access to column-level APIs otherwise. We hope to make that properly extensible at some point, but that's a long ways away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another main reason is that GeoColumn/GeoDataFrame extends Column/DataFrame so it depends on a lot of their internals. Removing the inheritance and using just the public APIs from cuDF may give us what we need minus the performance cost. (Use a Series everywhere a column op might come into play).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be possible but would have a significant performance cost. We can evaluate when we get around to the appropriate stage of cudf refactoring.



def haversine_distance(p1: GeoSeries, p2: GeoSeries):
Expand Down
Loading