Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this as performant as before? It's definitely less obvious what is happening compared to before, so perhaps a 1-liner doc? As a non-Python-native, is this considered Pythonic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a little more overhead due to validation checks in
DataFrame
as opposed toDataFrame._from_columns
, but this is the most equivalent public API to the previous line. Here's the performance of this patch with 500k pointsI would say so yes. I will add a 1 line comment about what this line is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
dict(enumerate(...))
is a Pythonic way to construct a dictionary of the form{i: values[i]}
from a sequencevalues
. This constructor is less performant than the old approach. The more performant choice here would be to useDataFrame._from_data(dict(enumerate(...)))
. I assume Matt picked this alternative to avoid using an internal method. The reality though is that cuspatial hooks heavily into cudf internals at the Python layer. Until we get around to a more thorough refactoring of cudf to better expose the functionality cuspatial relies on, it's probably OK to continue using cudf internals for better perf. I'm OK with either though and will leave it up to Matt to decide what to use.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's an understanding that cuspatial uses internal cudf APIs already and performant, public APIs don't exist for cuspatial yet, I'll switch this to the more performant
_from_data
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.
I don't know what the performance benefit is here, I think you only showed the performance of the dict() approach above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK actually it appears validation through the public
cudf.DataFrame
doesn't scale well with wide frames. Not sure what is a representative scale fordirected_hausdorff_distance
is, but for a 1000 x 1000 point comparison, this routine's microbenchmark is equivalent toSo let's keep
_from_data
here for now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but IIRC there are also cases where cuSpatial uses cuDF internals because there is no public-facing API for doing what cuSpatial needs. For example, cuDF is not really designed for extensible sets of dtypes, so cuSpatial uses internal methods to set up usage of the GeoColumn because there is no public access to column-level APIs otherwise. We hope to make that properly extensible at some point, but that's a long ways away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another main reason is that GeoColumn/GeoDataFrame extends Column/DataFrame so it depends on a lot of their internals. Removing the inheritance and using just the public APIs from cuDF may give us what we need minus the performance cost. (Use a Series everywhere a column op might come into play).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be possible but would have a significant performance cost. We can evaluate when we get around to the appropriate stage of cudf refactoring.