Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GpuExplode single row split to fit cuDF limits #10088

Closed
abellina opened this issue Dec 20, 2023 · 3 comments · Fixed by #10131 or #10193
Closed

[BUG] GpuExplode single row split to fit cuDF limits #10088

abellina opened this issue Dec 20, 2023 · 3 comments · Fixed by #10131 or #10193
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Collaborator

We have seen a case where a single row with strings (a 1MB sized string) and other columns, could have a list to explode by with many elements (10K elements for example). When we try to handle such an explode, we currently will not split the input, because it is a single row, but because of the repetition amount here we can go over cuDF column size limits.

The proposal is to at least do this in the withRetry case, where splitInHalfByRows for the explode case could have a special clause where for an input of 1 rows, it still can split the list we are exploding by, and replicate the row accordingly.

For example:

col1 col2
"very long string..." [list of 10,000 items]

Would become:

col1 col2
"very long string..." [list of 5,000 items]
"very long string..." [list of 5,000 items]

or more rows, with less items in the list. We should be able to calculate the size of the row that would fit cuDF and then split the list accordingly and replicate the rest of the columns to go along.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 20, 2023
@abellina abellina changed the title [BUG] GpuExplode single row split [BUG] GpuExplode single row split to fit cuDF limits Dec 20, 2023
@abellina
Copy link
Collaborator Author

PoC abellina@8bced26

In the PoC note that I pick an arbitrary 100 splits for the list we are exploding, this is not what we want ultimately. We should compute that to fit 2B entries in a cuDF column.

@abellina abellina self-assigned this Dec 27, 2023
@abellina
Copy link
Collaborator Author

PoC is fairly close to what we want, I'll polish this up and put up a PR

@abellina
Copy link
Collaborator Author

abellina commented Jan 12, 2024

I am seeing a problem where the output isn't quite right with more columns in the project, none of the tests are seeing this. I am investigating why this is happening.

Specifically, the first column is getting used for columns, except the exploding column. So if we have 2 carry along columns, the second column isn't correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants