-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column #11367
Comments
I was able to reproduce this with a much smaller data set It happens in the final aggregation. c_customer_id is a STRING that does not look like a number at all.
and when it is converted to a double, for the approximate_percentile, it ends up as
This should be super simple to put into a C++ repro so we can debug it. |
This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367. The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`. Authors: - Jihoon Son (https://github.com/jihoonson) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - https://github.com/nvdbaranec URL: #16675
@jihoonson I believe that this is fixed now. Is there something else we are waiting on to close this? |
@revans2 A new test failure was found in the nightly run yesterday (#11463), which looks related to the change in rapidsai/cudf#16675. I'm looking if that change has caused the failure. If so, we may want to revert rapidsai/cudf#16675 for now, and then make another PR to fix it. I would like to keep this open until the triage is done. |
Decided to revert rapidsai/cudf#16675. Will keep this issue open until we merge a proper fix. |
This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367. The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`. Authors: - Jihoon Son (https://github.com/jihoonson) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - https://github.com/nvdbaranec URL: rapidsai#16675
Let me add the impact: it is NOT impacting any users. This is just my own finding when reproducing another user issue. |
We should be able to close this now as rapidsai/cudf#16897 has been merged. Though I would like to wait one day to let all nightly tests run just in case. |
Closing this issue as the nightly tests have passed. |
Describe the bug
Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column
Steps/Code to reproduce bug
Error:
Expected behavior
As CPU run, it should complete successfully.
Environment details (please complete the following information)
Dataproc 2.1
Spark RAPIDS 24.08 snapshot
The text was updated successfully, but these errors were encountered: