[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column #11367

viadea · 2024-08-20T17:26:23Z

Describe the bug
Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column

Steps/Code to reproduce bug

spark.read.parquet("gs://nds-uscentral1/parquet_sf3k_decimal/customer").createOrReplaceTempView("customer")
val output=spark.sql("""select c_email_address, approx_percentile(c_customer_id, 0.95) col2 from customer group by c_email_address;""")
output.write.mode("overwrite").parquet("gs://xxx/yyy")

Error:

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-323-cuda11/thirdparty/cudf/cpp/src/table/table_view.cpp:36: Column size mismatch.
	at ai.rapids.cudf.Table.createCudfTableView(Native Method)
	at ai.rapids.cudf.Table.<init>(Table.java:92)
	at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:4099)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$2(GpuAggregateExec.scala:394)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$1(GpuAggregateExec.scala:381)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.performGroupByAggregation(GpuAggregateExec.scala:380)
	at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:295)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:314)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:312)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:311)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)

Expected behavior
As CPU run, it should complete successfully.

Environment details (please complete the following information)
Dataproc 2.1
Spark RAPIDS 24.08 snapshot

The text was updated successfully, but these errors were encountered:

revans2 · 2024-08-20T18:43:38Z

I was able to reproduce this with a much smaller data set It happens in the final aggregation.

c_customer_id is a STRING that does not look like a number at all.

+----------------------------+----------------+
|c_email_address             |c_customer_id   |
+----------------------------+----------------+
|[email protected]   |AAAAAAAAMELPMBAA|
|[email protected]|AAAAAAAALLINNBAA|
+----------------------------+----------------+

and when it is converted to a double, for the approximate_percentile, it ends up as null. We have tested this case before, but apparently something happens with a group by with a small number of rows > 1 in a MERGE_TDIGEST.

spark.conf.set("spark.sql.shuffle.partitions","1")
Seq(("a", "a"),("b", "b")).toDF("k", "v").createOrReplaceTempView("tmp")
spark.sql("""select k, approx_percentile(v, 0.95) col2 from tmp group by k;""").show()

This should be super simple to put into a C++ repro so we can debug it.

This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367. The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`. Authors: - Jihoon Son (https://github.com/jihoonson) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - https://github.com/nvdbaranec URL: #16675

revans2 · 2024-09-11T14:46:12Z

@jihoonson I believe that this is fixed now. Is there something else we are waiting on to close this?

jihoonson · 2024-09-11T17:51:27Z

@revans2 A new test failure was found in the nightly run yesterday (#11463), which looks related to the change in rapidsai/cudf#16675. I'm looking if that change has caused the failure. If so, we may want to revert rapidsai/cudf#16675 for now, and then make another PR to fix it. I would like to keep this open until the triage is done.

jihoonson · 2024-09-11T22:25:47Z

Decided to revert rapidsai/cudf#16675. Will keep this issue open until we merge a proper fix.

This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367. The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`. Authors: - Jihoon Son (https://github.com/jihoonson) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - https://github.com/nvdbaranec URL: rapidsai#16675

viadea · 2024-09-25T18:25:47Z

Let me add the impact: it is NOT impacting any users. This is just my own finding when reproducing another user issue.

jihoonson · 2024-10-02T05:38:17Z

We should be able to close this now as rapidsai/cudf#16897 has been merged. Though I would like to wait one day to let all nightly tests run just in case.

jihoonson · 2024-10-02T18:04:22Z

Closing this issue as the nightly tests have passed.

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 20, 2024

mattahrens assigned jihoonson Aug 20, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 20, 2024

jihoonson mentioned this issue Aug 28, 2024

Fix empty cluster handling in tdigest merge rapidsai/cudf#16675

Merged

3 tasks

jihoonson mentioned this issue Sep 23, 2024

[BUG] group_merge_tdigest() always produces one empty cluster regardless of the number of groups when all values are null rapidsai/cudf#16881

Closed

jihoonson closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column #11367

[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column #11367

viadea commented Aug 20, 2024

revans2 commented Aug 20, 2024

revans2 commented Sep 11, 2024

jihoonson commented Sep 11, 2024 •

edited

Loading

jihoonson commented Sep 11, 2024

viadea commented Sep 25, 2024

jihoonson commented Oct 2, 2024

jihoonson commented Oct 2, 2024

[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column #11367

[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column #11367

Comments

viadea commented Aug 20, 2024

revans2 commented Aug 20, 2024

revans2 commented Sep 11, 2024

jihoonson commented Sep 11, 2024 • edited Loading

jihoonson commented Sep 11, 2024

viadea commented Sep 25, 2024

jihoonson commented Oct 2, 2024

jihoonson commented Oct 2, 2024

jihoonson commented Sep 11, 2024 •

edited

Loading