Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column #11367

Closed
viadea opened this issue Aug 20, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@viadea
Copy link
Collaborator

viadea commented Aug 20, 2024

Describe the bug
Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column

Steps/Code to reproduce bug

spark.read.parquet("gs://nds-uscentral1/parquet_sf3k_decimal/customer").createOrReplaceTempView("customer")
val output=spark.sql("""select c_email_address, approx_percentile(c_customer_id, 0.95) col2 from customer group by c_email_address;""")
output.write.mode("overwrite").parquet("gs://xxx/yyy")

Error:

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-323-cuda11/thirdparty/cudf/cpp/src/table/table_view.cpp:36: Column size mismatch.
	at ai.rapids.cudf.Table.createCudfTableView(Native Method)
	at ai.rapids.cudf.Table.<init>(Table.java:92)
	at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:4099)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$2(GpuAggregateExec.scala:394)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$1(GpuAggregateExec.scala:381)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.performGroupByAggregation(GpuAggregateExec.scala:380)
	at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:295)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:314)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:312)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:311)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)

Expected behavior
As CPU run, it should complete successfully.

Environment details (please complete the following information)
Dataproc 2.1
Spark RAPIDS 24.08 snapshot

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 20, 2024
@revans2
Copy link
Collaborator

revans2 commented Aug 20, 2024

I was able to reproduce this with a much smaller data set It happens in the final aggregation.

c_customer_id is a STRING that does not look like a number at all.

+----------------------------+----------------+
|c_email_address             |c_customer_id   |
+----------------------------+----------------+
|[email protected]   |AAAAAAAAMELPMBAA|
|[email protected]|AAAAAAAALLINNBAA|
+----------------------------+----------------+

and when it is converted to a double, for the approximate_percentile, it ends up as null. We have tested this case before, but apparently something happens with a group by with a small number of rows > 1 in a MERGE_TDIGEST.

spark.conf.set("spark.sql.shuffle.partitions","1")
Seq(("a", "a"),("b", "b")).toDF("k", "v").createOrReplaceTempView("tmp")
spark.sql("""select k, approx_percentile(v, 0.95) col2 from tmp group by k;""").show()

This should be super simple to put into a C++ repro so we can debug it.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 20, 2024
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Sep 10, 2024
This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367.

The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. 

The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`.

Authors:
  - Jihoon Son (https://github.com/jihoonson)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - https://github.com/nvdbaranec

URL: #16675
@revans2
Copy link
Collaborator

revans2 commented Sep 11, 2024

@jihoonson I believe that this is fixed now. Is there something else we are waiting on to close this?

@jihoonson
Copy link
Collaborator

jihoonson commented Sep 11, 2024

@revans2 A new test failure was found in the nightly run yesterday (#11463), which looks related to the change in rapidsai/cudf#16675. I'm looking if that change has caused the failure. If so, we may want to revert rapidsai/cudf#16675 for now, and then make another PR to fix it. I would like to keep this open until the triage is done.

@jihoonson
Copy link
Collaborator

Decided to revert rapidsai/cudf#16675. Will keep this issue open until we merge a proper fix.

jihoonson added a commit to jihoonson/cudf that referenced this issue Sep 24, 2024
This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367.

The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. 

The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`.

Authors:
  - Jihoon Son (https://github.com/jihoonson)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - https://github.com/nvdbaranec

URL: rapidsai#16675
@viadea
Copy link
Collaborator Author

viadea commented Sep 25, 2024

Let me add the impact: it is NOT impacting any users. This is just my own finding when reproducing another user issue.

@jihoonson
Copy link
Collaborator

We should be able to close this now as rapidsai/cudf#16897 has been merged. Though I would like to wait one day to let all nightly tests run just in case.

@jihoonson
Copy link
Collaborator

Closing this issue as the nightly tests have passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants