-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026
Comments
When I ran this in a loop 100 times, I got it to fail just once:
Need to capture the contents of the b columns. |
I captured the input data so that I could use it to experiment. |
I read in the data with:
Then
I ran this on cpu and gpu and compared the problematic row (with a=-3.1261173254359074e-295).
You can see that GPU results have some variability, while CPU is consistent.
This produces consistent results.
Consistent, but different. Lets's try sorting by b only:
Sorting by column b gives us consistent results either way. |
So my theory is that the non-deterministic ordering from the hash aggregate is what causes the results to vary on the GPU, and every once in a while it is enough of a difference to be outside the float approximate comparison. |
@revans2 any thoughts on this? Should we adjust the |
@jbrennan333 that is my theory too. What I wanted to do was to take the input data for this particular key and see if we can figure out what the order would need to be to make this problem happen. I don't know how many values there are, but hopefully it is not so big that the possibilities make it impossible to do. If we can prove that is what happened, then we really only have a few choices.
Option 1 feels horrible in general and we should not do it. 3 is great, but I don't know how to do that without slowing down the GPU SUM operation massively, and even then it would get the same answer as Spark if and only if the inputs are in the same order, which is not ever guaranteed by Spark. So then we are left with 2. Which means we should try and understand how big of a difference is possible for different situations. Document them and ideally update the tests so that we reduce the impact of making |
One thing I noticed, is that this particular test It is run with two datagens,
The first uses an That datagen is used in one other test, There is definitely an issue with differences when summing floating point columns, but I think that is effectively the same issue as reported in #9822 for I will add some additional findings in a separate comment. |
I have been running some tests on the set of floating point values that I captured from the test input when run with the given seed value. There are only 21 values.
I uploaded a parquet file with these values: I made a sorted and reverse sorted version of this, and then ran sums on them on cpu and gpu. ( These are the results:
I then did this with an accumulations (
This sequence from the unsorted accumulations is interesting:
I don't understand why the GPU gets a different value here. I tried just summing the column |
Describe the bug
It was running against Spark 3.2.4.
I tried to reproduce this locally and I could not. The only explanation I have is that the GPU did the sum in a non-deterministic order and we some how ended up with a number that was larger than approximate float could handle. The key (a) on line 17 was
-3.1261173254359074e-295
and the values were1.7718319043726909e+25
vs1.7756097975589866e+25
.Not sure what more to do except dig in a little deeper to see what might have caused the result by looking at the values used by the SUM.
The text was updated successfully, but these errors were encountered: