Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts failed with DATAGEN_SEED=1705756525 #10234

Open
sameerz opened this issue Jan 21, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@sameerz
Copy link
Collaborator

sameerz commented Jan 21, 2024

Describe the bug

[2024-01-20T14:37:05.949Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-[('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]][DATAGEN_SEED=1705756525, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)] - AssertionError: GPU and CPU float values are different [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.949Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.hashAgg.replaceMode': 'final'}-[('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]][DATAGEN_SEED=1705756525, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)] - AssertionError: GPU and CPU float values are different [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.949Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.hashAgg.replaceMode': 'partial'}-[('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]][DATAGEN_SEED=1705756525, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)] - AssertionError: GPU and CPU float values are different [0, 'avg(DISTINCT a)']
Detailed output
[2024-01-20T14:37:05.944Z] _ test_hash_multiple_mode_query_avg_distincts[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-[('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]] _
[2024-01-20T14:37:05.944Z] [gw3] linux -- Python 3.9.18 /opt/conda/bin/python
[2024-01-20T14:37:05.944Z] 
[2024-01-20T14:37:05.944Z] data_gen = [('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]
[2024-01-20T14:37:05.944Z] conf = {'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.variableFloatAgg.enabled': 'true'}
[2024-01-20T14:37:05.944Z] 
[2024-01-20T14:37:05.944Z]     @approximate_float
[2024-01-20T14:37:05.944Z]     @ignore_order
[2024-01-20T14:37:05.944Z]     @incompat
[2024-01-20T14:37:05.944Z]     @pytest.mark.parametrize('data_gen', _init_list, ids=idfn)
[2024-01-20T14:37:05.944Z]     @pytest.mark.parametrize('conf', get_params(_confs, params_markers_for_confs),
[2024-01-20T14:37:05.944Z]         ids=idfn)
[2024-01-20T14:37:05.944Z]     def test_hash_multiple_mode_query_avg_distincts(data_gen, conf):
[2024-01-20T14:37:05.944Z] >       assert_gpu_and_cpu_are_equal_collect(
[2024-01-20T14:37:05.944Z]             lambda spark: gen_df(spark, data_gen, length=100)
[2024-01-20T14:37:05.944Z]                 .selectExpr('avg(distinct a)', 'avg(distinct b)','avg(distinct c)'),
[2024-01-20T14:37:05.944Z]             conf=conf)
[2024-01-20T14:37:05.944Z] 
[2024-01-20T14:37:05.944Z] ../../src/main/python/hash_aggregate_test.py:1087: 
[2024-01-20T14:37:05.944Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-01-20T14:37:05.944Z] ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect
[2024-01-20T14:37:05.944Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
[2024-01-20T14:37:05.944Z] ../../src/main/python/asserts.py:517: in _assert_gpu_and_cpu_are_equal
[2024-01-20T14:37:05.944Z]     assert_equal(from_cpu, from_gpu)
[2024-01-20T14:37:05.945Z] ../../src/main/python/asserts.py:107: in assert_equal
[2024-01-20T14:37:05.945Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2024-01-20T14:37:05.945Z] ../../src/main/python/asserts.py:43: in _assert_equal
[2024-01-20T14:37:05.945Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.945Z] ../../src/main/python/asserts.py:36: in _assert_equal
[2024-01-20T14:37:05.945Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2024-01-20T14:37:05.945Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-01-20T14:37:05.945Z] 
[2024-01-20T14:37:05.945Z] cpu = -9.961353300130207e+25, gpu = -9.961254917499822e+25
[2024-01-20T14:37:05.945Z] float_check = . at 0x7f7f804d4430>
[2024-01-20T14:37:05.945Z] path = [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.945Z] 
[2024-01-20T14:37:05.945Z]     def _assert_equal(cpu, gpu, float_check, path):
[2024-01-20T14:37:05.945Z]         t = type(cpu)
[2024-01-20T14:37:05.945Z]         if (t is Row):
[2024-01-20T14:37:05.945Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.945Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
[2024-01-20T14:37:05.945Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
[2024-01-20T14:37:05.945Z]                 for field in cpu.__fields__:
[2024-01-20T14:37:05.945Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2024-01-20T14:37:05.945Z]             else:
[2024-01-20T14:37:05.945Z]                 for index in range(len(cpu)):
[2024-01-20T14:37:05.945Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.945Z]         elif (t is list):
[2024-01-20T14:37:05.945Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.945Z]             for index in range(len(cpu)):
[2024-01-20T14:37:05.945Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.945Z]         elif (t is tuple):
[2024-01-20T14:37:05.945Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.945Z]             for index in range(len(cpu)):
[2024-01-20T14:37:05.945Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.945Z]         elif (t is pytypes.GeneratorType):
[2024-01-20T14:37:05.945Z]             index = 0
[2024-01-20T14:37:05.945Z]             # generator has no zip :( so we have to do this the hard way
[2024-01-20T14:37:05.945Z]             done = False
[2024-01-20T14:37:05.945Z]             while not done:
[2024-01-20T14:37:05.945Z]                 sub_cpu = None
[2024-01-20T14:37:05.945Z]                 sub_gpu = None
[2024-01-20T14:37:05.945Z]                 try:
[2024-01-20T14:37:05.945Z]                     sub_cpu = next(cpu)
[2024-01-20T14:37:05.945Z]                 except StopIteration:
[2024-01-20T14:37:05.945Z]                     done = True
[2024-01-20T14:37:05.945Z]     
[2024-01-20T14:37:05.945Z]                 try:
[2024-01-20T14:37:05.945Z]                     sub_gpu = next(gpu)
[2024-01-20T14:37:05.945Z]                 except StopIteration:
[2024-01-20T14:37:05.945Z]                     done = True
[2024-01-20T14:37:05.945Z]     
[2024-01-20T14:37:05.945Z]                 if done:
[2024-01-20T14:37:05.945Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
[2024-01-20T14:37:05.945Z]                 else:
[2024-01-20T14:37:05.945Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
[2024-01-20T14:37:05.945Z]     
[2024-01-20T14:37:05.945Z]                 index = index + 1
[2024-01-20T14:37:05.945Z]         elif (t is dict):
[2024-01-20T14:37:05.945Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
[2024-01-20T14:37:05.945Z]             # so sort the items to do our best with ignoring the order of dicts
[2024-01-20T14:37:05.945Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)
[2024-01-20T14:37:05.945Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)
[2024-01-20T14:37:05.945Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
[2024-01-20T14:37:05.945Z]         elif (t is int):
[2024-01-20T14:37:05.945Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
[2024-01-20T14:37:05.945Z]         elif (t is float):
[2024-01-20T14:37:05.945Z]             if (math.isnan(cpu)):
[2024-01-20T14:37:05.945Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
[2024-01-20T14:37:05.945Z]             else:
[2024-01-20T14:37:05.945Z] >               assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
[2024-01-20T14:37:05.946Z] E               AssertionError: GPU and CPU float values are different [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.946Z] 
[2024-01-20T14:37:05.946Z] ../../src/main/python/asserts.py:83: AssertionError
[2024-01-20T14:37:05.946Z] ----------------------------- Captured stdout call -----------------------------
[2024-01-20T14:37:05.946Z] ### CPU RUN ###
[2024-01-20T14:37:05.946Z] ### GPU RUN ###
[2024-01-20T14:37:05.946Z] ### COLLECT: GPU TOOK 0.17374825477600098 CPU TOOK 0.13654327392578125 ###
[2024-01-20T14:37:05.946Z] --- CPU OUTPUT
[2024-01-20T14:37:05.946Z] +++ GPU OUTPUT
[2024-01-20T14:37:05.946Z] @@ -1 +1 @@
[2024-01-20T14:37:05.946Z] -Row(avg(DISTINCT a)=-9.961353300130207e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.749297777543448e+17)
[2024-01-20T14:37:05.946Z] +Row(avg(DISTINCT a)=-9.961254917499822e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.74929777754345e+17)
[2024-01-20T14:37:05.946Z] _ test_hash_multiple_mode_query_avg_distincts[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.hashAgg.replaceMode': 'final'}-[('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]] _
[2024-01-20T14:37:05.946Z] [gw3] linux -- Python 3.9.18 /opt/conda/bin/python
[2024-01-20T14:37:05.946Z] 
[2024-01-20T14:37:05.946Z] data_gen = [('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]
[2024-01-20T14:37:05.946Z] conf = {'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.hashAgg.replaceMode': 'final', 'spark.rapids.sql.variableFloatAgg.enabled': 'true'}
[2024-01-20T14:37:05.946Z] 
[2024-01-20T14:37:05.946Z]     @approximate_float
[2024-01-20T14:37:05.946Z]     @ignore_order
[2024-01-20T14:37:05.946Z]     @incompat
[2024-01-20T14:37:05.946Z]     @pytest.mark.parametrize('data_gen', _init_list, ids=idfn)
[2024-01-20T14:37:05.946Z]     @pytest.mark.parametrize('conf', get_params(_confs, params_markers_for_confs),
[2024-01-20T14:37:05.946Z]         ids=idfn)
[2024-01-20T14:37:05.946Z]     def test_hash_multiple_mode_query_avg_distincts(data_gen, conf):
[2024-01-20T14:37:05.946Z] >       assert_gpu_and_cpu_are_equal_collect(
[2024-01-20T14:37:05.946Z]             lambda spark: gen_df(spark, data_gen, length=100)
[2024-01-20T14:37:05.946Z]                 .selectExpr('avg(distinct a)', 'avg(distinct b)','avg(distinct c)'),
[2024-01-20T14:37:05.946Z]             conf=conf)
[2024-01-20T14:37:05.946Z] 
[2024-01-20T14:37:05.946Z] ../../src/main/python/hash_aggregate_test.py:1087: 
[2024-01-20T14:37:05.946Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-01-20T14:37:05.946Z] ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect
[2024-01-20T14:37:05.946Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
[2024-01-20T14:37:05.946Z] ../../src/main/python/asserts.py:517: in _assert_gpu_and_cpu_are_equal
[2024-01-20T14:37:05.946Z]     assert_equal(from_cpu, from_gpu)
[2024-01-20T14:37:05.946Z] ../../src/main/python/asserts.py:107: in assert_equal
[2024-01-20T14:37:05.946Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2024-01-20T14:37:05.946Z] ../../src/main/python/asserts.py:43: in _assert_equal
[2024-01-20T14:37:05.946Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.946Z] ../../src/main/python/asserts.py:36: in _assert_equal
[2024-01-20T14:37:05.946Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2024-01-20T14:37:05.946Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-01-20T14:37:05.946Z] 
[2024-01-20T14:37:05.946Z] cpu = -9.961254917487832e+25, gpu = -9.961353300130207e+25
[2024-01-20T14:37:05.946Z] float_check = . at 0x7f808f0e1d30>
[2024-01-20T14:37:05.946Z] path = [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.946Z] 
[2024-01-20T14:37:05.946Z]     def _assert_equal(cpu, gpu, float_check, path):
[2024-01-20T14:37:05.946Z]         t = type(cpu)
[2024-01-20T14:37:05.946Z]         if (t is Row):
[2024-01-20T14:37:05.946Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.946Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
[2024-01-20T14:37:05.946Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
[2024-01-20T14:37:05.946Z]                 for field in cpu.__fields__:
[2024-01-20T14:37:05.946Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2024-01-20T14:37:05.946Z]             else:
[2024-01-20T14:37:05.946Z]                 for index in range(len(cpu)):
[2024-01-20T14:37:05.946Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.946Z]         elif (t is list):
[2024-01-20T14:37:05.946Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.946Z]             for index in range(len(cpu)):
[2024-01-20T14:37:05.946Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.946Z]         elif (t is tuple):
[2024-01-20T14:37:05.946Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.946Z]             for index in range(len(cpu)):
[2024-01-20T14:37:05.946Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.946Z]         elif (t is pytypes.GeneratorType):
[2024-01-20T14:37:05.946Z]             index = 0
[2024-01-20T14:37:05.946Z]             # generator has no zip :( so we have to do this the hard way
[2024-01-20T14:37:05.946Z]             done = False
[2024-01-20T14:37:05.946Z]             while not done:
[2024-01-20T14:37:05.946Z]                 sub_cpu = None
[2024-01-20T14:37:05.947Z]                 sub_gpu = None
[2024-01-20T14:37:05.947Z]                 try:
[2024-01-20T14:37:05.947Z]                     sub_cpu = next(cpu)
[2024-01-20T14:37:05.947Z]                 except StopIteration:
[2024-01-20T14:37:05.947Z]                     done = True
[2024-01-20T14:37:05.947Z]     
[2024-01-20T14:37:05.947Z]                 try:
[2024-01-20T14:37:05.947Z]                     sub_gpu = next(gpu)
[2024-01-20T14:37:05.947Z]                 except StopIteration:
[2024-01-20T14:37:05.947Z]                     done = True
[2024-01-20T14:37:05.947Z]     
[2024-01-20T14:37:05.947Z]                 if done:
[2024-01-20T14:37:05.947Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
[2024-01-20T14:37:05.947Z]                 else:
[2024-01-20T14:37:05.947Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
[2024-01-20T14:37:05.947Z]     
[2024-01-20T14:37:05.947Z]                 index = index + 1
[2024-01-20T14:37:05.947Z]         elif (t is dict):
[2024-01-20T14:37:05.947Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
[2024-01-20T14:37:05.947Z]             # so sort the items to do our best with ignoring the order of dicts
[2024-01-20T14:37:05.947Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)
[2024-01-20T14:37:05.947Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)
[2024-01-20T14:37:05.947Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
[2024-01-20T14:37:05.947Z]         elif (t is int):
[2024-01-20T14:37:05.947Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
[2024-01-20T14:37:05.947Z]         elif (t is float):
[2024-01-20T14:37:05.947Z]             if (math.isnan(cpu)):
[2024-01-20T14:37:05.947Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
[2024-01-20T14:37:05.947Z]             else:
[2024-01-20T14:37:05.947Z] >               assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
[2024-01-20T14:37:05.947Z] E               AssertionError: GPU and CPU float values are different [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.947Z] 
[2024-01-20T14:37:05.947Z] ../../src/main/python/asserts.py:83: AssertionError
[2024-01-20T14:37:05.947Z] ----------------------------- Captured stdout call -----------------------------
[2024-01-20T14:37:05.947Z] ### CPU RUN ###
[2024-01-20T14:37:05.947Z] ### GPU RUN ###
[2024-01-20T14:37:05.947Z] ### COLLECT: GPU TOOK 0.14001178741455078 CPU TOOK 0.11022210121154785 ###
[2024-01-20T14:37:05.947Z] --- CPU OUTPUT
[2024-01-20T14:37:05.947Z] +++ GPU OUTPUT
[2024-01-20T14:37:05.947Z] @@ -1 +1 @@
[2024-01-20T14:37:05.947Z] -Row(avg(DISTINCT a)=-9.961254917487832e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.749297777543451e+17)
[2024-01-20T14:37:05.947Z] +Row(avg(DISTINCT a)=-9.961353300130207e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.749297777543451e+17)
[2024-01-20T14:37:05.947Z] _ test_hash_multiple_mode_query_avg_distincts[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.hashAgg.replaceMode': 'partial'}-[('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]] _
[2024-01-20T14:37:05.947Z] [gw3] linux -- Python 3.9.18 /opt/conda/bin/python
[2024-01-20T14:37:05.947Z] 
[2024-01-20T14:37:05.947Z] data_gen = [('a', RepeatSeq(Float)), ('b', Float), ('c', Long)]
[2024-01-20T14:37:05.947Z] conf = {'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.hashAgg.replaceMode': 'partial', 'spark.rapids.sql.variableFloatAgg.enabled': 'true'}
[2024-01-20T14:37:05.947Z] 
[2024-01-20T14:37:05.947Z]     @approximate_float
[2024-01-20T14:37:05.947Z]     @ignore_order
[2024-01-20T14:37:05.947Z]     @incompat
[2024-01-20T14:37:05.947Z]     @pytest.mark.parametrize('data_gen', _init_list, ids=idfn)
[2024-01-20T14:37:05.947Z]     @pytest.mark.parametrize('conf', get_params(_confs, params_markers_for_confs),
[2024-01-20T14:37:05.947Z]         ids=idfn)
[2024-01-20T14:37:05.947Z]     def test_hash_multiple_mode_query_avg_distincts(data_gen, conf):
[2024-01-20T14:37:05.947Z] >       assert_gpu_and_cpu_are_equal_collect(
[2024-01-20T14:37:05.947Z]             lambda spark: gen_df(spark, data_gen, length=100)
[2024-01-20T14:37:05.947Z]                 .selectExpr('avg(distinct a)', 'avg(distinct b)','avg(distinct c)'),
[2024-01-20T14:37:05.947Z]             conf=conf)
[2024-01-20T14:37:05.947Z] 
[2024-01-20T14:37:05.947Z] ../../src/main/python/hash_aggregate_test.py:1087: 
[2024-01-20T14:37:05.947Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-01-20T14:37:05.947Z] ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect
[2024-01-20T14:37:05.947Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
[2024-01-20T14:37:05.947Z] ../../src/main/python/asserts.py:517: in _assert_gpu_and_cpu_are_equal
[2024-01-20T14:37:05.947Z]     assert_equal(from_cpu, from_gpu)
[2024-01-20T14:37:05.947Z] ../../src/main/python/asserts.py:107: in assert_equal
[2024-01-20T14:37:05.947Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2024-01-20T14:37:05.947Z] ../../src/main/python/asserts.py:43: in _assert_equal
[2024-01-20T14:37:05.947Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.947Z] ../../src/main/python/asserts.py:36: in _assert_equal
[2024-01-20T14:37:05.947Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2024-01-20T14:37:05.947Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-01-20T14:37:05.947Z] 
[2024-01-20T14:37:05.947Z] cpu = -9.961254917487832e+25, gpu = -9.961353300130207e+25
[2024-01-20T14:37:05.947Z] float_check = . at 0x7f7f7bd170d0>
[2024-01-20T14:37:05.947Z] path = [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.947Z] 
[2024-01-20T14:37:05.947Z]     def _assert_equal(cpu, gpu, float_check, path):
[2024-01-20T14:37:05.947Z]         t = type(cpu)
[2024-01-20T14:37:05.947Z]         if (t is Row):
[2024-01-20T14:37:05.947Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.948Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
[2024-01-20T14:37:05.948Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
[2024-01-20T14:37:05.948Z]                 for field in cpu.__fields__:
[2024-01-20T14:37:05.948Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2024-01-20T14:37:05.948Z]             else:
[2024-01-20T14:37:05.948Z]                 for index in range(len(cpu)):
[2024-01-20T14:37:05.948Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.948Z]         elif (t is list):
[2024-01-20T14:37:05.948Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.948Z]             for index in range(len(cpu)):
[2024-01-20T14:37:05.948Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.948Z]         elif (t is tuple):
[2024-01-20T14:37:05.948Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2024-01-20T14:37:05.948Z]             for index in range(len(cpu)):
[2024-01-20T14:37:05.948Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2024-01-20T14:37:05.948Z]         elif (t is pytypes.GeneratorType):
[2024-01-20T14:37:05.948Z]             index = 0
[2024-01-20T14:37:05.948Z]             # generator has no zip :( so we have to do this the hard way
[2024-01-20T14:37:05.948Z]             done = False
[2024-01-20T14:37:05.948Z]             while not done:
[2024-01-20T14:37:05.948Z]                 sub_cpu = None
[2024-01-20T14:37:05.948Z]                 sub_gpu = None
[2024-01-20T14:37:05.948Z]                 try:
[2024-01-20T14:37:05.948Z]                     sub_cpu = next(cpu)
[2024-01-20T14:37:05.948Z]                 except StopIteration:
[2024-01-20T14:37:05.948Z]                     done = True
[2024-01-20T14:37:05.948Z]     
[2024-01-20T14:37:05.948Z]                 try:
[2024-01-20T14:37:05.948Z]                     sub_gpu = next(gpu)
[2024-01-20T14:37:05.948Z]                 except StopIteration:
[2024-01-20T14:37:05.948Z]                     done = True
[2024-01-20T14:37:05.948Z]     
[2024-01-20T14:37:05.948Z]                 if done:
[2024-01-20T14:37:05.948Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
[2024-01-20T14:37:05.948Z]                 else:
[2024-01-20T14:37:05.948Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
[2024-01-20T14:37:05.948Z]     
[2024-01-20T14:37:05.948Z]                 index = index + 1
[2024-01-20T14:37:05.948Z]         elif (t is dict):
[2024-01-20T14:37:05.948Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
[2024-01-20T14:37:05.948Z]             # so sort the items to do our best with ignoring the order of dicts
[2024-01-20T14:37:05.948Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)
[2024-01-20T14:37:05.948Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)
[2024-01-20T14:37:05.948Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
[2024-01-20T14:37:05.948Z]         elif (t is int):
[2024-01-20T14:37:05.948Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
[2024-01-20T14:37:05.948Z]         elif (t is float):
[2024-01-20T14:37:05.948Z]             if (math.isnan(cpu)):
[2024-01-20T14:37:05.948Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
[2024-01-20T14:37:05.948Z]             else:
[2024-01-20T14:37:05.948Z] >               assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
[2024-01-20T14:37:05.948Z] E               AssertionError: GPU and CPU float values are different [0, 'avg(DISTINCT a)']
[2024-01-20T14:37:05.948Z] 
[2024-01-20T14:37:05.948Z] ../../src/main/python/asserts.py:83: AssertionError
[2024-01-20T14:37:05.948Z] ----------------------------- Captured stdout call -----------------------------
[2024-01-20T14:37:05.948Z] ### CPU RUN ###
[2024-01-20T14:37:05.948Z] ### GPU RUN ###
[2024-01-20T14:37:05.948Z] ### COLLECT: GPU TOOK 0.1446061134338379 CPU TOOK 0.08382821083068848 ###
[2024-01-20T14:37:05.948Z] --- CPU OUTPUT
[2024-01-20T14:37:05.948Z] +++ GPU OUTPUT
[2024-01-20T14:37:05.948Z] @@ -1 +1 @@
[2024-01-20T14:37:05.948Z] -Row(avg(DISTINCT a)=-9.961254917487832e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.749297777543451e+17)
[2024-01-20T14:37:05.948Z] +Row(avg(DISTINCT a)=-9.961353300130207e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.749297777543448e+17)

Steps/Code to reproduce bug

Expected behavior

Environment details (please complete the following information)

  • Environment location: Regular integration test environment
  • Spark configuration settings related to the issue

Additional context
Scala 2.13 test
DATAGEN_SEED=1705756525

@sameerz sameerz added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 21, 2024
@revans2
Copy link
Collaborator

revans2 commented Jan 22, 2024

I ran this locally and was not able to reproduce it.

I think it is the same problem as #9822 and #10026

because average really is a SUM(X)/COUNT(x) and if SUM(x) can fail, then dividing it can also fail, but it likely to fail by less. We should look at what is the best way to mitigate these issues around floats/doubles.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 23, 2024
@jbrennan333
Copy link
Contributor

jbrennan333 commented Mar 1, 2024

I'm not certain this is the same issue. I was unable to reproduce this with spark 3.3.3-scala-2.12, but it repros consistently for me with spark-3.3.3-scala-2.13 using DATAGEN_SEED= 1705756525. I'm not sure how the scala version could affect this. @revans2 did you test with 2.12 or 2.13?

@revans2
Copy link
Collaborator

revans2 commented Mar 1, 2024

I think I only tried 2.12

@jbrennan333
Copy link
Contributor

With debug logging, I have verified that the arithmetic results I am getting from the pluing with scala2.12 and 2.13 are the same. I believe the problem here is that for some reason when I run with scala-2.13, the pytest.ini file is not being loaded, so our marks are not being honored. There are a lot of warnings about marks when I run with scala-2.13, for example:

../../../../integration_tests/src/main/python/marks.py:20
  /home/jimb/git/spark-rapids/integration_tests/src/main/python/marks.py:20: PytestUnknownMarkWarning: Unknown pytest.mark.approximate_float - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    approximate_float = pytest.mark.approximate_float

I suspect that is why this particular test is failing in 2.13 and not in 2.12. From the logs, I can see a difference here:

scala 2.12:
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /home/jimb/git/spark-rapids/integration_tests, configfile: pytest.ini
plugins: xdist-2.4.0, forked-1.3.0
collecting ... collected 26674 items / 26673 deselected / 1 selected
scala 2.13:
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /home/jimb/git/spark-rapids/integration_tests
plugins: xdist-2.4.0, forked-1.3.0
collecting ... collected 26674 items / 26673 deselected / 1 selected

I haven't quite nailed down why this is happening yet. The root-dirs are the same, so I'm not sure why it's not finding the pytest.ini file.

@jbrennan333
Copy link
Contributor

Actually, I didn't notice this before, because I wasn't printing the test output data in the successful run. But it looks like in this case it is the CPU that is producing different results between 2.12 and 2.13:

2.12:

--- CPU OUTPUT
+++ GPU OUTPUT
-Row(avg(DISTINCT a)=-9.961353300130207e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.749297777543448e+17)
+Row(avg(DISTINCT a)=-9.961353300130207e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.74929777754345e+17)

2.13
--- CPU OUTPUT
+++ GPU OUTPUT
-Row(avg(DISTINCT a)=-9.961254917487832e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.749297777543448e+17)
+Row(avg(DISTINCT a)=-9.961353300130207e+25, avg(DISTINCT b)=nan, avg(DISTINCT c)=-6.74929777754345e+17)

@jbrennan333
Copy link
Contributor

hash-debug-input.json

Uploading the input data I captured from the integration test.
So far I haven't quite been able to reproduce this in spark shell.

@jbrennan333
Copy link
Contributor

This is what I got running in spark shell. I did it with just avg(distinct a) and also with all of them.

spark.conf.set("spark.rapids.sql.enabled", false)
val df=spark.read.json("/opt/data/jimb/hash-debug-input.json")
val adf=df.select("a")
adf.createOrReplaceTempView("input")
val sql = "SELECT avg(distinct a) from input"
val avg = spark.sql(sql).collect
spark shell scala 2.12:
CPU: avg: Array[org.apache.spark.sql.Row] = Array([-9.961254917495146E25])           
GPU: avg: Array[org.apache.spark.sql.Row] = Array([-9.961254917498747E25]) 

CPU: avg: Array[org.apache.spark.sql.Row] = Array([-9.961353333333334E25,NaN,-6.7492977775434509E17])
GPU: avg: Array[org.apache.spark.sql.Row] = Array([-9.961353333333334E25,NaN,-6.7492977775434496E17])
          

spark shell 2.13
CPU: val avg: Array[org.apache.spark.sql.Row] = Array([-9.961254917495146E25])
GPU: val avg: Array[org.apache.spark.sql.Row] = Array([-9.961254917498747E25])

CPU: val avg: Array[org.apache.spark.sql.Row] = Array([-9.961254917487832E25,NaN,-6.749297777543447E17])
GPU: val avg: Array[org.apache.spark.sql.Row] = Array([-9.961254917487832E25,NaN,-6.7492977775434496E17])

@tgravescs
Copy link
Collaborator

saw this again in last night integration tests: DATAGEN_SEED=1710866199

@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Mar 20, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants