Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_part_write_round_trip failed #9837

Closed
jlowe opened this issue Nov 22, 2023 · 4 comments
Closed

[BUG] test_part_write_round_trip failed #9837

jlowe opened this issue Nov 22, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

test_part_write_round_trip failed with a file already exists exception.

[2023-11-22T17:18:37.499Z] FAILED ../../src/main/python/orc_write_test.py::test_part_write_round_trip[Double][DATAGEN_SEED=1700670692, INJECT_OOM, IGNORE_ORDER] - py4j.protocol.Py4JJavaError: An error occurred while calling o809298.orc.
Detailed test failure output
[2023-11-22T17:18:37.241Z] =================================== FAILURES ===================================

[2023-11-22T17:18:37.241Z] ______________________ test_part_write_round_trip[Double] ______________________

[2023-11-22T17:18:37.241Z] 

[2023-11-22T17:18:37.241Z] spark_tmp_path = '/tmp/pyspark_tests//rapids-it-dataproc-20-ubuntu18-343-m-master-21814-280027119/'

[2023-11-22T17:18:37.241Z] orc_gen = Double

[2023-11-22T17:18:37.241Z] 

[2023-11-22T17:18:37.241Z]     @ignore_order

[2023-11-22T17:18:37.241Z]     @pytest.mark.parametrize('orc_gen', orc_part_write_gens, ids=idfn)

[2023-11-22T17:18:37.242Z]     def test_part_write_round_trip(spark_tmp_path, orc_gen):

[2023-11-22T17:18:37.242Z]         gen_list = [('a', RepeatSeqGen(orc_gen, 10)),

[2023-11-22T17:18:37.242Z]                 ('b', orc_gen)]

[2023-11-22T17:18:37.242Z]         data_path = spark_tmp_path + '/ORC_DATA'

[2023-11-22T17:18:37.242Z] >       assert_gpu_and_cpu_writes_are_equal_collect(

[2023-11-22T17:18:37.242Z]                 lambda spark, path: gen_df(spark, gen_list).coalesce(1).write.partitionBy('a').orc(path),

[2023-11-22T17:18:37.242Z]                 lambda spark, path: spark.read.orc(path),

[2023-11-22T17:18:37.242Z]                 data_path,

[2023-11-22T17:18:37.242Z]                 conf = {'spark.rapids.sql.format.orc.write.enabled': True})

[2023-11-22T17:18:37.242Z] 

[2023-11-22T17:18:37.242Z] ../../src/main/python/orc_write_test.py:121: 

[2023-11-22T17:18:37.242Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2023-11-22T17:18:37.242Z] ../../src/main/python/asserts.py:272: in assert_gpu_and_cpu_writes_are_equal_collect

[2023-11-22T17:18:37.242Z]     _assert_gpu_and_cpu_writes_are_equal(write_func, read_func, base_path, 'COLLECT', conf=conf)

[2023-11-22T17:18:37.242Z] ../../src/main/python/asserts.py:242: in _assert_gpu_and_cpu_writes_are_equal

[2023-11-22T17:18:37.242Z]     with_cpu_session(lambda spark : write_func(spark, cpu_path), conf=conf)

[2023-11-22T17:18:37.242Z] ../../src/main/python/spark_session.py:106: in with_cpu_session

[2023-11-22T17:18:37.242Z]     return with_spark_session(func, conf=copy)

[2023-11-22T17:18:37.242Z] ../../src/main/python/spark_session.py:90: in with_spark_session

[2023-11-22T17:18:37.242Z]     ret = func(_spark)

[2023-11-22T17:18:37.242Z] ../../src/main/python/asserts.py:242: in <lambda>

[2023-11-22T17:18:37.242Z]     with_cpu_session(lambda spark : write_func(spark, cpu_path), conf=conf)

[2023-11-22T17:18:37.242Z] ../../src/main/python/orc_write_test.py:122: in <lambda>

[2023-11-22T17:18:37.242Z]     lambda spark, path: gen_df(spark, gen_list).coalesce(1).write.partitionBy('a').orc(path),

[2023-11-22T17:18:37.242Z] /usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py:1409: in orc

[2023-11-22T17:18:37.242Z]     self._jwrite.orc(path)

[2023-11-22T17:18:37.242Z] /usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304: in __call__

[2023-11-22T17:18:37.242Z]     return_value = get_return_value(

[2023-11-22T17:18:37.242Z] /usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py:111: in deco

[2023-11-22T17:18:37.242Z]     return f(*a, **kw)

[2023-11-22T17:18:37.242Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2023-11-22T17:18:37.242Z] 

[2023-11-22T17:18:37.242Z] answer = 'xro809299'

[2023-11-22T17:18:37.242Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f1e3cd04640>

[2023-11-22T17:18:37.242Z] target_id = 'o809298', name = 'orc'

[2023-11-22T17:18:37.242Z] 

[2023-11-22T17:18:37.497Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):

[2023-11-22T17:18:37.497Z]         """Converts an answer received from the Java gateway into a Python object.

[2023-11-22T17:18:37.497Z]     

[2023-11-22T17:18:37.497Z]         For example, string representation of integers are converted to Python

[2023-11-22T17:18:37.497Z]         integer, string representation of objects are converted to JavaObject

[2023-11-22T17:18:37.497Z]         instances, etc.

[2023-11-22T17:18:37.497Z]     

[2023-11-22T17:18:37.497Z]         :param answer: the string returned by the Java gateway

[2023-11-22T17:18:37.497Z]         :param gateway_client: the gateway client used to communicate with the Java

[2023-11-22T17:18:37.497Z]             Gateway. Only necessary if the answer is a reference (e.g., object,

[2023-11-22T17:18:37.497Z]             list, map)

[2023-11-22T17:18:37.497Z]         :param target_id: the name of the object from which the answer comes from

[2023-11-22T17:18:37.497Z]             (e.g., *object1* in `object1.hello()`). Optional.

[2023-11-22T17:18:37.497Z]         :param name: the name of the member from which the answer comes from

[2023-11-22T17:18:37.497Z]             (e.g., *hello* in `object1.hello()`). Optional.

[2023-11-22T17:18:37.497Z]         """

[2023-11-22T17:18:37.497Z]         if is_error(answer)[0]:

[2023-11-22T17:18:37.497Z]             if len(answer) > 1:

[2023-11-22T17:18:37.497Z]                 type = answer[1]

[2023-11-22T17:18:37.497Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)

[2023-11-22T17:18:37.497Z]                 if answer[1] == REFERENCE_TYPE:

[2023-11-22T17:18:37.497Z] >                   raise Py4JJavaError(

[2023-11-22T17:18:37.497Z]                         "An error occurred while calling {0}{1}{2}.\n".

[2023-11-22T17:18:37.497Z]                         format(target_id, ".", name), value)

[2023-11-22T17:18:37.497Z] E                   py4j.protocol.Py4JJavaError: An error occurred while calling o809298.orc.

[2023-11-22T17:18:37.497Z] E                   : org.apache.spark.SparkException: Job aborted.

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:252)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:133)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:132)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)

[2023-11-22T17:18:37.497Z] E                   	at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:896)

[2023-11-22T17:18:37.497Z] E                   	at sun.reflect.GeneratedMethodAccessor114.invoke(Unknown Source)

[2023-11-22T17:18:37.497Z] E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[2023-11-22T17:18:37.497Z] E                   	at java.lang.reflect.Method.invoke(Method.java:498)

[2023-11-22T17:18:37.497Z] E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

[2023-11-22T17:18:37.497Z] E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

[2023-11-22T17:18:37.497Z] E                   	at py4j.Gateway.invoke(Gateway.java:282)

[2023-11-22T17:18:37.497Z] E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

[2023-11-22T17:18:37.497Z] E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)

[2023-11-22T17:18:37.498Z] E                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)

[2023-11-22T17:18:37.498Z] E                   	at java.lang.Thread.run(Thread.java:750)

[2023-11-22T17:18:37.498Z] E                   Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7685.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7685.0 (TID 28359) (rapids-it-dataproc-20-ubuntu18-343-w-0.c.rapids-spark.internal executor 1): org.apache.spark.SparkException: Task failed while writing rows.

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:317)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:131)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)

[2023-11-22T17:18:37.498Z] E                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

[2023-11-22T17:18:37.498Z] E                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[2023-11-22T17:18:37.498Z] E                   	at java.lang.Thread.run(Thread.java:750)

[2023-11-22T17:18:37.498Z] E                   Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /tmp/pyspark_tests/rapids-it-dataproc-20-ubuntu18-343-m-master-21814-280027119/ORC_DATA/CPU/_temporary/0/_temporary/attempt_202311221716152168032235479688932_7685_m_000000_28359/a=NaN/part-00000-ec193bc4-60ba-48e6-89a3-f77adcd1654a.c000.snappy.orc for client 10.128.0.188 already exists

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:388)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2560)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2457)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:478)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:549)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:518)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)

[2023-11-22T17:18:37.498Z] E                   	at java.security.AccessController.doPrivileged(Native Method)

[2023-11-22T17:18:37.498Z] E                   	at javax.security.auth.Subject.doAs(Subject.java:422)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)

[2023-11-22T17:18:37.498Z] E                   

[2023-11-22T17:18:37.498Z] E                   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

[2023-11-22T17:18:37.498Z] E                   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

[2023-11-22T17:18:37.498Z] E                   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

[2023-11-22T17:18:37.498Z] E                   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:281)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1230)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1209)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1147)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:471)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1126)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1106)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:187)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.orc.OrcFile.createWriter(OrcFile.java:893)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:47)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:120)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:241)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:262)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:299)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:307)

[2023-11-22T17:18:37.498Z] E                   	... 9 more

[2023-11-22T17:18:37.498Z] E                   Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): /tmp/pyspark_tests/rapids-it-dataproc-20-ubuntu18-343-m-master-21814-280027119/ORC_DATA/CPU/_temporary/0/_temporary/attempt_202311221716152168032235479688932_7685_m_000000_28359/a=NaN/part-00000-ec193bc4-60ba-48e6-89a3-f77adcd1654a.c000.snappy.orc for client 10.128.0.188 already exists

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:388)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2560)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2457)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:478)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:549)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:518)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)

[2023-11-22T17:18:37.498Z] E                   	at java.security.AccessController.doPrivileged(Native Method)

[2023-11-22T17:18:37.498Z] E                   	at javax.security.auth.Subject.doAs(Subject.java:422)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)

[2023-11-22T17:18:37.498Z] E                   

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1589)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Client.call(Client.java:1535)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Client.call(Client.java:1432)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)

[2023-11-22T17:18:37.498Z] E                   	at com.sun.proxy.$Proxy18.create(Unknown Source)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:372)

[2023-11-22T17:18:37.498Z] E                   	at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source)

[2023-11-22T17:18:37.498Z] E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[2023-11-22T17:18:37.498Z] E                   	at java.lang.reflect.Method.invoke(Method.java:498)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)

[2023-11-22T17:18:37.498Z] E                   	at com.sun.proxy.$Proxy19.create(Unknown Source)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:276)

[2023-11-22T17:18:37.498Z] E                   	... 29 more

[2023-11-22T17:18:37.498Z] E                   

[2023-11-22T17:18:37.498Z] E                   Driver stacktrace:

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2304)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2253)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2252)

[2023-11-22T17:18:37.498Z] E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)

[2023-11-22T17:18:37.498Z] E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

[2023-11-22T17:18:37.498Z] E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2252)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)

[2023-11-22T17:18:37.498Z] E                   	at scala.Option.foreach(Option.scala:407)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2491)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2433)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2422)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:902)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2204)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:202)

[2023-11-22T17:18:37.498Z] E                   	... 32 more

[2023-11-22T17:18:37.498Z] E                   Caused by: org.apache.spark.SparkException: Task failed while writing rows.

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:317)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:131)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)

[2023-11-22T17:18:37.498Z] E                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

[2023-11-22T17:18:37.498Z] E                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[2023-11-22T17:18:37.498Z] E                   	... 1 more

[2023-11-22T17:18:37.498Z] E                   Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /tmp/pyspark_tests/rapids-it-dataproc-20-ubuntu18-343-m-master-21814-280027119/ORC_DATA/CPU/_temporary/0/_temporary/attempt_202311221716152168032235479688932_7685_m_000000_28359/a=NaN/part-00000-ec193bc4-60ba-48e6-89a3-f77adcd1654a.c000.snappy.orc for client 10.128.0.188 already exists

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:388)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2560)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2457)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:478)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:549)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:518)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)

[2023-11-22T17:18:37.498Z] E                   	at java.security.AccessController.doPrivileged(Native Method)

[2023-11-22T17:18:37.498Z] E                   	at javax.security.auth.Subject.doAs(Subject.java:422)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)

[2023-11-22T17:18:37.498Z] E                   	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)

[2023-11-22T17:18:37.498Z] E                   

[2023-11-22T17:18:37.498Z] E                   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

[2023-11-22T17:18:37.499Z] E                   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

[2023-11-22T17:18:37.499Z] E                   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

[2023-11-22T17:18:37.499Z] E                   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:281)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1230)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1209)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1147)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:471)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1126)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1106)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:187)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.orc.OrcFile.createWriter(OrcFile.java:893)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:47)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:120)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:241)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:262)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:299)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:307)

[2023-11-22T17:18:37.499Z] E                   	... 9 more

[2023-11-22T17:18:37.499Z] E                   Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): /tmp/pyspark_tests/rapids-it-dataproc-20-ubuntu18-343-m-master-21814-280027119/ORC_DATA/CPU/_temporary/0/_temporary/attempt_202311221716152168032235479688932_7685_m_000000_28359/a=NaN/part-00000-ec193bc4-60ba-48e6-89a3-f77adcd1654a.c000.snappy.orc for client 10.128.0.188 already exists

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:388)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2560)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2457)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:478)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:549)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:518)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)

[2023-11-22T17:18:37.499Z] E                   	at java.security.AccessController.doPrivileged(Native Method)

[2023-11-22T17:18:37.499Z] E                   	at javax.security.auth.Subject.doAs(Subject.java:422)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)

[2023-11-22T17:18:37.499Z] E                   

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1589)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.Client.call(Client.java:1535)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.Client.call(Client.java:1432)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)

[2023-11-22T17:18:37.499Z] E                   	at com.sun.proxy.$Proxy18.create(Unknown Source)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:372)

[2023-11-22T17:18:37.499Z] E                   	at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source)

[2023-11-22T17:18:37.499Z] E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[2023-11-22T17:18:37.499Z] E                   	at java.lang.reflect.Method.invoke(Method.java:498)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)

[2023-11-22T17:18:37.499Z] E                   	at com.sun.proxy.$Proxy19.create(Unknown Source)

[2023-11-22T17:18:37.499Z] E                   	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:276)

[2023-11-22T17:18:37.499Z] E                   	... 29 more

[2023-11-22T17:18:37.499Z] 

[2023-11-22T17:18:37.499Z] /usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py:326: Py4JJavaError

[2023-11-22T17:18:37.499Z] ----------------------------- Captured stdout call -----------------------------

[2023-11-22T17:18:37.499Z] ### CPU RUN ###

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 22, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 28, 2023
@mattahrens
Copy link
Collaborator

Unlikely to be able to reproduce, can reopen if we see it again.

@mattahrens mattahrens closed this as not planned Won't fix, can't repro, duplicate, stale Nov 28, 2023
@pxLi
Copy link
Collaborator

pxLi commented Dec 4, 2023

failed in AQE pipeline again rapids_it-AQE-dev-github, ID 446

[src.main.python.parquet_write_test.test_part_write_round_trip[Float][DATAGEN_SEED=1701464234, INJECT_OOM, IGNORE_ORDER]](https://prod.blsm.nvidia.com/sw-gpu-spark-jenkins/job/rapids_it-AQE-dev-github/446/testReport/junit/src.main.python/parquet_write_test/test_part_write_round_trip_Float__DATAGEN_SEED_1701464234__INJECT_OOM__IGNORE_ORDER_/)
[src.main.python.parquet_write_test.test_part_write_round_trip[Double][DATAGEN_SEED=1701464234, INJECT_OOM, IGNORE_ORDER]](https://prod.blsm.nvidia.com/sw-gpu-spark-jenkins/job/rapids_it-AQE-dev-github/446/testReport/junit/src.main.python/parquet_write_test/test_part_write_round_trip_Double__DATAGEN_SEED_1701464234__INJECT_OOM__IGNORE_ORDER_/)

seems intermittent one

@pxLi pxLi reopened this Dec 4, 2023
@jlowe
Copy link
Member Author

jlowe commented Dec 4, 2023

This is failing on the CPU session. Looks like another instance of the NaN as partition value mishandling where Spark is trying to create a new file, thinking the partition directory is changing since NaN != NaN, yet it's in the same directory as before.

@pxLi
Copy link
Collaborator

pxLi commented Dec 5, 2023

closed by #9950

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants