[BUG] Possible identical column names with results from PPL queries #1023

normanj-bitquill · 2025-01-27T21:23:35Z

What is the bug?
Certain PPL queries can create results with multiple column using the same name. When this happens the results are more difficult to work with in Spark. .show() will work properly, but trying to save the results to a file will fail.

How can one reproduce the bug?
Steps to reproduce the behavior:

Startup OpenSearch
Configure Spark to use the Flint integration to access OpenSearch indices
Create an index named nested in OpenSearch using this data and this mapping
Run spark-shell

Run this query (replace dev with the name of your OpenSearchCatalog in spark)

var x = spark.sql("source = dev.default.nested | fields int_col, struct_col.field1, struct_col2.field1 | head 10")
x.write.format("csv").save("/tmp/results.csv")

Should produce this stack trace:

org.apache.spark.sql.AnalysisException: [COLUMN_ALREADY_EXISTS] The column `field1` already exists. Consider to choose another name or rename the existing column.
  at org.apache.spark.sql.errors.QueryCompilationErrors$.columnAlreadyExistsError(QueryCompilationErrors.scala:2450)
  at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:114)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:85)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:390)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:418)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:390)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
  at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
  at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:869)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:391)
  at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:364)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:243)
  ... 47 elided

What is the expected behavior?
Result set column names are unique whenever possible. In this case using field name like struct_col.field1 and struct_col2.field1 could solve this problem.

What is your host/environment?

Spark: 3.5.3
Flint integration from main branch

Do you have any screenshots?
N/A

Do you have any additional context?
No

The text was updated successfully, but these errors were encountered:

normanj-bitquill added bug Something isn't working untriaged labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Possible identical column names with results from PPL queries #1023

[BUG] Possible identical column names with results from PPL queries #1023

normanj-bitquill commented Jan 27, 2025

[BUG] Possible identical column names with results from PPL queries #1023

[BUG] Possible identical column names with results from PPL queries #1023

Comments

normanj-bitquill commented Jan 27, 2025