Skip to content

Commit

Permalink
Add column order error docs
Browse files Browse the repository at this point in the history
Signed-off-by: Robert (Bobby) Evans <[email protected]>
  • Loading branch information
revans2 committed Dec 27, 2023
1 parent 6abe148 commit 143488d
Showing 1 changed file with 29 additions and 0 deletions.
29 changes: 29 additions & 0 deletions docs/dev/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ following topics:
* [The GPU Semaphore](#the-gpu-semaphore)
* [Debugging Tips](#debugging-tips)
* [Profiling Tips](#profiling-tips)
* [Hard To Debug Errors](#hard-to-debug-errors)

## Spark SQL and Query Plans

Expand Down Expand Up @@ -272,3 +273,31 @@ via environment variable `JACOCO_SPARK_VER` before generate coverage report, e.g
mvn clean verify
./build/coverage-report
```

## Hard To Debug Errors
Spark is not designed to do what we are doing. The following are issues that devs have
run into in the past that were really hard to debug, and could not easily be fixed by a
framework change.

### Never Change the Order of Output Columns
Spark follows a fairly traditional SQL optimization path. It starts with a logical plan.
Does optimizations on that logical plan. Then translates it to a physical plan and
does more optimizations there. But when Spark does a `collect` operation it looks at the
output of the logical plan at that point to know how to convert the data to the format
that is collected. If we ever change the order or type of a column so it is different
from the logical plan's output we can get data corruption if assertions are disabled or
AssertionErrors like the following if they are enabled.

```
java.lang.AssertionError: sizeInBytes (1) should be a multiple of 8
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:179)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getStruct(UnsafeRow.java:453)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getStruct(UnsafeRow.java:61)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow_0_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:175)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:166)
```

Changing the order can be tempting because late binding will make everything work,
except if it is the last stage in a plan that will be collected.

0 comments on commit 143488d

Please sign in to comment.