Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add column order error docs [skip ci] #10101

Merged
merged 1 commit into from
Dec 27, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions docs/dev/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ following topics:
* [The GPU Semaphore](#the-gpu-semaphore)
* [Debugging Tips](#debugging-tips)
* [Profiling Tips](#profiling-tips)
* [Hard To Debug Errors](#hard-to-debug-errors)

## Spark SQL and Query Plans

Expand Down Expand Up @@ -272,3 +273,31 @@ via environment variable `JACOCO_SPARK_VER` before generate coverage report, e.g
mvn clean verify
./build/coverage-report
```

## Hard To Debug Errors
Spark is not designed to do what we are doing. The following are issues that devs have
run into in the past that were really hard to debug, and could not easily be fixed by a
framework change.

### Never Change the Order of Output Columns
Spark follows a fairly traditional SQL optimization path. It starts with a logical plan.
Does optimizations on that logical plan. Then translates it to a physical plan and
does more optimizations there. But when Spark does a `collect` operation it looks at the
output of the logical plan at that point to know how to convert the data to the format
that is collected. If we ever change the order or type of a column so it is different
from the logical plan's output we can get data corruption if assertions are disabled or
AssertionErrors like the following if they are enabled.

```
java.lang.AssertionError: sizeInBytes (1) should be a multiple of 8
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:179)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getStruct(UnsafeRow.java:453)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getStruct(UnsafeRow.java:61)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow_0_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:175)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:166)
```

Changing the order can be tempting because late binding will make everything work,
except if it is the last stage in a plan that will be collected.