Add column order error docs

Signed-off-by: Robert (Bobby) Evans <[email protected]>
NVIDIA · Dec 27, 2023 · 143488d · 143488d
1 parent 6abe148
commit 143488d
Showing 1 changed file with 29 additions and 0 deletions.
diff --git a/docs/dev/README.md b/docs/dev/README.md
@@ -20,6 +20,7 @@ following topics:
   * [The GPU Semaphore](#the-gpu-semaphore)
 * [Debugging Tips](#debugging-tips)
 * [Profiling Tips](#profiling-tips)
+* [Hard To Debug Errors](#hard-to-debug-errors)
 
 ## Spark SQL and Query Plans
 
@@ -272,3 +273,31 @@ via environment variable `JACOCO_SPARK_VER` before generate coverage report, e.g
 mvn clean verify
 ./build/coverage-report
 ```
+
+## Hard To Debug Errors
+Spark is not designed to do what we are doing. The following are issues that devs have
+run into in the past that were really hard to debug, and could not easily be fixed by a
+framework change.
+
+### Never Change the Order of Output Columns
+Spark follows a fairly traditional SQL optimization path. It starts with a logical plan.
+Does optimizations on that logical plan. Then translates it to a physical plan and
+does more optimizations there. But when Spark does a `collect` operation it looks at the
+output of the logical plan at that point to know how to convert the data to the format
+that is collected. If we ever change the order or type of a column so it is different
+from the logical plan's output we can get data corruption if assertions are disabled or
+AssertionErrors like the following if they are enabled.
+
+```
+java.lang.AssertionError: sizeInBytes (1) should be a multiple of 8
+  at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:179)
+  at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getStruct(UnsafeRow.java:453)
+  at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getStruct(UnsafeRow.java:61)
+  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow_0_1$(Unknown Source)
+  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
+  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:175)
+  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:166)
+```
+
+Changing the order can be tempting because late binding will make everything work,
+except if it is the last stage in a plan that will be collected.