From 8633ae7bc0cbbfbd16c91eb8685c7b47f975320f Mon Sep 17 00:00:00 2001
From: Jason Lowe <jlowe@nvidia.com>
Date: Mon, 11 Dec 2023 13:38:52 -0600
Subject: [PATCH] Add documentation for how to run tests with a fixed datagen
 seed

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
---
 integration_tests/README.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/integration_tests/README.md b/integration_tests/README.md
index af203f44ad9..f45564c6130 100644
--- a/integration_tests/README.md
+++ b/integration_tests/README.md
@@ -330,6 +330,19 @@ Basically, you need first to upload the test resources onto the cloud path `reso
 `root-dir` of each executor(e.g. via `spark-submit --files root-dir ...`). After that you must set both `LOCAL_ROOTDIR=root-dir` and `INPUT_PATH=resource-path`
 to run the shell-script, e.g. `LOCAL_ROOTDIR=root-dir INPUT_PATH=resource-path bash [run_pyspark_from_build.sh](run_pyspark_from_build.sh)`.
 
+### Running with a fixed data generation seed
+
+By default the tests are run with a different random data generator seed to increase the chance of
+uncovering bugs due to specific inputs. The seed used for a test is printed as part of the test
+name, see the `DATAGEN_SEED=` part of the test name printed as tests are run. If a problem is found
+with a specific data generation seed, the seed can be set explicitly when running the tests by
+exporting the `DATAGEN_SEED` environment variable to the desired seed before running the
+integration tests. For example:
+
+```shell
+$ DATAGEN_SEED=1702166057 SPARK_HOME=~/spark-3.4.0-bin-hadoop3 integration_tests/run_pyspark_from_build.sh
+```
+
 ### Reviewing integration tests in Spark History Server
 
 If the integration tests are run using [run_pyspark_from_build.sh](run_pyspark_from_build.sh) we have