From 8633ae7bc0cbbfbd16c91eb8685c7b47f975320f Mon Sep 17 00:00:00 2001 From: Jason Lowe Date: Mon, 11 Dec 2023 13:38:52 -0600 Subject: [PATCH] Add documentation for how to run tests with a fixed datagen seed Signed-off-by: Jason Lowe --- integration_tests/README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/integration_tests/README.md b/integration_tests/README.md index af203f44ad9..f45564c6130 100644 --- a/integration_tests/README.md +++ b/integration_tests/README.md @@ -330,6 +330,19 @@ Basically, you need first to upload the test resources onto the cloud path `reso `root-dir` of each executor(e.g. via `spark-submit --files root-dir ...`). After that you must set both `LOCAL_ROOTDIR=root-dir` and `INPUT_PATH=resource-path` to run the shell-script, e.g. `LOCAL_ROOTDIR=root-dir INPUT_PATH=resource-path bash [run_pyspark_from_build.sh](run_pyspark_from_build.sh)`. +### Running with a fixed data generation seed + +By default the tests are run with a different random data generator seed to increase the chance of +uncovering bugs due to specific inputs. The seed used for a test is printed as part of the test +name, see the `DATAGEN_SEED=` part of the test name printed as tests are run. If a problem is found +with a specific data generation seed, the seed can be set explicitly when running the tests by +exporting the `DATAGEN_SEED` environment variable to the desired seed before running the +integration tests. For example: + +```shell +$ DATAGEN_SEED=1702166057 SPARK_HOME=~/spark-3.4.0-bin-hadoop3 integration_tests/run_pyspark_from_build.sh +``` + ### Reviewing integration tests in Spark History Server If the integration tests are run using [run_pyspark_from_build.sh](run_pyspark_from_build.sh) we have