Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51030][CORE][TESTS] Add a check before Utils.deleteRecursively(tempDir) to ensure tempDir won't be cleaned up by the ShutdownHook in afterEach of LocalRootDirsTest #49723

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Jan 29, 2025

What changes were proposed in this pull request?

This pr add a check before Utils.deleteRecursively(tempDir) to ensure tempDir won't be cleaned up by the ShutdownHook in afterEach of LocalRootDirsTest to minimize the risk of test failures caused by race conditions in multi-threaded deletion of the tempdir.

Why are the changes needed?

In the daily tests on macOS, you may see test failures similar to the following:

SslExternalShuffleServiceSuite:
...
- SPARK-38640: memory only blocks can unpersist using shuffle service cache fetching
SUITE ABORTED - SslExternalShuffleServiceSuite: Failed to list files for dir: /Users/runner/work/spark/spark/core/target/tmp/local-6e9944e3-27c0-44bb-a535-8a7f09c1580b/55fe65f6-2d1c-4ced-af5f-abb64b135345/spark-3985444c-54a0-462c-b1d3-6df45ad55143/executor-8c55fc55-b74f-4793-bf3f-3d11a889fd8c/blockmgr-1a3455e7-bd31-4ef8-a9c6-910fde151bb3/11
  java.io.IOException: Failed to list files for dir: /Users/runner/work/spark/spark/core/target/tmp/local-6e9944e3-27c0-44bb-a535-8a7f09c1580b/55fe65f6-2d1c-4ced-af5f-abb64b135345/spark-3985444c-54a0-462c-b1d3-6df45ad55143/executor-8c55fc55-b74f-4793-bf3f-3d11a889fd8c/blockmgr-1a3455e7-bd31-4ef8-a9c6-910fde151bb3/11
  at org.apache.spark.network.util.JavaUtils.listFilesSafely(JavaUtils.java:201)
  at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:135)
  at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:123)
  at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:137)
  at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:123)
  at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:137)
  at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:123)
  at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:137)
  at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:123)
  at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:137)
  ...

This appears to be an issue caused by multi-threaded deletion of the tempdir.

tempDir = Utils.createTempDir(namePrefix = "local")

Refer to the code above, in LocalRootDirsTest#beforeEach method, tempDir = Utils.createTempDir(namePrefix = "local") invokes ShutdownHookManager.registerShutdownDeleteDir to register tempDir for deletion via a shutdown hook. Therefore, if tempDir remains unchanged during the test case, manual cleanup is unnecessary in LocalRootDirsTest#afterEach method.

However, given that tempDir is defined as protected var tempDir, it is possible for it to be altered within the test case. Hence, additional checks are implemented before manually cleaning up tempDir:

  1. tempDir has not been registered for cleanup via a shutdown hook.
  2. tempDir is not a subpath of any path that has been registered for cleanup via a shutdown hook.

This approach helps to minimize the risk of a race condition where multiple threads attempt to clean up the tempDir concurrently.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass GitHub Actions

Was this patch authored or co-authored using generative AI tooling?

No

@LuciferYang LuciferYang marked this pull request as draft January 29, 2025 06:40
@@ -1291,3 +1291,16 @@ jobs:
cd ui-test
npm install --save-dev
node --experimental-vm-modules node_modules/.bin/jest
maven-test:
name: "Run Maven Test on macos"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@LuciferYang LuciferYang changed the title Add a hasRootAsShutdownDeleteDir check for tempDir in the afterEach of LocalRootDirsTest. Add a hasShutdownDeleteDir check for tempDir in the afterEach of LocalRootDirsTest. Jan 29, 2025
@LuciferYang LuciferYang changed the title Add a hasShutdownDeleteDir check for tempDir in the afterEach of LocalRootDirsTest. Add a check for unregistered ShutdownHook for tempDir before manual cleanup in the afterEach method of LocalRootDirsTest Jan 29, 2025
@@ -46,7 +46,10 @@ trait LocalRootDirsTest extends SparkFunSuite with LocalSparkContext {

override def afterEach(): Unit = {
try {
Utils.deleteRecursively(tempDir)
if (!ShutdownHookManager.hasShutdownDeleteDir(tempDir) &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempDir = Utils.createTempDir(namePrefix = "local")

In the beforeEach method, tempDir = Utils.createTempDir(namePrefix = "local") invokes ShutdownHookManager.registerShutdownDeleteDir to register tempDir for deletion via a shutdown hook. Therefore, if tempDir remains unchanged during the test case, manual cleanup is unnecessary.

However, given that tempDir is defined as protected var tempDir, it is possible for it to be altered within the test case. Hence, additional checks are implemented before manually cleaning up tempDir:

  1. tempDir has not been registered for cleanup via a shutdown hook.
  2. tempDir is not a subpath of any path that has been registered for cleanup via a shutdown hook.

This approach helps to minimize the risk of a race condition where multiple threads attempt to clean up the tempDir concurrently. It seems that this issue is more prevalent on macOS, so let's conduct several more rounds of testing to ensure its effectiveness.

@LuciferYang LuciferYang changed the title Add a check for unregistered ShutdownHook for tempDir before manual cleanup in the afterEach method of LocalRootDirsTest Add a check before manually cleaning up tempDir to ensure that it won't be cleaned up by the ShutdownHook in the afterEach of LocalRootDirsTest Jan 29, 2025
@LuciferYang LuciferYang changed the title Add a check before manually cleaning up tempDir to ensure that it won't be cleaned up by the ShutdownHook in the afterEach of LocalRootDirsTest Add a check before Utils.deleteRecursively(tempDir) to ensure tempDir won't be cleaned up by the ShutdownHook in the afterEach of LocalRootDirsTest Jan 29, 2025
@LuciferYang LuciferYang changed the title Add a check before Utils.deleteRecursively(tempDir) to ensure tempDir won't be cleaned up by the ShutdownHook in the afterEach of LocalRootDirsTest Add a check before Utils.deleteRecursively(tempDir) to ensure tempDir won't be cleaned up by the ShutdownHook in afterEach of LocalRootDirsTest Jan 29, 2025
@LuciferYang LuciferYang changed the title Add a check before Utils.deleteRecursively(tempDir) to ensure tempDir won't be cleaned up by the ShutdownHook in afterEach of LocalRootDirsTest [SPARK-51030][CORE][TESTS] Add a check before Utils.deleteRecursively(tempDir) to ensure tempDir won't be cleaned up by the ShutdownHook in afterEach of LocalRootDirsTest Jan 29, 2025
@github-actions github-actions bot removed the INFRA label Jan 30, 2025
@LuciferYang LuciferYang marked this pull request as ready for review January 30, 2025 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants