Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capture nsys report #14

Merged
merged 2 commits into from
Dec 14, 2023
Merged

Conversation

wjxiz1992
Copy link

@wjxiz1992 wjxiz1992 commented Dec 13, 2023

This is for customer to try capture nsys report. The report is generated before executor exits, so user should be able to upload the report to their persist storage

This requires the change at entry point as well:

...
case "$1" in
  driver)
    shift 1
    CMD=(
      "$SPARK_HOME/bin/spark-submit"
      --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
      --deploy-mode client
      "$@"
    )
    ;;
  executor)
    shift 1
    CMD=(
      nsys launch --cuda-memory-usage=true
      ${JAVA_HOME}/bin/java
      "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
      -Xms$SPARK_EXECUTOR_MEMORY
      -Xmx$SPARK_EXECUTOR_MEMORY
      -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
      org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
      --driver-url $SPARK_DRIVER_URL
      --executor-id $SPARK_EXECUTOR_ID
      --cores $SPARK_EXECUTOR_CORES
      --app-id $SPARK_APPLICATION_ID
      --hostname $SPARK_EXECUTOR_POD_IP
      --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
      --podName $SPARK_EXECUTOR_POD_NAME
    )
    ;;
...

NOTE --cuda-memory-usage=true will cause GPU perf drop, so if you don't need memory usage analysis, you can remove it from the nsys comamnd.

Signed-off-by: Allen Xu <[email protected]>
val nsysStopCommand = "nsys stop"
val result: String = nsysStopCommand.!!
println(s"Nsys Stop Command output: $result")
Thread.sleep(120* 1000)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this make the GPU perf worse ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. only enabling cuda memory track(--cuda-memory-usage=true) will make perf worse, but I didn't enable that.
Should we enable that? that is used to see peak memory stuff.

Copy link
Owner

@firestarman firestarman Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean the "sleep" here, it will take 2 mins.
Customer will also use this branch for benchmarks. I am not sure debug-only code could get in.
@winningsix what's your idea ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I got your point. I removed the sleep, as the command itself is blocking, no need to wait.
And as we discussed offline that the report generation time is counted, so for analysis, use the new branch, for perf bench, use the orignal one which doesn't do nsys.

@firestarman firestarman changed the base branch from branch-23.12 to nsys-probe December 13, 2023 07:28
Signed-off-by: Allen Xu <[email protected]>
@firestarman firestarman merged commit f3019ad into firestarman:nsys-probe Dec 14, 2023
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants