Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified handling of GPU core dumps #9238

Merged
merged 9 commits into from
Oct 4, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions docs/dev/gpu-core-dumps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
layout: page
title: GPU Core Dumps
nav_order: 9
parent: Developer Overview
---
# GPU Core Dumps

## Overview

When the GPU segfaults and generates an illegal access exception, it can be difficult to know
what the GPU was doing at the time of the exception. GPU operations execute asynchronously, so what
the CPU was doing at the time the GPU exception was noticed often has little to do with what
triggered the exception. GPU core dumps can provide useful clues when debugging these errors, as
they contain the state of the GPU at the time the exception occurred on the GPU.

The GPU driver can be configured to write a GPU core dump when the GPU segfaults via environment
variable settings for the process. The challenges for the RAPIDS Accelerator use case are getting
the environment variables set on the executor processes and then copying the GPU core dump file
to a distributed filesystem after it is generated on the local filesystem by the driver.

## Environment Variables

The following environment variables are useful for controlling GPU core dumps. See the
[GPU core dump support section of the CUDA-GDB documentation](https://docs.nvidia.com/cuda/cuda-gdb/index.html#gpu-core-dump-support)
for more details.

### `CUDA_ENABLE_COREDUMP_ON_EXCEPTION`

Set to `1` to trigger a GPU core dump on a GPU exception.

### `CUDA_COREDUMP_FILE`

The filename to use for the GPU core dump file. Relative paths to the process current working
directory are supported. The pattern `%h` in the filename will be expanded to the hostname, and
the pattern `%p` will be expanded to the process ID. If the filename corresponds with a named pipe,
the GPU core dump data will be written to the named pipe by the GPU driver.

### `CUDA_ENABLE_LIGHTWEIGHT_COREDUMP`

Set to `1` to generate a lightweight core dump that omits the local, shared, and global memory
dumps. Disabled by default. Lightweight core dumps still show the code location that triggered
the exception and therefore can be a good option when one only needs to know what kernel(s) were
running at the time of the exception and which one triggered the exception.

### `CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION`

Set to `0` to prevent the GPU driver from causing a CPU core dump of the process after the GPU
core dump is written. Enabled by default.

### `CUDA_COREDUMP_SHOW_PROGRESS`

Set to `1` to print progress messages to the process stderr as the GPU core dump is generated. This
is only supported on newer GPU drivers (e.g.: those that are CUDA 12 compatible).

## YARN Log Aggregation

The log aggregation feature of YARN can be leveraged to copy GPU core dumps to the same place that
YARN collects container logs. When enabled, YARN will collect all files in a container's log
directory to a distributed filesystem location. YARN will automatically expand the pattern
`<LOG_DIR>` in a container's environment variables to the container's log directory which is useful
when configuring `CUDA_COREDUMP_FILE` to place the GPU core dump in the appropriate place for
log aggregation. Note that YARN log aggregation may be configured to have relatively low file size
limits which may interfere with successful collection of large GPU core dump files.

The following Spark configuration settings will enable GPU lightweight core dumps and have the
core dump files placed in the container log directory:

```text
spark.executorEnv.CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
spark.executorEnv.CUDA_ENABLE_LIGHTWEIGHT_COREDUMP=1
spark.executorEnv.CUDA_COREDUMP_FILE="<LOG_DIR>/executor-%h-%p.nvcudmp"
```

## Simplified Core Dump Handling

There is rudimentary support for simplified setup of GPU core dumps in the RAPIDS Accelerator.
This currently only works on Spark standalone clusters, since there is currently no way for a driver
plugin to programmatically override executor environment variable settings for Spark-on-YARN or
Spark-on-Kubernetes. In the future with a driver that is compatible with CUDA 12.1 or later,
the RAPIDS Accelerator could leverage GPU driver APIs to programmatically configure GPU core dump
support on executor startup.

To enable the simplified core dump handling, set `spark.rapids.gpu.coreDump.dir` to a directory to
use for GPU core dumps. Distributed filesystem URIs are supported. This leverages named pipes and
background threads to copy the GPU core dump data to the distributed filesystem. Note that anything
that causes early, abrupt termination of the process such as throwing from a C++ destructor will
often terminate the process before the dump write can be completed. These abrupt terminations should
be fixed when discovered.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.nvidia.spark.rapids

trait GpuCoreDumpMsg

/** Serialized message sent from executor to driver when a GPU core dump starts */
case class GpuCoreDumpMsgStart(executorId: String, dumpPath: String) extends GpuCoreDumpMsg

/** Serialized message sent from executor to driver when a GPU core dump completes */
case class GpuCoreDumpMsgCompleted(executorId: String, dumpPath: String) extends GpuCoreDumpMsg
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.nvidia.spark.rapids

import java.io.File
import java.lang.management.ManagementFactory
import java.nio.file.Files
import java.util.concurrent.{Executors, ExecutorService, TimeUnit}

import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource}
import com.nvidia.spark.rapids.shims.NullOutputStreamShim
import org.apache.commons.io.IOUtils
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.permission.{FsAction, FsPermission}

import org.apache.spark.SparkContext
import org.apache.spark.api.plugin.PluginContext
import org.apache.spark.internal.Logging
import org.apache.spark.io.CompressionCodec
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.rapids.execution.TrampolineUtil
import org.apache.spark.util.SerializableConfiguration

object GpuCoreDumpHandler extends Logging {
private var executor: Option[ExecutorService] = None
private var dumpedPath: Option[String] = None
private var namedPipeFile: File = _
private var isDumping: Boolean = false

/**
* Configures the executor launch environment for GPU core dumps, if applicable.
* Should only be called from the driver on driver startup.
*/
def driverInit(sc: SparkContext, conf: RapidsConf): Unit = {
// This only works in practice on Spark standalone clusters. It's too late to influence the
// executor environment for Spark-on-YARN or Spark-on-k8s.
// TODO: Leverage CUDA 12.1 core dump APIs in the executor to programmatically set this up
// on executor startup. https://github.com/NVIDIA/spark-rapids/issues/9370
conf.gpuCoreDumpDir.foreach { _ =>
TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_COREDUMP_ON_EXCEPTION", "1")
TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION", "0")
TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_LIGHTWEIGHT_COREDUMP",
if (conf.isGpuCoreDumpFull) "0" else "1")
TrampolineUtil.setExecutorEnv(sc, "CUDA_COREDUMP_FILE", conf.gpuCoreDumpPipePattern)
TrampolineUtil.setExecutorEnv(sc, "CUDA_COREDUMP_SHOW_PROGRESS", "1")
}
}

/**
* Sets up the GPU core dump background copy thread, if applicable.
* Should only be called from the executor on executor startup.
*/
def executorInit(rapidsConf: RapidsConf, pluginCtx: PluginContext): Unit = {
rapidsConf.gpuCoreDumpDir.foreach { dumpDir =>
namedPipeFile = createNamedPipe(rapidsConf)
executor = Some(Executors.newSingleThreadExecutor(new ThreadFactoryBuilder()
.setNameFormat("gpu-core-copier")
.setDaemon(true)
.build()))
executor.foreach { exec =>
val codec = if (rapidsConf.isGpuCoreDumpCompressed) {
Some(TrampolineUtil.createCodec(pluginCtx.conf(),
rapidsConf.gpuCoreDumpCompressionCodec))
} else {
None
}
val suffix = codec.map { c =>
"." + TrampolineUtil.getCodecShortName(c.getClass.getName)
}.getOrElse("")
exec.submit(new Runnable {
override def run(): Unit = {
try {
copyLoop(pluginCtx, namedPipeFile, new Path(dumpDir), codec, suffix)
} catch {
case _: InterruptedException => logInfo("Stopping GPU core dump copy thread")
case t: Throwable => logWarning("Error in GPU core dump copy thread", t)
}
}
})
}
}
}

/**
* Wait for a GPU dump in progress, if any, to complete
* @param timeoutSecs maximum amount of time to wait before returning
* @return true if the wait timedout, false otherwise
*/
def waitForDump(timeoutSecs: Int): Boolean = {
val endTime = System.nanoTime + TimeUnit.SECONDS.toNanos(timeoutSecs)
while (isDumping && System.nanoTime < endTime) {
Thread.sleep(10)
}
System.nanoTime < endTime
}

def shutdown(): Unit = {
executor.foreach { exec =>
exec.shutdownNow()
executor = None
namedPipeFile.delete()
namedPipeFile = null
}
}

def handleMsg(msg: GpuCoreDumpMsg): AnyRef = msg match {
case GpuCoreDumpMsgStart(executorId, dumpPath) =>
logError(s"Executor $executorId starting a GPU core dump to $dumpPath")
val spark = SparkSession.active
new SerializableConfiguration(spark.sparkContext.hadoopConfiguration)
case GpuCoreDumpMsgCompleted(executorId, dumpPath) =>
logError(s"Executor $executorId wrote a GPU core dump to $dumpPath")
null
case m =>
throw new IllegalStateException(s"Unexpected GPU core dump msg: $m")
}

// visible for testing
def getNamedPipeFile: File = namedPipeFile

private def createNamedPipe(conf: RapidsConf): File = {
val processName = ManagementFactory.getRuntimeMXBean.getName
val pidstr = processName.substring(0, processName.indexOf("@"))
val pipePath = conf.gpuCoreDumpPipePattern.replace("%p", pidstr)
val pipeFile = new File(pipePath)
val mkFifoProcess = Runtime.getRuntime.exec(Array("mkfifo", "-m", "600", pipeFile.toString))
require(mkFifoProcess.waitFor(10, TimeUnit.SECONDS), "mkfifo timed out")
pipeFile.deleteOnExit()
pipeFile
}

private def copyLoop(
pluginCtx: PluginContext,
namedPipe: File,
dumpDirPath: Path,
codec: Option[CompressionCodec],
suffix: String): Unit = {
try {
logInfo(s"Monitoring ${namedPipe.getAbsolutePath} for GPU core dumps")
withResource(new java.io.FileInputStream(namedPipe)) { in =>
isDumping = true
val appId = pluginCtx.conf.get("spark.app.id")
val executorId = pluginCtx.executorID()
val dumpPath = new Path(dumpDirPath,
s"gpucore-$appId-$executorId.nvcudmp$suffix")
logError(s"Generating GPU core dump at $dumpPath")
val hadoopConf = pluginCtx.ask(GpuCoreDumpMsgStart(executorId, dumpPath.toString))
.asInstanceOf[SerializableConfiguration].value
val dumpFs = dumpPath.getFileSystem(hadoopConf)
val bufferSize = hadoopConf.getInt("io.file.buffer.size", 4096)
val perms = new FsPermission(FsAction.READ_WRITE, FsAction.NONE, FsAction.NONE)
val fsOut = dumpFs.create(dumpPath, perms, false, bufferSize,
dumpFs.getDefaultReplication(dumpPath), dumpFs.getDefaultBlockSize(dumpPath), null)
val out = closeOnExcept(fsOut) { _ =>
codec.map(_.compressedOutputStream(fsOut)).getOrElse(fsOut)
}
withResource(out) { _ =>
IOUtils.copy(in, out)
}
dumpedPath = Some(dumpPath.toString)
pluginCtx.send(GpuCoreDumpMsgCompleted(executorId, dumpedPath.get))
jlowe marked this conversation as resolved.
Show resolved Hide resolved
}
} catch {
case e: Exception =>
logError("Error copying GPU dump", e)
} finally {
isDumping = false
}
// Always drain the pipe to avoid blocking the thread that triggers the coredump
while (namedPipe.exists()) {
Files.copy(namedPipe.toPath, NullOutputStreamShim.INSTANCE)
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,7 @@ class RapidsDriverPlugin extends DriverPlugin with Logging {
s"Rpc message $msg received, but shuffle heartbeat manager not configured.")
}
rapidsShuffleHeartbeatManager.executorHeartbeat(id)
case m: GpuCoreDumpMsg => GpuCoreDumpHandler.handleMsg(m)
case m => throw new IllegalStateException(s"Unknown message $m")
}
}
Expand All @@ -279,6 +280,7 @@ class RapidsDriverPlugin extends DriverPlugin with Logging {
RapidsPluginUtils.fixupConfigsOnDriver(sparkConf)
val conf = new RapidsConf(sparkConf)
RapidsPluginUtils.logPluginMode(conf)
GpuCoreDumpHandler.driverInit(sc, conf)

if (GpuShuffleEnv.isRapidsShuffleAvailable(conf)) {
GpuShuffleEnv.initShuffleManager()
Expand Down Expand Up @@ -351,6 +353,8 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
}
}

GpuCoreDumpHandler.executorInit(conf, pluginContext)

// we rely on the Rapids Plugin being run with 1 GPU per executor so we can initialize
// on executor startup.
if (!GpuDeviceManager.rmmTaskInitEnabled) {
Expand Down Expand Up @@ -475,6 +479,7 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
Option(rapidsShuffleHeartbeatEndpoint).foreach(_.close())
extraExecutorPlugins.foreach(_.shutdown())
FileCache.shutdown()
GpuCoreDumpHandler.shutdown()
}

override def onTaskFailed(failureReason: TaskFailedReason): Unit = {
Expand All @@ -487,6 +492,7 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
case Some(e) if containsCudaFatalException(e) =>
logError("Stopping the Executor based on exception being a fatal CUDA error: " +
s"${ef.toErrorString}")
GpuCoreDumpHandler.waitForDump(timeoutSecs = 60)
logGpuDebugInfoAndExit(systemExitCode = 20)
case Some(_: CudaException) =>
logDebug(s"Executor onTaskFailed because of a non-fatal CUDA error: " +
Expand Down
Loading