[BUG] Interleaving cuDF operations with Tensorflow results in `CUDA_ERROR_INVALID_VALUE` #14117

isVoid · 2023-09-18T10:41:46Z

Describe the bug
When executing cuDF code interleaved in certain tensorflow code paths, a device to host copy will result in a CUDA_ERROR_INVALID_VALUE error.

Full stack trace

tests/test_tensorflow.py:188: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/nvtx/nvtx.py:101: in inner
    result = func(*args, **kwargs)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/dataframe.py:5055: in to_pandas
    out_data[i] = self._data[col_key].to_pandas(
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/column/numerical.py:685: in to_pandas
    pd_series = pd.Series(self.values_host, copy=False)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/column/column.py:234: in values_host
    return self.data_array_view(mode="read").copy_to_host()
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/column/column.py:156: in data_array_view
    return cuda.as_cuda_array(obj).view(self.dtype)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/api.py:76: in as_cuda_array
    return from_cuda_array_interface(obj.__cuda_array_interface__,
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py:232: in _require_cuda_context
    return fn(*args, **kws)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/api.py:47: in from_cuda_array_interface
    devptr = driver.get_devptr_for_active_ctx(desc['data'][0])
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py:2964: in get_devptr_for_active_ctx
    driver.cuPointerGetAttribute(byref(devptr), attr, ptr)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py:327: in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <numba.cuda.cudadrv.driver.Driver object at 0x7ff5f744e7a0>, fname = 'cuPointerGetAttribute', retcode = 1

    def _check_ctypes_error(self, fname, retcode):
        if retcode != enums.CUDA_SUCCESS:
            errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
            msg = "Call to %s results in %s" % (fname, errname)
            _logger.error(msg)
            if retcode == enums.CUDA_ERROR_NOT_INITIALIZED:
                self._detect_fork()
>           raise CudaAPIError(retcode, msg)
E           numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuPointerGetAttribute results in CUDA_ERROR_INVALID_VALUE

/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py:395: CudaAPIError

Steps/Code to reproduce bug
Min repro:

import cudf
import tensorflow as tf

df = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

with tf.device("/GPU:1"):
    # Perform some operations on the GPU with tf
    inp = tf.keras.Input(shape=(), name="inp", dtype=tf.int64)
    b = inp[:, tf.newaxis]

# Copy data to host
pdf = df.to_pandas()

Expected behavior
Since keras layer only deal with constructing computation graph on the tensorflow side and shouldn't have interaction with cudf. cuDF operation shouldn't fail.

Environment overview (please complete the following information)

Environment location: [Bare-metal]
Method of cuDF install: [conda]]

Environment details
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

wence- · 2023-09-18T11:09:13Z

with tf.device("/GPU:1"):

Does this leave device-1 as the currently active context?

isVoid · 2023-09-18T11:14:13Z

It does: https://www.tensorflow.org/guide/gpu#manual_device_placement

you can use with tf.device to create a device context, and all the operations within that context will run on the same designated device.

Although it doesn't matter which device I choose, the error is repeatable.

wence- · 2023-09-18T11:23:29Z

FWIW, this WFM in a rapids-compose container (I did mamba install tensorflow to get version 2.12.1):

In [1]: import cudf
   ...: import tensorflow as tf
   ...: from cuda import cudart
   ...: df = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
   ...: 
   ...: print(cudart.cudaGetDevice())
   ...: with tf.device("/GPU:1"):
   ...:     # Perform some operations on the GPU with tf
   ...:     inp = tf.keras.Input(shape=(), name="inp", dtype=tf.int64)
   ...:     b = inp[:, tf.newaxis]
   ...:     print(cudart.cudaGetDevice())
   ...: 
   ...: # Copy data to host
   ...: print(cudart.cudaGetDevice())
   ...: pdf = df.to_pandas()

2023-09-18 12:22:32.889821: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-18 12:22:32.928206: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(<cudaError_t.cudaSuccess: 0>, 0)
2023-09-18 12:22:36.558988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 46304 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:17:00.0, compute capability: 8.6
2023-09-18 12:22:36.559648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 44191 MB memory:  -> device: 1, name: NVIDIA RTX A6000, pci bus id: 0000:b3:00.0, compute capability: 8.6
(<cudaError_t.cudaSuccess: 0>, 1)
(<cudaError_t.cudaSuccess: 0>, 1)

isVoid · 2023-09-18T11:29:04Z

I think I installed tensorflow the same way. It's also the same version (2.12.1). But running on bare metal.

isVoid · 2023-09-18T11:32:27Z

Which cuda version do you use?

wence- · 2023-09-18T11:42:46Z

I have NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 and:

$ mamba list | grep cuda
# packages in environment at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids:
cuda-nvtx                 11.8.86                       0    nvidia
cuda-python               11.8.2          py310h01a121a_0    conda-forge
cuda-sanitizer-api        11.8.86                       0    nvidia
cuda-version              11.8                 h70ddcb2_2    conda-forge
cudatoolkit               11.8.0              h4ba93d1_12    conda-forge
dask-cuda                 23.10.00a       py310_230911_g63ba2cc_12    rapidsai-nightly
libkvikio                 23.10.00a       cuda11_230911_gc85abd5_17    rapidsai-nightly
tensorflow                2.12.1          cuda112py310h457873b_0    conda-forge
tensorflow-base           2.12.1          cuda112py310h622e808_0    conda-forge
tensorflow-estimator      2.12.1          cuda112py310ha5e6de5_0    conda-forge

isVoid · 2023-09-19T13:16:15Z

xref: tensorflow/tensorflow#61911

isVoid added bug Something isn't working Needs Triage Need team to review and classify labels Sep 18, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 9, 2023

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Interleaving cuDF operations with Tensorflow results in `CUDA_ERROR_INVALID_VALUE` #14117

[BUG] Interleaving cuDF operations with Tensorflow results in `CUDA_ERROR_INVALID_VALUE` #14117

isVoid commented Sep 18, 2023

wence- commented Sep 18, 2023

isVoid commented Sep 18, 2023

wence- commented Sep 18, 2023

isVoid commented Sep 18, 2023 •

edited

Loading

isVoid commented Sep 18, 2023

wence- commented Sep 18, 2023

isVoid commented Sep 19, 2023

[BUG] Interleaving cuDF operations with Tensorflow results in CUDA_ERROR_INVALID_VALUE #14117

[BUG] Interleaving cuDF operations with Tensorflow results in CUDA_ERROR_INVALID_VALUE #14117

Comments

isVoid commented Sep 18, 2023

wence- commented Sep 18, 2023

isVoid commented Sep 18, 2023

wence- commented Sep 18, 2023

isVoid commented Sep 18, 2023 • edited Loading

isVoid commented Sep 18, 2023

wence- commented Sep 18, 2023

isVoid commented Sep 19, 2023

[BUG] Interleaving cuDF operations with Tensorflow results in `CUDA_ERROR_INVALID_VALUE` #14117

[BUG] Interleaving cuDF operations with Tensorflow results in `CUDA_ERROR_INVALID_VALUE` #14117

isVoid commented Sep 18, 2023 •

edited

Loading