Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Interleaving cuDF operations with Tensorflow results in CUDA_ERROR_INVALID_VALUE #14117

Open
isVoid opened this issue Sep 18, 2023 · 7 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.

Comments

@isVoid
Copy link
Contributor

isVoid commented Sep 18, 2023

Describe the bug
When executing cuDF code interleaved in certain tensorflow code paths, a device to host copy will result in a CUDA_ERROR_INVALID_VALUE error.

Full stack trace

tests/test_tensorflow.py:188: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/nvtx/nvtx.py:101: in inner
    result = func(*args, **kwargs)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/dataframe.py:5055: in to_pandas
    out_data[i] = self._data[col_key].to_pandas(
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/column/numerical.py:685: in to_pandas
    pd_series = pd.Series(self.values_host, copy=False)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/column/column.py:234: in values_host
    return self.data_array_view(mode="read").copy_to_host()
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/cudf/core/column/column.py:156: in data_array_view
    return cuda.as_cuda_array(obj).view(self.dtype)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/api.py:76: in as_cuda_array
    return from_cuda_array_interface(obj.__cuda_array_interface__,
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py:232: in _require_cuda_context
    return fn(*args, **kws)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/api.py:47: in from_cuda_array_interface
    devptr = driver.get_devptr_for_active_ctx(desc['data'][0])
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py:2964: in get_devptr_for_active_ctx
    driver.cuPointerGetAttribute(byref(devptr), attr, ptr)
/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py:327: in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <numba.cuda.cudadrv.driver.Driver object at 0x7ff5f744e7a0>, fname = 'cuPointerGetAttribute', retcode = 1

    def _check_ctypes_error(self, fname, retcode):
        if retcode != enums.CUDA_SUCCESS:
            errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
            msg = "Call to %s results in %s" % (fname, errname)
            _logger.error(msg)
            if retcode == enums.CUDA_ERROR_NOT_INITIALIZED:
                self._detect_fork()
>           raise CudaAPIError(retcode, msg)
E           numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuPointerGetAttribute results in CUDA_ERROR_INVALID_VALUE

/home/nfs/wangm/mambaforge/envs/xdf-integration/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py:395: CudaAPIError

Steps/Code to reproduce bug
Min repro:

import cudf
import tensorflow as tf

df = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

with tf.device("/GPU:1"):
    # Perform some operations on the GPU with tf
    inp = tf.keras.Input(shape=(), name="inp", dtype=tf.int64)
    b = inp[:, tf.newaxis]

# Copy data to host
pdf = df.to_pandas()

Expected behavior
Since keras layer only deal with constructing computation graph on the tensorflow side and shouldn't have interaction with cudf. cuDF operation shouldn't fail.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [conda]]

Environment details
N/A

Additional context
N/A

@isVoid isVoid added bug Something isn't working Needs Triage Need team to review and classify labels Sep 18, 2023
@wence-
Copy link
Contributor

wence- commented Sep 18, 2023

with tf.device("/GPU:1"):

Does this leave device-1 as the currently active context?

@isVoid
Copy link
Contributor Author

isVoid commented Sep 18, 2023

It does: https://www.tensorflow.org/guide/gpu#manual_device_placement

you can use with tf.device to create a device context, and all the operations within that context will run on the same designated device.

Although it doesn't matter which device I choose, the error is repeatable.

@wence-
Copy link
Contributor

wence- commented Sep 18, 2023

FWIW, this WFM in a rapids-compose container (I did mamba install tensorflow to get version 2.12.1):

In [1]: import cudf
   ...: import tensorflow as tf
   ...: from cuda import cudart
   ...: df = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
   ...: 
   ...: print(cudart.cudaGetDevice())
   ...: with tf.device("/GPU:1"):
   ...:     # Perform some operations on the GPU with tf
   ...:     inp = tf.keras.Input(shape=(), name="inp", dtype=tf.int64)
   ...:     b = inp[:, tf.newaxis]
   ...:     print(cudart.cudaGetDevice())
   ...: 
   ...: # Copy data to host
   ...: print(cudart.cudaGetDevice())
   ...: pdf = df.to_pandas()

2023-09-18 12:22:32.889821: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-18 12:22:32.928206: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(<cudaError_t.cudaSuccess: 0>, 0)
2023-09-18 12:22:36.558988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 46304 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:17:00.0, compute capability: 8.6
2023-09-18 12:22:36.559648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 44191 MB memory:  -> device: 1, name: NVIDIA RTX A6000, pci bus id: 0000:b3:00.0, compute capability: 8.6
(<cudaError_t.cudaSuccess: 0>, 1)
(<cudaError_t.cudaSuccess: 0>, 1)

@isVoid
Copy link
Contributor Author

isVoid commented Sep 18, 2023

I think I installed tensorflow the same way. It's also the same version (2.12.1). But running on bare metal.

@isVoid
Copy link
Contributor Author

isVoid commented Sep 18, 2023

Which cuda version do you use?

@wence-
Copy link
Contributor

wence- commented Sep 18, 2023

I have NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 and:

$ mamba list | grep cuda
# packages in environment at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids:
cuda-nvtx                 11.8.86                       0    nvidia
cuda-python               11.8.2          py310h01a121a_0    conda-forge
cuda-sanitizer-api        11.8.86                       0    nvidia
cuda-version              11.8                 h70ddcb2_2    conda-forge
cudatoolkit               11.8.0              h4ba93d1_12    conda-forge
dask-cuda                 23.10.00a       py310_230911_g63ba2cc_12    rapidsai-nightly
libkvikio                 23.10.00a       cuda11_230911_gc85abd5_17    rapidsai-nightly
tensorflow                2.12.1          cuda112py310h457873b_0    conda-forge
tensorflow-base           2.12.1          cuda112py310h622e808_0    conda-forge
tensorflow-estimator      2.12.1          cuda112py310ha5e6de5_0    conda-forge

@isVoid
Copy link
Contributor Author

isVoid commented Sep 19, 2023

xref: tensorflow/tensorflow#61911

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 9, 2023
@vyasr vyasr added this to cuDF Python Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants