Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error CUDA_ERROR_NOT_SUPPORTED: operation not supported while calling malloc_async_impl (cuMemAllocAsync) #450

Open
blooop opened this issue Jan 2, 2025 · 1 comment

Comments

@blooop
Copy link

blooop commented Jan 2, 2025

Ubuntu 22.04
on main branch with this commit:
3aaf87b
installed via docker

I was trying out the advanced_worm.py example and it was all working fine. At some point I stopped the script half way through. When I tried to run it again I got the error:



Traceback (most recent call last):
  File "/workspaces/genesis/examples/tutorials/advanced_worm.py", line 9, in <module>
    scene = gs.Scene(
            ^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/genesis/utils/misc.py", line 27, in new_init
    original_init(self, *args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/genesis/engine/scene.py", line 133, in __init__
    self._sim = Simulator(
                ^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/genesis/engine/simulator.py", line 94, in __init__
    self.rigid_solver = RigidSolver(self.scene, self, self.rigid_options)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/genesis/engine/solvers/rigid/rigid_solver_decomp.py", line 24, in __init__
    super().__init__(scene, sim, options)
  File "/opt/conda/lib/python3.11/site-packages/genesis/engine/solvers/base_solver.py", line 18, in __init__
    self._gravity.from_numpy(np.array(options.gravity, dtype=gs.np_float))
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/util.py", line 351, in wrapped
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/matrix.py", line 1353, in from_numpy
    self._from_external_arr(arr)
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/util.py", line 351, in wrapped
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/matrix.py", line 1337, in _from_external_arr
    ext_arr_to_matrix(arr, self, as_vector)
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/kernel_impl.py", line 1113, in wrapped
    return primal(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/kernel_impl.py", line 1043, in __call__
    key = self.ensure_compiled(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/kernel_impl.py", line 1011, in ensure_compiled
    self.materialize(key=key, args=args, arg_features=arg_features)
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/kernel_impl.py", line 637, in materialize
    self.runtime.materialize()
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/impl.py", line 471, in materialize
    self.materialize_root_fb(not self.materialized)
  File "/opt/conda/lib/python3.11/site-packages/taichi/lang/impl.py", line 406, in materialize_root_fb
    root.finalize(raise_warning=not is_first_call)
  File "/opt/conda/lib/python3.11/site-packages/taichi/_snode/fields_builder.py", line 170, in finalize
    return self._finalize(raise_warning, compile_only=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/taichi/_snode/fields_builder.py", line 182, in _finalize
    return SNodeTree(_ti_core.finalize_snode_tree(_snode_registry, self.ptr, impl.get_runtime().prog, compile_only))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_NOT_SUPPORTED: operation not supported while calling malloc_async_impl (cuMemAllocAsync)
[Genesis] [21:28:17] [INFO] 💤 Exiting Genesis and caching compiled kernels...

I figured I had probably run out of memory or some GPU resource was busy, so I restarted the machine and was able to run the example again. I tried running the differentialable_push.py example and got the same error. I tried the restarting trick, but that didn't fix it.

I've tried deleting the docker image, restarting etc. but that didn't seem to help. After a long time I was able to get the examples running again, and found that if I got that error and waited a couple of minutes that the error would go away. At the moment I am getting the error and it doesn't not seem to be going away any more.

This seems related:
taichi-dev/taichi#8395

My current theory is that I stopped execution during the middle of compiling the kernels and now some cache is dirty?

Do you know what this error is about?

Thanks.

@blooop
Copy link
Author

blooop commented Jan 2, 2025

After waiting a bit more, the examples work again. It seems like some async process takes a while to release resources even if I force kill the genesis code. I tried restarting the code every 3 mins or so, and after about 10 mins the example would stop throwing the above error and run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant