Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] CuTensorNetHandle failure on multiple GPUs after cuTensorNet 2.3.0 #92

Merged
merged 1 commit into from
Apr 4, 2024

Conversation

PabloAndresCQ
Copy link
Collaborator

@PabloAndresCQ PabloAndresCQ commented Apr 3, 2024

Description

When running any simulation method that used the CuTensorNetHandle and with cuTensorNet>=2.3.0 installed, if you tried to use a GPU device that was not the default one (device=0) then you'd get a very obscure error.

It appears that the problem was caused due to newer versions of cuTensorNet make use of cupy internally, which needs its device being specified if not using the default one. We were already doing this anyway, but it turns out that the order of the commands was wrong, and cutn.create() which creates the cuTensorNet library handle was called before updating the cupy device, causing a mismatch in the device being used.

Checklist

  • I have run the tests on a machine with GPUs.
  • I have performed a self-review of my code.
  • I have commented hard-to-understand parts of my code.
  • I have made corresponding changes to the public API documentation.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have updated the changelog with any user-facing changes.

@PabloAndresCQ PabloAndresCQ requested a review from cqc-melf April 3, 2024 15:11
Copy link
Collaborator

@cqc-melf cqc-melf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have tests that confirms this solved the issue? Do you usually do this?

@PabloAndresCQ
Copy link
Collaborator Author

Automatic testing for this would be hard, since it'd require a machine with multiple devices. I have tested this in the sense that I required this fix in order to be able to run some parallel tasks on Perlmutter (GPU cluster). I'd rather not add a test for this.

@PabloAndresCQ PabloAndresCQ requested a review from cqc-melf April 3, 2024 16:33
@PabloAndresCQ PabloAndresCQ merged commit 0b71367 into develop Apr 4, 2024
6 checks passed
@PabloAndresCQ PabloAndresCQ deleted the bugfix/libhandle_creation branch April 4, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants