[bugfix] CuTensorNetHandle failure on multiple GPUs after cuTensorNet 2.3.0 #92

PabloAndresCQ · 2024-04-03T14:52:31Z

Description

When running any simulation method that used the CuTensorNetHandle and with cuTensorNet>=2.3.0 installed, if you tried to use a GPU device that was not the default one (device=0) then you'd get a very obscure error.

It appears that the problem was caused due to newer versions of cuTensorNet make use of cupy internally, which needs its device being specified if not using the default one. We were already doing this anyway, but it turns out that the order of the commands was wrong, and cutn.create() which creates the cuTensorNet library handle was called before updating the cupy device, causing a mismatch in the device being used.

Checklist

I have run the tests on a machine with GPUs.
I have performed a self-review of my code.
I have commented hard-to-understand parts of my code.
I have made corresponding changes to the public API documentation.
I have added tests that prove my fix is effective or that my feature works.
I have updated the changelog with any user-facing changes.

…libhandle creation.

cqc-melf

Do you have tests that confirms this solved the issue? Do you usually do this?

PabloAndresCQ · 2024-04-03T15:28:20Z

Automatic testing for this would be hard, since it'd require a machine with multiple devices. I have tested this in the sense that I required this fix in order to be able to run some parallel tasks on Perlmutter (GPU cluster). I'd rather not add a test for this.

Changing the order of commands so that GPU device is assigned before …

a47d1db

…libhandle creation.

PabloAndresCQ requested a review from cqc-melf April 3, 2024 15:11

cqc-melf reviewed Apr 3, 2024

View reviewed changes

PabloAndresCQ requested a review from cqc-melf April 3, 2024 16:33

cqc-melf approved these changes Apr 4, 2024

View reviewed changes

PabloAndresCQ merged commit 0b71367 into develop Apr 4, 2024
6 checks passed

PabloAndresCQ deleted the bugfix/libhandle_creation branch April 4, 2024 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] CuTensorNetHandle failure on multiple GPUs after cuTensorNet 2.3.0 #92

[bugfix] CuTensorNetHandle failure on multiple GPUs after cuTensorNet 2.3.0 #92

PabloAndresCQ commented Apr 3, 2024 •

edited

Loading

cqc-melf left a comment

PabloAndresCQ commented Apr 3, 2024

[bugfix] CuTensorNetHandle failure on multiple GPUs after cuTensorNet 2.3.0 #92

[bugfix] CuTensorNetHandle failure on multiple GPUs after cuTensorNet 2.3.0 #92

Conversation

PabloAndresCQ commented Apr 3, 2024 • edited Loading

Description

Checklist

cqc-melf left a comment

Choose a reason for hiding this comment

PabloAndresCQ commented Apr 3, 2024

PabloAndresCQ commented Apr 3, 2024 •

edited

Loading