Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting model on multiple GPUs is broken (ROCm) #173

Closed
opcod3 opened this issue Jul 20, 2023 · 40 comments
Closed

Splitting model on multiple GPUs is broken (ROCm) #173

opcod3 opened this issue Jul 20, 2023 · 40 comments

Comments

@opcod3
Copy link

opcod3 commented Jul 20, 2023

Splitting a model between two AMD GPUs (Rx 7900XTX and Radeon VII) results in garbage output (gibberish).
Tested with Llama-2-13B-chat-GPTQ and Llama-2-70B-chat-GPTQ.
Running a model on just any one of the two card the output seems reasonable, although I cant vouch for the correctness of the 70B model as it cannot fit on a single card.

No flags seem to impact the results, although if i split the model and use --fused_mlp_thd 0 the following error occurs:

Exception
Traceback (most recent call last):
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/task.py", line 456, in execute
    for chunk in app_iter:
  File "/usr/lib/python3.11/site-packages/werkzeug/wsgi.py", line 289, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/werkzeug/wrappers/response.py", line 31, in _iter_encoded
    for item in iterable:
  File "/usr/lib/python3.11/site-packages/flask/helpers.py", line 149, in generator
    yield from gen
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/webui/session.py", line 694, in respond_multi
    yield from self.respond(self.participants[1], stop_conditions, total_tokens, res_line, num_res_tokens)
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/webui/session.py", line 532, in respond
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 487, in beam_search
    if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
                                                                           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 341, in gen_single_token
    token, _ = self.batched_sample(logits,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 64, in batched_sample
    if logits.shape[0] == 1: return self.sample(logits, temperature, top_k, top_p, min_p, typical, num)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 147, in sample
    sampled_ind = torch.multinomial(top_probs, top_probs.shape[-1] if num == -1 else min(num, top_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Compiling #146 does not seem to impact the outcome either.

The system is running Arch Linux with python-pytorch-opt-rocm 2.0.1-7

Output of rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 3950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 3950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3500                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131809036(0x7db3f0c) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131809036(0x7db3f0c) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131809036(0x7db3f0c) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-94c2e25f00000000               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2304                               
  BDFID:                   3584                               
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-ed7030e172da5eba               
  Marketing Name:          AMD Radeon VII                     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 26287(0x66af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   4352                               
  Internal Node ID:        2                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done *** 

I am available to do any testing that may help isolate the issue, I can try to test a third card as well (RX 6800XT).

@ardfork
Copy link
Contributor

ardfork commented Jul 20, 2023

I only have a single GPU, so I can't test. But @jmoney7823956789378 successfully ran exllama with 2 MI60, so unless any regression happened, it should work.

How exactly are you running it? HSA_OVERRIDE_GFX_VERSION should be unset as you have multiple GPU that are all supported, and you should hide your APU with HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES.

@jmoney7823956789378
Copy link

Yep, I've had this issue with certain models and I don't think I ever figured out a 100% solution.
My theory is something about how the model was quantized, specifically groupsize or act-order may have something to do with it. (I'm talking out of my ass right now since I don't have my MI60s online to test at the moment.)

One weird thing I noticed with MI60s specifically, was that running plain GPTQ-for-LLaMA ran faster than exllama...

btw, I am selling them if any of you smart rocm/exllama devs are interested :)

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 20, 2023

How do the Mi60 perform compared to 3090s?

@jmoney7823956789378
Copy link

How do the Mi60 perform compared to 3090s?

The 3090s are much faster (currently). I think the HMB2 in the MI60s have a lot of potential, but I have no idea how to take advantage of it.

@opcod3
Copy link
Author

opcod3 commented Jul 20, 2023

I am not using HSA_OVERRIDE_GFX_VERSION or any other ROCm flags except for ROCM_HOME. I also don't have an APU.

@jmoney7823956789378 You say your issue was model specific, did the models that fail on two cards run on a single one?

If they did then what I am about to say is certainly false, but otherwise I have a feeling that maybe some buffers are architecture specific and are stored differently between the two cards and for some reason they are not getting converted when copied from one card to another. It feels like that is the only issue that may occur since the models work individually. But I may be completely off base as I don´t have much experience with GPU compute.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 20, 2023

I am very tempted by Mi60 or Mi25 (x4), but I'd have to re-install my 2nd CPU and another power supply. I don't even want to know what it would idle at. Right now it's 200-230W and 1000w during inference or training. The GPUs would be cheap but the power bill will come get me.

@opcod3
Copy link
Author

opcod3 commented Jul 20, 2023

Aren't the Mi25 cards unsupported by new ROCm releases? And afaik Mi50 (same arch as Mi60) will be EOL next year, so no more new features for them. No idea how big of a deal that is

@jmoney7823956789378
Copy link

You say your issue was model specific, did the models that fail on two cards run on a single one?

I'm not able to recall specifically. I also recommend doing fresh installs of the rocm drivers, rebooting, then attempting inference if possible. As the ancient IT proverb states, "turn it off and back on again" worked more often than I'd like to admit.

You could be right about the change in arch, but I can't confirm. :(

@opcod3
Copy link
Author

opcod3 commented Jul 20, 2023

I've already rebooted a few times. Retrying once more wont hurt though.

I'll also try the docker images provided by AMD, maybe that'll help

@jmoney7823956789378
Copy link

I am very tempted by Mi60 or Mi25 (x4), but I'd have to re-install my 2nd CPU and another power supply. I don't even want to know what it would idle at. Right now it's 200-230W and 1000w during inference or training. The GPUs would be cheap but the power bill will come get me.

I know what you mean. Back when I was still messing with the MI60s I ended up with a second cheap epyc system, since they have all that PCIe on a relatively low power budget.

@jmoney7823956789378
Copy link

I've already rebooted a few times. Retrying once more wont hurt though.

I'll also try the docker images provided by AMD, maybe that'll help

I just tested out TheBloke/Llama-2-13B-chat-GPTQ and TheBloke/Llama-2-70B-chat-GPTQ on the two MI60s:

Single MI60, 13B:
Output generated in 12.60 seconds (15.88 tokens/s, 200 tokens, context 37, seed 1316458672)

Two MI60, 13B:
Output generated in 12.96 seconds (15.44 tokens/s, 200 tokens, context 37, seed 1404134278)

Two MI60, 70B:
Output generated in 59.02 seconds (3.39 tokens/s, 200 tokens, context 37, seed 742985185)

@jmoney7823956789378
Copy link

Same models on GPTQ-for-LLaMa, two MI60s:

13B:
Output generated in 18.90 seconds (10.53 tokens/s, 199 tokens, context 38, seed 17468020)

70B:
Output generated in 44.63 seconds (4.46 tokens/s, 199 tokens, context 38, seed 979069712)

@turboderp
Copy link
Owner

Two MI60, 70B:
Output generated in 59.02 seconds (3.39 tokens/s, 200 tokens, context 37, seed 742985185)

There's an update coming soon that should bump this up a little bit. Not sure how much, but the GQA implementation is a little stupid at the moment with a relatively expensive reshaping of the K/V cache to make the number of heads align with the queries.

Flash Attention 2.0 supports grouping directly so it's going to be faster on 70b, if only for avoiding that reshaping step. The only holdup at the moment is that it currently does causal masking in kind of a broken way. But they're working on that over there, and I'm trying to find a workaround in the meantime.

@jmoney7823956789378
Copy link

Flash Attention doesn't build on ROCm, and supposedly never will (according to their devs).

@ardfork
Copy link
Contributor

ardfork commented Jul 20, 2023

Flash Attention doesn't build on ROCm, and supposedly never will (according to their devs).

https://github.com/ROCmSoftwarePlatform/flash-attention

@jmoney7823956789378
Copy link

Flash Attention doesn't build on ROCm, and supposedly never will (according to their devs).

https://github.com/ROCmSoftwarePlatform/flash-attention

hooooooooooooly shit. rocm actually developing rocm? preposterous.

@ardfork
Copy link
Contributor

ardfork commented Jul 20, 2023

Why do you think that pytorch, tensorflow, triton work on ROCm? They are adding support for that themselves. They just rarely touch end users software, but add support in all the big libraries. Intel does the same.

@jmoney7823956789378
Copy link

Why do you think that pytorch, tensorflow, triton work on ROCm? They are adding support for that themselves. They just rarely touch end users software, but add support in all the big libraries. Intel does the same.

Just joking around. Here's some output from the ROCm flash-attention test docker container. An absolute TON of lines were removed, all stating "...does not support this problem"

Deterministic: False
Performance Mode: True
Using QLoop: True
FlashAttention - Forward pass
DeviceGroupedMultiheadAttentionForward_Xdl_CShuffle_V2<256, 128, 128, 32, 8, 8, 128, 64, 32, 2, MNKOPadding, ASpecDefault, B0SpecDefault, B1SpecDefault, CSpecDefault, MaskDisabled> does not support this problem

---truncated for brevity---

<torch.utils.benchmark.utils.common.Measurement object at 0x7f98b2f95460>
fn_amp(*inputs, **kwinputs)
  4.19 ms
  1 measurement, 30 runs , 8 threads
FlashAttention - Backward pass
DeviceGroupedMultiheadAttentionForward_Xdl_CShuffle_V2<256, 128, 128, 32, 8, 8, 128, 64, 32, 2, MNKOPadding, ASpecDefault, B0SpecDefault, B1SpecDefault, CSpecDefault, MaskDisabled> does not support this problem
DeviceGroupedMultiheadAttentionBackward_Qloop_Xdl_CShuffle_V1<256, 64, 128, 64, 8, 8, 64, 64, 32, 2, MNKOPadding, ASpecDefault, B0SpecDefault, B1SpecDefault, CSpecDefault, MaskDisabled> does not support this problem

---truncated for brevity---

<torch.utils.benchmark.utils.common.Measurement object at 0x7f98b2f95d60>
y.backward(grad, retain_graph=True)
  13.96 ms
  1 measurement, 30 runs , 8 threads
FlashAttention - Forward + Backward pass
DeviceGroupedMultiheadAttentionForward_Xdl_CShuffle_V2<256, 128, 128, 32, 8, 8, 128, 64, 32, 2, MNKOPadding, ASpecDefault, B0SpecDefault, B1SpecDefault, CSpecDefault, MaskDisabled> does not support this problem
DeviceGroupedMultiheadAttentionBackward_Qloop_Xdl_CShuffle_V1<256, 64, 128, 64, 8, 8, 64, 64, 32, 2, MNKOPadding, ASpecDefault, B0SpecDefault, B1SpecDefault, CSpecDefault, MaskDisabled> does not support this problem

---truncated for brevity---

<torch.utils.benchmark.utils.common.Measurement object at 0x7f98b2f9f430>
f(grad, *inputs, **kwinputs)
  19.14 ms
  1 measurement, 30 runs , 8 threads
PyTorch Standard Attention - Forward pass
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98b2f95340>
fn_amp(*inputs, **kwinputs)
  61.16 ms
  1 measurement, 30 runs , 8 threads
PyTorch Standard Attention - Backward pass
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98b2f9f220>
y.backward(grad, retain_graph=True)
  138.70 ms
  1 measurement, 30 runs , 8 threads
PyTorch Standard Attention - Forward + Backward pass
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98b2f9ad30>
f(grad, *inputs, **kwinputs)
  200.90 ms
  1 measurement, 30 runs , 8 threads

Besides that, the numbers look promising!

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 21, 2023

Two MI60, 70B:
Output generated in 59.02 seconds (3.39 tokens/s, 200 tokens, context 37, seed 742985185)

That isn't horrible at least. It would be good for 4xMi25 since they cost together what one Mi60 does.

pytorch

Pyrorch rocm doesn't work? It worked on my RX580.

@jmoney7823956789378
Copy link

That isn't horrible at least. It would be good for 4xMi25 since they cost together what one Mi60 does.

True, the MI60s are a slightly newer generation of chip though. I only got to test MI25s for a very short time, before exllama was out. If I remember correctly I think they did about 5t/s on 13B (stock bios).

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 21, 2023

That's not that great. People were using them for SD and kept saying they had good FP16 perf. I think they were flashing them though.

@jmoney7823956789378
Copy link

That's not that great. People were using them for SD and kept saying they had good FP16 perf. I think they were flashing them though.

Yep, I had flashed it after the fact but I don't think I have the perf stats saved.

@opcod3
Copy link
Author

opcod3 commented Jul 21, 2023

Guys, back to the issue at hand, does anyone have any tips on how to figure where is the computation breaking?

Is it possible to access the raw tensors with pytorch, as in the actual bytes that I can look through with a hex editor, to see if they are what I expect?

@jmoney7823956789378
Copy link

Another shot in the dark, but are you able to roll back to ROCm 5.5? that's what I have on my MI60s, under baremetal ubuntu 22.04.

@turboderp
Copy link
Owner

turboderp commented Jul 21, 2023

Guys, back to the issue at hand, does anyone have any tips on how to figure where is the computation breaking?

I would say if the model works on either GPU, but not when split across both, you'll want to start by focusing on where the hidden state is moved from one GPU to the next. I assume you've already tried --gpu-peer-fix, but otherwise the _move_tensor() function of model.py should wrap every such copy. If you don't have a debugger (I recommend trying PyCharm which is free and pretty competent) you could add some debug output:

def _move_tensor(tensor, new_device, name, config):
    device = str(tensor.device)
    if device == new_device: return tensor

    print("------------------------------------------")
    print(f"Moving tensor {name} from {device} to {new_device}")
    print(f"Tensor on {device}:")
    print(tensor)

    if config.gpu_peer_fix:
        if str(device).startswith("cuda:") and str(new_device).startswith("cuda:"):
            tensor = tensor.to("cpu")
            print(f"Tensor on CPU:")
            print(tensor)

    tensor = tensor.to(new_device)

    print(f"Tensor on {new_device}:")
    print(tensor)

    return tensor

@opcod3
Copy link
Author

opcod3 commented Jul 21, 2023

Another shot in the dark, but are you able to roll back to ROCm 5.5? that's what I have on my MI60s, under baremetal ubuntu 22.04.

Tried with ROCm 5.5.1, and the 6800XT instead of the 7900XTX and no difference. Still garbage.

I assume you've already tried --gpu-peer-fix but otherwise the _move_tensor() function of model.py should wrap every such copy

Yeah I tried --gpu-peer-fix. No change...

I also already tried instrumenting _move_tensor but not quite as extensively as you suggested. I'll try again. I had no idea you could print a tensor while it's on a non-CPU device. Thanks for the tips!

@opcod3
Copy link
Author

opcod3 commented Jul 21, 2023

Well damn, seems that when splitting the work over two gpus the hidden_states tensor suddenly becomes NaN

Moving tensor hidden_states from cpu to cuda:0
Tensor on cpu:
tensor([[[-0.0197753906, -0.0042114258,  0.0005874634,  ...,
           0.0012664795,  0.0031738281,  0.0076293945],
         [ 0.0087890625, -0.0012588501, -0.0439453125,  ...,
          -0.0089721680,  0.0034179688,  0.0088500977],
         [-0.0013961792, -0.0218505859, -0.0128784180,  ...,
           0.0363769531, -0.0120239258,  0.0422363281],
         ...,
         [ 0.0061035156, -0.0081176758,  0.0139770508,  ...,
          -0.0244140625, -0.0161132812,  0.0064086914],
         [ 0.0319824219, -0.0166015625,  0.0222167969,  ...,
           0.0317382812,  0.0131835938, -0.0017623901],
         [ 0.0034484863, -0.0013504028, -0.0005683899,  ...,
           0.0007972717,  0.0017318726, -0.0008697510]]], dtype=torch.float16)
Tensor on cuda:0:
tensor([[[-0.0197753906, -0.0042114258,  0.0005874634,  ...,
           0.0012664795,  0.0031738281,  0.0076293945],
         [ 0.0087890625, -0.0012588501, -0.0439453125,  ...,
          -0.0089721680,  0.0034179688,  0.0088500977],
         [-0.0013961792, -0.0218505859, -0.0128784180,  ...,
           0.0363769531, -0.0120239258,  0.0422363281],
         ...,
         [ 0.0061035156, -0.0081176758,  0.0139770508,  ...,
          -0.0244140625, -0.0161132812,  0.0064086914],
         [ 0.0319824219, -0.0166015625,  0.0222167969,  ...,
           0.0317382812,  0.0131835938, -0.0017623901],
         [ 0.0034484863, -0.0013504028, -0.0005683899,  ...,
           0.0007972717,  0.0017318726, -0.0008697510]]], device='cuda:0',
       dtype=torch.float16)
------------------------------------------
Moving tensor hidden_states from cuda:0 to cuda:1
Tensor on cuda:0:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16)
Tensor on CPU:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], dtype=torch.float16)
Tensor on cuda:1:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:1',
       dtype=torch.float16)
------------------------------------------
Moving tensor logits from cuda:1 to cpu
Tensor on cuda:1:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:1')
Tensor on cpu:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]])
------------------------------------------

@IMbackK
Copy link

IMbackK commented Jul 21, 2023

pretty clearly a bug in rocm or pytorch, def report it upstream

@ardfork
Copy link
Contributor

ardfork commented Jul 21, 2023

I agree that this should be reported upstream. You can also set AMD_LOG_LEVEL to 2 or more to maybe get some extra information about where the problem is.

@turboderp
Copy link
Owner

turboderp commented Jul 21, 2023

I don't think that's it. If it was a problem with moving tensors between devices you should see it start out looking correct on cuda:0 and then go bad as it's moved to a different device. According to this, you get the NaN tensor on cuda:0 already.

@jmoney7823956789378
Copy link

What kind of PCIe slot are the GPUs in?
My cards don't like to be in anything but full x16, but that could be due to the nature of MI-series.
Can you show me what's listed for your GPUs when you use lspci -nnvk ? This will produce a lot of output, so consider using lspci -nnvk | less

@IMbackK
Copy link

IMbackK commented Jul 22, 2023

I can also reproduce this with any combination of rx6800xt, mi25 and mi50s. all of my devices are connected with full x16 pcie 3.0 running a kernel with CONFIG_HSA_AMD_P2P (ie rocm can and dose use pcie p2p transfers)

@opcod3
Copy link
Author

opcod3 commented Jul 22, 2023

The gpu's are in x16 slots, running at up to PCIe 4.0. They are only connected at x8 width though.

If I place the 7900XTX in the second slot, it hangs when running exllama (nothing ever gets loaded into VRAM). It works fine in the first slot. Other cards work fine in the second slot.

Running two cards with AMD_LOG_LEVEL=1 results in the following lines occasionally

:1:devprogram.cpp           :1874: 0099726403 us: 5055 : [tid:0x7f52d92b3740] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

It seems that maybe PyTorch is attempting to load the wrong kernel into the gpu. Or maybe even not all of the correct offload architectures are being compiled into the module.

@IMbackK
Copy link

IMbackK commented Jul 22, 2023

same issue i have with just plain transformers on just one gpu see ROCm/ROCm#2328 and huggingface/transformers#25007

Please write a comment that you are haveing a simmular issue with exlama and try the transformers script on your setup

We should also probubly refer the issue to pytorch too.

@opcod3
Copy link
Author

opcod3 commented Jul 22, 2023

I can't replicate with your scripts. But I can only use 2 GPUs at a time in my system

@opcod3
Copy link
Author

opcod3 commented Jul 22, 2023

After doing some more research I believe that the issue may be in rocblas.

As after the failing hipModuleLoadData is called the code still tries to load in a bunch of functions:

:3:hip_module.cpp           :73  : 16124483089 us: 57772: [tid:0x7f9bbb3e1740] �[32m hipModuleGetFunction ( 0x7ffffeff8038, 0x55d75b402040, Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) �[0m
:1:hip_code_object.cpp      :606 : 16124483101 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 16124483116 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x5b402040 

:3:hip_module.cpp           :84  : 16124483127 us: 57772: [tid:0x7f9bbb3e1740] hipModuleGetFunction: Returned hipErrorNotFound : 
:3:hip_module.cpp           :73  : 16124483132 us: 57772: [tid:0x7f9bbb3e1740] �[32m hipModuleGetFunction ( 0x7ffffeff8038, 0x55d75b95bef0, Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) �[0m
:1:hip_code_object.cpp      :606 : 16124483143 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 16124483153 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x5b95bef0 

:3:hip_module.cpp           :84  : 16124483163 us: 57772: [tid:0x7f9bbb3e1740] hipModuleGetFunction: Returned hipErrorNotFound : 
:3:hip_module.cpp           :73  : 16124483168 us: 57772: [tid:0x7f9bbb3e1740] �[32m hipModuleGetFunction ( 0x7ffffeff8038, 0x55d75b3c5600, Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) �[0m
:1:hip_code_object.cpp      :606 : 16124483178 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 16124483191 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x5b3c5600 

:3:hip_module.cpp           :84  : 16124483201 us: 57772: [tid:0x7f9bbb3e1740] hipModuleGetFunction: Returned hipErrorNotFound : 
:3:hip_module.cpp           :73  : 16124483206 us: 57772: [tid:0x7f9bbb3e1740] �[32m hipModuleGetFunction ( 0x7ffffeff8038, 0x55d77db8cf90, Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) �[0m
:1:hip_code_object.cpp      :606 : 16124483215 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 16124483226 us: 57772: [tid:0x7f9bbb3e1740] Cannot find the function: Cijk_Ailk_Bljk_HB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x7db8cf90 

:3:hip_module.cpp           :84  : 16124483236 us: 57772: [tid:0x7f9bbb3e1740] hipModuleGetFunction: Returned hipErrorNotFound : 

If i grep for one of the names inside the rocm directory i find that these functions are found in the following files:

grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx90a-xnack-.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx1030.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx900.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx1101.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx1010.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx803.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx1100.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx906-xnack-.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx90a-xnack+.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx1012.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx1102.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx908-xnack-.hsaco: binary file matches
grep: /opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback.dat: binary file matches

What is likely happening is that rocblas loads in one of these modules once and then never bothers checking if the architecture of the device is the same on the next call, and therefore it tries to load the functions from memory which obviously fails.

@ardfork
Copy link
Contributor

ardfork commented Jul 22, 2023

I have those errors since I started making the ROCm patch, but at least on a single GPU setup, it doesn't seem to have any impact.

@opcod3
Copy link
Author

opcod3 commented Jul 22, 2023

Indeed, difference is on a single GPU the function is eventually found:

:3:hip_module.cpp           :73  : 21675563713 us: 69803: [tid:0x7fc9179e1740] �[32m hipModuleGetFunction ( 0x7ffe85f785f8, 0x55cca23c5400, Cijk_Alik_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW2_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) �[0m
:1:hip_code_object.cpp      :606 : 21675563724 us: 69803: [tid:0x7fc9179e1740] Cannot find the function: Cijk_Alik_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW2_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 21675563732 us: 69803: [tid:0x7fc9179e1740] Cannot find the function: Cijk_Alik_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW2_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0xa23c5400 

:3:hip_module.cpp           :84  : 21675563741 us: 69803: [tid:0x7fc9179e1740] hipModuleGetFunction: Returned hipErrorNotFound : 
:3:hip_module.cpp           :73  : 21675563746 us: 69803: [tid:0x7fc9179e1740] �[32m hipModuleGetFunction ( 0x7ffe85f785f8, 0x55cca2589470, Cijk_Alik_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW2_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) �[0m
:1:hip_code_object.cpp      :606 : 21675563755 us: 69803: [tid:0x7fc9179e1740] Cannot find the function: Cijk_Alik_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW2_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 21675563764 us: 69803: [tid:0x7fc9179e1740] Cannot find the function: Cijk_Alik_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW2_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0xa2589470 

:3:hip_module.cpp           :84  : 21675563776 us: 69803: [tid:0x7fc9179e1740] hipModuleGetFunction: Returned hipErrorNotFound : 
:3:hip_module.cpp           :73  : 21675563781 us: 69803: [tid:0x7fc9179e1740] �[32m hipModuleGetFunction ( 0x7ffe85f785f8, 0x55ccfd220430, Cijk_Alik_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW2_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) �[0m
:3:hip_module.cpp           :87  : 21675563790 us: 69803: [tid:0x7fc9179e1740] hipModuleGetFunction: Returned hipSuccess : 

You can see at the last line: hipModuleGetFunction: Returned hipSuccess. While with two GPUs it constantly returns hipErrorNotFound. Shocking part (at least to me) is that it doesn't cause the program to crash but to output incorrect results.

@opcod3
Copy link
Author

opcod3 commented Jul 23, 2023

I have figured out this is a bug in either rocBLAS or Tensile.

I've reported it upstream: ROCm/rocBLAS/issues/1346

@turboderp
Copy link
Owner

I'll close this issue here then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants