Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlx5: got completion with error #5

Open
jtuni opened this issue May 2, 2023 · 1 comment
Open

mlx5: got completion with error #5

jtuni opened this issue May 2, 2023 · 1 comment

Comments

@jtuni
Copy link

jtuni commented May 2, 2023

I'm unable to successfully transfer a file with hdRDMAcp. On sever side I am running the following:
./hdrdmacp -s -n 1 -m 8GB
and getting the following output:


=============================================
Found 1 devices
---------------------------------------------
   device 0 : mlx5_0 : uverbs0 : IB : InfiniBand channel adapter : Num. ports=1 : port num=1 : lid=30
=============================================

Device mlx5_0 opened. num_comp_vectors=60
Port attributes:
           state: 4
         max_mtu: 5
      active_mtu: 5
  port_cap_flags: 575793224
      max_msg_sz: 1073741824
    active_width: 2
    active_speed: 32
      phys_state: 5
      link_layer: 1
Created 1 buffers of 8000MB (8GB total)
Listening for connections on port ... 10470

So everything looks good. On client side, I am trying to send a large file running:
/hdrdmacp/hdrdmacp /home/tuni/example.file 10.2.1.85:/home/tuni/file_1g.file -n 1 -m 8GB
And I get an error, regardless of the file size and the combinations of buffers and buffer sizes I use:


=============================================
Found 1 devices
---------------------------------------------
   device 0 : mlx5_0 : uverbs0 : IB : InfiniBand channel adapter : Num. ports=1 : port num=1 : lid=21
=============================================

Device mlx5_0 opened. num_comp_vectors=60
Port attributes:
           state: 4
         max_mtu: 5
      active_mtu: 5
  port_cap_flags: 575793224
      max_msg_sz: 1073741824
    active_width: 2
    active_speed: 32
      phys_state: 5
      link_layer: 1
Created 1 buffers of 8000MB (8GB total)
IP address: 10.2.1.85 (10.2.1.85)
Connected to 10.2.1.85:10470
Sending file: /home/tuni/example.file-> (10.2.1.85:)/home/tuni/file_1g.file   (3.15027 GB)
  queued 3150MB (3150/3150 MB -- 100%  - 24.7215 Gbps)   mlx5: pirineusknl1: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 42006802 0a0002e3 0000eed2

  Transferred 3.15027 GB in 1.02321 sec  (24.6306 Gbps)
  I/O rate reading from file: 1.01938 sec  (24.7231 Gbps)

Even though the output states the file was transferred at 24.6 Gbps in 1.02 seconds, the file is never transferred. Both client and server are using the same OFED version and are basically identical on every aspect, so I don't know what to do to fix this, any ideas?

@faustus123
Copy link
Collaborator

I'll preface this by saying this has been very stable for our application for a few years now so I have not done anything with the code for some time. Here is what I would suggest you try:

  1. First, verify that there are no permission issues and that a file with the destination name does not already exist on the server. Check that you can create the file in the terminal where you will run the server just before starting the server with:
    touch /home/tuni/file_1g.file
    rm /home/tuni/file_1g.file

  2. There is currently no check in the code to verify that the output file is successfully opened at line 440 of hdRDMAThread.cc. My guess is this may be where the issue is if you are not even seeing the output file created. I should have put a check on this using ofs->is_open(). I would suggest adding a check and printing a message if the file was not opened successfully. (I need to do this in the repository code, but may not get to it right away and don't want to hold you up.)

  3. Uncomment lines 481- 488 of hdRDMAThread.cc and rerun it. This will print a message from the server when it receives a buffer marked as the last one for the file. It will at least let you check that the server is receiving that last buffer. It only tries closing the file when that buffer is received.

I'll be curious if you find what the problem is so be sure to post a follow-up with what you learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants