Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCI sock and cci_reject #29

Open
bosilca opened this issue Mar 24, 2013 · 9 comments
Open

CCI sock and cci_reject #29

bosilca opened this issue Mar 24, 2013 · 9 comments

Comments

@bosilca
Copy link

bosilca commented Mar 24, 2013

The use of cci_reject with sock leads to either a deadlock or a segfault. The same code works fine with verbs.

Here is a stack trace when it segfault (it doesn't always segfault, sometimes it just deadlock).
#0 0x00007f82f945da26 in sock_progress_queued (ep=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:2218

#1 0x00007f82f945dd55 in sock_progress_sends (ep=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:2419

#2 0x00007f82f946571f in sock_progress_thread (arg=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:4940

#3 0x00007f82ffa91b50 in start_thread (arg=) at pthread_create.c:304

Digging deeper in the core, I think I identified the issue. When the rx (type cci_event_connect_request_t) event is created it is correctly initialized. On the cci_reject call, as no connection exists yet the newly created tx event is tagged with a connection set to NULL. If this event end up in the queues and gets processes later on by sock_progress_queued, the NULL connections is just asking for troubles.

@gvallee
Copy link
Contributor

gvallee commented Apr 2, 2013

George, this problem should be fixed in my branch (gvallee/sock). There is also a test (src/tests/connect_reject.c) to test it. Let me know if you have any other problem. Thanks,

@scottatchley
Copy link
Contributor

Guys, is this fixed?

Scott

@gvallee
Copy link
Contributor

gvallee commented May 13, 2013

This segfault is fixed and pushed to master. I have another pending improvement related to reject.

@bosilca
Copy link
Author

bosilca commented May 22, 2013

On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.

Here is a backtrace:
#0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145
#1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169
#2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106
#3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467
#4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070
#5 0x00007fff8c239742 in _pthread_start ()
#6 0x00007fff8c226181 in thread_start ()

@scottatchley
Copy link
Contributor

George,

Can you try with tcp as well?

Geoffroy, can you take a look at this?

Scott

On May 22, 2013, at 3:30 PM, bosilca [email protected] wrote:

On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.

Here is a backtrace:
#0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145
#1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169
#2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106
#3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467
#4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070
#5 0x00007fff8c239742 in _pthread_start ()
#6 0x00007fff8c226181 in thread_start ()


Reply to this email directly or view it on GitHub.

@bosilca
Copy link
Author

bosilca commented May 22, 2013

Works fine with TCP. There seems to be some data scrambling on the wire, I'll take a look at this using TCP until the sock is fixed.

@gvallee
Copy link
Contributor

gvallee commented May 22, 2013

I just tried with my branch and did not get any segfault. I will try with master in a moment.
BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.

@bosilca
Copy link
Author

bosilca commented May 22, 2013

The behavior I described in the ticket was with the master.

I would definitively be interested in hearing about the case when it sends more data. Can you please elaborate a little bit on the circumstances when this happens?

Thanks,
George.

On May 22, 2013, at 17:45 , gvallee [email protected] wrote:

I just tried with my branch and did not get any segfault. I will try with master in a moment.
BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.


Reply to this email directly or view it on GitHub.

@gvallee
Copy link
Contributor

gvallee commented May 22, 2013

I cannot elaborate much more yet since the overall behavior of the test is not consistent from one run to another, which make debugging more difficult. As soon as i have more details, i will put them here. Also, i will very soon push to my branch a few modifications to the sock transport that correctly handles the situation: it returns an error if the payload size is too big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants