CCI sock and cci_reject #29

bosilca · 2013-03-24T22:55:34Z

The use of cci_reject with sock leads to either a deadlock or a segfault. The same code works fine with verbs.

Here is a stack trace when it segfault (it doesn't always segfault, sometimes it just deadlock).
#0 0x00007f82f945da26 in sock_progress_queued (ep=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:2218

#1 0x00007f82f945dd55 in sock_progress_sends (ep=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:2419

#2 0x00007f82f946571f in sock_progress_thread (arg=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:4940

#3 0x00007f82ffa91b50 in start_thread (arg=) at pthread_create.c:304

Digging deeper in the core, I think I identified the issue. When the rx (type cci_event_connect_request_t) event is created it is correctly initialized. On the cci_reject call, as no connection exists yet the newly created tx event is tagged with a connection set to NULL. If this event end up in the queues and gets processes later on by sock_progress_queued, the NULL connections is just asking for troubles.

The text was updated successfully, but these errors were encountered:

gvallee · 2013-04-02T22:48:58Z

George, this problem should be fixed in my branch (gvallee/sock). There is also a test (src/tests/connect_reject.c) to test it. Let me know if you have any other problem. Thanks,

scottatchley · 2013-05-10T18:40:14Z

Guys, is this fixed?

Scott

gvallee · 2013-05-13T15:02:38Z

This segfault is fixed and pushed to master. I have another pending improvement related to reject.

bosilca · 2013-05-22T19:30:19Z

On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.

Here is a backtrace:
#0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145
#1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169
#2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106
#3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467
#4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070
#5 0x00007fff8c239742 in _pthread_start ()
#6 0x00007fff8c226181 in thread_start ()

scottatchley · 2013-05-22T19:39:52Z

George,

Can you try with tcp as well?

Geoffroy, can you take a look at this?

Scott

On May 22, 2013, at 3:30 PM, bosilca [email protected] wrote:

On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.

Here is a backtrace:
#0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145
#1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169
#2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106
#3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467
#4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070
#5 0x00007fff8c239742 in _pthread_start ()
#6 0x00007fff8c226181 in thread_start ()

—
Reply to this email directly or view it on GitHub.

bosilca · 2013-05-22T19:44:37Z

Works fine with TCP. There seems to be some data scrambling on the wire, I'll take a look at this using TCP until the sock is fixed.

gvallee · 2013-05-22T21:45:45Z

I just tried with my branch and did not get any segfault. I will try with master in a moment.
BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.

bosilca · 2013-05-22T21:48:00Z

The behavior I described in the ticket was with the master.

I would definitively be interested in hearing about the case when it sends more data. Can you please elaborate a little bit on the circumstances when this happens?

Thanks,
George.

On May 22, 2013, at 17:45 , gvallee [email protected] wrote:

I just tried with my branch and did not get any segfault. I will try with master in a moment.
BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.

—
Reply to this email directly or view it on GitHub.

gvallee · 2013-05-22T22:52:09Z

I cannot elaborate much more yet since the overall behavior of the test is not consistent from one run to another, which make debugging more difficult. As soon as i have more details, i will put them here. Also, i will very soon push to my branch a few modifications to the sock transport that correctly handles the situation: it returns an error if the payload size is too big.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCI sock and cci_reject #29

CCI sock and cci_reject #29

bosilca commented Mar 24, 2013

gvallee commented Apr 2, 2013

scottatchley commented May 10, 2013

gvallee commented May 13, 2013

bosilca commented May 22, 2013

scottatchley commented May 22, 2013

bosilca commented May 22, 2013

gvallee commented May 22, 2013

bosilca commented May 22, 2013

gvallee commented May 22, 2013

CCI sock and cci_reject #29

CCI sock and cci_reject #29

Comments

bosilca commented Mar 24, 2013

gvallee commented Apr 2, 2013

scottatchley commented May 10, 2013

gvallee commented May 13, 2013

bosilca commented May 22, 2013

scottatchley commented May 22, 2013

bosilca commented May 22, 2013

gvallee commented May 22, 2013

bosilca commented May 22, 2013

gvallee commented May 22, 2013