-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCI sock and cci_reject #29
Comments
George, this problem should be fixed in my branch (gvallee/sock). There is also a test (src/tests/connect_reject.c) to test it. Let me know if you have any other problem. Thanks, |
Guys, is this fixed? Scott |
This segfault is fixed and pushed to master. I have another pending improvement related to reject. |
On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures. Here is a backtrace: |
George, Can you try with tcp as well? Geoffroy, can you take a look at this? Scott On May 22, 2013, at 3:30 PM, bosilca [email protected] wrote:
|
Works fine with TCP. There seems to be some data scrambling on the wire, I'll take a look at this using TCP until the sock is fixed. |
I just tried with my branch and did not get any segfault. I will try with master in a moment. |
The behavior I described in the ticket was with the master. I would definitively be interested in hearing about the case when it sends more data. Can you please elaborate a little bit on the circumstances when this happens? Thanks, On May 22, 2013, at 17:45 , gvallee [email protected] wrote:
|
I cannot elaborate much more yet since the overall behavior of the test is not consistent from one run to another, which make debugging more difficult. As soon as i have more details, i will put them here. Also, i will very soon push to my branch a few modifications to the sock transport that correctly handles the situation: it returns an error if the payload size is too big. |
The use of cci_reject with sock leads to either a deadlock or a segfault. The same code works fine with verbs.
Here is a stack trace when it segfault (it doesn't always segfault, sometimes it just deadlock).
#0 0x00007f82f945da26 in sock_progress_queued (ep=0xa636a0)
#1 0x00007f82f945dd55 in sock_progress_sends (ep=0xa636a0)
#2 0x00007f82f946571f in sock_progress_thread (arg=0xa636a0)
#3 0x00007f82ffa91b50 in start_thread (arg=) at pthread_create.c:304
Digging deeper in the core, I think I identified the issue. When the rx (type cci_event_connect_request_t) event is created it is correctly initialized. On the cci_reject call, as no connection exists yet the newly created tx event is tagged with a connection set to NULL. If this event end up in the queues and gets processes later on by sock_progress_queued, the NULL connections is just asking for troubles.
The text was updated successfully, but these errors were encountered: