Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The preq variable passed to svr_strtjob2 is the wrong request. #364

Open
wants to merge 1 commit into
base: 4.2-dev
Choose a base branch
from

Conversation

dvandok
Copy link
Contributor

@dvandok dvandok commented May 24, 2016

It should have been the AsyrunJob request but it is now the CopyFiles request to stage the
files to the mom; and that request has been done and dealt with. Usually the
ack later on to this request will not go through, as the file descriptor to
which it was associated is already closed, but in some cases the fd has been
reopened to start more jobs on the same worker node. In that case the ack
is misinterpreted as a Connect request by the mom and the job commit will not
go through, leaving jobs in a strange state. The fix in this commit mitigates
the problem by preventing the ack from being sent.

For a longer explaination see the wiki page I wrote.

…uld have

been the AsyrunJob request but it is now the CopyFiles request to stage the
files to the mom; and that request has been done and dealt with. Usually the
ack later on to this request will not go through, as the file descriptor to
which it was associated is already closed, but in some cases the fd has been
reopened to start more jobs on the same worker node. In that case the ack
is misinterpreted as a Connect request by the mom and the job commit will not
go through, leaving jobs in a strange state. The fix in this commit mitigates
the problem by preventing the ack from being sent.
@dbeer
Copy link

dbeer commented May 26, 2016

@dvandok great catch on this bug. I'm looking into this and I want to do some testing to see if there are other pieces that need to be added here to complete the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants