-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-32483 Capture global sort/join blocked time #19181
Conversation
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32483 Jirabot Action Result: |
Capture blocked time for global sort and joins. Signed-off-by: Shamser Ahmed <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamser - please see comments.
It may be better to split this into individual activity PR changes.
@@ -299,7 +299,10 @@ class JoinSlaveActivity : public CSlaveActivity, implements ILookAheadStopNotify | |||
stopOtherInput(); | |||
throw; | |||
} | |||
asyncSecondaryStart.wait(); | |||
{ | |||
BlockedActivityTimer t(slaveTimerStats, timeActivities); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this.
why is this blocked time?
input is on a thread - asynchronously starting.
Totalcycles at this point, will be the total time both took to start (overlapping).
The timing in queryLocalCycles() can go wrong because it is considering totalcycles - (input1-time+intput2-time), i.e. as if started synchronously.
This timer will be : input2 time - input1 time (or 0 if input2 was quicker).
Counting it as lookahead time, means "processCycles" in queryLocalCycles() will be total+this lookahead time, and mean that when the sum total for both inputs are deduced it should be correct.
Whatever the conclusion, needs a good comment here explaining the logic clearly.
@@ -387,7 +390,10 @@ class JoinSlaveActivity : public CSlaveActivity, implements ILookAheadStopNotify | |||
{ | |||
unsigned bn=noSortPartitionSide()?2:4; | |||
ActPrintLog("JOIN waiting barrier.%d",bn); | |||
barrier->wait(false); | |||
{ | |||
BlockedActivityTimer t(slaveTimerStats, timeActivities); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize there's BlockActivityTimer's in other stops already (via hash agg stream),
but .. stop() time in general is not being timed (as part of total cycles).. may be it should be - but that's a separate question.
So that means, if blocked time is, then processCycles could be < blockedCycles ( and result in warnings from queryLocalCycles)
If any act measures blocked in stop(), it would mean it needs to consider stop() in the calculation in general which it doesn't.
Can you remove this (and separately other Blocked time considerations in other stop'd contexts) and open a separate JIRA to revisit measuring time in general in activities stop()'s.
return false; | ||
{ | ||
BlockedActivityTimer t(slaveTimerStats, timeActivities); | ||
if (!barrier->wait(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the barrier->wait on line 613?
@@ -894,7 +895,10 @@ class CKeyedJoinSlave : public CSlaveActivity, implements IJoinProcessor, implem | |||
void inc() | |||
{ | |||
while (incNonBlocking()) | |||
{ | |||
BlockedActivityTimer t(activity.getTotalCyclesRef(), activity.queryTimeActivities()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this block time actually holding up the nextRow (or start) of the KJ act?
If it's not and overlapping time, e.g. because full, but nextRow is not blocked, then it would be wrong to consider it "blocked" time.
The time we are trying to track is the time the totalcycles was blocked.
@@ -907,6 +911,7 @@ class CKeyedJoinSlave : public CSlaveActivity, implements IJoinProcessor, implem | |||
} | |||
void block() | |||
{ | |||
BlockedActivityTimer t(activity.getTotalCyclesRef(), activity.queryTimeActivities()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same q. as per prev. comments.
Need to think carefully whether overlapping concurrent threads that are pending on something are blocking the acts start/nextrow from making progress.
@@ -1027,6 +1027,7 @@ class CInMemJoinBase : public CSlaveActivity, public CAllOrLookupHelper<HELPER>, | |||
} | |||
inline void interChannelBarrier() | |||
{ | |||
BlockedActivityTimer t(slaveTimerStats, timeActivities); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is generally corrected to consider this block time, because most are being called in the context (on the stack of) start/nextRow,
but there are a couple which are in the context of stop() [ see other comments re. not trackingstop() time ]
I'll be splitting the changes in this PR into separate PRs. |
Capture blocked time for global sort and joins.
Type of change:
Checklist:
Smoketest:
Testing: