-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-31062 Include dfu file op cost in workunit aggregates #18358
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamser - back to you for now as discussed offline.
common/workunit/workunit.cpp
Outdated
if (sst!=SSTnone) | ||
setStatistic(creatorType, creator, sst, currentScope.str(), StCostFileAccess, "", value, 1, 0, StatsMergeSum); | ||
} | ||
setStatistic(creatorType, creator, SSTglobal, "", StCostFileAccess, "", value, 1, 0, StatsMergeSum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, this would cause problems if called >once for same scope. This will happen if the duration if more than a few seconds, these mergesum figures will be wrong.
esp/scm/ws_fs.ecm
Outdated
@@ -734,7 +734,7 @@ ESPservice [ | |||
ESPmethod [resp_xsl_default("/esp/xslt/showresult.xslt")] ShowResult(ShowResultRequest, ShowResultResponse); | |||
ESPmethod [resp_xsl_default("/esp/xslt/dfuwu_search.xslt")] DFUWUSearch(DFUWUSearchRequest, DFUWUSearchResponse); | |||
ESPmethod [resp_xsl_default("/esp/xslt/dfu_workunits.xslt")] GetDFUWorkunits(GetDFUWorkunits, GetDFUWorkunitsResponse); | |||
ESPmethod [resp_xsl_default("/esp/xslt/dfu_wuid.xslt")] GetDFUWorkunit(GetDFUWorkunit, GetDFUWorkunitResponse); | |||
ESPmethod [min_ver("1.26"), resp_xsl_default("/esp/xslt/dfu_wuid.xslt")] GetDFUWorkunit(GetDFUWorkunit, GetDFUWorkunitResponse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what happens, but I would think this will cause all clients that are <1.26 to fail if they call this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added because 'FileAccessCost' wasn't being returned (it's a field with min_ver=1.23).
I am not sure why the service definition doesn't have a 'default_client_version', nor am I sure what the effect should be when client code compiled with a current version (1.26) hits a server (1.26), i.e. why doesn't it treat it as a 1.26 client in the absence of 'default_client_version'.
I can understand why an arbitrary SOAP call that doesn't specify a version uses 'default_client_version', but not in the case of code based on generated ecm stubs.
Let's check/discuss with @afishbeck further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #18538 is merged, this change (line 737 will not be needed)
common/workunit/workunit.cpp
Outdated
iter->playProperties(aggregatesLoader); | ||
if (iter->getScopeType()==SSToperation && !iter->nextSibling()) // don't descend operation. only top operation scope needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although I agree shouldn't be necessary to recurse down. I think in general, it would be better not to special case unless necessary. In offline chat you highlighted that there's a secondary reason related to a bug.
I'd leave this code as it is, but reference the reasons and the JIRA until resolved, then this code can be revisited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed the bug with: #18547. Once the fix is merged, the changes made in line 2697 will not be needed.
@shamser - can you remind me why this isn't/can't be implemented via the StatisticsAggregator mechanism? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamser see question.
@jakesmith However, if this is not done with the expectation is that the statistic will be handled by future StatisticsAggregator, it won't be. The reason being is that there is nothing marking the SSTdfuworkunit scope as dirty and if the scope is not dirty then it wouldn't be used to calculate the aggregate. |
That is already happening though(?), but fileservices doesn't have access to the statsAggregator. I think this approach is ok/pragmatic, but ideally, all stat settings would go through the aggregator that's active for the current workunit - with a similar approach to StatisticsAggregator::recordStats, which would mark dirty for a dfuscope (or arbitrary scope), and updates (to the workunit) would be a separate period task. It would be good to add some notes into the code, to clarify why fileservices stats are not aggregated via the current workunit's StatisticsAggregator and what would have to change for that to be so, to make it clearer if this is revisited. |
common/workunit/workunit.cpp
Outdated
// This would be slow and unnessarily cumbersome.) Note also that if the stat is written | ||
// with the expectation that the hThor/Thor stats aggregator would generate the aggregate for | ||
// this stat, it would not be the case as the aggregates are only created for stats marked | ||
// 'dirty'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment update wasn't quite what I had in mind.
I've opened a new JIRA to potentially reviist this: https://hpccsystems.atlassian.net/browse/HPCC-31703
And say the StatsAggregator were created here, it would need to load all the global stats from dali
Each filespray service call is called from a context that already has a CStatsAggregator, and has already loaded the existing stats, and has the ability to mark target scopes dirty.
However, fileservices doesn't have access to the in flight 'statsAggregator' (perhaps could have via ICodeContext).
And, CStatsAggregator would need an alternate 'recordStats' mechanism that took an arbitrary scope, and possible a CRuntimeStatisitcsCollections for the specific stats being updated.
The advantage would be that it would share the same rollup aggregator mechanism as used elsewhere, and it would allow the stat. updates vs the aggregation and publication to be done at different times (e.g. is spraying lots of small files, the updates to Dali constantly could become expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamser - approving, but instead of squashing, I think it would be better to drop the 2nd commit (comment update), see my comments.
And this can then revisited at some later date (https://hpccsystems.atlassian.net/browse/HPCC-31703)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamser - looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamser I think ok to merge, but have a couple of questions.
const unsigned numStats = mapping.numStatistics(); | ||
for (unsigned i=0; i<numStats; ++i) | ||
filter.addOutputStatistic(mapping.getKind(i)); | ||
filter.setDepth(1,3); // 1=global, 2=workflow, 3=graph | ||
filter.setDepth(0,3); // 0=global, 1=workflow, (2=graph, 3=subgraph | 2=>dfu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you probably just want to delete this line. As long as you have the next line it will efficiently walk the existing global stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "global", I mean the top-level scopes (e.g. parent of the workflow scopes). To iterate through the top level scopes, the depth needs to start at 0. Here is information from logging noteStatistic to record the scope names (depth 0,3):
noteStatistic Statistic CostExecute (value 415263369)
noteStatistic Statistic CostFileAccess (value 12554)
noteStatistic w1 Statistic CostExecute (value 415263369)
noteStatistic w1 Statistic CostFileAccess (value 12554)
noteStatistic w1:graph1 Statistic CostExecute (value 386667103)
noteStatistic w1:graph2 Statistic CostExecute (value 266666968)
noteStatistic w1:graph3 Statistic CostExecute (value 4056896594)
noteStatistic w1:graph3 Statistic CostFileAccess (value 323)
noteStatistic w1:graph4 Statistic CostFileAccess (value 525)
noteStatistic w1:graph4 Statistic CostExecute (value 3493194855)
noteStatistic w1:>dfu Statistic CostFileAccess (value 11706)
noteStatistic w1:>dfu:>copy Statistic CostFileAccess (value 11706)
If the depth is set to (1,3), the iterator skips the top-level and starts one level down (e.g. at the workflow scope level). Here's the results of the same workunit but with the depth set to (1,3):
noteStatistic w1 Statistic CostExecute (value 941930630)
noteStatistic w1 Statistic CostFileAccess (value 12554)
noteStatistic w1:graph1 Statistic CostExecute (value 384444878)
noteStatistic w1:graph2 Statistic CostExecute (value 246666945)
noteStatistic w1:graph3 Statistic CostExecute (value 310818807)
noteStatistic w1:graph3 Statistic CostFileAccess (value 323)
noteStatistic w1:graph4 Statistic CostFileAccess (value 525)
noteStatistic w1:graph4 Statistic CostExecute (value 1056007436)
noteStatistic w1:>dfu Statistic CostFileAccess (value 11706)
noteStatistic w1:>dfu:>copy Statistic CostFileAccess (value 11706)
I changed to 0,3 because I was seeing that the top-level totals wasn't being updated. With this change, they are.
@ghalliday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you need to change from 1,3 to 0,3. I think you could equally delete the filter, but I think probably no harm in leaving it here either.
The intent of the filters was that the sources etc. would be automatically deduced from the filter conditions. That logic needs updating for it to be true - and in this case the source has been explicitly selected.
@@ -8941,7 +8940,7 @@ void CLocalWorkUnit::setStatistic(StatisticCreatorType creatorType, const char * | |||
if (mergeAction != StatsMergeAppend) | |||
{ | |||
StringBuffer xpath; | |||
xpath.append("Statistic[@creator='").append(creator).append("'][@scope='").append(scope).append("'][@kind='").append(kindName).append("']"); | |||
xpath.append("Statistic[@scope='").append(scope).append("'][@kind='").append(kindName).append("']"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks correct, but what was the reason for it? The jira or the PR should include some indication of why.
The StCostFileAccess from dfu scopes are included in the aggregation of the StCostFileAccess totals. Note that an aggregate may be written by one component and then later updated by another component. CLocalWorkUnit::setStatistic has been modified to ignore the creator when looking up the existing value for a statistic. Signed-off-by: Shamser Ahmed <[email protected]> # dockerfiles/t4.yaml
The commit message has been updated to explain the change to setStatistic. I believe the change to the filter depth is necessary. Please see comment: #18358 (comment). |
Type of change:
Checklist:
Smoketest:
Testing: