Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-31062 Include dfu file op cost in workunit aggregates #18358

Merged
merged 1 commit into from
May 9, 2024

Conversation

shamser
Copy link
Contributor

@shamser shamser commented Feb 29, 2024

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@shamser shamser changed the title Issue31062 HPCC-31062 Generates aggregates for dfu stats in workunits Feb 29, 2024
@shamser shamser marked this pull request as ready for review March 1, 2024 12:30
@shamser shamser requested a review from jakesmith March 1, 2024 12:31
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - back to you for now as discussed offline.

if (sst!=SSTnone)
setStatistic(creatorType, creator, sst, currentScope.str(), StCostFileAccess, "", value, 1, 0, StatsMergeSum);
}
setStatistic(creatorType, creator, SSTglobal, "", StCostFileAccess, "", value, 1, 0, StatsMergeSum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, this would cause problems if called >once for same scope. This will happen if the duration if more than a few seconds, these mergesum figures will be wrong.

@shamser shamser changed the base branch from master to candidate-9.6.x March 27, 2024 16:29
@shamser shamser marked this pull request as draft March 27, 2024 16:30
@shamser shamser changed the title HPCC-31062 Generates aggregates for dfu stats in workunits HPCC-31062 Include dfu file op cost in workunit aggregates Mar 27, 2024
@shamser shamser marked this pull request as ready for review March 27, 2024 16:50
@shamser shamser requested a review from jakesmith March 27, 2024 16:50
@@ -734,7 +734,7 @@ ESPservice [
ESPmethod [resp_xsl_default("/esp/xslt/showresult.xslt")] ShowResult(ShowResultRequest, ShowResultResponse);
ESPmethod [resp_xsl_default("/esp/xslt/dfuwu_search.xslt")] DFUWUSearch(DFUWUSearchRequest, DFUWUSearchResponse);
ESPmethod [resp_xsl_default("/esp/xslt/dfu_workunits.xslt")] GetDFUWorkunits(GetDFUWorkunits, GetDFUWorkunitsResponse);
ESPmethod [resp_xsl_default("/esp/xslt/dfu_wuid.xslt")] GetDFUWorkunit(GetDFUWorkunit, GetDFUWorkunitResponse);
ESPmethod [min_ver("1.26"), resp_xsl_default("/esp/xslt/dfu_wuid.xslt")] GetDFUWorkunit(GetDFUWorkunit, GetDFUWorkunitResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what happens, but I would think this will cause all clients that are <1.26 to fail if they call this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added because 'FileAccessCost' wasn't being returned (it's a field with min_ver=1.23).

I am not sure why the service definition doesn't have a 'default_client_version', nor am I sure what the effect should be when client code compiled with a current version (1.26) hits a server (1.26), i.e. why doesn't it treat it as a 1.26 client in the absence of 'default_client_version'.
I can understand why an arbitrary SOAP call that doesn't specify a version uses 'default_client_version', but not in the case of code based on generated ecm stubs.
Let's check/discuss with @afishbeck further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #18538 is merged, this change (line 737 will not be needed)

iter->playProperties(aggregatesLoader);
if (iter->getScopeType()==SSToperation && !iter->nextSibling()) // don't descend operation. only top operation scope needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although I agree shouldn't be necessary to recurse down. I think in general, it would be better not to special case unless necessary. In offline chat you highlighted that there's a secondary reason related to a bug.
I'd leave this code as it is, but reference the reasons and the JIRA until resolved, then this code can be revisited.

Copy link
Contributor Author

@shamser shamser Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the bug with: #18547. Once the fix is merged, the changes made in line 2697 will not be needed.

@shamser
Copy link
Contributor Author

shamser commented Apr 17, 2024

The following PRs must be merged before this one: #18538 and #18547 .

@jakesmith
Copy link
Member

@shamser - can you remind me why this isn't/can't be implemented via the StatisticsAggregator mechanism?

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser see question.

@shamser
Copy link
Contributor Author

shamser commented Apr 29, 2024

@shamser - can you remind me why this isn't/can't be implemented via the StatisticsAggregator mechanism?

@jakesmith
SSTdfuworkunit statistic is updated by fileservices plugin using CLocalWorkunit::setStatistic.
To update the aggregates immediately within setStatistic, it would need to create an StatisticsAggregator object, load all of the existing statistics from global stats, mark the SSTdfuworkunit scope as "dirty", and then execute refreshAggregates to update the totals. The CStatisticCollection would need to be extended to support marking a scope as dirty. (This seems messy, slow and complicated compared to what is being done in this PR).

However, if this is not done with the expectation is that the statistic will be handled by future StatisticsAggregator, it won't be. The reason being is that there is nothing marking the SSTdfuworkunit scope as dirty and if the scope is not dirty then it wouldn't be used to calculate the aggregate.

@shamser shamser requested a review from jakesmith April 29, 2024 09:30
@jakesmith
Copy link
Member

To update the aggregates immediately within setStatistic, it would need to create an StatisticsAggregator object, load all of the existing statistics from global stats

That is already happening though(?), but fileservices doesn't have access to the statsAggregator.
i.e. fileservices is being called from a running workunit, which is loading all stats/calling StatisticsAggregator::updateAggregates periodically.

I think this approach is ok/pragmatic, but ideally, all stat settings would go through the aggregator that's active for the current workunit - with a similar approach to StatisticsAggregator::recordStats, which would mark dirty for a dfuscope (or arbitrary scope), and updates (to the workunit) would be a separate period task.

It would be good to add some notes into the code, to clarify why fileservices stats are not aggregated via the current workunit's StatisticsAggregator and what would have to change for that to be so, to make it clearer if this is revisited.

// This would be slow and unnessarily cumbersome.) Note also that if the stat is written
// with the expectation that the hThor/Thor stats aggregator would generate the aggregate for
// this stat, it would not be the case as the aggregates are only created for stats marked
// 'dirty'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment update wasn't quite what I had in mind.
I've opened a new JIRA to potentially reviist this: https://hpccsystems.atlassian.net/browse/HPCC-31703

And say the StatsAggregator were created here, it would need to load all the global stats from dali

Each filespray service call is called from a context that already has a CStatsAggregator, and has already loaded the existing stats, and has the ability to mark target scopes dirty.
However, fileservices doesn't have access to the in flight 'statsAggregator' (perhaps could have via ICodeContext).

And, CStatsAggregator would need an alternate 'recordStats' mechanism that took an arbitrary scope, and possible a CRuntimeStatisitcsCollections for the specific stats being updated.

The advantage would be that it would share the same rollup aggregator mechanism as used elsewhere, and it would allow the stat. updates vs the aggregation and publication to be done at different times (e.g. is spraying lots of small files, the updates to Dali constantly could become expensive.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - approving, but instead of squashing, I think it would be better to drop the 2nd commit (comment update), see my comments.
And this can then revisited at some later date (https://hpccsystems.atlassian.net/browse/HPCC-31703)

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - looks good.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser I think ok to merge, but have a couple of questions.

const unsigned numStats = mapping.numStatistics();
for (unsigned i=0; i<numStats; ++i)
filter.addOutputStatistic(mapping.getKind(i));
filter.setDepth(1,3); // 1=global, 2=workflow, 3=graph
filter.setDepth(0,3); // 0=global, 1=workflow, (2=graph, 3=subgraph | 2=>dfu)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably just want to delete this line. As long as you have the next line it will efficiently walk the existing global stats.

Copy link
Contributor Author

@shamser shamser May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "global", I mean the top-level scopes (e.g. parent of the workflow scopes). To iterate through the top level scopes, the depth needs to start at 0. Here is information from logging noteStatistic to record the scope names (depth 0,3):

noteStatistic Statistic CostExecute (value 415263369)
noteStatistic Statistic CostFileAccess (value 12554)
noteStatistic w1 Statistic CostExecute (value 415263369)
noteStatistic w1 Statistic CostFileAccess (value 12554)
noteStatistic w1:graph1 Statistic CostExecute (value 386667103)
noteStatistic w1:graph2 Statistic CostExecute (value 266666968)
noteStatistic w1:graph3 Statistic CostExecute (value 4056896594)
noteStatistic w1:graph3 Statistic CostFileAccess (value 323)
noteStatistic w1:graph4 Statistic CostFileAccess (value 525)
noteStatistic w1:graph4 Statistic CostExecute (value 3493194855)
noteStatistic w1:>dfu Statistic CostFileAccess (value 11706)
noteStatistic w1:>dfu:>copy Statistic CostFileAccess (value 11706)

If the depth is set to (1,3), the iterator skips the top-level and starts one level down (e.g. at the workflow scope level). Here's the results of the same workunit but with the depth set to (1,3):

noteStatistic w1 Statistic CostExecute (value 941930630)
noteStatistic w1 Statistic CostFileAccess (value 12554)
noteStatistic w1:graph1 Statistic CostExecute (value 384444878)
noteStatistic w1:graph2 Statistic CostExecute (value 246666945)
noteStatistic w1:graph3 Statistic CostExecute (value 310818807)
noteStatistic w1:graph3 Statistic CostFileAccess (value 323)
noteStatistic w1:graph4 Statistic CostFileAccess (value 525)
noteStatistic w1:graph4 Statistic CostExecute (value 1056007436)
noteStatistic w1:>dfu Statistic CostFileAccess (value 11706)
noteStatistic w1:>dfu:>copy Statistic CostFileAccess (value 11706)

I changed to 0,3 because I was seeing that the top-level totals wasn't being updated. With this change, they are.
@ghalliday

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you need to change from 1,3 to 0,3. I think you could equally delete the filter, but I think probably no harm in leaving it here either.
The intent of the filters was that the sources etc. would be automatically deduced from the filter conditions. That logic needs updating for it to be true - and in this case the source has been explicitly selected.

@@ -8941,7 +8940,7 @@ void CLocalWorkUnit::setStatistic(StatisticCreatorType creatorType, const char *
if (mergeAction != StatsMergeAppend)
{
StringBuffer xpath;
xpath.append("Statistic[@creator='").append(creator).append("'][@scope='").append(scope).append("'][@kind='").append(kindName).append("']");
xpath.append("Statistic[@scope='").append(scope).append("'][@kind='").append(kindName).append("']");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks correct, but what was the reason for it? The jira or the PR should include some indication of why.

The StCostFileAccess from dfu scopes are included in the aggregation
of the StCostFileAccess totals.

Note that an aggregate may be written by one component and then later
updated by another component.  CLocalWorkUnit::setStatistic has been
modified to ignore the creator when looking up the existing value for
a statistic.

Signed-off-by: Shamser Ahmed <[email protected]>

 #	dockerfiles/t4.yaml
@shamser shamser requested a review from ghalliday May 8, 2024 09:49
@shamser
Copy link
Contributor Author

shamser commented May 8, 2024

The commit message has been updated to explain the change to setStatistic.

I believe the change to the filter depth is necessary. Please see comment: #18358 (comment).

@ghalliday

@ghalliday ghalliday merged commit 540d967 into hpcc-systems:candidate-9.6.x May 9, 2024
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants