Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-32873 Prevent concurrent write to same file when spraying/despraying #19240

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Oct 25, 2024

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32873

Jirabot Action Result:
Assigning user: [email protected]
Workflow Transition To: Merge Pending
Updated PR

@jakesmith jakesmith force-pushed the HPCC-32873-spray-avoid-concurrent-write branch from ccaf0ef to fe6eb67 Compare October 29, 2024 16:43
@jakesmith jakesmith marked this pull request as ready for review October 29, 2024 16:45
@jakesmith jakesmith requested a review from ghalliday October 29, 2024 16:45
@jakesmith jakesmith force-pushed the HPCC-32873-spray-avoid-concurrent-write branch from fe6eb67 to c4739fc Compare October 29, 2024 18:34
values[FileSyncWriteClose] = v != unsetPlaneAttrValue ? 1 : 0;
}
else
values[propNum] = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= unsetPlaneAttrValue

@jakesmith jakesmith requested a review from ghalliday November 1, 2024 00:31
@jakesmith
Copy link
Member Author

@ghalliday - please see 2nd commit.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. The only one that I think needs addressing is push single file if target does not support multiple writes will pull. Probably not a major problem, although inefficient. The logging could be confusing though.

else
targetSupportsConcurrentWrite = true;
if (!targetPlane.isEmpty())
targetSupportsConcurrentWrite = 0 != getPlaneAttributeValue(targetPlane, ConcurrentWriteSupport, targetSupportsConcurrentWrite ? 1 : 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future: getPlaneAttributeBool() would clean the code up

// always pull
if (pushRequested)
IWARNLOG("Ignoring push option as targets < sources and target does not support concurrent write");
return true; // could be push if equal # of soruces and targets but no point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: typo "soruces"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return false;
// always pull
if (pushRequested)
IWARNLOG("Ignoring push option as targets < sources and target does not support concurrent write");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could occur if targets > sources - including sources==1 where push may make sense, especially if it is small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple writes could be involved even if targets < sources (see comment on line 4072)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if >1 source , and pushing, then each source pusher may write to same target, e.g. 2 source pushing to 3 targets, each source will push to the middle target ?

sources == 1 .. ok that is special and will only ever be pushing 1 partitions to single target ?

so special case source==1 here and leave with push preference if specified, otherwise, calc size and push if < threshold and pull otherwise?

bool wantPush = false;
if (targets.ordinality() < sources.ordinality()) // implying multiple writes to same target, force pull, and will common up on matching target filenames
wantPull = true;
else if (targets.ordinality() > sources.ordinality()) // targets > sources. i.e. multiple splits of source files for each target
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If doing an M to N copy (e.g. 20 to 30) you could still have multiple writes to the same target. Probably an edge case we don't care about - since that is legal in this branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is when targetSupportsConcurrentWrite = true only (e.g. BM not cloud), so it's okay / desirable (or at least maintaining previous behaviour) to continue with push even if multiple sources being pushed to same target in this case?

unsigned __int64 value = plane.getPropInt64(prop.c_str(), unsetPlaneAttrValue);
if (unsetPlaneAttrValue != value)
{
if (attrInfo.scale)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: For simplicity I would have coded using a scale of 1 for the booleans and unconditionally multiplied by the scale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, chose not to, with thought in mind that might later have strings too perhaps, and would want to only scale if explicitly made sense to.

{
unsigned __int64 v = plane.getPropInt64("expert/@fileSyncMaxRetrySecs", unsetPlaneAttrValue);
// NB: fileSyncMaxRetrySecs==0 is treated as set/enabled
value = v != unsetPlaneAttrValue ? 1 : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this hasn't changed, but I think this should actually be

if (v != unsetPlaneAttrValue)
   value = 1;

but changing it for 9.6.x would probably be a mistake.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it's all deprecated (fileSyncMaxRetrySecs) - I am just keeping what it used to be, which was that if 0, it did enable fsync with no retry.

StringBuffer fileFlagsStr;
if (getComponentConfigSP()->getProp("expert/@enableIFileMask", fileFlagsStr) || getGlobalConfigSP()->getProp("expert/@enableIFileMask", fileFlagsStr))
if (componentConfig->getProp("expert/@enableIFileMask", fileFlagsStr) || getGlobalConfigSP()->getProp("expert/@enableIFileMask", fileFlagsStr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Could also move getGlobalConfigSP() into local to avoid multiple calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

@jakesmith jakesmith requested a review from ghalliday November 1, 2024 13:30
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks logically correct, but I'm not sure the code is in the right place. One trivial typo.

if (targetSupportsConcurrentWrite)
return; // push ok

bool multilpeSourcesPerTarget = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: multiple

@@ -3353,6 +3405,10 @@ void FileSprayer::spray()
}
addEmptyFilesToPartition();

if (!usePullOperation())
checkPushSupported(); // will switch to pull if push not supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This placement doesn't seem right. Better would be for this call to be inside usePullOperation().... or in calcUsePull as suggested below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's too early. partitions are created later, but usePullOperation (and calcUsePull) are called early.

return; // push ok

IWARNLOG("Forcing pull. Multiple source partitions write to same target, and target does not support concurrent write");
cachedUsePull = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaner to return a boolean and only update cachedUsePull inside usePullOperation()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usePullOperation() is called too early, before the partitions are created (by e.g. calculateOne2OnePartition, calculateSprayPartition etc)

else
{
// may still not be allowed. After partitioning, if multiple sources write to same target, then pull is forced (see checkPushSupported())
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this call checkPushSupported() at this point?

@jakesmith jakesmith marked this pull request as draft November 4, 2024 09:39
@jakesmith
Copy link
Member Author

@ghalliday - push new commit (not tested) to discuss later.

Basically approach is:

  1. check if pull and push are supported (checkPushAndPullSupport()) upfront based on operation - before partitioning - reject user requests if incompatible.
  2. partition using preferred method (containerized will use push if operation supported, BM decides [unchanged] based on 'calcOutput'). Check createFormatPartitioner changes.
  3. After partitioning, calculates whether to use push or pull (calcPushPullOperation -> calcUsePull), including considering checkFoMultipleSourcesPerTarget()

@jakesmith jakesmith force-pushed the HPCC-32873-spray-avoid-concurrent-write branch from 1ad9731 to f0af494 Compare November 4, 2024 17:52
prop += "@" + std::string(attrInfo.name);
switch (attrInfo.type)
{
case PlaneAttrType::integer:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday: NB: it wasn't okay that was commoned up handling boolean/integer here (before last commit), because it means "true" "false" will not be treated correctly if defined in the config (because only getPropBool will look at this literals).

{
unsigned __int64 v = plane.getPropInt64("expert/@fileSyncMaxRetrySecs", unsetPlaneAttrValue);
// NB: fileSyncMaxRetrySecs==0 is treated as set/enabled
if (unsetPlaneAttrValue != v)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday - as you pointed out before I think, this was wrong before this final commit. As it was, it caused the value to be 0 if not set (which is the default anyway, but..)

@jakesmith jakesmith force-pushed the HPCC-32873-spray-avoid-concurrent-write branch from f0af494 to 6a3a023 Compare November 4, 2024 17:59
Copy link
Member Author

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday - please review. I've kept the plane config code in.
I could remove it, and hard-code getConcurrentWriteSupported to return default, and move the plane attr stuff to a new PR targeting master...

@jakesmith jakesmith requested a review from ghalliday November 4, 2024 18:01
@jakesmith jakesmith marked this pull request as ready for review November 4, 2024 20:43
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith one minor comment, otherwise approved. Please squash.

if (isContainerized() && noCommon)
{
IWARNLOG("Ignoring noCommon option in containerized mode");
noCommon = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has no effect (variable not used after this point) - coverity may complain.

Maybe restructure as:

    if (noCommon)
    {
        if (!isContainerized())
            return;
        IWARNLOG("Ignoring noCommon option in containerized mode");
    }

This does mean small sprays on bare-metal may use lots of processes, but that will be dealt with in a later PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change noCommon test.

This does mean small sprays on bare-metal may use lots of processes, but that will be dealt with in a later PR.

yes, no change there, but should be changed when larger refactoring is done.

@jakesmith jakesmith force-pushed the HPCC-32873-spray-avoid-concurrent-write branch from d238085 to dc4b77e Compare November 5, 2024 18:57
@jakesmith
Copy link
Member Author

@ghalliday - squashed.

@ghalliday ghalliday merged commit 230ce80 into hpcc-systems:candidate-9.8.x Nov 11, 2024
49 checks passed
Copy link

Jirabot Action Result:
Added fix version: 9.8.38
Workflow Transition: 'Resolve issue'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants