-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-32873 Prevent concurrent write to same file when spraying/despraying #19240
HPCC-32873 Prevent concurrent write to same file when spraying/despraying #19240
Conversation
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32873 Jirabot Action Result: |
ccaf0ef
to
fe6eb67
Compare
fe6eb67
to
c4739fc
Compare
system/jlib/jfile.cpp
Outdated
values[FileSyncWriteClose] = v != unsetPlaneAttrValue ? 1 : 0; | ||
} | ||
else | ||
values[propNum] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= unsetPlaneAttrValue
@ghalliday - please see 2nd commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. The only one that I think needs addressing is push single file if target does not support multiple writes will pull. Probably not a major problem, although inefficient. The logging could be confusing though.
dali/ft/filecopy.cpp
Outdated
else | ||
targetSupportsConcurrentWrite = true; | ||
if (!targetPlane.isEmpty()) | ||
targetSupportsConcurrentWrite = 0 != getPlaneAttributeValue(targetPlane, ConcurrentWriteSupport, targetSupportsConcurrentWrite ? 1 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for future: getPlaneAttributeBool() would clean the code up
dali/ft/filecopy.cpp
Outdated
// always pull | ||
if (pushRequested) | ||
IWARNLOG("Ignoring push option as targets < sources and target does not support concurrent write"); | ||
return true; // could be push if equal # of soruces and targets but no point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trivial: typo "soruces"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
dali/ft/filecopy.cpp
Outdated
return false; | ||
// always pull | ||
if (pushRequested) | ||
IWARNLOG("Ignoring push option as targets < sources and target does not support concurrent write"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could occur if targets > sources - including sources==1 where push may make sense, especially if it is small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple writes could be involved even if targets < sources (see comment on line 4072)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if >1 source , and pushing, then each source pusher may write to same target, e.g. 2 source pushing to 3 targets, each source will push to the middle target ?
sources == 1 .. ok that is special and will only ever be pushing 1 partitions to single target ?
so special case source==1 here and leave with push preference if specified, otherwise, calc size and push if < threshold and pull otherwise?
dali/ft/filecopy.cpp
Outdated
bool wantPush = false; | ||
if (targets.ordinality() < sources.ordinality()) // implying multiple writes to same target, force pull, and will common up on matching target filenames | ||
wantPull = true; | ||
else if (targets.ordinality() > sources.ordinality()) // targets > sources. i.e. multiple splits of source files for each target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If doing an M to N copy (e.g. 20 to 30) you could still have multiple writes to the same target. Probably an edge case we don't care about - since that is legal in this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is when targetSupportsConcurrentWrite = true only (e.g. BM not cloud), so it's okay / desirable (or at least maintaining previous behaviour) to continue with push even if multiple sources being pushed to same target in this case?
unsigned __int64 value = plane.getPropInt64(prop.c_str(), unsetPlaneAttrValue); | ||
if (unsetPlaneAttrValue != value) | ||
{ | ||
if (attrInfo.scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: For simplicity I would have coded using a scale of 1 for the booleans and unconditionally multiplied by the scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, chose not to, with thought in mind that might later have strings too perhaps, and would want to only scale if explicitly made sense to.
system/jlib/jfile.cpp
Outdated
{ | ||
unsigned __int64 v = plane.getPropInt64("expert/@fileSyncMaxRetrySecs", unsetPlaneAttrValue); | ||
// NB: fileSyncMaxRetrySecs==0 is treated as set/enabled | ||
value = v != unsetPlaneAttrValue ? 1 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this hasn't changed, but I think this should actually be
if (v != unsetPlaneAttrValue)
value = 1;
but changing it for 9.6.x would probably be a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it's all deprecated (fileSyncMaxRetrySecs) - I am just keeping what it used to be, which was that if 0, it did enable fsync with no retry.
system/jlib/jfile.cpp
Outdated
StringBuffer fileFlagsStr; | ||
if (getComponentConfigSP()->getProp("expert/@enableIFileMask", fileFlagsStr) || getGlobalConfigSP()->getProp("expert/@enableIFileMask", fileFlagsStr)) | ||
if (componentConfig->getProp("expert/@enableIFileMask", fileFlagsStr) || getGlobalConfigSP()->getProp("expert/@enableIFileMask", fileFlagsStr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Could also move getGlobalConfigSP() into local to avoid multiple calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks logically correct, but I'm not sure the code is in the right place. One trivial typo.
dali/ft/filecopy.cpp
Outdated
if (targetSupportsConcurrentWrite) | ||
return; // push ok | ||
|
||
bool multilpeSourcesPerTarget = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: multiple
dali/ft/filecopy.cpp
Outdated
@@ -3353,6 +3405,10 @@ void FileSprayer::spray() | |||
} | |||
addEmptyFilesToPartition(); | |||
|
|||
if (!usePullOperation()) | |||
checkPushSupported(); // will switch to pull if push not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This placement doesn't seem right. Better would be for this call to be inside usePullOperation().... or in calcUsePull as suggested below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's too early. partitions are created later, but usePullOperation (and calcUsePull) are called early.
dali/ft/filecopy.cpp
Outdated
return; // push ok | ||
|
||
IWARNLOG("Forcing pull. Multiple source partitions write to same target, and target does not support concurrent write"); | ||
cachedUsePull = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaner to return a boolean and only update cachedUsePull inside usePullOperation()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usePullOperation() is called too early, before the partitions are created (by e.g. calculateOne2OnePartition, calculateSprayPartition etc)
dali/ft/filecopy.cpp
Outdated
else | ||
{ | ||
// may still not be allowed. After partitioning, if multiple sources write to same target, then pull is forced (see checkPushSupported()) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this call checkPushSupported() at this point?
@ghalliday - push new commit (not tested) to discuss later. Basically approach is:
|
1ad9731
to
f0af494
Compare
prop += "@" + std::string(attrInfo.name); | ||
switch (attrInfo.type) | ||
{ | ||
case PlaneAttrType::integer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday: NB: it wasn't okay that was commoned up handling boolean/integer here (before last commit), because it means "true" "false" will not be treated correctly if defined in the config (because only getPropBool will look at this literals).
{ | ||
unsigned __int64 v = plane.getPropInt64("expert/@fileSyncMaxRetrySecs", unsetPlaneAttrValue); | ||
// NB: fileSyncMaxRetrySecs==0 is treated as set/enabled | ||
if (unsetPlaneAttrValue != v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday - as you pointed out before I think, this was wrong before this final commit. As it was, it caused the value to be 0 if not set (which is the default anyway, but..)
f0af494
to
6a3a023
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday - please review. I've kept the plane config code in.
I could remove it, and hard-code getConcurrentWriteSupported to return default, and move the plane attr stuff to a new PR targeting master...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith one minor comment, otherwise approved. Please squash.
dali/ft/filecopy.cpp
Outdated
if (isContainerized() && noCommon) | ||
{ | ||
IWARNLOG("Ignoring noCommon option in containerized mode"); | ||
noCommon = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has no effect (variable not used after this point) - coverity may complain.
Maybe restructure as:
if (noCommon)
{
if (!isContainerized())
return;
IWARNLOG("Ignoring noCommon option in containerized mode");
}
This does mean small sprays on bare-metal may use lots of processes, but that will be dealt with in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change noCommon test.
This does mean small sprays on bare-metal may use lots of processes, but that will be dealt with in a later PR.
yes, no change there, but should be changed when larger refactoring is done.
…ying Signed-off-by: Jake Smith <[email protected]>
d238085
to
dc4b77e
Compare
@ghalliday - squashed. |
Jirabot Action Result: |
Type of change:
Checklist:
Smoketest:
Testing: