Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filesize sniffing and parallelize importing on jobs #5133

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented Oct 24, 2024

Closes #5114

Changelog Entry

To be copied to the draft changelog by merger:

  • Add support for parallel file imports
    • Usurps onto Support importing on workers in WDL #5103 and Allow importing on workers #5098
    • New argument --importWorkersThreshold. This specifies the threshold where files will begin to be imported on individual jobs. Small files will be batched into the same import job up to this threshold.
    • --importWorkersDisk defaults to 1 MiB. Should be increased when download streaming is not possible on a worker.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@stxue1 stxue1 marked this pull request as ready for review October 25, 2024 17:34
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the shape of the import logic is right, but I'm concerned about some variable names and some duplicated code, and about whether the CWL-side import logic actually has all its interdependencies documented sufficiently.

Comment on lines 137 to 143
from toil.jobStores.abstractJobStore import (
AbstractJobStore,
InvalidImportExportUrlException,
LocatorException,
NoSuchFileException,
LocatorException,
InvalidImportExportUrlException,
UnimplementedURLException,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this might be undoing some of the formatting/import sorting improvements we recently merged; maybe the PR should be run through the code formatter Makefile target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the Makefile code formatter formats quite a few files, except for cwltoil.py. I'll manually undo this change, maybe this was done on accident while I was messing with imports

Comment on lines 1837 to 1845
def extract_files(
fileindex: Dict[str, str],
existing: Dict[str, str],
file_metadata: CWLObjectType,
mark_broken: bool = False,
skip_remote: bool = False,
) -> Optional[str]:
"""
Extract the filename from a CWL file record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this operates on just one filename, it should have a singular name.

This function also does more than just extract a filename from a record. It consults and sometimes updates fileindex and it will also update file_metadata to fill in its location from its path, and it will find the realpath (i.e. resolve symlinks) for bare file paths but for some reason not for file:// URIs. The docstring needs to explain why it does these things. Otherwise the caller will be surprised when it happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has a bunch of borrowed code from upload_file and write_file. I think if the file:// scheme exists, some resolving was already done.

"""
Extract the filename from a CWL file record
:param fileindex: Forward mapping of filename
:param existing: Reverse mapping of filename. This function does not use this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this argument to match a pre-defined function signature, right? Maybe we should mention where the kind of function that this function needs to be is documented. And if there isn't any documentation on this kind of function, we should maybe come up with a name for it and document it.

if not urlparse(location).scheme:
rp = os.path.realpath(location)
else:
rp = location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't symlinks in the file:// URI's path resolved here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I borrowed this code from write_file. I believe all files should be resolved into file URIs by this point, so this is likely some edge case. From my limited testing, I can't trigger the realpath branch.

Comment on lines 1873 to 1875
# This is a local file, or we also need to download and re-upload remote files
if location not in fileindex:
# don't download twice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't have any responsibility for deciding whether to download things; why are these comments talking about whether to download things?

"""
return is_url_with_scheme(filename, REMOTE_SCHEMES)
Resolve relative-URI files in the given environment and them then into absolute normalized URIs. Returns a dictionary of WDL file values to a tuple of the normalized URI,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think and them then makes sense.

We might also want to say something different than "WDL file values"; we mean the string values that would appear in the value field of a WDL.Value.File object. But you could also think that a WDL.Value.File object is itself a "WDL file value".

Comment on lines 1364 to 1386
except UnimplementedURLException as e:
# We can't find anything that can even support this URL scheme.
# Report to the user, they are probably missing an extra.
logger.critical("Error: " + str(e))
raise
except HTTPError as e:
# Something went wrong looking for it there.
logger.warning(
"Checked URL %s but got HTTP status %s", candidate_uri, e.code
)
# Try the next location.
continue
except FileNotFoundError:
# Wasn't found there
continue
except Exception:
# Something went wrong besides the file not being found. Maybe
# we have no auth.
logger.error(
"Something went wrong when testing for existence of %s",
candidate_uri,
)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates a lot of code with the CWL-side get_file_sizes(). Is there a common function that could be extracted here that polls a URL and returns whether it existed and, if so, the size if available, and raises on other errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated it out into job.py

Comment on lines 1423 to 1428
def convert_files(
environment: WDLBindings,
file_to_id: Dict[str, FileID],
file_to_data: Dict[str, FileMetadata],
task_path: str,
) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of altering the File objects inside environment in place, this function should return a modified copy of environment. The WDLBindings objects in MiniWDL I think are meant to be immutable.

@@ -5180,8 +5401,8 @@ class WDLStartJob(WDLSectionJob):

def __init__(
self,
target: WDL.Tree.Workflow | WDL.Tree.Task,
inputs: WDLBindings,
target: Union[WDL.Tree.Workflow, WDL.Tree.Task],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using | everywhere instead of Union is another pyupgrade change that we want to keep.



def make_root_job(
target: WDL.Tree.Workflow | WDL.Tree.Task,
inputs: WDLBindings,
inputs_search_path: list[str],
inputs_search_path: List[str],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-c also changed to using the new generic support in the base list, dict, etc. instead of needing to import the versions from typing, so we shouldn't undo that.

src/toil/cwl/cwltoil.py Outdated Show resolved Hide resolved
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think the idea of hoping to notice via exception whether we have exhausted our disk space is going to work. Some backends just kill the job instead of giving you an exception, whereas others let you plow right through your disk limit and interfere with other jobs. (Usually other jobs in the workflow, of which there shouldn't really be any of note during file import, but it's still not a thing jobs are meant to knowingly do.)

I am also still dubious of the extract_file_uri_once design being the best approach there (who would think to themselves "I want to get the file URI but only if it hasn't already been put in the cache"?). But it kind of has to be the shape it is to be mapped over the files, so maybe it really is the best we can do?

src/toil/cwl/cwltoil.py Outdated Show resolved Hide resolved
src/toil/job.py Outdated Show resolved Hide resolved
src/toil/job.py Outdated Show resolved Hide resolved
src/toil/job.py Outdated Show resolved Hide resolved
src/toil/job.py Outdated Show resolved Hide resolved
stxue1 and others added 5 commits November 18, 2024 11:57
…ular import and add new argument to make user deal with import worker disk size when streaming is not available. Fix a bug with file mutation as well.
Co-authored-by: Adam Novak <[email protected]>
@adamnovak adamnovak self-requested a review November 20, 2024 18:06
Without the stream switch there's no reason to use disk_size instead of disk.
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this looks good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add filesize sniffing to specify the import job's disk space for WDL and CWL
3 participants