Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RE2022-217: Workspace uploader performance #426

Merged
merged 42 commits into from
Feb 1, 2024
Merged

Conversation

Xiangs18
Copy link
Collaborator

No description provided.

@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Attention: 102 lines in your changes are missing coverage. Please review.

Comparison is base (8b48835) 46.97% compared to head (c92fe2f) 47.00%.

Files Patch % Lines
...c/loaders/workspace_uploader/workspace_uploader.py 63.20% 46 Missing ⚠️
src/loaders/common/callback_server_wrapper.py 3.44% 28 Missing ⚠️
src/loaders/common/loader_helper.py 20.00% 24 Missing ⚠️
...aders/workspace_downloader/workspace_downloader.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #426      +/-   ##
==========================================
+ Coverage   46.97%   47.00%   +0.02%     
==========================================
  Files          75       75              
  Lines        6518     6612      +94     
==========================================
+ Hits         3062     3108      +46     
- Misses       3456     3504      +48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Xiangs18
Copy link
Collaborator Author

Xiangs18 commented Sep 5, 2023

save_assemblies_from_fastas_integration_test.py::test_import_fasta_mass_timing Start uploading to 69141 in test_import_fasta_mass_timing fn
validating parameters
The following files are passed in _import_fasta_mass fn:


GCA_002506415.1_ASM250641v1_genomic.fna.gz
GCF_000007065.1_ASM706v1_genomic.fna.gz
GCF_000970165.1_ASM97016v1_genomic.fna.gz
GCF_000970185.1_ASM97018v1_genomic.fna.gz
GCF_000970205.1_ASM97020v1_genomic.fna.gz
GCF_000970245.1_ASM97024v1_genomic.fna.gz
GCF_000979375.1_gtlEnvA5udCFS_genomic.fna.gz
GCF_000979555.1_gtlEnvA5udCFS_genomic.fna.gz
GCF_000980105.1_gtlEnvA5udCFS_genomic.fna.gz
GCF_000980175.1_gtlEnvA5udCFS_genomic.fna.gz


  • staged 10 files in 5.6641364097595215 seconds
    parsing FASTA file: /kb/module/work/tmp/import_fasta_0955b238-9e96-4095-9172-dddd4dfff0c6/GCA_002506415.1_ASM250641v1_genomic.fna
  • parsed 130 contigs, 3839682 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_878d8535-ea93-404b-b9a3-d43e67ca5f69/GCF_000007065.1_ASM706v1_genomic.fna
  • parsed 1 contigs, 4096345 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_b4a1b8e1-2164-4822-8e8e-4bd984d722ca/GCF_000970165.1_ASM97016v1_genomic.fna
  • parsed 1 contigs, 4096482 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_c05de458-98ad-447f-b3ed-735a2595c73b/GCF_000970185.1_ASM97018v1_genomic.fna
    1693955669.0551195: uploading file /kb/module/work/tmp/import_fasta_878d8535-ea93-404b-b9a3-d43e67ca5f69/GCF_000007065.1_ASM706v1_genomic.fna into shock node
    1693955672.2739327: uploading done into shock node: d5e6b96b-d3f1-493f-a9b6-c9ba9bfb29b2
    1693955672.2740831: uploading file /kb/module/work/tmp/import_fasta_b4a1b8e1-2164-4822-8e8e-4bd984d722ca/GCF_000970165.1_ASM97016v1_genomic.fna into shock node
    1693955677.3947856: uploading done into shock node: 6ce559d9-e733-448d-a11e-84dbaaeed99e
  • parsed 1 contigs, 4066551 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_7135350e-3363-493c-8360-b6c68c7d9b2a/GCF_000970205.1_ASM97020v1_genomic.fna
  • parsed 1 contigs, 4142816 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_65828b11-6633-4270-ab27-985ae2eab967/GCF_000970245.1_ASM97024v1_genomic.fna
  • parsed 1 contigs, 4166241 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_9da2a649-9a60-42e5-832f-052fcad14ea5/GCF_000979375.1_gtlEnvA5udCFS_genomic.fna
  • parsed 136 contigs, 4092300 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_fac65d9a-860e-41ad-b63f-0f1fe08f374b/GCF_000979555.1_gtlEnvA5udCFS_genomic.fna
  • parsed 185 contigs, 4078931 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_259d4045-4384-433f-9ca2-718ac1d61227/GCF_000980105.1_gtlEnvA5udCFS_genomic.fna
  • parsed 260 contigs, 4079204 bp
    parsing FASTA file: /kb/module/work/tmp/import_fasta_23bb941d-b645-41df-9f3f-ff06c174330c/GCF_000980175.1_gtlEnvA5udCFS_genomic.fna
  • parsed 142 contigs, 4085981 bp
  • parsed 10 FASTA files in 5.011489391326904 seconds
    saving assemblies to KBase
    Uploading FASTA files to the Blobstore
    1693955663.00 - CallbackServer: Subjob method: DataFileUtil.file_to_shock_mass JobID: 2dbd1842-ff75-44dc-ba4c-705610b5fda6
    node
    1693955699.9393437: uploading done into shock node: 1e25e7bd-6c27-4f07-aabd-fff1df1a9989
    1693955699.9401429: uploading file /kb/module/work/tmp/import_fasta_259d4045-4384-433f-9ca2-718ac1d61227/GCF_000980105.1_gtlEnvA5udCFS_genomic.1693955663.00 - CallbackServer: WARNING: Module DataFileUtil was already used once for this job. Using cached version:
    url: https://github.com/kbaseapps/DataFileUtil
    node
    1693955699.9393437: uploading done into shock node: 1e25e7bd-6c27-4f07-aabd-fff1df1a9989
    1693955699.9401429: uploading file /kb/module/work/tmp/import_fasta_259d4045-4384-433f-9ca2-718ac1d61227/GCF_000980105.1_gtlEnvA5udCFS_genomic.1693955699.9401429: uploading file /kb/module/work/tmp/import_fasta_259d4045-4384-433f-9ca2-718ac1d61227/GCF_000980105.1_gtlEnvA5udCFS_genomic.fna into shock node
    1693955703.332774: uploading done into shock node: c243eeee-63c7-45b1-bb37-b2f3efdce2ee
    1693955703.3329513: uploading file /kb/module/work/tmp/import_fasta_23bb941d-b645-41df-9f3f-ff06c174330c/GCF_000980175.1_gtlEnvA5udCFS_genomic.fna into shock node
    1693955705.995946: uploading done into shock node: a62d7640-2cd1-46e6-93ca-d3b9c785dff0
  • saved 10 files to the Blobstore in 58.62708044052124 seconds
  • built 10 assembly objects in 0.15749096870422363 seconds
    Saving Assemblies to Workspace
    1693955721.81 - CallbackServer: Subjob method: DataFileUtil.save_objects JobID: 67f4e667-ffae-415f-8462-632141a7d8a9
    1693955721.81 - CallbackServer: WARNING: Module DataFileUtil was already used once for this job. Using cached version:
    url: https://github.com/kbaseapps/DataFileUtil
    commit: 6268e0f87425cdd96901299e618d6778aa44425a
    version: 0.2.0
    release: release

1693955723.5434816: Shock url: https://ci.kbase.us/services/shock-api

  • saved 10 assemblies to the Workspace in 5.342104434967041 seconds
    PASSEDDeleted shock node 784cdab7-3bdd-41d7-a09b-353c38c9f752
    Deleted shock node d5e6b96b-d3f1-493f-a9b6-c9ba9bfb29b2
    Deleted shock node 6ce559d9-e733-448d-a11e-84dbaaeed99e
    Deleted shock node 5e46fbd1-2522-4de7-8c87-95234efa2246
    Deleted shock node edb2e4aa-ef0a-4b10-87d5-75970d36c436
    Deleted shock node cf4dc0d1-3ab0-4595-a285-82177ddbac77
    Deleted shock node ec866e05-9337-4e12-8fc1-0bd526f4f9c5
    Deleted shock node 1e25e7bd-6c27-4f07-aabd-fff1df1a9989
    Deleted shock node c243eeee-63c7-45b1-bb37-b2f3efdce2ee
    Deleted shock node a62d7640-2cd1-46e6-93ca-d3b9c785dff0

----------- coverage: platform linux, python 3.6.3-final-0 -----------
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml

========================= 1 passed in 83.58s (0:01:23) =========================
Shutting down callback server...
1693955730.77 - CallbackServer: Shutting down executor service

@Xiangs18
Copy link
Collaborator Author

Xiangs18 commented Sep 5, 2023

I highlight the key time stats from the console via running kb-sdk test on _import_fasta_mass function:

  • staged 10 files in 5.6641364097595215 seconds
  • parsed 10 FASTA files in 5.011489391326904 seconds
  • saved 10 files to the Blobstore in 58.62708044052124 seconds
  • built 10 assembly objects in 0.15749096870422363 seconds
  • saved 10 assemblies to the Workspace in 5.342104434967041 seconds

@Xiangs18
Copy link
Collaborator Author

Xiangs18 commented Nov 21, 2023

@MrCreosote @Tianhao-Gu
Update since last code review:

  1. Delete all queue stuff and move batching logic && parallelization to AssemblyUtil
  2. Use a list of nameTuples as suggested to avoid buggy code
  3. Add 2 new CLI: jr_max_tasks and au_service _ver
  4. Add containers.conf file setup so we can get logs from the callback server
  5. Fix catalog params (Catalog params fix kbaseapps/AssemblyUtil#91)
  6. Rename, relocate, and update the callback server wrapper (rename & relocate workspace_downloader_helper.py file #555)

@Xiangs18 Xiangs18 mentioned this pull request Nov 29, 2023
Copy link
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor stuff in this review I think

Still need to review tests, but I'll wait until the code is done

I'll be really glad when we merge this PR, it takes a long time to review...

src/loaders/workspace_uploader/workspace_uploader.py Outdated Show resolved Hide resolved
src/loaders/workspace_uploader/workspace_uploader.py Outdated Show resolved Hide resolved
src/loaders/workspace_uploader/workspace_uploader.py Outdated Show resolved Hide resolved
src/loaders/workspace_uploader/workspace_uploader.py Outdated Show resolved Hide resolved
src/loaders/workspace_uploader/workspace_uploader.py Outdated Show resolved Hide resolved
src/loaders/workspace_uploader/workspace_uploader.py Outdated Show resolved Hide resolved
@MrCreosote
Copy link
Member

Note to self: all my comments above this point have been addressed

Copy link
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, went thought the tests finally

@Xiangs18
Copy link
Collaborator Author

Xiangs18 commented Jan 30, 2024

Addressed all comments in the tests. We are getting really close to merge this PR. Maybe after Podman test ...

Copy link
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, looking about done. I think there are 2 things left beyond the fairly trivial comment in this review:

  • Test the code with podman, including the recovery code, now that it's fixed
  • @Tianhao-Gu should take another look as the PR has been basically completely rewritten since his review and I've looked at it so many times now for so long I've probably lost objectivity

@Xiangs18
Copy link
Collaborator Author

@MrCreosote Tested the uploader with podman, including the recovery code. Everything looks good!

Tianhao-Gu
Tianhao-Gu previously approved these changes Jan 31, 2024
Copy link
Collaborator

@Tianhao-Gu Tianhao-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good except some typos. Are we planning to refactor it and upload genome all together in the future? Currently, workspace downloader is expecting every Assembly has an associated Genome obj.

If not supplied, the token must be provided in the environment variable KB_AUTH_TOKEN.
The KB_ADMIN_AUTH_TOKEN environment variable will get set by this token if the user runs as catalog admin.
au_service_ver (str): The service verison of AssemblyUtilClient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
au_service_ver (str): The service verison of AssemblyUtilClient
au_service_ver (str): The service version of AssemblyUtilClient

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

workers (int): The number of workers to use for multiprocessing.
max_callback_server_tasks (int): The maxmium subtasks for the callback server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_callback_server_tasks (int): The maxmium subtasks for the callback server.
max_callback_server_tasks (int): The maximum number of subtasks for the callback server.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"--cbs_max_tasks",
type=int,
default=20,
help="The maxmium subtasks for the callback server",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="The maxmium subtasks for the callback server",
help="The maximum subtasks for the callback server",

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@MrCreosote
Copy link
Member

Are we planning to refactor it and upload genome all together in the future?

yep

Currently, workspace downloader is expecting every Assembly has an associated Genome obj.

There's no toggle to choose to d/l genomes or not?

@Tianhao-Gu
Copy link
Collaborator

Tianhao-Gu commented Jan 31, 2024

There's no toggle to choose to d/l genomes or not?

As far as I understand, no option to ignore genome existence. Looks to me, it doesn't really download the genome obj. It just saves the genome_upa to the meta file.

@MrCreosote
Copy link
Member

Yeah, it needs that for exporting genome sets. There's no real need for the genome data proper in the pipeline as yet

@Xiangs18 Xiangs18 merged commit 86ddf57 into main Feb 1, 2024
10 of 11 checks passed
@Xiangs18 Xiangs18 deleted the dv-WS_uploader_performance branch February 1, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants