RE2022-272: Add a bulk version of genbank_to_genome in GFU #208

Xiangs18 · 2024-01-27T03:03:15Z

No description provided.

codecov · 2024-01-27T08:07:16Z

Codecov Report

Attention: Patch coverage is 99.42197% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 80.61%. Comparing base (d48a690) to head (5097ef0).

Files	Patch %	Lines
lib/GenomeFileUtil/GenomeFileUtilImpl.py	90.47%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #208      +/-   ##
==========================================
+ Coverage   79.25%   80.61%   +1.35%     
==========================================
  Files          11       11              
  Lines        2902     3007     +105     
==========================================
+ Hits         2300     2424     +124     
+ Misses        602      583      -19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

GenomeFileUtil.spec

lib/GenomeFileUtil/core/GenbankToGenome.py

lib/GenomeFileUtil/GenomeFileUtilImpl.py

lib/GenomeFileUtil/core/GenbankToGenome.py

test/supplemental_genbank_tests/genbank_upload_full_test.py

MrCreosote

Ok, this is looking pretty good now. Lesson learned: we need to be a lot more careful about making smaller changes per PR. If it looks like the changes are going to be big we should get together and try and hash out a way to split things up.

Here's the list of stuff I can find we've said we need to do in future PRs:

Add tests for the various input parameters for the bulk method
delete export_genome_features_protein_to_fasta from spec and recompile
validate / parse all data, both genome & assembly data, before saving anything
batch saving genomes vs multiple calls to save_one_genome
Read through the code looking for places where a lot of stuff is being loaded into memory (e.g. contigs) and be sure that it's removed from memory as soon as possible
Same thing for files - there are places where files are copied that might not be necessary or files can be deleted earlier
Handle the case where there are > 10000 inputs (workspace will reject)
parallelization

@Tianhao-Gu should do the final review & approval for this PR

Xiangs18 · 2024-05-15T22:55:37Z

@MrCreosote Have we discussed about export_genome_features_protein_to_fasta? It doesn't ring a bell. This function is no long in use?

MrCreosote · 2024-05-15T22:59:44Z

https://app.slack.com/client/T026VDM4X/C4E7KUGTD

lib/GenomeFileUtil/GenomeFileUtilImpl.py

Tianhao-Gu · 2024-05-16T17:25:54Z

lib/GenomeFileUtil/core/GenbankToGenome.py

        # dict with feature 'id's that have been used more than once.
        self.used_twice_identifiers = {}
+
+        # related info for genome process and upload


does it matter that 'gc_content', 'dna_size', and 'md5' attributes are absent from _Genome()?

Does not matter because these attributes will be assigned before being used.
But for consistency, I can initiate and assign them to None in _Genome() in the next PR.

Tianhao-Gu

👍

Xiangs18 · 2024-05-16T20:52:03Z

#208 (comment)

@MrCreosote This link only directs me to kbase_coders channel.

MrCreosote · 2024-05-16T20:56:37Z

ok, try https://kbase.slack.com/archives/C4E7KUGTD/p1710806243115819

Xiangs18 added 2 commits January 26, 2024 10:52

move default catalog params to GenbankToGenome.py

e3962f2

add bulk version of genbank_to_genome

ff1116f

Xiangs18 requested review from jsfillman, jkbaumohl and Tianhao-Gu as code owners January 27, 2024 03:03

Xiangs18 added 5 commits January 26, 2024 19:10

update GenomeFileUtilServer.py

f07f80c

fix typos

1245e61

fix validate_params fun call

8d16ab5

use input_params

faf2c9d

add workspace_name check

6229ca2

Xiangs18 added 11 commits January 27, 2024 00:29

replace workspace_name by workspace_id

33c278c

add genbank_to_genome bulk test

0933532

test genbank_upload_full_test.py

34c3108

debug test_genbanks_to_genomes

473b22a

run a specific test

e5bcd97

run a single fun

234c84d

use mass method to upload a single genbank

9220025

retest mass function

8cc8068

add tests to increase coverage

03fcc7f

run all tests && clean up

b711c1b

add doc string for genbanks_to_genomes

12bd88d

Xiangs18 requested a review from MrCreosote January 29, 2024 19:35

MrCreosote reviewed Jan 29, 2024

View reviewed changes

Xiangs18 added 4 commits January 30, 2024 12:05

rename functions && correct typos

62d58d4

refactor and test

090fbe3

finish refactor && clean up

f504773

make Genome class private

9706af7

MrCreosote reviewed Jan 31, 2024

View reviewed changes

refactor save assembly function

961bb93

Xiangs18 added 17 commits April 23, 2024 18:01

fix aliases bug

8939c93

check assembly handle_id, blob_id, and url

cc2a609

check genome handle ref

13733c2

add token

3302dcb

check blob_info and data md5

b6aeac3

fix format error

2b60903

check consolidated file path

a8835d4

rerun and check if checksum is changing

f7f7071

add _download_file_from_blobstore and _md5sum_string functions

96b3cc2

fix _md5sum_string bug

580b8e5

use shock to file

5edad3c

fix output dir

f732bd4

check genome md5sum

58e79d8

fix calculate_md5sum bug

a2e271d

check assembly md5sum

8a88a00

add missing rtn

c23e9a4

finish assembly md5sum check

43c420f

MrCreosote reviewed May 13, 2024

View reviewed changes

test/supplemental_genbank_tests/genbank_upload_full_test.py Show resolved Hide resolved

test/supplemental_genbank_tests/genbank_upload_full_test.py Outdated Show resolved Hide resolved

add genome upa check; blobstore filename check; reduce nums of args

5097ef0

Xiangs18 requested a review from MrCreosote May 14, 2024 21:24

MrCreosote reviewed May 15, 2024

View reviewed changes

Tianhao-Gu reviewed May 16, 2024

View reviewed changes

Tianhao-Gu approved these changes May 16, 2024

View reviewed changes

Xiangs18 merged commit e0b1dd3 into master May 21, 2024
3 checks passed

Xiangs18 mentioned this pull request Aug 12, 2024

validate and parse genome before upload #211

Merged

Xiangs18 mentioned this pull request Oct 31, 2024

batch saving genomes #212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RE2022-272: Add a bulk version of genbank_to_genome in GFU #208

RE2022-272: Add a bulk version of genbank_to_genome in GFU #208

Xiangs18 commented Jan 27, 2024

codecov bot commented Jan 27, 2024 •

edited

Loading

MrCreosote left a comment •

edited

Loading

Xiangs18 commented May 15, 2024 •

edited

Loading

MrCreosote commented May 15, 2024

Tianhao-Gu May 16, 2024

Xiangs18 May 16, 2024 •

edited

Loading

Tianhao-Gu May 16, 2024

Tianhao-Gu left a comment

Xiangs18 commented May 16, 2024 •

edited

Loading

MrCreosote commented May 16, 2024

RE2022-272: Add a bulk version of genbank_to_genome in GFU #208

RE2022-272: Add a bulk version of genbank_to_genome in GFU #208

Conversation

Xiangs18 commented Jan 27, 2024

codecov bot commented Jan 27, 2024 • edited Loading

Codecov Report

MrCreosote left a comment • edited Loading

Choose a reason for hiding this comment

Xiangs18 commented May 15, 2024 • edited Loading

MrCreosote commented May 15, 2024

Tianhao-Gu May 16, 2024

Choose a reason for hiding this comment

Xiangs18 May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Tianhao-Gu May 16, 2024

Choose a reason for hiding this comment

Tianhao-Gu left a comment

Choose a reason for hiding this comment

Xiangs18 commented May 16, 2024 • edited Loading

MrCreosote commented May 16, 2024

codecov bot commented Jan 27, 2024 •

edited

Loading

MrCreosote left a comment •

edited

Loading

Xiangs18 commented May 15, 2024 •

edited

Loading

Xiangs18 May 16, 2024 •

edited

Loading

Xiangs18 commented May 16, 2024 •

edited

Loading