-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RE2022-272: Add a bulk version of genbank_to_genome in GFU #208
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #208 +/- ##
==========================================
+ Coverage 79.25% 80.61% +1.35%
==========================================
Files 11 11
Lines 2902 3007 +105
==========================================
+ Hits 2300 2424 +124
+ Misses 602 583 -19 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is looking pretty good now. Lesson learned: we need to be a lot more careful about making smaller changes per PR. If it looks like the changes are going to be big we should get together and try and hash out a way to split things up.
Here's the list of stuff I can find we've said we need to do in future PRs:
- Add tests for the various input parameters for the bulk method
- delete
export_genome_features_protein_to_fasta
from spec and recompile - validate / parse all data, both genome & assembly data, before saving anything
- batch saving genomes vs multiple calls to
save_one_genome
- Read through the code looking for places where a lot of stuff is being loaded into memory (e.g. contigs) and be sure that it's removed from memory as soon as possible
- Same thing for files - there are places where files are copied that might not be necessary or files can be deleted earlier
- Handle the case where there are > 10000 inputs (workspace will reject)
- parallelization
@Tianhao-Gu should do the final review & approval for this PR
@MrCreosote Have we discussed about |
# dict with feature 'id's that have been used more than once. | ||
self.used_twice_identifiers = {} | ||
|
||
# related info for genome process and upload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter that 'gc_content', 'dna_size', and 'md5' attributes are absent from _Genome()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not matter because these attributes will be assigned before being used.
But for consistency, I can initiate and assign them to None
in _Genome() in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@MrCreosote This link only directs me to kbase_coders channel. |
No description provided.