Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RE2022-272: Add a bulk version of genbank_to_genome in GFU #208

Merged
merged 174 commits into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
174 commits
Select commit Hold shift + click to select a range
e3962f2
move default catalog params to GenbankToGenome.py
Xiangs18 Jan 26, 2024
ff1116f
add bulk version of genbank_to_genome
Xiangs18 Jan 27, 2024
f07f80c
update GenomeFileUtilServer.py
Xiangs18 Jan 27, 2024
1245e61
fix typos
Xiangs18 Jan 27, 2024
8d16ab5
fix validate_params fun call
Xiangs18 Jan 27, 2024
faf2c9d
use input_params
Xiangs18 Jan 27, 2024
6229ca2
add workspace_name check
Xiangs18 Jan 27, 2024
33c278c
replace workspace_name by workspace_id
Xiangs18 Jan 27, 2024
0933532
add genbank_to_genome bulk test
Xiangs18 Jan 27, 2024
34c3108
test genbank_upload_full_test.py
Xiangs18 Jan 28, 2024
473b22a
debug test_genbanks_to_genomes
Xiangs18 Jan 28, 2024
e5bcd97
run a specific test
Xiangs18 Jan 28, 2024
234c84d
run a single fun
Xiangs18 Jan 28, 2024
9220025
use mass method to upload a single genbank
Xiangs18 Jan 28, 2024
8cc8068
retest mass function
Xiangs18 Jan 28, 2024
03fcc7f
add tests to increase coverage
Xiangs18 Jan 28, 2024
b711c1b
run all tests && clean up
Xiangs18 Jan 28, 2024
12bd88d
add doc string for genbanks_to_genomes
Xiangs18 Jan 29, 2024
62d58d4
rename functions && correct typos
Xiangs18 Jan 30, 2024
090fbe3
refactor and test
Xiangs18 Jan 31, 2024
f504773
finish refactor && clean up
Xiangs18 Jan 31, 2024
9706af7
make Genome class private
Xiangs18 Jan 31, 2024
961bb93
refactor save assembly function
Xiangs18 Feb 2, 2024
3911611
fix metadata key error
Xiangs18 Feb 2, 2024
5d79e1d
update client code && test
Xiangs18 Feb 2, 2024
496aa2f
fix bug
Xiangs18 Feb 2, 2024
4c70b9d
fix objects index bug
Xiangs18 Feb 2, 2024
eb33f6a
debug index out of bounds
Xiangs18 Feb 2, 2024
58e836b
add try except in _save_genomes
Xiangs18 Feb 2, 2024
09dd0da
fix missing genome_names
Xiangs18 Feb 2, 2024
46ec762
run pass all tests && clean up
Xiangs18 Feb 2, 2024
eafa232
add default param for version
Xiangs18 Feb 3, 2024
c812cca
test genbank_assembly_ref_test.py only
Xiangs18 Feb 3, 2024
6965069
update _save_assemblies function logic
Xiangs18 Feb 3, 2024
4ffcd80
rerun genbank_assembly_ref_test
Xiangs18 Feb 3, 2024
d2be8ee
run all tests && final cleanup
Xiangs18 Feb 3, 2024
81dbadc
finish ASU refactor
Xiangs18 Feb 3, 2024
65bbcef
make validate_params private
Xiangs18 Feb 6, 2024
b491fe9
add missing self
Xiangs18 Feb 6, 2024
c21f3da
validate first before upload
Xiangs18 Feb 13, 2024
b56f03a
fix objects not speicifed bug
Xiangs18 Feb 13, 2024
5418499
fix fail error message
Xiangs18 Feb 13, 2024
ae8000a
add gc_content, dna_size, and md5
Xiangs18 Feb 13, 2024
dce257d
test dfu output
Xiangs18 Feb 13, 2024
59c9f01
fix metadata type
Xiangs18 Feb 13, 2024
eb8aac1
fix type check
Xiangs18 Feb 13, 2024
a3ead74
fix specs
Xiangs18 Feb 13, 2024
9cb61de
finish && cleanup
Xiangs18 Feb 13, 2024
347a791
test genome output
Xiangs18 Feb 14, 2024
ed8abdb
fix typo
Xiangs18 Feb 14, 2024
1224189
add assembly_info
Xiangs18 Feb 14, 2024
adfb3a4
add more tests
Xiangs18 Feb 14, 2024
2f4bbc8
update release notes && run all tests
Xiangs18 Feb 14, 2024
6649f13
keep ws and remove object_info def
Xiangs18 Feb 14, 2024
cf12176
update specs; fix release version; rename file_handle
Xiangs18 Feb 16, 2024
60c4e37
add valid params test
Xiangs18 Feb 16, 2024
130e958
fix error message
Xiangs18 Feb 16, 2024
8b54339
run all tests && clean up
Xiangs18 Feb 16, 2024
c4f3c95
bump version; update GenomeFileUtilImpl.py; use get_object_info3
Xiangs18 Feb 23, 2024
3347a5e
rename _get_contigs_and_validate_existing_assembly func and move down…
Xiangs18 Feb 23, 2024
a953fcc
fix ws client error
Xiangs18 Feb 23, 2024
558ed47
add usrname, missing comment; rename fun
Xiangs18 Feb 26, 2024
e2e7da9
remove out_contigs
Xiangs18 Feb 26, 2024
31b4132
add contigs_output to avoid pass in null
Xiangs18 Feb 26, 2024
1168eb2
move input_params into _Genome
Xiangs18 Mar 5, 2024
9599d70
add _check_result_object_info_fields func
Xiangs18 Mar 5, 2024
10cb15d
seperate mass test function
Xiangs18 Mar 5, 2024
42f828c
add new test
Xiangs18 Mar 5, 2024
d9f6d4b
fix bug in get_object_info3 func call
Xiangs18 Mar 5, 2024
b34a5b9
check info output
Xiangs18 Mar 5, 2024
3b99e10
fix info bug
Xiangs18 Mar 5, 2024
bb51ab5
display metadata to check
Xiangs18 Mar 5, 2024
ace9c90
fix tests
Xiangs18 Mar 6, 2024
1d0f965
check error output
Xiangs18 Mar 6, 2024
a251871
test output
Xiangs18 Mar 6, 2024
784e4b4
fix match problem
Xiangs18 Mar 6, 2024
4e14017
fix bugs in tests
Xiangs18 Mar 6, 2024
f769404
debug assertion fail
Xiangs18 Mar 6, 2024
f77eab1
run pass all tests
Xiangs18 Mar 6, 2024
19d2c31
add missing metadata param
Xiangs18 Mar 6, 2024
fe703f3
add provenance
Xiangs18 Mar 6, 2024
5f14f52
fix assert error
Xiangs18 Mar 6, 2024
c08f235
more checks
Xiangs18 Mar 6, 2024
fb60e20
finish adding provenance test
Xiangs18 Mar 6, 2024
f1a133a
add TODO and remove print messages
Xiangs18 Mar 7, 2024
9f9da3e
rm duplicate error check
Xiangs18 Mar 7, 2024
c8dfda9
check mRNA missing annotations
Xiangs18 Mar 8, 2024
3775aee
add rmna test
Xiangs18 Mar 8, 2024
3272e72
remove logs
Xiangs18 Mar 8, 2024
40fb78a
remove idx and add input params
Xiangs18 Mar 8, 2024
93b8b91
run all tests
Xiangs18 Mar 8, 2024
d5bce8a
more tests
Xiangs18 Mar 9, 2024
d770130
test spoof
Xiangs18 Mar 11, 2024
d3ab203
display spoof rtn
Xiangs18 Mar 11, 2024
21ac1ef
check data
Xiangs18 Mar 11, 2024
6c066f9
add check spoof function
Xiangs18 Mar 11, 2024
47aae1b
display spoof warning
Xiangs18 Mar 11, 2024
41c2339
more check
Xiangs18 Mar 11, 2024
062c3d3
test diff genbank file
Xiangs18 Mar 11, 2024
5c9ea7c
pass filed check
Xiangs18 Mar 11, 2024
84fe0b8
display warnings
Xiangs18 Mar 11, 2024
27d3fee
run all tests
Xiangs18 Mar 12, 2024
daf54ca
run mRNA with no parent
Xiangs18 Mar 12, 2024
d2a9be1
add mRNA_with_no_parent.gbff file
Xiangs18 Mar 12, 2024
5e44f27
check curated file meta
Xiangs18 Mar 12, 2024
7aaa537
remove redudant tests
Xiangs18 Mar 13, 2024
33f56f5
run all tests
Xiangs18 Mar 13, 2024
a32bdc1
cover ontology
Xiangs18 Mar 18, 2024
0488a57
correct ontology genbank name
Xiangs18 Mar 18, 2024
0d64e5b
fix ontology
Xiangs18 Mar 18, 2024
c49f54b
cover all missing lines && run all tests
Xiangs18 Mar 18, 2024
989ddc8
display data/info/prov
Xiangs18 Mar 19, 2024
42916a3
check info, metadata, and prov
Xiangs18 Mar 20, 2024
4e860d9
fix datetime iso check
Xiangs18 Mar 20, 2024
f8a5e17
fix metadata bug
Xiangs18 Mar 20, 2024
f98dcc0
debug on metadata
Xiangs18 Mar 20, 2024
89bca42
fix prov and metadata
Xiangs18 Mar 20, 2024
3233347
add TODOs and delete print
Xiangs18 Mar 20, 2024
1359c39
test prov again
Xiangs18 Mar 20, 2024
1957748
fix prov tests
Xiangs18 Mar 20, 2024
cd84e97
remove print message && finish
Xiangs18 Mar 20, 2024
5a15f2d
add genome data check
Xiangs18 Mar 22, 2024
2e88a66
display genome data
Xiangs18 Mar 25, 2024
ca640a5
rerun genome data check
Xiangs18 Mar 25, 2024
11f8e34
compare sorted data
Xiangs18 Mar 25, 2024
01a7da6
test genome data with order
Xiangs18 Mar 26, 2024
f3dbe55
print genome data
Xiangs18 Mar 26, 2024
378e39c
rerun ordered comparison
Xiangs18 Mar 26, 2024
e971cca
rerun
Xiangs18 Mar 26, 2024
27d64d2
rerun small genome files
Xiangs18 Mar 27, 2024
02e4ec4
fix test failure
Xiangs18 Mar 28, 2024
909e12c
add fix to download json file
Xiangs18 Apr 2, 2024
660d145
fix check
Xiangs18 Apr 4, 2024
449da57
genome data check using downloaded files
Apr 5, 2024
3decca9
fix filename error
Xiangs18 Apr 6, 2024
f91072e
new data files
Apr 8, 2024
4e1d603
rerun tests
Xiangs18 Apr 8, 2024
087e440
fix bug
Xiangs18 Apr 8, 2024
7867216
format document
Xiangs18 Apr 9, 2024
0083baf
complete genome check
Xiangs18 Apr 9, 2024
c69f797
display assembly
Xiangs18 Apr 10, 2024
e5ab860
debug key error
Xiangs18 Apr 10, 2024
61571fc
add assembly check first cut
Xiangs18 Apr 10, 2024
51340d5
fix annotations
Xiangs18 Apr 10, 2024
a52c980
debug assembly data check
Xiangs18 Apr 11, 2024
6b937c2
fix assembly data check bug
Xiangs18 Apr 11, 2024
d5c74f6
add token and rerun
Xiangs18 Apr 11, 2024
758ed57
refactor code
Xiangs18 Apr 11, 2024
55b4273
rerun all tests
Xiangs18 Apr 11, 2024
a11415a
redownload genome files
Xiangs18 Apr 16, 2024
05d4ff6
uplaod download genome files
Apr 16, 2024
bc630b4
rerun all tests
Xiangs18 Apr 16, 2024
0aaf931
finish && cleanup
Xiangs18 Apr 16, 2024
ee5638b
1.check version; 2.validate assembly_upa; 3. remove idx arg for self.…
Xiangs18 Apr 23, 2024
0b5ad8a
fix prov and aliases
Xiangs18 Apr 23, 2024
04aa6e1
download new data
Apr 24, 2024
8939c93
fix aliases bug
Xiangs18 Apr 24, 2024
cc2a609
check assembly handle_id, blob_id, and url
Xiangs18 Apr 24, 2024
13733c2
check genome handle ref
Xiangs18 Apr 24, 2024
3302dcb
add token
Xiangs18 Apr 25, 2024
b6aeac3
check blob_info and data md5
Xiangs18 Apr 25, 2024
2b60903
fix format error
Xiangs18 Apr 25, 2024
a8835d4
check consolidated file path
Xiangs18 Apr 26, 2024
f7f7071
rerun and check if checksum is changing
Xiangs18 Apr 26, 2024
96b3cc2
add _download_file_from_blobstore and _md5sum_string functions
Xiangs18 May 3, 2024
580b8e5
fix _md5sum_string bug
Xiangs18 May 3, 2024
5edad3c
use shock to file
Xiangs18 May 9, 2024
f732bd4
fix output dir
Xiangs18 May 10, 2024
58e79d8
check genome md5sum
Xiangs18 May 10, 2024
a2e271d
fix calculate_md5sum bug
Xiangs18 May 10, 2024
8a88a00
check assembly md5sum
Xiangs18 May 10, 2024
c23e9a4
add missing rtn
Xiangs18 May 10, 2024
43c420f
finish assembly md5sum check
Xiangs18 May 10, 2024
5097ef0
add genome upa check; blobstore filename check; reduce nums of args
Xiangs18 May 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions GenomeFileUtil.spec
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,38 @@ module GenomeFileUtil {
funcdef genbank_to_genome(GenbankToGenomeParams params)
returns (GenomeSaveResult result) authentication required;

typedef structure {
File file;
string genome_name;

string source;
string taxon_wsname;
string taxon_id;

string release;
string generate_ids_if_needed;
int genetic_code;
string scientific_name;
usermeta metadata;
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
boolean generate_missing_genes;
string use_existing_assembly;
} GenbankToGenomeInput;

typedef structure {
int workspace_id;
list<GenbankToGenomeInput> inputs;
} GenbanksToGenomesParams;

/* Results for the genbanks_to_genomes function.
results - the results of the save operation in the same order as the input.
*/
typedef structure {
list<GenomeSaveResult> results;
} GenomeSaveResults;

funcdef genbanks_to_genomes(GenbanksToGenomesParams params)
returns (GenomeSaveResults results) authentication required;

/*
is_gtf - optional flag switching export to GTF format (default is 0,
which means GFF)
Expand Down
65 changes: 59 additions & 6 deletions lib/GenomeFileUtil/GenomeFileUtilImpl.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@
from pprint import pprint

from GenomeFileUtil.core.FastaGFFToGenome import FastaGFFToGenome
from GenomeFileUtil.core.GenbankToGenome import GenbankToGenome
from GenomeFileUtil.core.GenomeFeaturesToFasta import GenomeFeaturesToFasta
from GenomeFileUtil.core.GenomeInterface import (
GenomeInterface,
from GenomeFileUtil.core.GenbankToGenome import (
GenbankToGenome,
MAX_THREADS_DEFAULT,
THREADS_PER_CPU_DEFAULT,
)
from GenomeFileUtil.core.GenomeFeaturesToFasta import GenomeFeaturesToFasta
from GenomeFileUtil.core.GenomeInterface import GenomeInterface
from GenomeFileUtil.core.GenomeToGFF import GenomeToGFF
from GenomeFileUtil.core.GenomeToGenbank import GenomeToGenbank
from installed_clients.AssemblyUtilClient import AssemblyUtil
Expand Down Expand Up @@ -92,7 +92,6 @@
#END_CONSTRUCTOR
pass


def genbank_to_genome(self, ctx, params):
"""
:param params: instance of type "GenbankToGenomeParams" (genome_name
Expand Down Expand Up @@ -132,7 +131,7 @@
pprint(params)

importer = GenbankToGenome(self.cfg)
result = importer.refactored_import(ctx, params)
result = importer.refactored_import(params)

print('import complete -- result = ')
pprint(result)
Expand All @@ -145,6 +144,60 @@
# return the results
return [result]

def genbanks_to_genomes(self, ctx, params):
"""
:param params: instance of type "GenbanksToGenomesParams" -> structure:
parameter "workspace_id" of Long, parameter "inputs" of list of
type "GenbankToGenomeInput" (genome_name - becomes the name of the
object source - Source of the file typically something like RefSeq
or Ensembl taxon_ws_name - where the reference taxons are :
ReferenceTaxons taxon_id - if defined, will try to link the Genome
to the specified taxonomy id in lieu of performing the lookup
during upload release - Release or version number of the data per
example Ensembl has numbered releases of all their data: Release 31
generate_ids_if_needed - If field used for feature id is not there,
generate ids (default behavior is raising an exception) genetic_code
- Genetic code of organism. Overwrites determined GC from taxon
object scientific_name - will be used to set the scientific name of
the genome and link to a taxon generate_missing_genes - If the file
has CDS or mRNA with no corresponding gene, generate a spoofed gene.
use_existing_assembly - Supply an existing assembly reference) ->
structure: parameter "file" of type "File" -> structure: parameter
"path" of String, parameter "shock_id" of String, parameter
"ftp_url" of String, parameter "genome_name" of String, parameter
"source" of String, parameter "taxon_wsname" of String, parameter
"taxon_id" of String, parameter "release" of String, parameter
"generate_ids_if_needed" of String, parameter "genetic_code" of
Long, parameter "scientific_name" of String, parameter "metadata"
of type "usermeta" -> mapping from String to String, parameter
"generate_missing_genes" of type "boolean" (A boolean - 0 for false,
1 for true. @range (0, 1)), parameter "use_existing_assembly" of
String
:returns: instance of type "GenomeSaveResults" -> structure: parameter
"results" of list of type "GenomeSaveResult" -> structure: parameter
"genome_ref" of String
"""
# ctx is the context object
# return variables are: result
#BEGIN genbanks_to_genomes
print('genbanks_to_genomes -- paramaters = ')
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
pprint(params)
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved

results = {

Check warning on line 186 in lib/GenomeFileUtil/GenomeFileUtilImpl.py

View check run for this annotation

Codecov / codecov/patch

lib/GenomeFileUtil/GenomeFileUtilImpl.py#L186

Added line #L186 was not covered by tests
'results': GenbankToGenome(self.cfg).refactored_import_mass(params)
}

print('import complete -- results = ')
pprint(results)
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
#END genbanks_to_genomes

# At some point might do deeper type checking...
if not isinstance(results, dict):
raise ValueError('Method genbank_to_genome return value ' +
'result is not type dict as required.')
# return the results
return [results]

def genome_to_gff(self, ctx, params):
"""
:param params: instance of type "GenomeToGFFParams" (is_gtf -
Expand Down
4 changes: 4 additions & 0 deletions lib/GenomeFileUtil/GenomeFileUtilServer.py
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,10 @@ def __init__(self):
name='GenomeFileUtil.genbank_to_genome',
types=[dict])
self.method_authentication['GenomeFileUtil.genbank_to_genome'] = 'required' # noqa
self.rpc_service.add(impl_GenomeFileUtil.genbanks_to_genomes,
name='GenomeFileUtil.genbanks_to_genomes',
types=[dict])
self.method_authentication['GenomeFileUtil.genbanks_to_genomes'] = 'required' # noqa
self.rpc_service.add(impl_GenomeFileUtil.genome_to_gff,
name='GenomeFileUtil.genome_to_gff',
types=[dict])
Expand Down
182 changes: 132 additions & 50 deletions lib/GenomeFileUtil/core/GenbankToGenome.py
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's about 8 places in the code where the code changed in this PR but there's no test coverage. As a general rule, if you change code you should write tests against the old code first to make sure you have a baseline, change the code, and make sure the tests pass.

Would it take a lot of effort to write tests to cover the changed code?

Copy link
Contributor Author

@Xiangs18 Xiangs18 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you refer to 8 places from GenbankToGenome.py file, I only changed them from self. to genome_obj.
I guess these lines are not covered by previous logic and that's why they are showing up now?

Copy link
Member

@MrCreosote MrCreosote Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I'm talking about. Even small changes still need tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add tests to cover missing lines. This PR or a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the purpose of the tests is to ensure this PR hasn't broken anything, so ideally you'd add the tests, make sure they pass, make the changes in this PR, and make sure they still pass. So they should probably be part of this PR, although it's already really big

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@
MAX_MISC_FEATURE_SIZE = 10000
MAX_PARENT_LOOKUPS = 5

# catalog params
MAX_THREADS_DEFAULT = 10
THREADS_PER_CPU_DEFAULT = 1

_WSID = 'workspace_id'
_INPUTS = 'inputs'


def _upa(object_info):
return f'{object_info[6]}/{object_info[0]}/{object_info[4]}'
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved


class GenbankToGenome:
def __init__(self, config):
Expand All @@ -36,11 +47,15 @@
self.dfu = DataFileUtil(config.callbackURL)
self.aUtil = AssemblyUtil(config.callbackURL)
self.ws = Workspace(config.workspaceURL)
self.re_api_url = config.re_api_url
yml_text = open('/kb/module/kbase.yml').read()
self.version = re.search("module-version:\n\W+(.+)\n", yml_text).group(1)
self.reset_attributes()

def reset_attributes(self):
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
self._messages = []
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
self.time_string = str(datetime.datetime.fromtimestamp(
time.time()).strftime('%Y_%m_%d_%H_%M_%S'))
yml_text = open('/kb/module/kbase.yml').read()
self.version = re.search("module-version:\n\W+(.+)\n", yml_text).group(1)
self.generate_parents = False
self.generate_ids = False
self.genes = OrderedDict()
Expand All @@ -62,7 +77,6 @@
self.excluded_features = ('source', 'exon', 'fasta_record')
self.ont_mappings = load_ontology_mappings('/kb/module/data')
self.code_table = 11
self.re_api_url = config.re_api_url
# dict with feature 'id's that have been used more than once.
self.used_twice_identifiers = {}
self.default_params = {
Expand All @@ -84,53 +98,121 @@
def messages(self):
return "\n".join(self._messages)

def refactored_import(self, ctx, params):
# 1) validate parameters and extract defaults
self.validate_params(params)

# 2) construct the input directory staging area
input_directory = self.stage_input(params)

# 3) update default params
self.default_params.update(params)
params = self.default_params
self.generate_parents = params.get('generate_missing_genes')
self.generate_ids = params.get('generate_ids_if_needed')
if params.get('genetic_code'):
self.code_table = params['genetic_code']

# 4) Do the upload
files = self._find_input_files(input_directory)
consolidated_file = self._join_files_skip_empty_lines(files)
genome = self.parse_genbank(consolidated_file, params)
if params.get('genetic_code'):
genome["genetic_code"] = params['genetic_code']

result = self.gi.save_one_genome({
'workspace': params['workspace_name'],
'name': params['genome_name'],
'data': genome,
"meta": params['metadata'],
})
ref = f"{result['info'][6]}/{result['info'][0]}/{result['info'][4]}"
logging.info(f"Genome saved to {ref}")

# 5) clear the temp directory
shutil.rmtree(input_directory)

# 6) return the result
info = result['info']
details = {
'genome_ref': ref,
'genome_info': info
}

def refactored_import(self, params):
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
print('validating parameters')
mass_params = self._set_up_single_params(params)
return self._refactored_import_mass(mass_params)[0]

def refactored_import_mass(self, params):
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
print('validating parameters')
self._validate_mass_params(params)
return self._refactored_import_mass(params)

def _set_up_single_params(self, params):
inputs = dict(params)
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
self.validate_params(inputs)
ws_id = self._get_int(inputs.pop(_WSID, None), _WSID)
ws_name = inputs.pop('workspace_name', None)
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
if (bool(ws_id) == bool(ws_name)): # xnor
raise ValueError(f"Exactly one of a '{_WSID}' or a 'workspace' parameter must be provided")
if not ws_id:
print(f"Translating workspace name {ws_name} to a workspace ID. Prefer submitting "
+ "a workspace ID over a mutable workspace name that may cause race conditions")
ws_id = self.dfu.ws_name_to_id(ws_name)
mass_params = {_WSID: ws_id, _INPUTS: [inputs]}
return mass_params

def _validate_mass_params(self, params):
ws_id = self._get_int(params.get(_WSID), _WSID)
if not ws_id:
raise ValueError(f"{_WSID} is required")
inputs = params.get(_INPUTS)
if not inputs or type(inputs) != list:
raise ValueError(f"{_INPUTS} field is required and must be a non-empty list")
for i, inp in enumerate(inputs, start=1):
if type(inp) != dict:
raise ValueError(f"Entry #{i} in {_INPUTS} field is not a mapping as required")
self.validate_params(inp)

def _get_int(self, putative_int, name, minimum=1):
if putative_int is not None:
if type(putative_int) != int:
raise ValueError(f"{name} must be an integer, got: {putative_int}")
if putative_int < minimum:
raise ValueError(f"{name} must be an integer >= {minimum}")

Check warning on line 142 in lib/GenomeFileUtil/core/GenbankToGenome.py

View check run for this annotation

Codecov / codecov/patch

lib/GenomeFileUtil/core/GenbankToGenome.py#L142

Added line #L142 was not covered by tests
return putative_int

def _refactored_import_mass(self, params):

workspace_id = params[_WSID]
inputs = params[_INPUTS]

genome_names = []
genome_data = []
genome_meta = []

for input_params in inputs:
# 1) construct the input directory staging area
input_directory = self.stage_input(input_params)

# 2) update default params
input_params = {**self.default_params, **input_params}
self.generate_parents = input_params.get('generate_missing_genes')
self.generate_ids = input_params.get('generate_ids_if_needed')
if input_params.get('genetic_code'):
self.code_table = input_params['genetic_code']

# 3) Do the upload
files = self._find_input_files(input_directory)
consolidated_file = self._join_files_skip_empty_lines(files)
genome = self.parse_genbank(
workspace_id, consolidated_file, input_params
)
if input_params.get('genetic_code'):
genome["genetic_code"] = input_params['genetic_code']

# 4) clear the temp directory and reset attributes
shutil.rmtree(input_directory)
self.reset_attributes()

genome_data.append(genome)
genome_names.append(input_params['genome_name'])
genome_meta.append(input_params['metadata'])

results = self._save_genomes(
workspace_id, genome_names, genome_data, genome_meta
)

# 5) return the result
details = [
{'genome_ref': _upa(result["info"]), 'genome_info': result["info"]}
for result in results
]
for detail in details:
logging.info(f"Genome saved to {detail['genome_ref']}")
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
return details

def _save_genomes(
self,
workspace_id,
genome_names,
genome_data,
genome_meta
):
results = [
self.gi.save_one_genome(
{
'workspace': workspace_id,
'name': name,
'data': data,
"meta": meta,
}
) for name, data, meta in zip(genome_names, genome_data, genome_meta)
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
]
return results

@staticmethod
def validate_params(params):
if 'workspace_name' not in params:
raise ValueError('required "workspace_name" field was not defined')
if 'genome_name' not in params:
raise ValueError('required "genome_name" field was not defined')
if 'file' not in params:
Expand Down Expand Up @@ -206,15 +288,15 @@

return input_directory

def parse_genbank(self, file_path, params):
def parse_genbank(self, workspace_id, file_path, params):
logging.info("Saving original file to shock")
shock_res = self.dfu.file_to_shock({
'file_path': file_path,
'make_handle': 1,
'pack': 'gzip',
})
# Write and save assembly file
assembly_ref = self._save_assembly(file_path, params)
assembly_ref = self._save_assembly(workspace_id, file_path, params)
assembly_data = self.dfu.get_objects(
{'object_refs': [assembly_ref],
'ignore_errors': 0})['data'][0]['data']
Expand Down Expand Up @@ -319,7 +401,7 @@
logging.info(f"Feature Counts: {genome['feature_counts']}")
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
return genome

def _save_assembly(self, genbank_file, params):
def _save_assembly(self, workspace_id, genbank_file, params):
"""Convert genbank file to fasta and sve as assembly"""
contigs = Bio.SeqIO.parse(genbank_file, "genbank")
assembly_id = f"{params['genome_name']}_assembly"
Expand Down Expand Up @@ -367,7 +449,7 @@
Bio.SeqIO.write(out_contigs, fasta_file, "fasta")
assembly_ref = self.aUtil.save_assembly_from_fasta(
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
{'file': {'path': fasta_file},
'workspace_name': params['workspace_name'],
'workspace_id': workspace_id,
'assembly_name': assembly_id,
'type': params.get('genome_type', 'isolate'),
'contig_info': extra_info})
Expand Down
Loading
Loading