Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_unite_data functions #134

Merged
merged 64 commits into from
Nov 13, 2023
Merged
Changes from 7 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
4e59f17
Add _assemble_unite_data_urls and testing code
colinbrislawn Apr 10, 2022
2d17273
update unite url lookup
colinbrislawn Oct 5, 2023
91de774
add _assemble_unite_data_urls() and DOI dict
colinbrislawn Oct 6, 2023
0c8e1b2
Delete .vscode/settings.json
colinbrislawn Oct 6, 2023
97c23ab
Add draft _unite_download_targz()'
colinbrislawn Oct 6, 2023
98de886
Merge branch 'get_unite' of https://github.com/colinbrislawn/RESCRIPt…
colinbrislawn Oct 6, 2023
78414b1
Add examples for questions
colinbrislawn Oct 7, 2023
d7dc1e2
add imporing tests
colinbrislawn Oct 11, 2023
6a55d8f
revert get_data.py and splint out unite
colinbrislawn Oct 11, 2023
ab53997
simply _unite_doi_to_url
colinbrislawn Oct 11, 2023
8668d23
unmuddle _get_unite_* functions
colinbrislawn Oct 12, 2023
c2dd4cf
rename function, again
colinbrislawn Oct 12, 2023
2d69773
remove old Silva stuff
colinbrislawn Oct 12, 2023
ff30cb9
Merge branch 'bokulich-lab:master' into get_unite
colinbrislawn Oct 12, 2023
da73365
Add working get_unite_data()!
colinbrislawn Oct 12, 2023
f2035c7
add alt method get_unite_data2()
colinbrislawn Oct 12, 2023
4b2688d
major updateto get_unite_data
colinbrislawn Oct 14, 2023
0beac8c
linting with flake8
colinbrislawn Oct 15, 2023
a63dd1f
remove unneeded import
colinbrislawn Oct 16, 2023
d09cd46
remove unitefile.tar.gz and change cluster_id filter
colinbrislawn Oct 16, 2023
537ccdb
rename for consistancy
colinbrislawn Oct 17, 2023
7443eb6
flake8 and remove unite v8.0
colinbrislawn Oct 17, 2023
e0d3c84
match files by _dev and check for unfinished download
colinbrislawn Oct 17, 2023
e674fde
fix labels
colinbrislawn Oct 18, 2023
2607842
fixes to retry download loop
colinbrislawn Oct 18, 2023
3f072ce
major update, rename and reorg functions
colinbrislawn Oct 20, 2023
e405f81
autoformat with black
colinbrislawn Oct 20, 2023
3587f10
add early testing
colinbrislawn Oct 21, 2023
a3589c5
Add alt tests for get_doi and get_url
colinbrislawn Oct 21, 2023
a728d59
add nilsson2019unite
colinbrislawn Oct 23, 2023
3991ed0
add check to _unite_get_artifacts
colinbrislawn Oct 23, 2023
8346ea8
major update to testing code
colinbrislawn Oct 23, 2023
31cb706
add global test_get_unite_data and alt
colinbrislawn Oct 23, 2023
2414dfe
merge _unite_get_doi into _unite_get_url
colinbrislawn Oct 24, 2023
019a40a
remove unused code
colinbrislawn Oct 24, 2023
f637891
return tuple of Artifacts, without lists
colinbrislawn Oct 24, 2023
aed4d06
update unite citation
colinbrislawn Oct 24, 2023
3f9dafc
remove extra print lines
colinbrislawn Oct 24, 2023
1bc5231
update unite description and license
colinbrislawn Oct 24, 2023
657fd2b
fixing types
colinbrislawn Oct 25, 2023
b20c29a
first working build
colinbrislawn Oct 26, 2023
42e3ca1
test bad URL in _unite_get_tgz
colinbrislawn Oct 27, 2023
2781122
set all defaults for get_unite_data()
colinbrislawn Nov 1, 2023
d5cf2d4
removing unneeded / uncommon error handeling in _unite_get_tgz
colinbrislawn Nov 1, 2023
af258d5
tests
colinbrislawn Nov 1, 2023
b371d96
Update defaults
colinbrislawn Nov 2, 2023
3d4e53e
remove unneeded import of HTTPError
colinbrislawn Nov 3, 2023
0ed5f8f
Test output types for get_unite_data()
colinbrislawn Nov 6, 2023
032e220
formating
colinbrislawn Nov 6, 2023
6da7608
remove full unite abstract
colinbrislawn Nov 6, 2023
f90f25d
reword error for wrong number of cluster_id files
colinbrislawn Nov 6, 2023
29a3763
update text
colinbrislawn Nov 8, 2023
d6ed1a3
use one dlfail in _unite_get_tgz
colinbrislawn Nov 8, 2023
76d92b5
use self.assertRaisesRegex
colinbrislawn Nov 8, 2023
4317d77
Add mock download code
colinbrislawn Nov 8, 2023
52a9ed1
roll back mock download
colinbrislawn Nov 8, 2023
79e7384
Merge branch 'get_unite' of https://github.com/colinbrislawn/RESCRIPt…
colinbrislawn Nov 8, 2023
2bc4735
rollback, again
colinbrislawn Nov 8, 2023
4e39434
rollback, again
colinbrislawn Nov 8, 2023
8747b47
add back URL
colinbrislawn Nov 8, 2023
c674b00
add mocked test_unite_get_tgz2
colinbrislawn Nov 8, 2023
4f67f0f
use MixedCaseDNAFASTAFormat
colinbrislawn Nov 10, 2023
592905b
black format
colinbrislawn Nov 10, 2023
55109fc
test get_tgz with mock
colinbrislawn Nov 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions rescript/get_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,132 @@
import shutil
import gzip
import warnings
import requests
import tarfile

import qiime2
from urllib.request import urlretrieve
from urllib.error import HTTPError
from q2_types.feature_data import RNAFASTAFormat


def _unite_dois_to_urls(DOIs):
'''Generate UNITE urls, given their DOIs.'''
# Make DOIs iterable
DOIs = [DOIs] if isinstance(DOIs, str) else DOIs
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
print('Get URLs for these DOIs:', DOIs)
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
base_url = 'https://api.plutof.ut.ee/v1/public/dois/'\
'?format=vnd.api%2Bjson&identifier='
# Eventual output
URLs = set()
# For each DOI, get download URL of file
for DOI in DOIs:
query_data = requests.get(base_url + DOI).json()
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
# Updates can be made to files in a DOI, so on the advice of the devs,
# only return the last file uploaded with this -1 vv
URL = query_data['data'][0]['attributes']['media'][-1]['url']
URLs.add(URL)
return URLs


def _unite_get_url(version, taxon_group, singletons):
'''Generate UNITE urls, given database version and reference target.'''
# Lookup DOIs for various databases, source: https://unite.ut.ee/repository.php
unite_dois = {
'9.0': {'fungi': {False: '10.15156/BIO/2938079', True: '10.15156/BIO/2938080'},
'eukaryotes': {False: '10.15156/BIO/2938081', True: '10.15156/BIO/2938082'}},
# Old version 9.0 is not listed here
'8.3': {'fungi': {False: '10.15156/BIO/1264708', True: '10.15156/BIO/1264763'},
'eukaryotes': {False: '10.15156/BIO/1264819', True: '10.15156/BIO/1264861'}},
'8.2': {'fungi': {False: '10.15156/BIO/786385', True: '10.15156/BIO/786387'},
'eukaryotes': {False: '10.15156/BIO/786386', True: '10.15156/BIO/786388'}},
'8.0': {'fungi': {False: '', True: '10.15156/BIO/786349'},
'eukaryotes': {False: '', True: ''}}, # All other 8.0 are in zip files
}
# There's got to be a better way! See https://stackoverflow.com/questions/25833613/safe-method-to-get-value-of-nested-dictionary
try:
# Check if we have the DOI requested
target_doi = unite_dois[version][taxon_group][singletons]
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
except KeyError as ke:
print('Unknown DOI for this value: ' + str(ke))
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
raise
return _unite_dois_to_urls(target_doi).pop()

_unite_get_url(version='9.0', taxon_group='fungi', singletons=False)

# with tempfile.TemporaryDirectory() as tmp_dir:
tmp_dir = tempfile.mkdtemp()
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved

def _unite_download_targz(url, download_path):
print('Downloading ' + url)

response = requests.get(url, stream=True)
if response.status_code != 200:
raise ValueError("Failed to download the file from " + url)

tar_file_path = os.path.join(download_path, 'unitefile.tar.gz')
with open(tar_file_path, 'wb') as f:
f.write(response.content)

# Extract only the 'developer' subdirectory
with tarfile.open(tar_file_path, 'r:gz') as tar:
# Ensure that 'developer' exists in the tar file
members = [member for member in tar.getmembers() if member.name.startswith('developer')]
if not members:
raise ValueError("No 'developer' subdirectory found in the .tar.gz file.")

for member in members:
member.name = os.path.basename(member.name) # Strip the 'developer' prefix
tar.extract(member, path=download_path)

return download_path

# Test it by downloading this file
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
# _unite_download_targz('https://files.plutof.ut.ee/public/orig/59/12/591225E8985EFC44B595C79AF5F467421B4D9A95093A0811B13CB4CC13A6DA46.tgz', tmp_dir)

# import as artifacts
# results[name] = qiime2.Artifact.import_data(dtype, destination)

def get_unite_data(version, taxon_group, singletons=False):
url = _unite_get_url(version, taxon_group, singletons)
results = {'sequences': [], 'taxonomy': []}

# with tempfile.TemporaryDirectory() as tmp_dir:
tmp_dir = tempfile.mkdtemp()
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved

print('Temporary directory:', tmp_dir)
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
_unite_download_targz(url, download_path=tmp_dir)

for root, dirs, files in os.walk(tmp_dir):
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
for file in files:
print(results)
if file.endswith('.fasta'):
fasta_file_name = os.path.join(root, file)
print('found fasta: ' + fasta_file_name)
with open(fasta_file_name, 'r') as fasta_file:
# Read the content of the file and append it as a Python object
fasta_content = fasta_file.read()
results['sequences'].append(qiime2.Artifact.import_data('FeatureData[RNASequence]', fasta_content))
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
elif file.endswith('.txt'):
txt_file_name = os.path.join(root, file)
print('found txt: ' + txt_file_name)
results['taxonomy'].append(qiime2.Artifact.import_data('FeatureData[Taxonomy]', txt_file_name))
return results

get_unite_data(version='9.0', taxon_group='fungi')


# How do I import data?
with open("/tmp/tmpzprzopiu/sh_refs_qiime_ver9_97_25.07.2023_dev.fasta", 'r') as fasta_file:
qiime2.Artifact.import_data('FeatureData[RNASequence]', fasta_file)

with open("/tmp/tmpzprzopiu/sh_refs_qiime_ver9_97_25.07.2023_dev.fasta", 'r') as fasta_file:
# Read the content of the file and append it as a Python object
fasta_content = fasta_file.read()
qiime2.Artifact.import_data('FeatureData[RNASequence]', fasta_content)

colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved


def get_silva_data(ctx,
version='138.1',
target='SSURef_NR99',
Expand Down