Skip to content

Commit

Permalink
Merge branch 'develop' into gonzalo-internship3
Browse files Browse the repository at this point in the history
  • Loading branch information
jakebeal authored Apr 7, 2024
2 parents f9e565c + d8df83c commit 1155790
Show file tree
Hide file tree
Showing 41 changed files with 5,249 additions and 52 deletions.
23 changes: 15 additions & 8 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,23 +12,25 @@ on:

jobs:
build:
runs-on: ubuntu-latest
env:
IDT_CREDENTIALS: ${{ secrets.IDT_CREDENTIALS }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
# Default builds are on Ubuntu
os: [ubuntu-latest]
python-version: ['3.7', '3.8', '3.9', '3.10']
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11']
include:
# Also test on macOS and Windows using the latest Python 3
- os: macos-latest
python-version: 3.x
- os: windows-latest
python-version: 3.x
python-version: 3.11 # Return to 3.x after resolution of https://github.com/RDFLib/pySHACL/issues/212
- os: windows-2019
python-version: 3.11 # Return to 3.x after resolution of https://github.com/RDFLib/pySHACL/issues/212

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -37,11 +39,16 @@ jobs:
python -m pip install pytest
python -m pip install interrogate
- name: Setup Graphviz
uses: ts-graphviz/setup-graphviz@v1
uses: ts-graphviz/setup-graphviz@v2
with:
# Skip running of sometimes problematic brew update command on macOS.
# Remove after resolution of https://github.com/ts-graphviz/setup-graphviz/issues/593
macos-skip-brew-update: 'true' # default false
- name: Show Node.js version
run: |
node --version
- name: Test with pytest
run: |
pip install .
echo "$IDT_CREDENTIALS" > test_secret_idt_credentials.json
pytest --ignore=test/test_docstr_coverage.py -s
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,9 @@
test/__pycache__/
sbol_utilities/__pycache__/
__pycache__/

# test secrets
test_secret*

.idea/
*.egg-info/
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,9 @@ The `excel-to-sbol` utility reads an Excel file specifying a library of basic an
The `sbol-converter` utility converts between any of the SBOL3, SBOL2, GenBank, and FASTA formats.

Additional "macro" utilities convert specifically between SBOL3 and one of the other formats:
- `sbol2fasta` and `fasta2sbol` convert from SBOL3 to FASTA and vice versa
- `sbol2genbank` and `genbank2sbol` convert from SBOL3 to GenBank and vice versa
- `sbol3to2` and `sbol2to3` convert to and from SBOL2
- `sbol-to-fasta` and `fasta-to-sbol` convert from SBOL3 to FASTA and vice versa
- `sbol-to-genbank` and `genbank-to-sbol` convert from SBOL3 to GenBank and vice versa
- `sbol3-to-sbol2` and `sbol2-to-sbol3` convert to and from SBOL2

### Expand the combinatorial derivations in an SBOL file

Expand All @@ -69,6 +69,14 @@ The `sbol-expand-derivations` utility searches through an SBOL file for Combinat

The `sbol-calculate-sequences` utility attempts to calculate the sequence of any DNA Component that can be fully specified from the sequences of its sub-components.

### Calculate sequence synthesis complexity for DNA sequences in an SBOL file

The `sbol-calculate-complexity` utility attempts to calculate the synthesis complexity of any DNA sequence in the file, by sending sequences to be evaluated by IDT's sequence calculator service. Sequences whose complexity is known are not re-calculated.

The system uses the gBlock API, which is intended for sequences from 125 to 3000 bp in length. If it is more than 3000 bp or less than 125 bp your returned score will be 0. A complexity score in the range from 0 to 10 means your sequence is synthesizable, if the score is greater or equal than 10 means it is not synthesizable.

Note that use of this utility requires an account with IDT that is set up to use IDT's online service API (see: https://www.idtdna.com/pages/tools/apidoc)

### Compute the difference between two SBOL3 documents
The `sbol-diff` utility computes the difference between two SBOL3 documents
and reports the differences.
Expand All @@ -77,4 +85,4 @@ and reports the differences.
## Contributing

We welcome contributions that patch bugs, improve existing utilities or documentation, or add new utilities!
For guidance on how to contribute effectively to this project, see [CONTRIBUTING.md](CONTRIBUTING.md).
For guidance on how to contribute effectively to this project, see [CONTRIBUTING.md](CONTRIBUTING.md).
250 changes: 250 additions & 0 deletions sbol_utilities/calculate_complexity_scores.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
from __future__ import annotations

import json

from typing import Optional

import datetime
import argparse
import logging
import uuid
from requests import post
from requests.auth import HTTPBasicAuth

import sbol3
import tyto

from sbol_utilities.workarounds import type_to_standard_extension

COMPLEXITY_SCORE_NAMESPACE = 'http://igem.org/IDT_complexity_score'
REPORT_ACTIVITY_TYPE = 'https://github.com/SynBioDex/SBOL-utilities/compute-sequence-complexity'


class IDTAccountAccessor:
"""Class that wraps access to the IDT API"""

_TOKEN_URL = 'https://www.idtdna.com/Identityserver/connect/token'
"""API URL for obtaining session tokens"""
_SCORE_URL = 'https://www.idtdna.com/api/complexities/screengBlockSequences'
"""APR URL for obtaining sequence scores"""
_BLOCK_SIZE = 1 # TODO: determine if it is possible to run multiple sequences in a single query
SCORE_TIMEOUT = 120
"""Number of seconds to wait for score query requests to complete"""

def __init__(self, username: str, password: str, client_id: str, client_secret: str):
"""Initialize with required access information for IDT API (see: https://www.idtdna.com/pages/tools/apidoc)
Automatically logs in and obtains a session token
:param username: Username of your IDT account
:param password: Password of your IDT account
:param client_id: ClientID key of your IDT account
:param client_secret: ClientSecret key of your IDT account
"""
self.username = username
self.password = password
self.client_id = client_id
self.client_secret = client_secret
self.token = self._get_idt_access_token()

@staticmethod
def from_json(json_object) -> IDTAccountAccessor:
"""Initialize IDT account accessor from a JSON object with field values
:param json_object: object with account information
:return: Account accessor object
"""
return IDTAccountAccessor(username=json_object['username'], password=json_object['password'],
client_id=json_object['ClientID'], client_secret=json_object['ClientSecret'])

def _get_idt_access_token(self) -> str:
"""Get access token for IDT API (see: https://www.idtdna.com/pages/tools/apidoc)
:return: access token string
"""
logging.info('Connecting to IDT API')
data = {'grant_type': 'password', 'username': self.username, 'password': self.password, 'scope': 'test'}
auth = HTTPBasicAuth(self.client_id, self.client_secret)
result = post(IDTAccountAccessor._TOKEN_URL, data, auth=auth, timeout=IDTAccountAccessor.SCORE_TIMEOUT)

if 'access_token' in result.json():
return result.json()['access_token']
else:
raise ValueError('Access token for IDT API could not be generated. Check your credentials.')

def get_sequence_scores(self, sequences: list[sbol3.Sequence]) -> list:
"""Retrieve synthesis complexity scores of sequences from the IDT API
This system uses the gBlock API, which is intended for sequences from 125 to 3000 bp in length. If it is more
than 3000 bp or less than 125 bp your returned score will be 0. A complexity score in the range from 0 to 10 means
your sequence is synthesizable, if the score is greater or equal than 10 means it is not synthesizable.
:param sequences: sequences for which we want to calculate the complexity score
:return: dictionary mapping sequences to complexity Scores
:return: List of lists of dictionaries with information about sequence synthesis features
"""
# Set up list of query dictionaries
seq_dict = [{'Name': str(seq.display_name), 'Sequence': str(seq.elements)} for seq in sequences]
# Break into query blocks
partitions_sequences = [seq_dict[x:x + 1] for x in range(0, len(seq_dict), IDTAccountAccessor._BLOCK_SIZE)]
# Send each query to IDT and collect results
results = []
for idx, partition in enumerate(partitions_sequences):
logging.debug('Sequence score request %i of %i', idx+1, len(partitions_sequences))
resp = post(IDTAccountAccessor._SCORE_URL, json=partition, timeout=IDTAccountAccessor.SCORE_TIMEOUT,
headers={'Authorization': 'Bearer {}'.format(self.token),
'Content-Type': 'application/json; charset=utf-8'})
response_list = resp.json()
if len(response_list) != len(partition):
raise ValueError(f'Unexpected complexity score: expected {len(partition)} scores, '
f'but got {len(response_list)}')
results.append(resp.json())
logging.info('Requests to IDT API finished.')
return results

def get_sequence_complexity(self, sequences: list[sbol3.Sequence]) -> dict[sbol3.Sequence, float]:
""" Extract complexity scores from IDT API for a list of SBOL Sequence objects
This works by computing full sequence evaluations, then compressing down to a single score for each sequence.
:param sequences: list of SBOL Sequences to evaluate
:return: dictionary mapping sequences to complexity Scores
"""
# Retrieve full evaluations for sequences
scores = self.get_sequence_scores(sequences)
# Compute total score for each sequence as the sum all complexity scores for the sequence
score_list = []
for score_set in scores:
for sequence_scores in score_set:
complexity_score = sum(score.get('Score') for score in sequence_scores)
score_list.append(complexity_score)
# Associate each sequence to its score
return dict(zip(sequences, score_list))


def get_complexity_score(seq: sbol3.Sequence) -> Optional[float]:
"""Given a sequence, return its previously computed complexity score, if such exists
:param seq: SBOL Sequence object to check for score
:return: score if set, None if not
"""
scores = [score for score in seq.measures if tyto.EDAM.sequence_complexity_report in score.types]
if scores:
if len(scores) > 1:
raise ValueError(f'Found multiple complexity scores on Sequence {seq.identity}')
return scores[0].value
else:
return None


def get_complexity_scores(sequences: list[sbol3.Sequence], include_missing=False) -> \
dict[sbol3.Sequence, Optional[float]]:
"""Retrieve complexity scores for a list of sequences
:param sequences: Sequences to get scores for
:param include_missing: if true, Sequences without scores are included, mapping to none
:return: dictionary mapping Sequence to score
"""
# TODO: change to run computations only on DNA sequences
score_map = {seq: get_complexity_score(seq) for seq in sequences}
if not include_missing:
score_map = {k: v for k, v in score_map.items() if v is not None}
return score_map


def idt_calculate_sequence_complexity_scores(accessor: IDTAccountAccessor, sequences: list[sbol3.Sequence]) -> \
dict[sbol3.Sequence, float]:
"""Given a list of sequences, compute the complexity scores for any sequences not currently scored
by sending the sequences to IDT's online service for calculating sequence synthesis complexity.
Also records the complexity computation with an activity
:param accessor: IDT API access object
:param sequences: list of SBOL Sequences to evaluate
:return: Dictionary mapping Sequences to complexity scores for newly computed sequences
"""
# Determine which sequences need scores
need_scores = [seq for seq, score in get_complexity_scores(sequences, include_missing=True).items()
if score is None]
if not need_scores:
return dict()

# Query for the scores of the sequences
score_dictionary = accessor.get_sequence_complexity(need_scores)

# Create report generation activity
doc = need_scores[0].document
timestamp = datetime.datetime.utcnow().isoformat(timespec='seconds') + 'Z'
report_id = f'{COMPLEXITY_SCORE_NAMESPACE}/Complexity_Report_{timestamp.replace(":", "").replace("-", "")}_' \
f'{str(uuid.uuid4())[0:8]}'
report_generation = sbol3.Activity(report_id, end_time=timestamp, types=[REPORT_ACTIVITY_TYPE])
doc.add(report_generation)

# Mark the sequences with their scores, where each score is a dimensionless measure
for sequence, score in score_dictionary.items():
measure = sbol3.Measure(score, unit=tyto.OM.number_unit, types=[tyto.EDAM.sequence_complexity_report])
measure.generated_by.append(report_generation)
sequence.measures.append(measure)
# return the dictionary of newly computed scores
return score_dictionary


def idt_calculate_complexity_scores(accessor: IDTAccountAccessor, doc: sbol3.Document) -> dict[sbol3.Sequence, float]:
"""Given an SBOL Document, compute the complexity scores for any sequences in the Document not currently scored
by sending the sequences to IDT's online service for calculating sequence synthesis complexity.
Also records the complexity computation with an activity
:param accessor: IDT API access object
:param doc: SBOL document with sequences of interest in it
:return: Dictionary mapping Sequences to complexity scores
"""
sequences = [obj for obj in doc if isinstance(obj, sbol3.Sequence)]
return idt_calculate_sequence_complexity_scores(accessor, sequences)


def main():
"""
Main wrapper: read from input file, invoke idt_calculate_complexity_scores, then write to output file
"""
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--credentials',
help="""JSON file containing IDT API access credentials.
To obtain access credentials, follow the directions at https://www.idtdna.com/pages/tools/apidoc
The values of the IDT access credentials should be stored in a JSON of the following form:
{ "username": "username", "password": "password", "ClientID": "####", "ClientSecret": "XXXXXXXXXXXXXXXXXXX" }"
""")
parser.add_argument('--username', help="Username of your IDT account (if not using JSON credentials)")
parser.add_argument('--password', help="Password of your IDT account (if not using JSON credentials)")
parser.add_argument('--ClientID', help="ClientID of your IDT account (if not using JSON credentials)")
parser.add_argument('--ClientSecret', help="ClientSecret of your IDT account (if not using JSON credentials)")
parser.add_argument('input_file', help="Absolute path to sbol file with sequences")
parser.add_argument('output_name', help="Name of SBOL file to be written")
parser.add_argument('-t', '--file-type', dest='file_type', default=sbol3.SORTED_NTRIPLES,
help="Name of SBOL file to output to (excluding type)")
parser.add_argument('--verbose', '-v', dest='verbose', action='count', default=0)
args_dict = vars(parser.parse_args())

# Extract arguments:
verbosity = args_dict['verbose']
logging.getLogger().setLevel(level=(logging.WARN if verbosity == 0 else
logging.INFO if verbosity == 1 else logging.DEBUG))
input_file = args_dict['input_file']
output_name = args_dict['output_name']

if args_dict['credentials'] != None:
with open(args_dict['credentials']) as credentials:
idt_accessor = IDTAccountAccessor.from_json(json.load(credentials))
else:
idt_accessor = IDTAccountAccessor(args_dict['username'], args_dict['password'], args_dict['ClientID'],
args_dict['ClientSecret'])

extension = type_to_standard_extension[args_dict['file_type']]
outfile_name = output_name if output_name.endswith(extension) else output_name + extension

# Read file, convert, and write resulting document
logging.info('Reading SBOL file ' + input_file)
doc = sbol3.Document()
doc.read(input_file)
results = idt_calculate_complexity_scores(idt_accessor, doc)
doc.write(outfile_name, args_dict['file_type'])
logging.info('SBOL file written to %s with %i new scores calculated', outfile_name, len(results))


if __name__ == '__main__':
main()
Loading

0 comments on commit 1155790

Please sign in to comment.