Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] New release #240

Merged
merged 94 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
ed478ec
added test.txt
Oct 13, 2021
87847ad
added possibility of amplicon sequencing and choosing level of read …
Oct 14, 2021
50ec59f
deleted test.txt
Oct 14, 2021
10f0134
Style changes according to PEP 8
Oct 28, 2021
a0e0634
removed quality_bin options
Oct 29, 2021
471ecc2
improve flow
Oct 29, 2021
7112649
A2BCR-25 working tests
Oct 29, 2021
47f16ed
Merge pull request #2 from ENPICOM/amplicon
ThijsMaas Feb 16, 2022
8684d73
Fixed tests.
ThijsMaas Mar 9, 2022
b47fdac
Removed print statement
ThijsMaas Oct 5, 2022
53cd244
Removed unused import statements
ThijsMaas Nov 7, 2022
456d8c4
More code cleanup and test added for amplicon read
ThijsMaas Nov 7, 2022
9cb368d
Merge pull request #4 from ENPICOM/amp-seq-fix
ThijsMaas Nov 7, 2022
a2f2e43
More option for model generation
ThijsMaas Jan 13, 2023
6ca615d
Revert "More option for model generation"
ThijsMaas Jan 25, 2023
053e45c
Added default amplicon sequence type
ThijsMaas Feb 6, 2023
874af2a
Fized default
ThijsMaas Feb 6, 2023
b7d6915
ignore
ThijsMaas Jul 12, 2023
9b96034
ignore eggs
ThijsMaas Jul 13, 2023
dd71fab
Remove unused dependency: future
gdrosos Aug 11, 2023
f6d71ad
Improve formatting
gdrosos Aug 11, 2023
97005d0
unpin deps
HadrienG Aug 4, 2023
1e18951
migrate to pytest
HadrienG Aug 7, 2023
da31452
tomutable -> MutableSeq()
HadrienG Aug 7, 2023
0e9fd7a
fix crash on empty error models
HadrienG Aug 7, 2023
f204bd8
numpt deprecations
HadrienG Aug 8, 2023
a8b354b
GC -> gc_fraction (biopython deprecation)
HadrienG Aug 8, 2023
e2d75bd
update test matrix + switch github ci to pytest
HadrienG Aug 8, 2023
9f083c4
quote version number in ci matrix
HadrienG Aug 8, 2023
e3b623e
test compat python>3.11
HadrienG Aug 8, 2023
7ebf763
use the correct dict in draft/coverage mode
HadrienG Aug 8, 2023
b6e60a3
version bump (1.6.0)
HadrienG Aug 10, 2023
70a101c
add sequence_type arg to small generate test
HadrienG Sep 13, 2023
671d445
Merge branch 'master' into ENPICOM-amp-seq-fix
HadrienG Sep 13, 2023
0bbb744
Merge pull request #239 from HadrienG/ENPICOM-amp-seq-fix
HadrienG Sep 13, 2023
9e75b2a
Merge pull request #238 from gdrosos/master
HadrienG Sep 13, 2023
53c6d78
Merge branch 'master' into updateMaster
ThijsMaas Sep 14, 2023
cfd0ee0
Fix test
ThijsMaas Sep 14, 2023
f0b132b
Merge pull request #8 from ENPICOM/updateMaster
ThijsMaas Sep 14, 2023
4c15100
Added readcounts parser. Added readcounts file input, added alternati…
ThijsMaas Sep 21, 2023
cb73c36
Merge remote-tracking branch 'upstream/dev' into improve-amplicon-seq
ThijsMaas Sep 21, 2023
6f2368a
Merge pull request #242 from ENPICOM/improve-amplicon-seq
HadrienG Oct 6, 2023
c352361
Style update with Black, isort and flake8.
ThijsMaas Nov 13, 2023
69627af
restore main
ThijsMaas Nov 22, 2023
fd522f5
restore main functionality
ThijsMaas Nov 22, 2023
6e83c7e
Added new dev dependencies
ThijsMaas Nov 23, 2023
67b1a20
Separate pipenv run commands
ThijsMaas Nov 23, 2023
0538841
Black requirements
ThijsMaas Nov 23, 2023
30841b5
Split app.py
ThijsMaas Nov 23, 2023
e0da6ba
Single temp file for each cpu, reduces concat work
ThijsMaas Nov 23, 2023
dc68a33
Also added tomli for black on python versions < 3.11
ThijsMaas Nov 24, 2023
3ffe963
Dont run codecov on forks
ThijsMaas Nov 24, 2023
a2f04be
Merge branch 'styling-formatting-rework' into amplicon-performance-an…
ThijsMaas Nov 24, 2023
7a59bed
remove line
ThijsMaas Nov 24, 2023
b818bbc
Merge pull request #245 from ENPICOM/styling-formatting-rework
HadrienG Nov 24, 2023
996da4c
Warning instead of assertion
ThijsMaas Nov 24, 2023
6ae032e
Remove joblib
ThijsMaas Nov 24, 2023
f5a4e96
info to debug
ThijsMaas Nov 27, 2023
35b4b37
Fix coverage variable bug
ThijsMaas Nov 27, 2023
c24c81f
fix n_reads vs n_read_pairs issue
ThijsMaas Nov 27, 2023
534bbf4
Fix unrounded issue
ThijsMaas Nov 27, 2023
02fb399
fix cap of 500 on insert size distribution
HadrienG Nov 28, 2023
351a636
linting
HadrienG Nov 28, 2023
0388886
Added new fragment length params and fixes tests
ThijsMaas Nov 29, 2023
f5fe702
fix isort
ThijsMaas Nov 30, 2023
d4aa56b
docstrings
ThijsMaas Dec 19, 2023
a62f395
removed list()
ThijsMaas Dec 19, 2023
d2b1dd9
Restored default functionality. Add logging
ThijsMaas Dec 19, 2023
e6db5da
Merge branch 'experiment/fragment-length' into amplicon-performance-a…
ThijsMaas Dec 19, 2023
df39237
removed import
ThijsMaas Dec 19, 2023
fcd3490
Merge pull request #246 from ENPICOM/amplicon-performance-and-features
HadrienG Jan 8, 2024
716a700
Merge pull request #247 from HadrienG/isize
HadrienG Jan 8, 2024
aec500d
update hiseq and novaseq profiles
HadrienG Feb 1, 2024
fb425d1
Merge pull request #248 from HadrienG/hiseq_novaseq_updates
HadrienG Feb 1, 2024
2debb42
added new miSeq and Nextseq error models
StefanLelieveld Feb 1, 2024
fe9c5f4
added legacy MiSeq error model to fix the test_amp test
StefanLelieveld Feb 1, 2024
8633b9f
init storing mutations
ThijsMaas Feb 5, 2024
968d6e7
format and fix
ThijsMaas Feb 5, 2024
823fd75
Fix mut sequence output
ThijsMaas Feb 5, 2024
7108eab
Add indel storing and fix cleanup
ThijsMaas Feb 5, 2024
87cf604
comment
ThijsMaas Feb 5, 2024
886fdea
snp -> sub
ThijsMaas Feb 7, 2024
e8ff9ca
Merge pull request #250 from HadrienG/storing-mutations
HadrienG Feb 12, 2024
cac4365
removed types from precomputed_error_models
StefanLelieveld Feb 12, 2024
cd714b1
Merge pull request #249 from HadrienG/nextSeq-miSeq-error-models
HadrienG Feb 13, 2024
3e25e0a
comment out raise statement used while debugging
HadrienG Feb 13, 2024
ef4908a
sequence_type is optional since meta is the default
HadrienG Feb 13, 2024
b3ece20
ignore case in moel names
HadrienG Feb 15, 2024
f03eb94
update iss generate doc
HadrienG Feb 15, 2024
5588ad1
update readthedocs config
HadrienG Feb 15, 2024
412dfc7
update readthedocs config
HadrienG Feb 15, 2024
2150de4
update model qualities and read length
HadrienG Feb 15, 2024
4772f53
remove todo ...
HadrienG Feb 15, 2024
0b80d8e
add thijs and stefan as authors on the doc website
HadrienG Feb 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[flake8]
max-line-length = 120
ignore = E203, W503, C901
exclude = .git,__pycache__,docs/source/conf.py,old,build,dist
max-complexity = 10
6 changes: 6 additions & 0 deletions .github/workflows/pythonpackage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,17 @@ jobs:
python -m pip install --upgrade pip
pip install pipenv
pipenv install --dev
- name: Style check
run: |
pipenv run black
pipenv run isort
pipenv run flake8
- name: Test with pytest
run: |
chmod -w data/read_only.fasta
pipenv run tests
- name: Upload to Codecov
if: github.event.repository.fork == false
uses: codecov/[email protected]
with:
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
*.bam
*.bam.bai
*.bt2
*.npz
test_runs/

# Byte-compiled / optimized / DLL files
Expand All @@ -11,6 +12,7 @@ __pycache__/

# Distribution / packaging
*.egg-info/
.eggs/
dist/
build/

Expand Down
7 changes: 7 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3"
python:
install:
- requirements: doc/requirements.txt
- method: pip
path: .
10 changes: 8 additions & 2 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@ verify_ssl = true
name = "pypi"

[packages]
future = "*"
numpy = "*"
scipy = "*"
biopython = "*"
joblib = "*"
pysam = "*"
requests = "*"
urllib3 = ">=1.26.5"
Expand All @@ -20,7 +18,15 @@ pycodestyle = "*"
pytest = "*"
pytest-cov = "*"
exceptiongroup = "*"
black = "*"
isort = "*"
flake8 = "*"
typing-extensions = "*"
tomli = "*"

[scripts]
iss = "python -m iss"
tests = "pytest --cov=iss ."
black = "black --check ."
isort = "isort --check ."
flake8 = "flake8 ."
875 changes: 485 additions & 390 deletions Pipfile.lock

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions data/readcounts.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
amplicon_A 2
amplicon_T 2
amplicon_GC 4
amplicon_ATCG 1
amplicon_TA 1
2 changes: 1 addition & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ InSilicoSeq supports substitution, insertion, deletion errors, and models gc bia
Details
-------

* **Authors:** Hadrien Gourlé, Juliette Hayer, Oskar E. Karlsson and Erik Bongcam-Rudloff
* **Authors:** Hadrien Gourlé, Juliette Hayer, Oskar E. Karlsson, Erik Bongcam-Rudloff, Stefan Lelieveld, Thijs Maas
* **Contact:** `[email protected] <[email protected]>`_
* **GitHub:** `HadrienG/InSilicoSeq <https://github.com/HadrienG/InSilicoSeq>`_
* **License:** MIT
Expand Down
47 changes: 42 additions & 5 deletions doc/iss/generate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,19 @@
Generating reads
================

InSilicoSeq can simulate amplicon reads, or reads from whole metagenome sequencing (the default).
You can specify the type of reads you want to simulate with the ``--sequence_type`` option.

InSilicoSeq comes with a set of pre-computed error models to allow the user to easily generate reads from the most popular Illumina instruments:

- HiSeq
- MiSeq
- MiSeq (optionally, with various quality thresholds):
- MiSeq-20
- MiSeq-24
- MiSeq-28
- MiSeq-32
- MiSeq-36
- NextSeq
- NovaSeq

Per example generate 1 million MiSeq reads from a set of input genomes:
Expand Down Expand Up @@ -49,6 +58,19 @@ You can also provide multiple input files:
curl -O -J -L https://osf.io/37kg8/download # download another example file
iss generate --genomes SRS121011.fasta minigut.fasta --n_genomes 5 --model novaseq --output novaseq_reads

Amplicons
---------

To generate amplicon reads, use the ``--sequence_type amplicon`` option:

.. code-block:: bash

# no example data is provided here
iss generate --genomes my_amplicons.fasta ---readcount_file counts.txt -sequence_type amplicon --model nextseq --output reads

where ``counts.txt`` is a tab-delimited file containing the number of reads to generate for each amplicon sequence present in ``my_amplicons.fasta``.
Alternatively, you can use the ``--n_reads`` option to generate a fixed number of reads, together with an abundance distribution.

Draft genomes
-------------

Expand Down Expand Up @@ -230,6 +252,11 @@ coverage distribution. Can be uniform, halfnormal, exponential, lognormal or zer

file containing coverage information (default: None).

--readcount_file
^^^^^^^^^^^^^^^^

file containing read_count information (default: None).

--n_reads
^^^^^^^^^

Expand All @@ -246,16 +273,21 @@ Can be 'kde' or 'basic'
^^^^^^^^

Error model file. (default: None).
Use HiSeq, NovaSeq or MiSeq for a pre-computed error model provided with the software, or a file generated with iss model.
If you do not wish to use a model, use --mode basic.
The name of the built-in models is case insensitive.
Use HiSeq, NextSeq, NovaSeq, MiSeq or Miseq-[20,24,28,32] for a pre-computed error model provided with the software, or a file generated with iss model.
If you do not wish to use a model, use --mode basic or --mode perfect.
The name of the built-in models are case insensitive.

--gc_bias
^^^^^^^^^

If set, may fail to sequence reads with abnormal GC content.
Does not guarantee --n_reads (default: False)

--sequence_type
^^^^^^^^^^^^^^^

Type of sequencing. Can be metagenomics or amplicon (default: metagenomics).

--cpus
^^^^^^

Expand Down Expand Up @@ -284,4 +316,9 @@ Output file path and prefix (Required)
--compress
^^^^^^^^^^

Compress the output in gzip format (default: False).
Compress the output in gzip format (default: False).

--store_mutations
^^^^^^^^^^^^^^^^^

Generates an additional VCF file with the mutations introduced in the reads (bool).
1 change: 0 additions & 1 deletion doc/iss/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ To install InSilicoSeq, type the following in your terminal:
It will install InSilicoSeq as well as the following dependencies:

* biopython
* joblib
* numpy
* pysam
* scipy
Expand Down
4 changes: 3 additions & 1 deletion doc/iss/model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@ Available models
+------------+-------------+
| Model name | Read length |
+============+=============+
| MiSeq | 300 bp |
| MiSeq* | 300 bp |
+------------+-------------+
| HiSeq | 125 bp |
+------------+-------------+
| NextSeq | 300 bp |
+------------+-------------+
| NovaSeq | 150 bp |
+------------+-------------+

Expand Down
Binary file modified doc/iss/qualities.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 0 additions & 4 deletions iss/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +0,0 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from iss import *
1 change: 1 addition & 0 deletions iss/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
# -*- coding: utf-8 -*-

from iss.app import main

main()
78 changes: 51 additions & 27 deletions iss/abundance.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,40 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from Bio import SeqIO
from scipy import stats

import logging
import os
import sys
import logging

import numpy as np
from Bio import SeqIO
from scipy import stats


def parse_readcount_file(readcount_file):
logger = logging.getLogger(__name__)

readcount_dic = {}
try:
assert os.stat(readcount_file).st_size != 0
f = open(readcount_file, "r")
except (IOError, OSError) as e:
logger.error("Failed to open file:%s" % e)
sys.exit(1)
except AssertionError:
logger.error("File seems empty: %s" % readcount_file)
sys.exit(1)
else:
with f:
for line in f:
try:
genome_id = line.split()[0]
read_count = int(line.split()[1])
except IndexError as e:
logger.error("Failed to read file: %s" % e)
sys.exit(1)
else:
readcount_dic[genome_id] = read_count
return readcount_dic


def parse_abundance_file(abundance_file):
Expand All @@ -28,12 +55,12 @@ def parse_abundance_file(abundance_file):
abundance_dic = {}
try:
assert os.stat(abundance_file).st_size != 0
f = open(abundance_file, 'r')
f = open(abundance_file, "r")
except (IOError, OSError) as e:
logger.error('Failed to open file:%s' % e)
logger.error("Failed to open file:%s" % e)
sys.exit(1)
except AssertionError as e:
logger.error('File seems empty: %s' % abundance_file)
except AssertionError:
logger.error("File seems empty: %s" % abundance_file)
sys.exit(1)
else:
with f:
Expand All @@ -42,11 +69,11 @@ def parse_abundance_file(abundance_file):
genome_id = line.split()[0]
abundance = float(line.split()[1])
except IndexError as e:
logger.error('Failed to read file: %s' % e)
logger.error("Failed to read file: %s" % e)
sys.exit(1)
else:
abundance_dic[genome_id] = abundance
logger.debug('Loaded abundance/coverage file: %s' % abundance_file)
logger.debug("Loaded abundance/coverage file: %s" % abundance_file)
return abundance_dic


Expand Down Expand Up @@ -179,16 +206,16 @@ def coverage_scaling(total_n_reads, abundance_dic, genome_file, read_length):
Returns:
dict: scaled coverage dictionary
"""
logger = logging.getLogger(__name__)
total_reads = 0
f = open(genome_file, 'r') # re-opens the file
f = open(genome_file, "r") # re-opens the file
with f:
fasta_file = SeqIO.parse(f, 'fasta')
fasta_file = SeqIO.parse(f, "fasta")
for record in fasta_file:
try:
species_coverage = abundance_dic[record.id]
except KeyError as e:
logger.error(
'Fasta record not found in abundance file: %s' % e)
logger.error("Fasta record not found in abundance file: %s" % e)
sys.exit(1)

genome_size = len(record.seq)
Expand All @@ -210,18 +237,18 @@ def to_file(abundance_dic, output, mode="abundance"):
"""
logger = logging.getLogger(__name__)
if mode == "abundance":
output_abundance = output + '_abundance.txt'
output_abundance = output + "_abundance.txt"
else:
output_abundance = output + '_coverage.txt'
output_abundance = output + "_coverage.txt"
try:
f = open(output_abundance, 'w')
f = open(output_abundance, "w")
except PermissionError as e:
logger.error('Failed to open output file: %s' % e)
logger.error("Failed to open output file: %s" % e)
sys.exit(1)
else:
with f:
for record, abundance in abundance_dic.items():
f.write('%s\t%s\n' % (record, abundance))
f.write("%s\t%s\n" % (record, abundance))


def draft(genomes, draft, distribution, output, mode="abundance"):
Expand All @@ -240,13 +267,10 @@ def draft(genomes, draft, distribution, output, mode="abundance"):
# first we get a list of contig names in draft genomes
draft_records = []
for d in draft:
draft_records.extend(
[record.name for record in SeqIO.parse(d, 'fasta')])
draft_records.extend([record.name for record in SeqIO.parse(d, "fasta")])
genomes = list(set(genomes) - set(draft_records))
abundance_dic = distribution(genomes + draft)
complete_genomes_abundance = {k: v for
k, v in abundance_dic.items()
if k not in draft}
complete_genomes_abundance = {k: v for k, v in abundance_dic.items() if k not in draft}
to_file(abundance_dic, output)
draft_dic = expand_draft_abundance(abundance_dic, draft, mode)
full_abundance_dic = {**complete_genomes_abundance, **draft_dic}
Expand All @@ -273,12 +297,12 @@ def expand_draft_abundance(abundance_dic, draft, mode="abundance"):
for key, abundance in abundance_dic.items():
if key in draft:
try:
f = open(key, 'r')
f = open(key, "r")
with f:
fasta_file = SeqIO.parse(f, 'fasta')
fasta_file = SeqIO.parse(f, "fasta")
draft_records = [record for record in fasta_file]
total_length = sum(len(record) for record in draft_records)
except Exception as e:
except Exception:
raise
else:
total_length = sum(len(record) for record in draft_records)
Expand Down
Loading
Loading