Skip to content
/ bioy Public

Tools for NGS sequence analysis and bacterial classification

License

Notifications You must be signed in to change notification settings

nhoffman/bioy

Repository files navigation

bioy: a collection of bioinformatics tools

Bio-y
(pronounced "bio-ee") The adjective form of the noun "Bio"
  • Noah Hoffman
  • Chris Rosenthal
  • Tyler Land
  • A unix-like system; tested primarily on Ubuntu 12.04
  • Python 2.7.x
  • setuptools

Some functions require the following python packages:

  • numpy
  • pandas
  • biopython

And other require external programs, including:

To install bioy and python dependencies, run setup.py or pip from the project directory:

% cd bioy
% python setup.py install
# or
% pip install -U .

If you don't want to install the dependencies (numpy and pandas take a while to compile), use:

% pip install --no-deps -U .

Numpy and pandas require many dependencies to compile (and you'll likely need to compile them because versions in package managers are typically out of date). Fortunately, these can pretty easily be installed on Ubuntu 12.04 by running:

% sudo apt-get build-dep python-numpy python-pandas

A virtualenv containing a complete python execution environment can be created using dev/bootstrap.sh:

% dev/bootstrap.sh -h
Create a virtualenv and install all pipeline dependencies
Options:
--venv            - path of virtualenv [bioy-env]
--python          - path to the python interpreter [/usr/local/bin/python]
--wheelstreet     - path to directory containing python wheels; wheel files will be
in a subdirectory named according to the python interpreter version;
uses WHEELSTREET if defined.
(a suggested location is ~/wheelstreet) []
--requirements    - a file listing python packages to install [requirements.txt]

The bioy script provides the user interface, and uses standard UNIX command line syntax. Note that for development, it is convenient to run bioy from within the project directory by specifying the relative path to the script:

% ./bioy

Commands are constructed as follows. Every command starts with the
name of the script, followed by an "action" followed by a series of
required or optional "arguments". The name of the script, the action,
and options and their arguments are entered on the command line
separated by spaces. Help text is available for both the ``bioy``
script and individual actions using the ``-h`` or ``--help`` options::

usage: bioy [-h] [-V] [-v] [-q]

            {help,align_clusters,all_pairwise,blast,children,classifier,classify,cmscores,consensus,csv2fasta,csv2hdf5,csvmod,dedup,denoise,errors,fasta,fasta2csv,fastq_stats,gb2fa,index,map_clusters,primer_trim,pull_reads,repl,reshape,reverse_complement,rldecode,rlencode,split_barcodes,split_reads,ssearch,ssearch2csv,ssearch_count,tree_edit,tsv2csv,usearch}
            ...

Tools for microbial sequence analysis and classification.

positional arguments:
  {help,align_clusters,all_pairwise,blast,children,classifier,classify,cmscores,consensus,csv2fasta,csv2hdf5,csvmod,dedup,denoise,errors,fasta,fasta2csv,fastq_stats,gb2fa,index,map_clusters,primer_trim,pull_reads,repl,reshape,reverse_complement,rldecode,rlencode,split_barcodes,split_reads,ssearch,ssearch2csv,ssearch_count,tree_edit,tsv2csv,usearch}
    help                Detailed help for actions using `help <action>`
    align_clusters      Align reads contributing to a denoised cluster.
    all_pairwise        Calculate all Smith-Waterman pairwise distances among
                        sequences.
    blast               Run blastn and produce classify friendly output
    children            Return the children of a taxtable given a list of
                        taxids
    classifier          Classify sequences by grouping blast output by
                        matching taxonomic names
    classify            Classify sequences by grouping blast output by
                        matching taxonomic names
    cmscores            Convert raw cmalign alignment scores to csv format.
    consensus           Calculate the consensus for a multiple aignment
    csv2fasta           Turn a csv file into a fasta file specifying two
                        columns
    csv2hdf5            Convert a csv file to HDF5
    csvmod              Add or rename columns in a csv file.
    dedup               Fast deduplicate sequences by coalescing identical
                        substrings
    denoise             Denoise a fasta file of clustered sequences
    errors              Tally and classify errors given ./ion rlaligns
                        reference and query sequences
    fasta               Run the fasta pairwise aligment tool and output in csv
                        format.
    fasta2csv           Turn a fasta file into a csv
    fastq_stats         Describe distributions of sequencing quality scores
    gb2fa               Outputs a standard Genbank Record File into fasta file
                        format and optional seqinfo file in format ['seqname',
                        'tax_id','accession','description','length','ambig_cou
                        nt','is_type','rdp_lineage']
    index               Add simple indices to an sqlite database
    map_clusters        Create a readmap and specimenmap and/or weights file
                        from a
    ncbi_fetch          Fetch sequences from NCBI's nucleotide database using
                        sequence identifiers (gi or gb)
    primer_trim         Parse region between primers from fasta file
    pull_reads          Parse barcode, primer, and read from a fastq file
    repl                Replace strings in one or more files.
    reshape             convert a tsv file to a csv with an optional split/add
                        columns feature
    reverse_complement  reverse complement rle and non-rle sequences
    rldecode            Run-length decode a fasta file
    rlencode            Run-length encode a fasta file
    split_barcodes      Partition reads in a fastq file by barcode and write
                        an annotated fasta file
    split_reads         Parse reads from a fasta file by read to specimen csv
                        map file
    ssearch             Run the ssearch (Smith-Waterman) pairwise aligment
                        tool and output in csv format.
    ssearch2csv         Parse ssearch36 -m10 output and print specified
                        contents
    ssearch_count       Tally ssearch base count by position
    tree_edit           Tree leaf name editor that wraps BioPython.
    tsv2csv             convert a tsv file to a csv with an optional split/add
                        columns feature
    usearch             Run usearch global and produce classify friendly
                        output

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         Print the version number and exit
  -v, --verbose         Increase verbosity of screen output (eg, -v is
                        verbose, -vv more so)
  -q, --quiet           Suppress output

We use abbrevited git sha hashes to identify the software version:

% ./bioy --version
0128.9790c13

The version information is saved in bioy_pkg/data when setup.py is run (on installation, or even by executing python setup.py -h).

Unit tests are implemented using the unittest module in the Python standard library. The tests subdirectory is itself a Python package that imports the local version (ie, the version in the project directory, not the version installed to the system) of the package. All unit tests can be run like this:

% ./testall
...........
----------------------------------------------------------------------
Ran 11 tests in 0.059s

OK

A single unit test can be run by referring to a specific module, class, or method within the tests package using dot notation:

% ./testone -v tests.test_utils

To build the Sphinx docs:

(cd docs && make html)

And to publish to GitHub pages:

ghp-import -p docs/_build/html

(ghp-import and Sphinx are both included in the requirements.txt)

Copyright (c) 2012 Noah Hoffman

Released under the GPLv3 License

About

Tools for NGS sequence analysis and bacterial classification

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •