Skip to content

Commit

Permalink
Add the ability to manually set base frequences
Browse files Browse the repository at this point in the history
This makes it possible to specify empirical base frequences and to override base frequences in MrBayes' output.
  • Loading branch information
jmenglund committed Jul 11, 2019
1 parent 2c20912 commit afd9ee2
Show file tree
Hide file tree
Showing 5 changed files with 88 additions and 44 deletions.
18 changes: 14 additions & 4 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,30 @@ All notable changes to this project will be documented in this file.
This project adheres to `Semantic Versioning <http://semver.org/>`_.


v0.7.0 - 2019-07-11
-------------------

* Added flags ``-f`` and ``--freqs`` to manually set base frequences.
As a consequence, the flags ``-o`` and ``--out-format`` have to be
used when specifying the output format.

`View commits <https://github.com/jmenglund/predsim/compare/v0.6.0...v0.7.0>`_


v0.6.0 - 2019-05-05
-------------------

Added
~~~~~

* Ability to write used trees to a file with the ``--trees-file`` option.
* Ability to write used trees to a file with the ``--trees-file`` flag.
* Simple test data for various substitution models.


Changed
~~~~~~~

* Command-line option ``-o`` and ``--output-format`` replaced with ``-f`` and
* Command-line flags ``-o`` and ``--output-format`` replaced with ``-f`` and
``--format``, respectively.
* The file with seed numbers may now contain more numbers than being used.
* Content is written to output after completing each simulation (instead
Expand Down Expand Up @@ -66,8 +76,8 @@ v0.4.0 - 2019-03-09
Added
~~~~~

* The command-line options ``-o`` and ``--output-format`` now allow
writing output to either the PHYLIP or the NEXUS format.
* The command-line flags ``-o`` and ``--output-format`` now allow
writing output to either Phylip or Nexus format.


Changed
Expand Down
20 changes: 12 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ predsim
|Build-Status| |Coverage-Status| |PyPI-Status| |License| |DOI-URI|

**predsim** is a simple command-line tool for simulating predictive
datasets from `MrBayes <http://mrbayes.sourceforge.net>`_ output files.
datasets from `MrBayes' <http://mrbayes.sourceforge.net>`_ output files.
Datasets can be simulated under the GTR+G+I substitution model or any nested
variant available in MrBayes (JC69, HKY85 etc.). The code is contained
within a single module that can be imported using Python's import mechanism.
Expand Down Expand Up @@ -65,9 +65,9 @@ Usage
.. code-block::
$ predsim --help
usage: predsim [-h] [-V] [-l N] [-g N] [-s N] [-n N] [-f {nexus,phylip}]
[-p FILE] [--seeds-file FILE] [--commands-file FILE]
[--trees-file FILE]
usage: predsim [-h] [-V] [-l N] [-f #A #C #G #T] [-g N] [-s N] [-n N]
[-o {nexus,phylip}] [-p FILE] [--seeds-file FILE]
[--commands-file FILE] [--trees-file FILE]
pfile tfile
A command-line utility that reads posterior output of MrBayes and simulates
Expand All @@ -81,12 +81,15 @@ Usage
-h, --help show this help message and exit
-V, --version show program's version number and exit
-l N, --length N sequence lenght (default: 1000)
-f #A #C #G #T, --freqs #A #C #G #T
base frequences (overrides any base frequences in
MrBayes' output)
-g N, --gamma-cats N number of gamma rate categories (default: continuous)
-s N, --skip N number of records (trees) to skip at the beginning of
the sample (default: 0)
-n N, --num-records N
number of records (trees) to use in the simulation
-f {nexus,phylip}, --format {nexus,phylip}
-o {nexus,phylip}, --out-format {nexus,phylip}
output format (default: "nexus")
-p FILE, --seqgen-path FILE
path to a Seq-Gen executable (default: "seq-gen")
Expand All @@ -95,9 +98,10 @@ Usage
--commands-file FILE path to output file with commands used by Seq-Gen
--trees-file FILE path to output file with trees used by Seq-Gen
* It is recommended that you use the ``--commands-file`` and the ``--trees-file``
options to check the input given to Seq-Gen.
* If base frequences are missing from MrBayes' output, these must be set manually
with the ``-f`` (or ``--freqs``) flag.
* It is recommended that you use the ``--commands-file`` and ``--trees-file``
flags to check the input given to Seq-Gen.


Running the tests
Expand Down
56 changes: 34 additions & 22 deletions predsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

__author__ = 'Markus Englund'
__license__ = 'MIT'
__version__ = '0.6.0'
__version__ = '0.7.0'


def main(args=None):
Expand All @@ -41,7 +41,7 @@ def main(args=None):

result_iterator = iter_seqgen_results(
simulation_input, seq_len=parser.length, gamma_cats=parser.gamma_cats,
seqgen_path=parser.sg_filepath)
basefreqs=parser.basefreqs, seqgen_path=parser.sg_filepath)

if parser.out_format == 'nexus':
schema_kwargs = {'schema': 'nexus', 'simple': False}
Expand Down Expand Up @@ -75,6 +75,12 @@ def parse_args(args):
parser.add_argument(
'-l', '--length', action='store', default=1000, type=int,
help='sequence lenght (default: 1000)', metavar='N', dest='length')
parser.add_argument(
'-f', '--freqs', action='store', type=float, nargs=4,
help=(
'base frequences (overrides any base frequences '
'in MrBayes\' output)'),
metavar=('#A', '#C', '#G', '#T'), dest='basefreqs')
parser.add_argument(
'-g', '--gamma-cats', action='store', type=int,
help='number of gamma rate categories (default: continuous)',
Expand All @@ -88,7 +94,7 @@ def parse_args(args):
help='number of records (trees) to use in the simulation',
metavar='N', dest='num_records')
parser.add_argument(
'-f', '--format', default='nexus', choices=['nexus', 'phylip'],
'-o', '--out-format', default='nexus', choices=['nexus', 'phylip'],
help='output format (default: "nexus")', dest='out_format')
parser.add_argument(
'-p', '--seqgen-path', default='seq-gen', type=str,
Expand All @@ -111,7 +117,7 @@ def parse_args(args):
help='path to a MrBayes p-file', metavar='pfile')
parser.add_argument(
'tfile_path', action=StoreExpandedPath, type=is_file,
help='path to a MrBayes t-file', metavar='tfile', )
help='path to a MrBayes t-file', metavar='tfile')

return parser.parse_args(args)

Expand Down Expand Up @@ -200,35 +206,41 @@ def kappa_to_titv(kappa, piA, piC, piG, piT):
return titv


def get_seqgen_params(mrbayes_params):
def get_seqgen_params(mrbayes_params, basefreqs=None):
"""
Adapt MrBayes parameter values for use with Seq-Gen.
Paramters
---------
mrbayes_prams : dict
mrbayes_params : dict
Parameter values from a single row in a MrBayes p-file.
basefreqs : list of floats
Frequences for the four nucleotides A, C, G, and T to use
if missing from MrBayes output.
Returns
-------
seqgen_params : dict
"""
seqgen_params = {}
try:
seqgen_params['state_freqs'] = (
str(mrbayes_params['pi(A)']) + ',' +
str(mrbayes_params['pi(C)']) + ',' +
str(mrbayes_params['pi(G)']) + ',' +
str(mrbayes_params['pi(T)']))
except KeyError:
seqgen_params['state_freqs'] = '0.25,0.25,0.25,0.25'

if basefreqs is None:
try:
basefreqs = [
float(mrbayes_params['pi(A)']),
float(mrbayes_params['pi(C)']),
float(mrbayes_params['pi(G)']),
float(mrbayes_params['pi(T)'])]
except KeyError as exc:
msg = (
'Base frequences must be provided since they '
'are not present in MrBayes\' output.')
raise KeyError(msg) from exc

seqgen_params = {'state_freqs': ','.join([str(v) for v in basefreqs])}

try:
seqgen_params['ti_tv'] = kappa_to_titv(
float(mrbayes_params['kappa']),
float(mrbayes_params['pi(A)']),
float(mrbayes_params['pi(C)']),
float(mrbayes_params['pi(G)']),
float(mrbayes_params['pi(T)']))
float(mrbayes_params['kappa']), *basefreqs)
except KeyError:
pass
try:
Expand Down Expand Up @@ -267,11 +279,11 @@ def combine_simulation_input(tree_list, p_dicts, rng_seeds=None):


def iter_seqgen_results(
simulation_input, seq_len=1000, gamma_cats=None,
simulation_input, seq_len=1000, gamma_cats=None, basefreqs=None,
seqgen_path='seq-gen'):
"""Iterate over multiple simulations."""
for tree, p_dict, rng_seed in simulation_input:
seqgen_params = get_seqgen_params(p_dict)
seqgen_params = get_seqgen_params(p_dict, basefreqs=basefreqs)
result = simulate_matrix(
tree, seq_len=seq_len, rng_seed=rng_seed,
seqgen_path=seqgen_path, **seqgen_params)
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

setup(
name='predsim',
version='0.6.0',
version='0.7.0',
description=(
'Command-line tool for simulating predictive datasets '
'from MrBayes\' output.'),
Expand Down
36 changes: 27 additions & 9 deletions test_predsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,8 @@ def test_equal_basefreqs(self):
assert get_seqgen_params(self.d1) == self.d2

def test_missing_basefreq(self):
assert get_seqgen_params(self.d3) == self.d2
with pytest.raises(KeyError):
get_seqgen_params(self.d3)


@seqgen_required
Expand Down Expand Up @@ -253,14 +254,17 @@ def test_parser(self):
seeds_filepath = get_testfile_path('seeds_1.txt')

parser = parse_args([
'-l', '2', '-s', '1', '-g', '5', '-n', '1', '-f', 'phylip',
'-p', 'sg', '--seeds-file', seeds_filepath,
'-l', '2', '-s', '1',
'-g', '5', '--freqs', '0.25', '0.25', '0.25', '0.25',
'-n', '1', '--out-format', 'phylip', '-p', 'sg',
'--seeds-file', seeds_filepath,
'--commands-file', self.commands_fo.name,
'--trees-file', self.trees_fo.name,
pfile_path, tfile_path])
assert parser.length == 2
assert parser.skip == 1
assert parser.gamma_cats == 5
assert parser.basefreqs == [0.25, 0.25, 0.25, 0.25]
assert parser.num_records == 1
assert parser.out_format == 'phylip'
assert parser.sg_filepath == 'sg'
Expand Down Expand Up @@ -338,7 +342,7 @@ def test_hky_outfile(self, outfile_arg):

def test_hky_phylip(self, capsys):
main([
'-l', '2', '-s', '2', '-f', 'phylip',
'-l', '2', '-s', '2', '--out-format', 'phylip',
'--seeds-file', get_testfile_path('seeds_1.txt'),
get_testfile_path('hky.p'),
get_testfile_path('hky.t')])
Expand All @@ -351,16 +355,30 @@ def test_hky_phylip(self, capsys):
'jc',
'jc_gamma',
'jc_propinvar',
'jc_invgamma',
'gtr'])
def test_subst_model(self, capsys, model_string):
'jc_invgamma'])
def test_jc(self, capsys, model_string):
pfile_path, tfile_path, expected_path = get_model_testfile_paths(
model_string, num_records=1, ext='.nex')
with open(expected_path, 'r') as fo:
expected_out = fo.read()
main([
'-l', '2', '-n', '1', '--seeds-file',
get_testfile_path('seeds_1.txt'), pfile_path, tfile_path])
'-l', '2', '-n', '1',
'--freqs', '0.25', '0.25', '0.25', '0.25',
'--seeds-file', get_testfile_path('seeds_1.txt'),
pfile_path, tfile_path])
out, err = capsys.readouterr()
assert out == expected_out
assert err == ''

def test_gtr(self, capsys):
pfile_path, tfile_path, expected_path = get_model_testfile_paths(
'gtr', num_records=1, ext='.nex')
with open(expected_path, 'r') as fo:
expected_out = fo.read()
main([
'-l', '2', '-n', '1',
'--seeds-file', get_testfile_path('seeds_1.txt'),
pfile_path, tfile_path])
out, err = capsys.readouterr()
assert out == expected_out
assert err == ''

0 comments on commit afd9ee2

Please sign in to comment.