tpms_mkdb

TPMS MKDB

INTRODUCTION

This help file explains how to create you own database. Then, you will be able to perform pattern matching queries on it.

REQUIREMENTS

You need:

(#1) One species tree (filename example: speciesTree):
- one file,
- newick format,
- the tree has not to be resolved (bifurcated),
- species names as leaves, taxon names as internal nodes,
- names into double quotes ("),
- there is no limitation about special character using (excepted double quotes), but it is safer to use only alphanumerical characters and spaces.
(#2) One directory containing all gene families trees (dirname example: familiesDir/):
- one file per one family tree,
- sequences names as leaves,
- trees must be resolved (bifurcated)
- names are not quoted.
(#3) One file listing [sequences names] -> [species] associations (filename example: seqNames2species): (you can use the assistant to build this file easily, see below)
- one association per line
- the line looks like
  
  seqName12_abc4:MY SPECIE
- the specie name must exist in the species tree,
- one sequence name can be associated with only one specie (but several sequences can refer to the same specie),
- all sequences names of each family tree must exist in this list, or else the family tree would be rejected.

Be careful, all sequences and species names are case sensitive.

#ASSISTANT TO BUILD FILE (#3): [sequences names] -> [species] associations

At this stage, the only requirement is to have the directory (#2) containing all your families trees. Let this directory called "familiesDir/". Check it contains only newick trees and execute the following command:

build/tpms_mkdb -extract-seqnames -families-trees=familiesDir/ -output=seqNames2species

You will get a new file in your working directory: seqNames2species (file #3). It has been filled with the first column like this: seqName12_abc4: seqName12_xyg8: otherSeq54:

It's up to you to complete it as it looks like: seqName12_abc4:SPECIE1 seqName12_xyg8:SPECIE1 otherSeq54:OTHER SPECIE

Experimental: auto guessing species name using a database

If your sequences names match to accession number in sequences databases (eg: genbank), you can use this command instead of the previous:

build/tpms_mkdb -extract-seqnames -families-trees=familiesDir/ -output=seqNames2species -guess-from-db=bankName

This command uses the Remote Access ACNUC system, developed at pbil (http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html), and you have to use one of the banks names available on this webpage: http://pbil.univ-lyon1.fr/databases/acnuc/banques_raa.php

BUILDING THE DB FILE

We know need the tree files desribed above : (#1), (#2), and (#3).

Use the command:

build/tpms_mkdb -families-trees=familiesDir/ -seq-to-species=seqNames2species -sp-tree=speciesTree -output=myDB

Arguments

-families-trees=<directory>

the directory (#2) containing the families trees files

-seq-to-species=<newick file>

the file (#3) listing [sequences names] -> [species] associations

-sp-tree=<newick file>

the file (#1) containing the tree of the species

-output=<RAP file>

the new file that will be created. It will contain the new DB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly