-
Notifications
You must be signed in to change notification settings - Fork 0
tpms_mkdb
This help file explains how to create you own database. Then, you will be able to perform pattern matching queries on it.
You need:
-
(#1) One species tree (filename example: speciesTree):
- one file,
- newick format,
- the tree has not to be resolved (bifurcated),
- species names as leaves, taxon names as internal nodes,
- names into double quotes ("),
- there is no limitation about special character using (excepted double quotes), but it is safer to use only alphanumerical characters and spaces.
-
(#2) One directory containing all gene families trees (dirname example: familiesDir/):
- one file per one family tree,
- sequences names as leaves,
- trees must be resolved (bifurcated)
- names are not quoted.
-
(#3) One file listing [sequences names] -> [species] associations (filename example: seqNames2species): (you can use the assistant to build this file easily, see below)
-
one association per line
-
the line looks like
seqName12_abc4:MY SPECIE
-
the specie name must exist in the species tree,
-
one sequence name can be associated with only one specie (but several sequences can refer to the same specie),
-
all sequences names of each family tree must exist in this list, or else the family tree would be rejected.
-
Be careful, all sequences and species names are case sensitive.
#ASSISTANT TO BUILD FILE (#3): [sequences names] -> [species] associations
At this stage, the only requirement is to have the directory (#2) containing all your families trees. Let this directory called "familiesDir/". Check it contains only newick trees and execute the following command:
build/tpms_mkdb -extract-seqnames -families-trees=familiesDir/ -output=seqNames2species
You will get a new file in your working directory: seqNames2species (file #3). It has been filled with the first column like this: seqName12_abc4: seqName12_xyg8: otherSeq54:
It's up to you to complete it as it looks like: seqName12_abc4:SPECIE1 seqName12_xyg8:SPECIE1 otherSeq54:OTHER SPECIE
If your sequences names match to accession number in sequences databases (eg: genbank), you can use this command instead of the previous:
build/tpms_mkdb -extract-seqnames -families-trees=familiesDir/ -output=seqNames2species -guess-from-db=bankName
This command uses the Remote Access ACNUC system, developed at pbil (http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html), and you have to use one of the banks names available on this webpage: http://pbil.univ-lyon1.fr/databases/acnuc/banques_raa.php
We know need the tree files desribed above : (#1), (#2), and (#3).
Use the command:
build/tpms_mkdb -families-trees=familiesDir/ -seq-to-species=seqNames2species -sp-tree=speciesTree -output=myDB
-families-trees=<directory>
the directory (#2) containing the families trees files
-seq-to-species=<newick file>
the file (#3) listing [sequences names] -> [species] associations
-sp-tree=<newick file>
the file (#1) containing the tree of the species
-output=<RAP file>
the new file that will be created. It will contain the new DB.