Skip to content
tbigot edited this page Oct 16, 2012 · 5 revisions

TPMS MKDB

INTRODUCTION

This help file explains how to create you own database. Then, you will be able to perform pattern matching queries on it.

REQUIREMENTS

You need:

  • (#1) One species tree (filename example: speciesTree):

    • one file,
    • newick format,
    • the tree has not to be resolved (bifurcated),
    • species names as leaves, taxon names as internal nodes,
    • names into double quotes ("),
    • there is no limitation about special character using (excepted double quotes), but it is safer to use only alphanumerical characters and spaces.
  • (#2) One directory containing all gene families trees (dirname example: familiesDir/):

    • one file per one family tree,
    • sequences names as leaves,
    • trees must be resolved (bifurcated)
    • names are not quoted.
  • (#3) One file listing [sequences names] -> [species] associations (filename example: seqNames2species): (you can use the assistant to build this file easily, see below)

    • one association per line

    • the line looks like

      seqName12_abc4:MY SPECIE

    • the specie name must exist in the species tree,

    • one sequence name can be associated with only one specie (but several sequences can refer to the same specie),

    • all sequences names of each family tree must exist in this list, or else the family tree would be rejected.

Be careful, all sequences and species names are case sensitive.

#ASSISTANT TO BUILD FILE (#3): [sequences names] -> [species] associations

At this stage, the only requirement is to have the directory (#2) containing all your families trees. Let this directory called "familiesDir/". Check it contains only newick trees and execute the following command:

build/tpms_mkdb -extract-seqnames -families-trees=familiesDir/ -output=seqNames2species

You will get a new file in your working directory: seqNames2species (file #3). It has been filled with the first column like this: seqName12_abc4: seqName12_xyg8: otherSeq54:

It's up to you to complete it as it looks like: seqName12_abc4:SPECIE1 seqName12_xyg8:SPECIE1 otherSeq54:OTHER SPECIE

Experimental: auto guessing species name using a database

If your sequences names match to accession number in sequences databases (eg: genbank), you can use this command instead of the previous:

build/tpms_mkdb -extract-seqnames -families-trees=familiesDir/ -output=seqNames2species -guess-from-db=bankName

This command uses the Remote Access ACNUC system, developed at pbil (http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html), and you have to use one of the banks names available on this webpage: http://pbil.univ-lyon1.fr/databases/acnuc/banques_raa.php

BUILDING THE DB FILE

We know need the tree files desribed above : (#1), (#2), and (#3).

Use the command:

build/tpms_mkdb -families-trees=familiesDir/ -seq-to-species=seqNames2species -sp-tree=speciesTree -output=myDB

Arguments

-families-trees=<directory>

the directory (#2) containing the families trees files

-seq-to-species=<newick file>

the file (#3) listing [sequences names] -> [species] associations

-sp-tree=<newick file>

the file (#1) containing the tree of the species

-output=<RAP file>

the new file that will be created. It will contain the new DB.