-
Notifications
You must be signed in to change notification settings - Fork 0
tpms_query
The program tpms_query is the binary that you will use to perform queries into the database.
This help file explains how to launch tpms_query, and give details about the querying language.
To perform a query, you need a database in the RAP format. You can create such a database using the tpms_mkdb command (see the helpfile “tpms_mkdb” for instructions). And you need to have the binary tpms_query (If it's not the case, see the README file).
If you have just built the program from the sources and did not installed it yet, remember it's still in the "build/" directory of the tpms main directory.
Just type this command:
tpms_query -collection=<RAP Database> -output-dir=<ouput Directory> [-threads=<nbthreads>] [-synonyms=yes]
-collection=<RAP Database>
the file you generated at the previous step
-output-dir=<output Directory>
a directory in which result files will be saved. If you run tpms for the first time, just create the directory.
-threads=<nbthreads>
number of threads to be used. If you are on a SMP (many CPUs or cores on the same machine) machine, you can put a large number here. Some operations could be linearly faster. Eg, -threads=64 could faster some processes 64 times.
-synonyms=yes
allows the usage of synonyms (RAP style). If activated, you must provide a species tree in which every node (leaves included, root included) have a distance, even if it is fake or empty. Eg : ((S1:0,S2:):0):;
The distance can be zero or empty.
If everything is OK, you should have this kind of lined displayed:
Loading preamble... [DONE]
Loading species tree... [DONE]
Loading family trees:
|0% 100%|
[#################################################]
Load sucessful!
Species tree loaded: 5 species.
Loaded 19910 families / 19910 expected.
Trees are not reconciled.
Generating trees with species names as labels:
|0% 100%|
[#################################################]
You are now prompted to give a query name. This is the name of the file to be created in . You have to give one query name per query, so, in each file, you get the result of one query.
Composition of a query string The query language is based on the Newick formalism: text strings separated by commas and parenthesis, and ending with a semicolon. Each text string has to be written in three parts separated by slashes (/). The generic pattern of such a string is: (PARAMETERS/SUBTREE CONSTRAINTS/LEAF CONSTRAINTS) The last part (LEAF CONSTRAINTS) is only usable on leaves. The two first parts (PARAMETERS and SUBTREE CONSTRAINTS) are optional on a leaf. They can be empty, such as in this example: //LEAF CONSTRAINTS All parts are optional on internal nodes.
It is advised that you prepare your queries in a text editor and paste the queries into the terminal window.
Example of a simple query
If you want to search the database for the following tree pattern:
+------------Species A
|
| +--------Species B
+---|
+--------Species C
your query will be:
(//Species A,(//Species B,//Species C));
This will select all the families in the database which the genes tree contains this pattern. In this very simple pattern, we only used the last part of the query (LEAF CONSTRAINTS).
As seen above, you can require some species on the leaves. You can use a single species or build a list using plus (+) and minus (-) operators. For instance:
homo sapiens+pan troglodytes
means you want to allow homo sapiens AND pan troglodytes sequences.
mammalia-mus musculus-homo sapiens
means you want to allow all mammalia excepted mouse and humain.
-bacteria
means you allow everything but bacterias.
-bacteria+gammaproteobacteria
means you allow everything but bacterias, but you allow gammaproteobacterias.
All combinations are possible, you can use all the labels of your species trees (taxon on nodes and species on leaves).
You can set constraints on subtrees containing a searched leaf.
Say we want to require the left branch of this subtree only contains genes from species that are member to a certain taxonomic group Taxon 1.
+(Mammalia)-------- Mus Musculus
|
| +-------- Homo Sapiens
+---|
+-------- Pan Troglodytes
The corresponding query string is:
(/Mammalia/Mus Musculus,(//Homo sapiens,//Pan troglodytes));
+-------- Mus Musculus
|
| +-------- Homo Sapiens
+(Mammalia)---|
+-------- Pan Troglodytes
The corresponding query string is:
(//Mus Musculus,(//Homo sapiens,//Pan troglodytes)/Mammalia);
You can also set some flags in the part 1 of the query string.
If you use a reconciled tree (reconciled with RAP), you can require a speciation with a S or a duplication with a D.
Use the exclamation mark (!) to require a direct link between nodes.
For instance, if you want Homo sapiens and Pan troglodytes sequences are brother in your tree, it means that both leaves are linked with direct link to the father node (stars in this figure means direct link).
+-------- Mus musculus
|
| +********* Homo sapiens
+-(Mammalia)--|
+********* Pan troglodytes
To query this tree, use:
(//Mus musculus,(!//Homo sapiens,!//Pan troglodytes)/Mammalia);
You can also require a minimum bootstrap on a branch. For instance, if you want the branch leading to Homo sapiens and Pan troglodytes is supported by a bootstrap greater or equal to 90, according to this:
+-------- Mus musculus
|
| +--------- Homo sapiens
+-------90%---|
+--------- Pan troglodytes
you can use:
(//Mus musculus,(//Homo sapiens,//Pan troglodytes)$90);