Skip to content

Incremental distance tree tools

Vyacheslav Brover edited this page Sep 28, 2021 · 21 revisions

The executables are in the directory $TT/phylogeny/.
All incremental distance tree scripts start with distTree_inc.
The first parameter of these scripts is an incremental distance tree directory.

distTree_inc_init.sh

Create an incremental distance tree directory with parameter files and empty scripts.

distTree_inc_init_stnd.sh

Create an incremental distance tree directory with the parameter files and scripts from standard for a biologial project:

  • Genome
    • bacteria
    • fungi
    • Metazoa
    • Protists
    • Viridiplantae
  • rRNA
    • bacteria (prokaryotic 16S)
    • fungi
      • ITS
      • 18S
      • 28S
    • SSU (eukaryotic 18S)
    • 5.8S (eukaryotic)
  • virus
    • SARS-CoV-2

distTree_inc_complete.sh

For an incremental distance tree directory and a list of objects compute a complete pairwise dissimilarity matrix, store it in a Data Master file data.dm and build a distance tree.

On finishing, a Data Master format file data.dm is created.
This file contains a two-way attribute "dissim" - dissimilarity matrix computed by inc/request2dissim.sh.

calibrateDissims.sh

Test specific variance or dissimilarity parameters on a pairwise dissimilarity matrix stored in a Data Master file, e.g., data.dm produced by $TT/phylogeny/distTree_inc_complete.sh.

distTree_inc.sh

Incremental distance tree building.
Requires a computer with large memory.

For a tree with 200,000 objects the addition of new objects has the speed of about 7,000 objects per day using 30 threads on a computer with the speed of 3500 MHz. Theoretically, the running time is O(n log^4 n) and space is O(n log^3 n), where n is the number of objects.

On finishing, a Data Master format file leaf_errors.dm is created.
This file contains two attributes defined on objects:

  • "leaf_error": normalized object criterion, which theoretically has a standard normal distribution;
  • "deformation": relative object deformation, which theoretically has the distribution of a maximum of 100 chi^2 with 1 degree of freedom (if the tree has 100 objects).

Large values of these attributes identify outlier objects.

distTree_inc_delete.sh

Delete a list of objects from the tree in an incremental distance tree directory.

distTree_inc_optimize.sh

Test specific variance or dissimilarity parameters on an existing incremental distance tree directory without changing it.

distTree_inc_status.sh

Print the status of an incremental distance tree directory: version, the number of objects in the tree, etc.