-
Notifications
You must be signed in to change notification settings - Fork 55
Task: summary
This summarises the results from one or more runs of ARIBA.
The usage is
ariba summary out in.report.1.tsv in.report.2.tsv ...
where in.report.1.tsv
, in.report.2.tsv ...
is a list of report
files made by separate runs of ARIBA.
It makes three output files:
-
out.csv
. This is a csv file that can be viewed in your favourite spreadsheet program. -
out.phandango.{csv,tre}
. These are two files that allow you to view the results in Phandango. They can be drag-and-dropped straight into Phandango. Note that ARIBA makes a rough tree, using the contents of the CSV file. You may wish to provide your own tree file to Phandango and stop ARIBA from making a tree with the--no_tree
option.
By default, the output is minimal. It contains one column per
cluster and one row per sample,
with a "yes" or "no" as to whether or not each
sample has a match (as described below) to that cluster.
You can exactly control what is
output, or use the option --preset
to use one of several preset
combinations. In order of increasing number of columns, the values
that can be used with --preset
are: minimal, cluster_small,
cluster_all, cluster_var_groups, all, all_no_filter.
Continue reading for an explanation of all the columns that can be reported.
There can be up to seven columns output per cluster:
-
assembled: this is one of "no", "fragmented", "interrupted", "yes", "yes_nonunique", depending on the flag. Please see here for how it is determined.
-
match: this is either "yes" or "no". It is set to "yes" if the assembled column is "yes" or "yes_nonunique", and in the case of a variants-only gene it must also have a known variant. Otherwise it is set to "no".
-
ref_seq: this is set to the name of the closest reference sequence for each sample. Set to "NA" if assembled is "no".
-
pct_id: this is the percent identity of the contig that has the largest value in the
ref_base_assembled
column of the report. Set to "NA" if assembled is "no". -
ctg_cov: same as 4, except this is the mean read depth across the contig.
-
known_var: "yes" or "no" depending on whether or not the sample has a known variant. Set to "NA" if assembled is "no".
-
novel_var: "yes" or "no" depending on whether or not the sample has a novel variant (ie not specified in the original metadata). Set to "NA" if assembled is "no".
Which of the seven columns are output is controlled
using the option --cluster_cols
. Provide a comma-separated list
of the names that you want in the output. For example:
--cluster_cols assembled,match,known_var
would report the three columns assembled, match and known_var.
By default, variants are not reported when running summary.
There are three types of variant columns that can be reported.
The reporting of variants can be switched on using any of
the options --v_groups
, --known_variants
, and
--novel_variants
.
-
--v_groups
: this only applies if you allocated the variants to groups, eg when running aln2meta. Otherwise, you can ignore this option. If it is used, it will output a column for each group, showing whether or not each sample has any variant from that group. -
--known_variants
: output a column per variant, showing whether or not each sample has it. This applies to variants that ARIBA is already aware of because they were provided in the original metadata when running prepareref. -
--novel_variants
: this is the same as--known_variants
except novel variants are reported, ie any variants found that were not given in the original metadata.
By default, any row or column that only contains "no" or "NA" is
removed. This filtering can be changed using the options
--col_filter n
and --row_filter n
.
Preset combinations of the columns to output are available using
the --preset
option. The default is --preset minimal
.
Using --preset
will override the options
--cluster_cols
, --v_groups
, --known_variants
, --novel_variants
,
--col_filter
, and --row_filter
.
The cluster columns are set follows depending on the preset:
Preset | Value of --cluster_cols
|
---|---|
minimal |
match |
cluster_small |
assembled,match,ref_seq,known_var |
cluster_all |
assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var |
cluster_var_groups |
assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var |
all |
assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var |
all_no_filter |
assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var |
The variant options and row/column filtering are set as follows depending on the preset:
Preset |
--v_groups , --known_variants , --novel_variants
|
row_filter |
col_filter |
---|---|---|---|
minimal |
(none used) | y |
y |
cluster_small |
(none used) | y |
y |
cluster_all |
(none used) | y |
y |
cluster_var_groups |
--v_groups |
y |
y |
all |
--v_groups --known_variants --novel_variants |
y |
y |
all_no_filter |
--v_groups --known_variants --novel_variants |
n |
n |
The other options, not described above, are as follows:
-
--no_tree
. This stops the tree being calculated, which can be quite slow. -
--min_id
. The Minimum percent identity to count as assembled. Default: 90. -
--only_clusters Cluster_names
. Only report data for the given comma-separated list of cluster names, eg:cluster1,cluster2,cluster42
.