Nubeam is a reference-free approach to analyze short sequencing reads. It represents nucleotides by matrices, transforms a read into a product of matrices, and based on which assigns numbers to reads. A sequencing sample, which is a collection of reads, becomes a collection of numbers that form an empirical distribution. Then the genetic difference between samples is quantified by the distance between empirical distributions.
zlib
is required to compile. To install zlib
, run the following commands:
wget https://www.zlib.net/zlib1211.zip
unzip zlib1211.zip
cd zlib-1.2.11/
./configure
make
sudo make install
Run the following commands:
wget --no-check-certificate --content-disposition https://github.com/daihang16/nubeam/archive/master.zip
unzip Nubeam-master.zip
cd Nubeam-master/
make
./nubeam -h
gives you the following messages:
./nubeam [qtf, rgc_beta, rgc_res, cad, cad2]
./nubeam qtf [-iodwSnfh]
compute quadriples for reads in fastq format.
produces prefix.quad.gz (gc content is within) and prefix.quad.log.
-i : input filename
-o : output prefix
-d : length of the reads (default d=75).
-w : sliding window size (default w=d).
-S : sliding window step (default S=w).
-n : number of missing nucleotide allowed.
-f : value, plus 33 is the PHRED quality value of fastq reads.
-h : print this help
./nubeam rgc_beta [-ioh]
perform regression on gc contents from read quantification and output regression coefficients.
produces prefix.beta.log.
-i : input file name.
-o : output prefix.
-h : print this help
./nubeam rgc_res [-ioh beta]
regress out gc contents from read quantification and output residuals.
produces prefix.nogc.gz and prefix.nogc.log.
-i : input file name.
-beta : beta file name.
-o : output prefix.
-h : print this help
./nubeam cad [-iombh bf]
compute pariwise distances of a set; the inputs are nubeam qtf outputs.
produces prefix.cad.log.
-i : specifies input file which is output of nubeam.
-o : output prefix (prefix.log contains pairwise distance matrix)
-m : choice of methods: h2 (Hellinger distance), cos (Cosine dissimilarity).
-b : designating the number of bins per column of scores
-bf : the file describing how to partition the bins
-h : print this help
./nubeam cad2 [-ijombh bf]
compute pariwise cross distances between two sets; the inputs are nubeam qtf outputs.
produces prefix.cad2.log.
-i : specifies input file of first set, which is output of nubeam.
-j : specifies input file of second set, which is output of nubeam.
-o : output prefix (prefix.log contains pairwise distance matrix)
-m : choice of methods: h2 (Hellinger distance), cos (Cosine dissimilarity).
-b : designating the number of bins per column of scores
-bf : the file describing how to partition the bins
-h : print this help
-
Quantify reads
./nubeam qtf -i S1.fq -o S1.fq -d 75 -a 0 -n 0 -f 0
Quantify the reads in input file
S1.fq
, with the read length of 75, adaptor size of 0,N
not allowed in read, the output file name will beS1.fq.quad.gz
. The output file has six columns of numbers: first four columns are Nubeam quadruplets for reads, the last two columns are GC counts for reads. -
Regress out GC content
-
Obtain regression coeffients
First combine all the output files produced by
qtf
together:cat S1.fq.quad.gz S2.fq.quad.gz S3.fq.quad.gz > all.quad.gz
Then calculate the regression coefficients for GC count:
./nubeam rgc_beta -i all.quad.gz -o all.quad
The regression coeffients are in
all.quad.beta.log
. -
Obtain residuals
For each original output files produced by
qtf
:./nubeam rgc_res -i S1.fq.quad.gz -beta all.quad.beta.log -o S1.fq.quad
The residuals will be written to
S1.fq.quad.nogc.gz
.
-
-
Quantify pair-wise distance
-
Calculate within-group distances
./nubeam cad -o output -m h2 -b 10 -bf bin.txt -i S1.fq.quad.nogc.gz -i S2.fq.quad.nogc.gz -i S3.fq.quad.nogc.gz
For
n
samples, the command calculaten(n-1)/2
Hellinger distances. The number of bins partitioned for R^4 space is10^4
; if thebin.txt
exists, it will be used for partitioning; if not, the partitioning will be calculated and written tobin.txt
. If the input files are too large, there may not be enough memory to calculate the bin partitioning file. To deal with this problem, you can down-sample input files and used them to calculate a bin partitioning file first, and then use this bin partitioning file and original input files to calculate distance matrix. The distance matrix is at the end ofoutput.cad.log
. -
Calculate between-group distances
./nubeam cad2 -o output -m h2 -b 10 -bf bin.txt -i S1.fq.quad.nogc.gz -i S2.fq.quad.nogc.gz -i S3.fq.quad.nogc.gz -j S4.fq.quad.gz -j S5.fq.quad.gz
For a group of
n
samples (specified by-i
) and a group ofm
samples (specified by-j
), the command calculaten*m
Hellinger distances. The number of bins partitioned for R^4 space is10^4
; unlike incad
, here incad2
, the argument-bf
is required, with the specified filebin.txt
be used for partitioning. If you don't have the bin partitioning file, you need to calculate one usingcad
first. The distance matrix is at the end ofoutput.cad2.log
.
-