Skip to content

BioInfoTools/BSVF

Repository files navigation

BSVF

Bisulfite Sequencing Virus integration Finder

Repo URL: https://github.com/BGI-SZ/BSVF

Attention

For directional libraries only. PBAT and indirectional libraries are NOT supported.

Dependencies

bwa-meth 0.10 depends on

Install

Since the project leader wants to include all relevant tools here, even if they are already provided by main Linux distributions.

For problems on compiling EMBOSS, BWA or SAMTOOLS/HTSLIB, please ask the original programmer.

pip install toolshed
apt-get install autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev
#yum install autoconf automake make gcc perl-Data-Dumper zlib-devel bzip2 bzip2-devel xz-devel curl-devel openssl-devel
git clone https://github.com/BGI-SZ/BSVF.git
cd BSVF
git submodule init
git submodule update
src/install.sh

In case EMBOSS failed to install, you'll need to download the binary from above sites. And put water of EMBOSS in to ./bin. Or, just link water to ./bin.

Your BSVF/bin/ should be like this:

-rwxr-xr-x  398860 Feb 20 00:48 bwa
-rwxr-xr-x   21892 Sep  1 08:37 bwameth.py
-rwxr-xr-x   27040 Feb 20 01:14 water
-rwxr-xr-x  971772 Feb 20 00:48 samtools

Debian

apt install libbam-dev libhts-dev python3-pip emboss bwa samtools
pip install toolshed
git clone https://github.com/BGI-SZ/BSVF.git
cd BSVF/src/analyser
make
cd BSVF/bin
[symbolic link `bwa`, `samtools` and `water` from /usr/bin/ or so]

Homebrew/Linuxbrew

brew tap Ensembl/homebrew-external
brew install emboss bwa samtools python
pip install toolshed

ln -s `which bwa` ./bin/
ln -s `which samtools` ./bin/
ln -s `which water` ./bin/

brew install gcc
cd ./src/analyser/
make
cp -av analyser/bsanalyser ../../bin/
cd ../../bin/
ls -l

Citation

Gao, S., Hu, X., Xu, F., Gao, C., Xiong, K., Zhao, X., … Pedersen, C. N. S. (2018). BS-virus-finder: virus integration calling using bisulfite sequencing data. GigaScience, 7(1), 1–7. https://doi.org/10.1093/gigascience/gix123

Usage

./bsuit <command> <config_file>

./bsuit prepare prj.ini
./bsuit aln prj.ini
./bsuit grep prj.ini
./bsuit analyse prj.ini

a Logo

Test Run

mkdir sim90 && cd sim90 && ./simVirusInserts.pl GRCh38_no_alt_analysis_set.fna.gz X04615.fa.gz s90 && cd ..
mkdir sim50 && cd sim50 && ./simVirusInserts.pl GRCh38_no_alt_analysis_set.fna.gz X04615.fa.gz s50 50 ../sim90/s90.ini && cd ..

./bsuit prepare sim90/s90.ini

./bsuit aln sim90/s90.ini
./run/s90_aln.sh
./bsuit grep sim90/s90.ini
./bsuit analyse sim90/s90.ini

./bsuit aln sim50/s50.ini
./run/s50_aln.sh
./bsuit grep sim50/s50.ini
./bsuit analyse sim50/s50.ini

Reference Files

Simulation

./simVirusInserts.pl GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz HBV.X04615.fa sim150 150

GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

A gzipped file that contains FASTA format sequences for the following:

  1. chromosomes from the GRCh38 Primary Assembly unit.
    Note: the two PAR regions on chrY have been hard-masked with Ns.
    The chromosome Y sequence provided therefore has the same coordinates as the GenBank sequence but it is not identical to the GenBank sequence. Similarly, duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns (locations of the unmasked copies are given below).
  2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
  3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
  4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
  5. Epstein-Barr virus (EBV) sequence
    Note: The EBV sequence is not part of the genome assembly but is included in the analysis set as a sink for alignment of reads that are often present in sequencing samples.

Format of config_file

An example

[RefFiles]
HostRef=/share/HomoGRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
VirusRef=/share/work/bsvir/HBV.AJ507799.2.fa

[DataFiles]
780_T.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_1.fq.gz
780_T.2=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_2.fq.gz
s01_P.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_1.fq.gz
s01_P.2=/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_2.fq.gz
;MultiLibExample.1=/test/Lib1/AAAA.1.fq.gz, /test/Lib2/AAAA.1.fq.gz, /test/Lib3/BBBB.1.fq.gz
;MultiLibExample.2=/test/Lib1/AAAA.2.fq.gz, /test/Lib2/AAAA_2.fq , /test/Lib3/BBBB.2.fq.gz
tSE_X.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_1.fq.gz,/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_2.fq.gz

[InsertSizes]
780_T=200
780_T.SD=120
s01_P=200
s01_P.SD=30
;MultiLibExample=210
;MultiLibExample.SD=70
tSE_X=90
tSE_X.SD=1

[Output]
WorkDir=/share/work/bsvir/bsI
ProjectID=SZ2015

[Parameters]
Aligner=bwa-meth
MinVirusLength=20

Build

You'll need cmake and autoconf, automake and devel-libs, as well as gcc, g++ to compile all sources.

For Mac OS X, install Homebrew first. Then:

xcode-select --install
brew install autoconf automake cmake python
brew install --without-multilib gcc

To Build the binaries:

cd src
./download.sh
./install.sh

pip install toolshed

Details

  • For comment lines, use ; as the first character.

  • RefFiles Section

    • HostRef is Host genome.
    • VirusRef is Virus sequence.
  • DataFiles Section

    • Each Sample need an unique ID as SampleID. Use SampleID.1 and SampleID.2 to specify pair-end sequencing data.
    • For samples with multiple PE sets, join each file with comma and keep their order.
  • InsertSizes Section

    • For each SampleID, use SampleID to specify average insert sizes. And use SampleID.SD to specify its standard deviation.
  • Output Section

    • WorkDir is the output directory.
    • ProjectID is an unique ID for this analyse defined in the config_file.

Description

BSuit is a suit to analyse xxx.

Formats

病毒整合结果文件

Chr	breakpoint	virus-start virus-end virusstrand	how-many-reads-support cluster-name
Chr1	3000	200	300	+/-	20 cluster1

中间contig信息文件

clustername contig-number chrpoint virus-integration
cluster1	contig1	chr1:3000	virus:+:200-300
cluster1	contig2	chr2:4000	viurs:-:300-400

ToDo

Compare with ViralFusionSeq [VFS] and VirusFinder 2 on normal WGS data.

See also

Tool Sequencing Type Programme Language 1st Aligenment * Assembler 2nd Aligenment # Epub Date
VirusSeq RNA-Seq, WGS Perl MOSAIK to Human MOSAIK to Virus MOSAIK to Hybrid 2012 Nov 08
ViralFusionSeq RNA-Seq, WGS Perl BWA-SW to Human cap3, SSAKE Blastall to Virus 2013 Jan 12
VERSE(VirusFinder2) WGS, RNA-Seq Perl Bowtie2 to Human, BLAT to Virus, BLASTN to Virus Trinity BWA-SW to Hybrid, SVDetect,CREST 2015 Jan 20
Virus-Clip RNA-seq Perl BWA-MEM to Virus Virus-Clip BLASTN to Human 2015 May 19
Vy-PER WGS, RNA-Seq Python2 BWA-SW to Human Vy-PER BLAT to Virus 2015 Jul 13
seeksv WGS C++ BWA to Hybrid seeksv seeksv to Hybrid 2016 Sep 14
BSVF WGBS, WGS Perl,C,C++ BWA-MEM to Hybrid BSVF water(EMBOSS) to Hybrid N/A

* for virus-infected reads
# for integration infomation

One More Things

To extract relevant PE reads within 500 bp range from final result, BS.analyse for example.

perl -lane '$a=$F[2]-501;$b=$F[2]+501;print join("\t",$F[1],$a,$b)' ../W2BS_analyse/BS.analyse >zones.bed
vi zones.bed # To remove the first head line
# sort BS.bam to BS.sort.bam and index it.
samtools view -L zones.bed BS.sort.bam > zones.sam
awk '{print $1}' zones.sam | sort | uniq > zones.ids
#samtools view BS.bam | grep -F -f zones.ids >zones.PE.sam
samtools view BS.sort.bam | grep -F -f zones.ids > zones.PEs.sam # sorted one maybe more useful.