makeflow-examples/blast at master · cooperative-computing-lab/makeflow-examples

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
blast.json		blast.json
blast.jx		blast.jx
blast.mf		blast.mf
fasta_generator		fasta_generator
image.mf		image.mf
image.pdf		image.pdf
image.png		image.png
makeflow_blast		makeflow_blast
small.fasta		small.fasta

README.md

BLAST Workflow Example

This directory contains the materials needed to construct a blast workflow. However, you will first need to install the blast software and a suitable database before you can run the makeflow.

If you have not done so already, please clone this example repository like so:

git clone https://github.com/cooperative-computing-lab/makeflow-examples.git
cd ./makeflow-examples/blast

First, obtain a blast binary suitable for your architecture. (about 30MB)

wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-x64-linux.tar.gz
tar xvzf blast-2.2.26-x64-linux.tar.gz

Next, copy the main executable into the working directory.

cp blast-2.2.26/bin/blastall .

Obtain a nucleotide database suitable for searching. (about 400MB)

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.44.tar.gz
mkdir nt
tar -C nt -xvzf nt.44.tar.gz

Now, test to make sure that blast works locally:

./blastall -p blastn -d nt/nt -i small.fasta

If everything is working correctly, you should see output that starts like this:

BLASTN 2.2.26 [Sep-21-2011]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

And then goes on for quite a while.

Now you are ready to generate workflow that will do the job in parallel. Use the makeflow_blast script to create a workflow blast.mf that will split up the input into several pieces, run blast on each one, and then join the results together:

./makeflow_blast -d nt -i small.fasta -o output.fasta -p blastn --num_seq 10 --makeflow blast.mf

Then, you can use makeflow to run the whole thing as desired. For example, to run it all locally:

makeflow blast.mf

Or to run it using HTCondor or Work Queue or SGE:

makeflow -T condor blast.mf
makeflow -T wq blast.mf
makeflow -T sge blast.mf

To visualize the workflow that was generated:

makeflow_viz blast.mf --dot-no-labels > blast.dot
dot -Tpng blast.dot > blast.png
display blast.png

Additionally, you can generate random data to adjust the total runtime:

./fasta_generator 200 1000 > test.fasta
./makeflow_blast -d nt -i test.fasta -p blastn --num_seq 5 --makeflow blast_test.mf
makeflow blast_test.mf

Alternatively, the makeflow can be run using JX or JSON formats using one of the following commands:

makeflow --jx blast.jx
makeflow --json blast.json

The number and length of sequences can be adjusted for your needs, with the first number adjusting the number of contigs and the second adjusting the length of these contigs. fasta_generator produces contigs containing random AGCT sequences.

The provided values produces a workflow that runs in ~5 minutes on a local single core machine.

Workflow Size	Reference Size	Query Size(Number x Length)	Number of seq per split	Approx Time with Machine
Small	NT (Fixed 565MB)	200x1000 (198K)	5	~5 min : 1 machine
Medium	NT (Fixed 565MB)	30000x2000 (58M)	100	~20 min : 20 machines
Large	NT (Fixed 565MB)	100000x2000 (193M)	1000	~30 min : 75 machines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blast

blast

README.md

BLAST Workflow Example

Files

blast

Directory actions

More options

Directory actions

More options

Latest commit

History

blast

Folders and files

parent directory

README.md

BLAST Workflow Example