-
Notifications
You must be signed in to change notification settings - Fork 2
Home
aLib is a sets of software tools to do basic analysis of Illumina sequencers. The different components can be used in conjuction or independently. We provide instructions for whether users wish to use aLib as a whole or just sub-components.
First, make sure you are running a Linux computer with the following:
- C++ compiler
- zlib.h (you can usually get it on Ubuntu systems by typing "sudo apt-get install zlib1g-dev")
- Python interpreter
- R
- cabal (for Haskell submodules) , make sure you have the binary-strict package install (cabal install binary-strict)
- fastqc (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- freeIbis (optional) (http://github.com/grenaud/freeIbis)
- biohazard (http://github.com/udo-stenzel/biohazard/)
- network-aware-bwa (http://github.com/udo-stenzel/network-aware-bwa/)
- libgab (https://github.com/grenaud/libgab)
- bamtools (https://github.com/pezmaster31/bamtools)
- Compile bamtools (https://github.com/pezmaster31/bamtools)
- In the main directory, just type make.
The first step, is to configure the config.json file. This has to be done once. Once this is done, you can run aLib on a given sequencing run. Whether you want to use the individual components or use them in conjunction, the basic configuration is stored in the config.json file. For the use of individual components, the default config.json can probably just be used as is.
If you have successfully typed "make" (and maybe configured the config.json file if you need to change some values ex: sequence for barcodes for the demultiplexer), the various components should be ready to use.
The workflow can be described as follows:
- The read directory from where your sequencer(s) will write their sequencing data (basecalls and intensities in /Data/Intensities/)
- The write directory is where aLib will produce the usable data
YYMMDD_SEQUENCERID_RUN-NUMBER_COMMENTS
The main configuration file is config.json.
Field | Meaning |
alibdir | The base directory where aLib is installed. |
fastqcdir | Directory containing fastqc |
illuminareaddir | The directory where the sequencer writes the sequencing data (basecalls and intensities) |
illuminawritedir | This is the directory where aLib will write the processed data |
sequencers | Enter the id and type of the sequencer for your sequencing center |
runstodisplay | The number of runs to display |
emailAddrToSend | Email of the administrator |
genomedirectory | Directory that contains the BWA genomic databases. (see details about setup). |
tempdirectory | Directory used by aLib to write temp files |
freeibispath | Path to freeIbis |
controlindex | 7 bp index for reads used a phiX control spike-in |
phixref | Path to the phiX reference |
chimeras | For various protocols, define the name of the protocol, the sequence of the adapters and putative chimeric sequences |
Indices | Define as the high level the indexing scheme and the id to sequence data for the indices used by the demultiplexer |
As an option, you can run a webform to display the runs and automatically create json file using a click-through interface.
To set it up, create a directory that is web accessible and copy the contents of webForm/ in there. Let the URL defined by this directory as http://internal.webserver.com/aLib/
On the server where aLib is running, there should be an access to BWA genomes indices. Each BWA index should be in a directory of its own indicating the name of the build:
hg19/
and the index should be bwa-0.4.9 as such:
hg19/bwa-0.4.9.amb hg19/bwa-0.4.9.ann ...
Also, the directory should contain a BWA for the index used for the control genome (PhiX, not crucial but nice to have). This directory should be named :
phiX/
Have within it the directory control/:
phiX/control/
and have the following files for the fasta genome and BWA index:
phiX/control/whole_genome.fa phiX/control/bwa-0.4.9.{amb,ann,bwt,pac,rbwt,rpac,rsa,sa}
The components of aLib can be used independently or as a pipeline (using GNU Make) to process sequencing runs.
Here is a partial list of the different components:
bam2fastq/bam2fastq | Format converter from bam to fastq |
BCL2BAM/bcl2bam | Format converter from BCL to bam |
fastq2bam/fastSingle2bam | Converts single reads into bam |
fastq2bam/fastq2bam | Converts paired reads into bam |
pipeline/generate_report | Reads the RTA report and saves it as an HTML document for archiving purposes |
pipeline/filterReads | Flags reads with high expectancy of mismatches |
pipeline/assignRG | Demultiplexes reads (assigns to read groups) and computes likelihood of belonging to these read groups. |
pipeline/errorRatePerCycle | Computes the sequencing error rate and type of error on a per cycle basis using an aligned bam file. |
tileCount/tileCount.py | Counts # of clusters in a BAM file and a Illumina cluster coordinate file |
qualScoreC++/qualScoresObsVsPred | Reads an aligned bam file and computes obseved vs predicted quality scores. |
biohazard/dist/build/bam-rmdup/bam-rmdup | Removes duplicates and calls consensus using those. |
Once the installation and setup completed, you can run aLib as a pipeline. aLib uses GNU make to resolve dependencies. To build the makefile, you need a json file detailing the different parameters then run json2make.py. There are two ways to generate the json file : manually and use the web form.
To generate Makefiles, you can manually generate a json configuration file. There is an example (webForm/exampleRun.json) of a configuration json file distributed along aLib. The program webForm/json2make.py can generate the makefiles from the json configuration file. The following is a description of the fields:
parambwa | Enter either "default" for default parameters or "ancient" for mapping ancient DNA |
genomebwa | Enter the name of the genome (directory name) stored in your genomedirectory (see https://github.com/grenaud/aLib/wiki#configuring-configjson) |
usebwa | true for use of mapping, false otherwise |
indicesraw | This contains a json array of two three fields "name", "p7" and possibly "p5" for double indices. This stored the name of the read group ("name") with the correspondence with the numerical value for the indices ("p7" and "p5") |
indicesseq | Like "indicesraw", but stores the actual sequences. To generate a json file with that field created automatically, given that you have configured config.json, use webForm/jsonIndices.py. |
spikedin | true if control sequences were spiked in, false otherwise |
bustard | true if we will use the default Bustard basecalls, false otherwise |
freeibis | true if we will use the default freeIbis basecalls, false otherwise |
lanes | json array of lanes to process |
sequencer | Type of sequencer either "ga", "hiseq" or "miseq" |
Email of the person to notify regarding this data | |
TileCount | Number of tiles on the flowcell |
SwathCount | Number of swaths on the flowcell |
LaneCount | Number of lanes on the flowcell |
SurfaceCount | Number of surfaces on the flowcell |
runid | An ID for this given run. |
expname | Name of the experiment, this will be used for the name of the reads |
cyclesread1 | How many cycles were used for first (forward) read. |
cyclesread2 | How many cycles were used for second (reverse) read. Use 0 for single end runs. |
cyclesindx1 | How many cycles were used for first index. Use 0 when multiplexing is not used. |
cyclesindx2 | How many cycles were used for first index. Use 0 when only one index is used or multiplexing is not used. |
ctrlindex | When a run is multiplexed and has control, this is the index of the control reads. |
lanesdedicated | If a lane was dedicated to control reads, enter the lane # of control reads. |
adapter1 | Sequence of the first adapter |
adapter2 | Sequence of the second adapter |
chimeras | Sequence of putative chimeras |
protocol | Type of protocol that was used (see main json configuration file) |
mergeoverlap | Whether --mergeoverlap will be used by the merger program. This will merge reads that have only a partial overlap |
key1 | If a sequence key was used, enter it here |
key2 | If a sequence key was used, enter it here |
filterseqexp | true if sequences are to be filtered on expected # of mismatches, false otherwise |
seqNormExpcutoff | If filterseqexp was set to true, this is the expected # of mismatch cutoff at which aLib will fail reads. |
filterentropy | true if sequences are to be filtered on entropy, false otherwise |
entropycutoff | If filterentropy was set to true, this is the entropy cutoff at which aLib will fail reads. |
filterfrequency | true if sequences are to be filtered on sequence frequency, false otherwise |
frequencycutoff | If filterentropy was set to true, this is the sequence frequency cutoff at which aLib will fail reads. |
Direct a browser to the address that was configured on the webserver (e.g. http://internal.webserver.com/aLib/form.php). Ask the user to select their run and click launch. This will automatically create the json file and makefiles.
If used as a pipeline, aLib has many dependencies that can be resolved using GNU make.
To build the different parts, simple type:
make
to parallelize using multiple cores, type:
make -j N
where N is the number of cores. For doing a mock run and just see the commands used, type:
make -n