-
This repository contains data and scripts from the publication "RNA sequencing indicates widespread conservation of circadian clocks in marine zooplankton".
-
Brief summary of the project: we explored the de novo assembled transcriptomes of a diverse set of marine zooplankton species to reveal the presence of transcripts coding for proteins that may together constitute functional circadian clocks. This was achieved by identifying candidate sequences via orthology to a select set of "bait" circadian clock protein sequences from a set of model organisms. We also annotated these transcriptomes as very little sequencing data is available for most of these species.
-
Note: some data from this study (sequencing reads, for example) are not hosted here, and can instead be found on the NCBI under the accession code
PRJNA824716
. -
Citations:
-
Please use our issue tracker or email us for further assistance.
-
annotations
contains compressed flat files (*.tsv.xz
) that tabulate sequence annotations for the de novo assembled transcripts. Annotations include homologs fromSwiss-Prot
, and orthologs, domains, and gene ontology terms fromEggNOG
. -
transcriptomes
andproteomes
contain the de novo assembled transcriptome assemblies and their in silico translated protein sequence sets (proteomes) respectively. -
circadian_clock_candidates
containsFASTA
files of sequences identified as circadian clock proteins. Thefas_by_samp
sub-directory presents these grouped by host species, and thefas_by_type
sub-directory presents the very same sequences, but grouped by clock protein type. Thedomain_structure_visualizations
sub-directory contains visualizations of the functional domains in the protein sequences, grouped together by circadian clock protein type. Finally, thecircadian_clock_candidates
directory also contains a file tabulating annotations for the candidate sequences (cc_cand_sel_main_table.csv
). Within this table, each row corresponds to a particular candidate sequence. The columns represent different kinds of annotations and sequence-related information. The contents of the columns are as follows:
orthogroup - EggNOG orthogroup the sequence was assigned to.
refseq - sequence identifer of the bait that was used to find the candidate sequence.
matchseq - sequence identifier of the candidate.
spref - species to which the bait belongs.
spmatch - species to which the candidate belongs.
refmatcat - as the search strategy also allows for the bait to identify other baits that are orthologs, the matches in the table may be classified into two categories - bait vs. bait (SOIREF__SOIREF) and bait vs. candidate (SOIREF__TRINITY).
pacc_match - PANTHER DB annotation's accession for the candidate.
pdesc_match - PANTHER DB annotation's description for the candidate.
pacc_ref - PANTHER DB annotation's accession for the bait.
pdesc_ref - PANTHER DB annotation's description for the bait.
pfacc_match - Pfam DB annotation's accession for the candidate.
pfdesc_match - Pfam DB annotation's description for the candidate.
pfloc_match - Pfam domains' locations on on the sequence for the candidate.
pfacc_ref - Pfam DB annotation's accession for the bait
pfdesc_ref - Pfam DB annotation's description for the bait.
pfloc_ref - Pfam domains' locations on on the sequence for the bait.
protcat - which type of circadian clock protein the candidate is.
mconf - binary indicator of whether the PANTHER DB sub-family annotations of the bait and the candidate are identical or non-identical (1 - identical, 0 - non-identical).
upid - UniProt identifier of the best match homolog for the candidate.
swissprot_subject - UniProt/Swiss-Prot sequence header string of the best match homolog for the candidate.
swissprot_percid - pairwise percentage identity between the sequences of the best match homolog from Swiss-Prot and the candidate.
swissprot_evalue - likelihood of this sequence from Swiss-Prot having been assigned as the best match to the candidate purely by chance (E-value; smaller the value, more unlikely that the assignment was by chance).
fas_match - protein sequence of the candidate.
fas_ref - protein sequence of the bait.
pid - pairwise percentage identity between the sequences of the bait and the candidate.
candname - sequence header of the candidate.
-
The scripts used to execute the tools and perform all analyses for the paper are contained in this directory. Please refer to the workflow figure included in the "Methods" section of the manuscript for a diagrammatic overview.
-
The last few R scripts together with the scripts in the
of_ref_prep
andspecies_tree
also produce all plots and figures in the manuscript. -
A vast majority of the scripts here are
BASH
scripts that are used to submit jobs on a cluster. -
All scripts were executed on a high performance compute cluster running
Scientific Linux release 7.9 (Nitrogen)
withSLURM
as the scheduler. -
For easy replication of the scripts here, it is best to organize everything within a directory structure like so (with the names indicated below):
newrun/
|-jobfiles/
|-cc_orthofinder_dbs/
|-reads/
|-outputs/
-
All scripts discussed below should be placed in the
jobfiles/
directory (indicated in the directory tree above); these are some exceptions, and these will be indicated as they arise. When executed, they will place outputs in thenewrun/outputs
directory, and will placestdout
andstderr
data + EXECUTION scripts (see next point below) underjobfiles/<appropriately_named_directory>
. There are a fewR
scripts in this workflow. These do not savestderr
andstdout
by default, but rather only print to the terminal instead. The user can optionally redirect the output to a file to create a log of the run. -
Recreating the workflow is simple once all the data and scripts and so forth are placed wherever they should be. The user must just execute the scripts in the order indicated by the numerical prefixes to the script names. I.e.,
1_*_mainjob.bash
must be executed first, then2_*_mainjob.bash
, and so forth. If there are scripts with additional prefixes (e.g.,1a_
and1b_
), the script with the prefix occuring first in the alphabetical order must be executed first (i.e.,1a_
before1b_
). The script names should be fairly self-explanatory. In pretty much all cases, setting the main path (via theMYPATH
variable or its equivalent, e.g.,mypath
) to point to wherever thenewrun/
directory is should suffice. -
As these scripts were designed to run on compute nodes on a
SLURM
-based cluster, the thread counts and memory allocations are large in many cases (typically 16 cores and 124GB of RAM). The mainOrthoFinder
run uses a full complement of 256 or 128 cores + all available memory. The user will have to change these parameters accordingly for their use case. Additionally, some of the scripts by contain copy/move commands to move inputs and outputs onto--or off of--local storage on the compute nodes. These must be adjusted according to local requirements. -
The tools used are, in most cases, installed via
conda
; conda activation and deactivation invocations are therefore to be found inscribed within the job scripts. Tool version numbers are noted in the manuscript. Please modify the scripts accordingly before use shouldconda
not be available. -
Most scripts here are
BASH
scripts meant to be executed on aSLURM
-based high performance compute cluster. In most cases, they are also SUBMITTER scripts: scripts carrying afor loop
within that cycles through the inputs at a given path, and creates EXECUTIONBASH
scripts for each input(s). The SUBMITTER also submits these EXECUTION scripts to theSLURM
scheduler viasbatch
. That is, most scripts will have the following generic structure:
#sbatch <slurm_stuff>
#This is the SUBMITTER script.
PATH="/path/to/inputs";
#Constructing the EXECUTION script using a for loop holding a template.
for SAMPLE in PATH;
do
#Set up some stuff.
EXECSCRIPT="/path/to/execscript/$(basename ${SAMPLE})";
echo "
#sbatch <slurm_stuff>
#This is the EXECUTION script's template body.
#Executing tool.
mytool -i ${SAMPLE} -o /path/to/output/$(basename ${SAMPLE});
" > ${EXECSCRIPT};
#Executing the exec script by submitting it to sbatch.
sbatch ${EXECSCRIPT};
done;
- Non-
SLURM
scripts also exist in this workflow. None of this code is parellelized (i.e., they are not submitter scripts). These are:
10b_sed_td_postprocess_mainjob.bash - pass the path to the TransDecoder output directory as input, execute directly at the command line.
13_rscript_lenfilt_new_mainjob.R - execute via Rscript directly from command line, set input paths within script properly before execution.
19_rscript_of_getsois_mainjob.R - execute via Rscript, inputs must be set within the script, input is basically the path to the OrthoFinder outputs directory. This one is a bit strange in that it places its outputs under the outputs/orthofinder/ directory.
20c_awk_sed_interproscan_postprocessing_mainjob.bash - pass the path to the InterProScan outputs of 19_rscript_of_getsois_mainjob.R, execute directly from the command line via Rscript.
21a_rscript_ccsel_main.R - execute via Rscript, inputs must be set within the script (pay heed to the comments), must have 21b_rscript_ccsel_auxfunc.R alongside in the same directory to function properly.
17_rscript_annotscmp_mainjob.R - execute via Rscript, inputs must be set within the script.
23_rscript_seqstats.R and 24_rscript_ccvissum_mainjob.R - execute via Rscript, see comments in script. These two generate the final tables, plots, and figures.
25_rscript_of_refcomp_prep.R and 27_rscript_of_refcomp_mainjob.R - execute via Rscript, see comments in script. These together run an analysis that generates a Venn diagram comparion the actual OrthoFinder run used to detect clock proteins in the paper with an expanded run (with the same parameters) with additional reference proteomes. These additional reference proteomes are basically all the "reference" quality metazoan proteomes in UniProt. These must be downloaded manually, and then 25_of_refcomp_prep.R can be executed on these to format the headers and filenames properly for analysis with OrthoFinder. Subsequently, 26_of_refcomp_mainjob.bash can be executed to run OrthoFinder, and then 27_rscript_of_refcomp_mainjob.R can be executed to generate the Venn diagram.
28a_ncbicomp_mainjob_noslurm.bash and 28b_rscript_ncbicomp_mainjob.R - these together take the rscript_ccsel/cc_cand_sel_pub_table.csv file created by 21a_rscript_ccsel_main.R and the files cc_queries.fasta and species_taxids.csv (place these one level above the outputs/ directory within which rscript_ccsel/cc_cand_sel_pub_table.csv should sit; both are available in this GitHub repository under ncbicomp_files/) to search against NCBI's transcriptome and genome assemblies for the species involved in this study to see if any of the proteins for which no candidates were found using our workflow perhaps have matches in the NCBI data. 28a_ncbicomp_mainjob_noslurm.bash uses entrez esearch and fastq-dump (installed in a conda environment called entrez_conda) to get the assemblies, and uses MMseqs2 (installed in a conda environment called mmseqs2_conda) to search against them with cc_queries.fasta which just contains the 10 bait protein sequences used in this study. Note: the bash script here is NOT a Slurm script.
29_rscript_ccseqcomp_mainjob.R - used to take all the clock protein categories wherein a species had more than one candidate and investigate whether these sequences are paralogs or the result of sequence variation. Needs the contents of rscript_ccsel/fas_by_type which is genertated by 21a_rscript_ccsel_main.R. This produces some MSA FASTA files and visualizations (these MSAs and visualizations can be found in this GitHub repository under circadian_clock_candidates/multiple_candidate_msa) as well as a histogram of pairwise sequence identity values (this is supplementary file S10).
-
As can be seen from some of the script names above (e.g.,
22_rscript_ccsel_main.R
), quite a few scripts use theR
programming language (v3.6.0
or greater) and concomitant packages. An exhaustive list of packages used in these scripts (and package versions) can be found in the filer_packages_used.csv
in thescripts/
directory. Accessing the scripts throughRStudio
should highlight missing packages automatically and present an option to install them automatically; this does not work for theBioconductor
packages nor packages fromGitHub
(e.g.,seqvisr
). -
Very important: some scripts contain additional post-processing commands built into them. For instance, the script
06_trinity_mainjob.bash
, which runs theTrinity
de novo transcriptome assembler, has aSeqKit
call after the main tool's execution call to filter out all assembled contigs that are shorter than 200 nucleotides (this seems to be a Trinity bug at the moment that 200 nucleotides is set as the minimum length, but shorter sequences are emitted anyway). Likewise the script that executes theTransDecoder
sequence translation tool has quite a few post-processing commands built in also. We have not provided an overview of what post-processing is in-built into what script, and urge the user to take note of such instances by reading the code. -
Databases: most databases used have versioning bound to their tools (e.g.,
SortMeRNA
and its RNA database,Kraken2
and thePlusPFP
DB). The only real "external" databases used areUniProt/SwissProt
v2021_03
(for transcriptome annotation viaMMseqs2
), and the three reference proteomes fromUniProt
used byOrthoFinder
to find the circadian clock proteins. These areDanaus_plexippus_UP000007151_278856.fasta
,Drosophila_melanogaster_UP000000803_7227.fasta
, andMus_musculus_UP000000589_10090.fasta
(all downloaded on or before March 2021). There are copies of these in theof_ref_prep/proteomes
directory, and do not have to be generated afresh to replicate this workflow. -
In the interest of easing the burden of reproducing the results here, the most important database resources have been provided under the directory
databases_used/
in this repository. Place thecc_orthofinder_dbs/
directory from here directly undernewrun/
as indicated in the directory tree earlier above. TheUniProt/SwissProt
FASTA
file is also provided here underdatabases_used/
as anxz
-compressed file. This must be decompressed (decompress withxz -vd <filename>
) and placed at a convenient location (e.g., directly undernewrun/
), and this path then edited into line 13 of16b_mmseqs2_annot_mainjob.bash
prior to execution. We have not included the full set of proteomes used for the comparative analysis (see note on25_of_refcomp_prep.R
and26_of_refcomp_mainjob.R
above) as these are too large to be hosted by us, and are just reference proteomes downloaded fromUniProt
. -
The scripts and data in the
scripts/of_re_prep
sub-directory in this collection of scripts need not necessarily be run. Their outputs are already present (theof_ref_prep/final_ref_orthofinder
directory). These constitute the set of reference transcriptomes supplied toOrthoFinder
to identify circadian clock protein candidates in the other input proteomes by means of pairwise orthology to select bait sequences (appropriately highlighted by annotations) in a select set of reference proteomes. -
The script (and data) in the
scripts/ncbi_assem_counts
directory are not a part of the analytical workflow. These were used to estimate how many of the sequenced species had genomes/transcriptomes available on NCBI. -
Similarly, the script and data in the
scripts/species_tree
directory are also not a part of the analytical workflow. These were used to generated the species tree figure included in the manuscript.
-
supplementary_files
contains a copy of the supplementary materials submitted along with the manuscript. -
ncbi_submission
contains some material concerning the NCBI submission for this project. Data uploaded to NCBI can be found under the accession codePRJNA824716
.