The goal of this project is to use model-based statistical inference to identify the sequence-level determinants of nucleotide trimming during V(D)J recombination of adaptive immune receptor loci.
Everything R 4.1.3 based. R packages that are required can be installed via miniconda:
conda env create -f environment.yml
conda activate mechanistic-trimming
You will also need to install IGoR if you wish to annotate sequences using IGoR (Marcou et.al Nature Communications 2018)
Most of these analyses can be run on any machine. However, some of the data preparation steps, such as sequence annotation using IGoR (Marcou et.al Nature Communications 2018), are computationally intensive and require a cluster to run efficiently. This sequence annotation script is written specifically for a cluster set up to use the Slurm job scheduler. (Some minor modifications to the sequence annotation script could allow this step to be run locally or using a different cluster workload manager.
Table of Contents:
- Model training
- Quantifying coefficient significance
- Model evaluation
- Model validation
- Plot results
- Supplementary analyses
- Download the training cohort data set using the link provided in the original publication
- Annotate these sequences using IGoR. You can run the annotation script to do this. As described above, this script is written specifically for a cluster set up to use the Slurm job scheduler. This script takes five arguments:
- the raw file path (wherever you downloaded the files in step 0 above)
- a temporary directory location (wherever you want the intermediate annotation files to be stored)
- an output directory location (wherever you want the final annotation files to be stored--you will eventually save this location as
TCR_REPERTOIRE_DATA_igor
within the config file in step 3) - the number of possible rearrangment scenarios you want to sample from for each sequence (we used 10 scenarios)
- the number of CPUs you want to use
- Download TRB gene name and germline sequences from IMGT (we downloaded these data in December 2021); save these data to a file--you will eventually save the location of this file as
WHOLE_NUCSEQS_igor
within the config file in step 3 - Edit config file to be project and/or computer specific. See the README for more details.
- Train model using the model fitting script. This script takes 12 arguments, and can be run locally or on a cluster:
- the annotation type--for the general model, this should be
igor
- the trimming type for model fitting (either
v_trim
orj_trim
) - productivity of sequences (either
productive
,nonproductive
, orboth
) - motif type--for the general model, this should be
unbounded
- the number of CPUs you want to use
- model group (either
all_subjects
orindividual_subjects
depending on whether you want to train a model across all subjects, or a unique model for each individual) - gene weight type (either
p_gene_marginal
orp_gene_given_subject
); for the general model usep_gene_marginal
--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood - left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
- model type--see the table below for model options
- (optional) for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides
- the annotation type--for the general model, this should be
This trained model will be stored in the models directory.
Here is a summary of some of the model options. You can find a description of additional model type options here
Long model name (same as manuscript) | Code argument | Features parameterized | Other notes |
---|---|---|---|
null | null |
none | |
motif | motif |
a sequence motif | the motif size is specified by the left_motif_count and right_motif_count arguments |
DNA-shape | dna_shape-std |
DNA-shape parameters for nucleotides contained in a specified sequence window | the sequence window size is specified by the left_motif_count and right_motif_count arguments |
distance | linear-distance |
sequence-independent distance from the end of the gene (integer-valued) | |
two-side base-count | two-side-base-count |
count of AT and GC nucleotides on either side of the trimming site | the total number of nucleotides included in the 5' AT and GC counts is specified by the 12th argument of the model training script |
motif + distance | motif_linear-distance |
a sequence motif and sequence-independent distance from the end of the gene (integer-valued) | the motif size is specified by the left_motif_count and right_motif_count arguments |
motif + two-side base-count beyond | motif_two-side-base-count-beyond |
a sequence motif and the count of AT and GC nucleotides on either side of the trimming site (including only nucleotides outside of the motif) | the motif size is specified by the left_motif_count and right_motif_count arguments and the total number of nucleotides included in the 5' AT and GC counts is specified by the 12th argument of the model training script |
If you would like to measure the significance of each model coefficient, you can use this script. This script takes 11 arguments:
- the annotation type--for the general model, this should be
igor
- the trimming type for model fitting (either
v_trim
orj_trim
) - productivity of sequences (either
productive
,nonproductive
, orboth
) - motif type--for the general model, this should be
unbounded
- the number of CPUs you want to use
- gene weight type (either
p_gene_marginal
orp_gene_given_subject
); for the general model usep_gene_marginal
--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood - left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
- model type--see the table below for model options
- (optional) for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides
This analysis will save a file containing the results in a directory called model_bootstrap
located in the indicated OUTPUT_PATH
as specified in the config files
If you would like to evaluate the performance of a model using various subsets of the training data set, you can use the model evaluation script. This script also takes 12 arguments which are similar to the model training script:
- the annotation type--for the general model, this should be
igor
- the trimming type for model fitting (either
v_trim
orj_trim
) - productivity of sequences (either
productive
,nonproductive
, orboth
) - motif type--for the general model, this should be
unbounded
- the number of CPUs you want to use
- gene weight type (either
p_gene_marginal
orp_gene_given_subject
); for the general model usep_gene_marginal
--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood - left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
- model type--see the table above for model options
- model evaluation type--see the table below for evaluation options
- (optional) for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides
All output files will be located in a directory called temp_evaluation
within the OUTPUT_PATH
as specified in the config file
Here is a summary of the model evaluation arguments. More details can be found here
Long evaluation name (same as manuscript) | Code argument | Summary of held-out group |
---|---|---|
full V-gene training data set | log_loss |
Loss computed across the full training data set |
many held-out subsets of the V-gene training data set | expected_log_loss |
Expected loss computed across 20 random different held-out subsets of the full training data set |
"most different" cluster of V-genes (terminal sequences) | v_gene_family_loss |
Loss computed across the group of V-genes which are most-different according to the hamming distances of the last 25 nucleotides of each sequence |
"most different" cluster of V-genes (full sequences) | full_v_gene_family_loss |
Loss computed across the group of V-genes which are most-different according to the hamming distances of each full sequence |
full J-gene data set | log_loss_j_gene |
Loss computed across the full J-gene data set from the training data set |
If you would like to use the model to make predictions on a new data set and/or validate the model on a testing data set, you can follow these steps:
-
Download the processed testing data set, and make sure it contains the same column names as the training data set
-
Depending on the locus of the testing data, download the appropriate gene name and germline sequences from IMGT (we downloaded these data in August 2022); save these data to a file--you will save the location of this file in the next step
-
Edit config file to be project and/or computer specific (be sure to add the path to the file from the last step)
-
Run the model validation script. This script takes 17 arguments which force you to specify model parameters and validation data set specifics:
- the annotation type--for the general model, this should be
igor
- the trimming type used in model fitting (either
v_trim
orj_trim
) - productivity of sequences used in model fitting (either
productive
,nonproductive
, orboth
) - motif type--for the general model, this should be
unbounded
- the number of CPUs you want to use
- model group (either
all_subjects
orindividual_subjects
depending on whether the model was trained across all subjects, or uniquely for each individual) - gene weight type used for model training (either
p_gene_marginal
orp_gene_given_subject
); for the general model usep_gene_marginal
--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood - left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
- model type--see the table below for model options
- for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides (you can specify
NA
otherwise) - the directory storing the validation data set
- the validation type--see the summary below for details
- validation trimming type (either
v_trim
orj_trim
)--this does not need to be the same as what the model was trained with - productivity of sequences for model validation (either
productive
,nonproductive
, orboth
) - gene weight type used for loss calculation (either
p_gene_marginal
orp_gene_given_subject
); for the general usep_gene_given_subject
- the annotation type--for the general model, this should be
All output files will be located in a directory called temp_evaluation
within the OUTPUT_PATH
as specified in the config file
Summary of the model validation data sets used in our analysis.
Locus name | Code argument | Data used in the manuscript |
---|---|---|
TRB locus | validation_data_beta |
TRB testing data set; processed data can be downloaded here |
TRA locus | validation_data_alpha |
TRA testing data set; processed data can be downloaded here |
TRG locus | validation_data_gamma |
TRG testing data set; processed data can be downloaded here |
TRD locus | validation_data_delta |
TRD not analyzed in the manuscript |
IGH locus | validation_data_igh |
productive IGH testing data set can be downloaded here; nonproductive IGH testing data set can be downloaded here |
Plot figures from the manuscript.
Also, have a look at the plotting README for more details.
If you would like to measure the relative weights of the motif and base-count-beyond parameters within the motif + two-side base-count-beyond model for a specific validation data set, you can run this script locally or on a cluster. This script takes 15 arguments:
- the annotation type--for the general model, this should be
igor
- the trimming type used in model fitting (either
v_trim
orj_trim
) - productivity of sequences used in model fitting (either
productive
,nonproductive
, orboth
) - motif type--for the general model, this should be
unbounded
- the number of CPUs you want to use
- model group (either
all_subjects
orindividual_subjects
depending on whether the model was trained across all subjects, or uniquely for each individual) - gene weight type used for model training (either
p_gene_marginal
orp_gene_given_subject
); for the general model usep_gene_marginal
--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood - left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
- upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
- the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides
- the directory storing the validation data set
- the validation type--see the summary for details
- validation trimming type (either
v_trim
orj_trim
)--this does not need to be the same as what the model was trained with - productivity of sequences for model validation (either
productive
,nonproductive
, orboth
)
All output files will be located in a directory called rel_importance
within the OUTPUT_PATH
as specified in the config file
If you would like to evaluate whether model coefficients vary in the context of an Artemis locus SNP (rs41298872), you can follow the instructions in the quantifying coefficient significance section using motif_two-side-base-count-beyond_snp-interaction-20717849
as the model type.
With this analysis, we want to quantify the sequence-level determinants of nucleotide trimming during V(D)J recombination. See the manuscript for specific model and methods details:
Russell, M. L., Simon, N., Bradley, P., & Matsen, F. A. (2022). Statistical inference reveals the role of length, breathing, and nucleotide identity in V(D)J nucleotide trimming. In bioRxiv (p. 2022.12.08.519635). https://doi.org/10.1101/2022.12.08.519635
The following packages were especially helpful in our analyses:
- data.table (Dowle and Srinivasan, 2021)
- tidyverse (Wickham et. al, 2019)
- doParallel (Corporation and Steve Weston, 2020)
- cowplot (Wilke, 2020)
- mclogit (Elff, 2021)
- Biostrings (Pagès et. al, 2021)
- DECIPHER (Wright, 2016)
- ape (Paradis and Schliep, 2019)