mechanistic-trimming

The goal of this project is to use model-based statistical inference to identify the sequence-level determinants of nucleotide trimming during V(D)J recombination of adaptive immune receptor loci.

Install

Everything R 4.1.3 based. R packages that are required can be installed via miniconda:

conda env create -f environment.yml
conda activate mechanistic-trimming

You will also need to install IGoR if you wish to annotate sequences using IGoR (Marcou et.al Nature Communications 2018)

Requirements:

Most of these analyses can be run on any machine. However, some of the data preparation steps, such as sequence annotation using IGoR (Marcou et.al Nature Communications 2018), are computationally intensive and require a cluster to run efficiently. This sequence annotation script is written specifically for a cluster set up to use the Slurm job scheduler. (Some minor modifications to the sequence annotation script could allow this step to be run locally or using a different cluster workload manager.

Analysis outline:

Table of Contents:

Model training
- Summary of model types
Quantifying coefficient significance
Model evaluation
- Summary of model evaluation options
Model validation
- Summary of model validation options
Plot results
Supplementary analyses
- Meaures the relative weights of parameters across different testing data sets
- Artemis SNP analysis

Model training

Download the training cohort data set using the link provided in the original publication
Annotate these sequences using IGoR. You can run the annotation script to do this. As described above, this script is written specifically for a cluster set up to use the Slurm job scheduler. This script takes five arguments:
1. the raw file path (wherever you downloaded the files in step 0 above)
2. a temporary directory location (wherever you want the intermediate annotation files to be stored)
3. an output directory location (wherever you want the final annotation files to be stored--you will eventually save this location as TCR_REPERTOIRE_DATA_igor within the config file in step 3)
4. the number of possible rearrangment scenarios you want to sample from for each sequence (we used 10 scenarios)
5. the number of CPUs you want to use
Download TRB gene name and germline sequences from IMGT (we downloaded these data in December 2021); save these data to a file--you will eventually save the location of this file as WHOLE_NUCSEQS_igor within the config file in step 3
Edit config file to be project and/or computer specific. See the README for more details.
Train model using the model fitting script. This script takes 12 arguments, and can be run locally or on a cluster:
1. the annotation type--for the general model, this should be igor
2. the trimming type for model fitting (either v_trim or j_trim)
3. productivity of sequences (either productive, nonproductive, or both)
4. motif type--for the general model, this should be unbounded
5. the number of CPUs you want to use
6. model group (either all_subjects or individual_subjects depending on whether you want to train a model across all subjects, or a unique model for each individual)
7. gene weight type (either p_gene_marginal or p_gene_given_subject); for the general model use p_gene_marginal--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood
8. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
9. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
10. upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
11. model type--see the table below for model options
12. (optional) for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides

This trained model will be stored in the models directory.

Model types

Here is a summary of some of the model options. You can find a description of additional model type options here

Long model name (same as manuscript)	Code argument	Features parameterized	Other notes
null	`null`	none
motif	`motif`	a sequence motif	the motif size is specified by the `left_motif_count` and `right_motif_count` arguments
DNA-shape	`dna_shape-std`	DNA-shape parameters for nucleotides contained in a specified sequence window	the sequence window size is specified by the `left_motif_count` and `right_motif_count` arguments
distance	`linear-distance`	sequence-independent distance from the end of the gene (integer-valued)
two-side base-count	`two-side-base-count`	count of AT and GC nucleotides on either side of the trimming site	the total number of nucleotides included in the 5' AT and GC counts is specified by the 12th argument of the model training script
motif + distance	`motif_linear-distance`	a sequence motif and sequence-independent distance from the end of the gene (integer-valued)	the motif size is specified by the `left_motif_count` and `right_motif_count` arguments
motif + two-side base-count beyond	`motif_two-side-base-count-beyond`	a sequence motif and the count of AT and GC nucleotides on either side of the trimming site (including only nucleotides outside of the motif)	the motif size is specified by the `left_motif_count` and `right_motif_count` arguments and the total number of nucleotides included in the 5' AT and GC counts is specified by the 12th argument of the model training script

Quantifying coefficient significance

If you would like to measure the significance of each model coefficient, you can use this script. This script takes 11 arguments:

the annotation type--for the general model, this should be igor
the trimming type for model fitting (either v_trim or j_trim)
productivity of sequences (either productive, nonproductive, or both)
motif type--for the general model, this should be unbounded
the number of CPUs you want to use
gene weight type (either p_gene_marginal or p_gene_given_subject); for the general model use p_gene_marginal--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood
left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
model type--see the table below for model options
(optional) for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides

This analysis will save a file containing the results in a directory called model_bootstrap located in the indicated OUTPUT_PATH as specified in the config files

Model evaluation

If you would like to evaluate the performance of a model using various subsets of the training data set, you can use the model evaluation script. This script also takes 12 arguments which are similar to the model training script:

the annotation type--for the general model, this should be igor
the trimming type for model fitting (either v_trim or j_trim)
productivity of sequences (either productive, nonproductive, or both)
motif type--for the general model, this should be unbounded
the number of CPUs you want to use
gene weight type (either p_gene_marginal or p_gene_given_subject); for the general model use p_gene_marginal--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood
left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
model type--see the table above for model options
model evaluation type--see the table below for evaluation options
(optional) for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides

All output files will be located in a directory called temp_evaluation within the OUTPUT_PATH as specified in the config file

Model evaluation options

Here is a summary of the model evaluation arguments. More details can be found here

Long evaluation name (same as manuscript)	Code argument	Summary of held-out group
full V-gene training data set	`log_loss`	Loss computed across the full training data set
many held-out subsets of the V-gene training data set	`expected_log_loss`	Expected loss computed across 20 random different held-out subsets of the full training data set
"most different" cluster of V-genes (terminal sequences)	`v_gene_family_loss`	Loss computed across the group of V-genes which are most-different according to the hamming distances of the last 25 nucleotides of each sequence
"most different" cluster of V-genes (full sequences)	`full_v_gene_family_loss`	Loss computed across the group of V-genes which are most-different according to the hamming distances of each full sequence
full J-gene data set	`log_loss_j_gene`	Loss computed across the full J-gene data set from the training data set

Model validation

If you would like to use the model to make predictions on a new data set and/or validate the model on a testing data set, you can follow these steps:

Download the processed testing data set, and make sure it contains the same column names as the training data set
Depending on the locus of the testing data, download the appropriate gene name and germline sequences from IMGT (we downloaded these data in August 2022); save these data to a file--you will save the location of this file in the next step
Edit config file to be project and/or computer specific (be sure to add the path to the file from the last step)
Run the model validation script. This script takes 17 arguments which force you to specify model parameters and validation data set specifics:
1. the annotation type--for the general model, this should be igor
2. the trimming type used in model fitting (either v_trim or j_trim)
3. productivity of sequences used in model fitting (either productive, nonproductive, or both)
4. motif type--for the general model, this should be unbounded
5. the number of CPUs you want to use
6. model group (either all_subjects or individual_subjects depending on whether the model was trained across all subjects, or uniquely for each individual)
7. gene weight type used for model training (either p_gene_marginal or p_gene_given_subject); for the general model use p_gene_marginal--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood
8. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
9. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
10. upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
11. model type--see the table below for model options
12. for models with base-count terms, specify the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides (you can specify NA otherwise)
13. the directory storing the validation data set
14. the validation type--see the summary below for details
15. validation trimming type (either v_trim or j_trim)--this does not need to be the same as what the model was trained with
16. productivity of sequences for model validation (either productive, nonproductive, or both)
17. gene weight type used for loss calculation (either p_gene_marginal or p_gene_given_subject); for the general use p_gene_given_subject

All output files will be located in a directory called temp_evaluation within the OUTPUT_PATH as specified in the config file

Model validation options

Summary of the model validation data sets used in our analysis.

Locus name	Code argument	Data used in the manuscript
TRB locus	`validation_data_beta`	TRB testing data set; processed data can be downloaded here
TRA locus	`validation_data_alpha`	TRA testing data set; processed data can be downloaded here
TRG locus	`validation_data_gamma`	TRG testing data set; processed data can be downloaded here
TRD locus	`validation_data_delta`	TRD not analyzed in the manuscript
IGH locus	`validation_data_igh`	productive IGH testing data set can be downloaded here; nonproductive IGH testing data set can be downloaded here

Plot results

Plot figures from the manuscript.

Also, have a look at the plotting README for more details.

Supplementary analyses

Parameter relative weight analysis

If you would like to measure the relative weights of the motif and base-count-beyond parameters within the motif + two-side base-count-beyond model for a specific validation data set, you can run this script locally or on a cluster. This script takes 15 arguments:

the annotation type--for the general model, this should be igor
the trimming type used in model fitting (either v_trim or j_trim)
productivity of sequences used in model fitting (either productive, nonproductive, or both)
motif type--for the general model, this should be unbounded
the number of CPUs you want to use
model group (either all_subjects or individual_subjects depending on whether the model was trained across all subjects, or uniquely for each individual)
gene weight type used for model training (either p_gene_marginal or p_gene_given_subject); for the general model use p_gene_marginal--this will ensure that all genes are weighted the same across all subjects in the construction of the likelihood
left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
upper trim bound--the upper bound on trimming lengths; for the general model, we use a bound of 14
the number of nucleotides to be included in the base-count 5' of the trimming site; for the general model, we use 10 nucleotides
the directory storing the validation data set
the validation type--see the summary for details
validation trimming type (either v_trim or j_trim)--this does not need to be the same as what the model was trained with
productivity of sequences for model validation (either productive, nonproductive, or both)

All output files will be located in a directory called rel_importance within the OUTPUT_PATH as specified in the config file

Artemis SNP analysis

If you would like to evaluate whether model coefficients vary in the context of an Artemis locus SNP (rs41298872), you can follow the instructions in the quantifying coefficient significance section using motif_two-side-base-count-beyond_snp-interaction-20717849 as the model type.

About the analysis

With this analysis, we want to quantify the sequence-level determinants of nucleotide trimming during V(D)J recombination. See the manuscript for specific model and methods details:

Russell, M. L., Simon, N., Bradley, P., & Matsen, F. A. (2022). Statistical inference reveals the role of length, breathing, and nucleotide identity in V(D)J nucleotide trimming. In bioRxiv (p. 2022.12.08.519635). https://doi.org/10.1101/2022.12.08.519635

The following packages were especially helpful in our analyses:

data.table (Dowle and Srinivasan, 2021)
tidyverse (Wickham et. al, 2019)
doParallel (Corporation and Steve Weston, 2020)
cowplot (Wilke, 2020)
mclogit (Elff, 2021)
Biostrings (Pagès et. al, 2021)
DECIPHER (Wright, 2016)
ape (Paradis and Schliep, 2019)

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
config		config
models		models
plotting_scripts		plotting_scripts
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
annotate_with_igor.sh		annotate_with_igor.sh
environment.yml		environment.yml
evaluate_coef_significance.sh		evaluate_coef_significance.sh
evaluate_model.sh		evaluate_model.sh
evaluate_rel_importance.sh		evaluate_rel_importance.sh
fit_model.sh		fit_model.sh
subsample_experiment.sh		subsample_experiment.sh
validate_model.sh		validate_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mechanistic-trimming

Install

Requirements:

Analysis outline:

Model training

Model types

Quantifying coefficient significance

Model evaluation

Model evaluation options

Model validation

Model validation options

Plot results

Supplementary analyses

Parameter relative weight analysis

Artemis SNP analysis

About the analysis

About

Releases

Packages

Languages

License

magdalenarussell/mechanistic-trimming

Folders and files

Latest commit

History

Repository files navigation

mechanistic-trimming

Install

Requirements:

Analysis outline:

Model training

Model types

Quantifying coefficient significance

Model evaluation

Model evaluation options

Model validation

Model validation options

Plot results

Supplementary analyses

Parameter relative weight analysis

Artemis SNP analysis

About the analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages