Skip to content

Latest commit

 

History

History
334 lines (218 loc) · 14.3 KB

README.md

File metadata and controls

334 lines (218 loc) · 14.3 KB

LN Phenocluster Manuscript Pipeline

Project Name: Lymphoid neoplasms (LN) phenocluster

Description: This repository contains the scripts and pipeline used in the LN phenocluster manuscript. Please follow the scripts listed below to generate the analysis results described in the manuscript.

Citation: "M. Guler & F. Canzian, Clustering of Lymphoid Neoplasms by Cell of Origin, Somatic Mutation and Drug Usage Profiles: A Multi-trait Genome-Wide Association Study, submitted, 2024"


Required Softwares

Software Version Link/Citation
regenie 3.2.1 regenie
plink 1.90b6.21 plink
plink2 2.00a3LM plink2
gps_cpp beta gps_cpp
R 4.3.0 or above R
METAL 2020-05-05 METAL
LDSC v1.0.1 LDSC
GWASLab v3.4.31 GWASLab

Required R Packages

R Package Version Link/Citation
ASSET 2.18.0 ASSET
tidyverse 2.0.0 tidyverse
ggplot2 3.1.3.1 gplots
dendextend 1.17.1 dendextend
data.table 1.14.10 data.table
parallel 4.3.0 parallel
magrittr 2.0.3 magrittr
colorspace 2.1.0 colorspace

Steps


1) Create phenoclusters

The data for somatic mutation patterns (SData1.txt) and approved drugs (SData2.txt) were accessed on cBioPortal and the Open Targets Platform, respectively.

For somatic mutations, each LN phenotype and somatically mutated genes were downloaded from cBioPortal. The LN subtypes-mutated gene binary matrix (0,1) was created based on detected mutated genes without their frequency or position, and the data folder created LN-mutated genes matrix provided as SData1.txt.

For the approved drugs, each phenotype was searched on the Open Targets platform, and each drug (at least phase 3 & 4) was manually searched on public databases to check FDA or EMA approval for clinical usage. The LN subtypes-drug binary matrix (0,1) was created based on this information, and the data folder created LN-drug matrix provided as SData2.txt.

After creating these two data files, hierarchical clustering (hclust) analysis methods were compared using the 00_Compare_hclust_method.R script. The script compares hierarchical clustering (hclust) algorithms by generating correlation plots and calculating the Fowlkes-Mallows Index. The resulting dendrograms are analyzed for cophenetic correlation (Pearson correlation coefficient) and Fowlkes-Mallows Index for different cluster counts (k=3, 4, 5). Finally, the script visualizes these correlations using correlation plots saved as TIFF images.

Note

For somatic mutation data method comparison done with nearly 10 000 genes that are mutated in more than 20% of subtypes becuase of computational restrictions. But, in the next step whole data set used.

After running the 00_Compare_hclust_method.R script, the selected hclust method is used in the 01_LNcluster.R script to generate and visualize phenoclusters. The script creates dendrograms using Ward's method and the Jaccard similarity coefficient (in R dist(method = "binary")). The dendrograms are then used to create heatmaps (heatmap.2 from the gplots package) that visualize relationships between drugs or genes and LN subtypes. The resulting heatmaps are saved as TIFF images (drug_plot.tiff and somatic_plot.tiff).

To run these scripts, use the following commands:

Rscript --slave --no-restore --no-save scripts/00_compare_hclust_methods.R
Rscript --slave --no-restore --no-save scripts/01_phenocluster.R

You can run these commands in your R terminal (not console). If you have access to any HPC, you can submit these scripts with a job runner script. The script 03_LNcluster.bsub is created for IBM LSF job scheduler. This script can easily be converted to commonly used schedulers such as SLURM or PBS. An example SLURM script is provided below:

#!/bin/bash
#SBATCH --partition=long                    # Set your partition
#SBATCH --job-name=LNcluster                # Name of the job
#SBATCH --ntasks=1                          # Request a core
#SBATCH --mem=32G                           # Request a total memory limit of 32GB for the job
#SBATCH --output=/your/out/dir/LN_cluster.out  # Give path for the out file
#SBATCH --error=/your/err/dir/LN_cluster.err   # Give path for the err file

# Load modules
module load R/4.3.0

# Set script dir
cd /scripts

# Rscript
Rscript --slave --no-restore --no-save scripts/00_compare_hclust_methods.R
Rscript --slave --no-restore --no-save scripts/01_phenocluster.R

List of script and their functions for the Step 1:

Script Function
00 Compare hclust methods
01 Create phenoclusters and visualize
03 Runner script for script 00 and 01


2) QC genetic data and GWAS with regenie

REGENIE uses a two-step approach. In the first step, original non-imputed genotype data is used, filtering only high-quality genotyped variants: minor allele frequency (MAF) > 1%, minor allele count (MAC) > 5, genotyping rate > 99%, Hardy-Weinberg equilibrium (HWE) test P > 1E−08, <1% missingness. The quality control of genotype data and filtering was done using plink2 software. The script 04_qc.sh combines all chromosomes and creates a list of SNPs that meet QC criteria.

bash scripts/04_qc.sh

After the QC step, REGENIE step 1 can be run:

bsub < scripts/05_regenie_step1.bsub -R "rusage[mem=32G]"

REGENIE step 1 will generate a ukb_step1_LM_pred.list file. Using this file and imputed genotype files (bgen), step 2 can be run:

bsub < scripts/06_regenie_step2.bsub -R "rusage[mem=32G]"

Step 2 will generate GWAS results for each phenotype separated by chromosomes. These files need to be merged by phenotype. To do this, run the merger script:

bash scripts/07_merge_regenie_outputs.sh

Finally, GWAS summary statistics can be filtered and formatted. The script 08_filter_format_sumstats.bsub creates unique SNP IDs (CHR:POS:REF), converts -log10P to P, and filters Info_score and SPA correction failed tests.

bsub < scripts/08_filter_format_sumstats.bsub -R "rusage[mem=32G]"

List of script and their functions for the Step 2:

Script Function
04 QC genotype for regenie step1
05 REGENIE STEP1
06 REGENIE STEP2
07 Merge REGENIE OUTPUTS
08 Filter and format SUMSTATS


3) ASSET and Meta-analysis with METAL

To identify pleiotropic variants in a hypothesis-free manner, we conducted an 'Association analysis based on the SubSETs' approach, called ASSET. ASSET is a collection of statistical methods designed to combine association signals from multiple studies or traits, especially when effects are present in only some studies and may be in opposite directions. This tool searches through all potential subsets of studies, adjusts for multiple testing, and identifies the most significant subset contributing to the overall association, while accounting for correlations due to overlapping participants. We ran the ASSET analysis using a custom R script, which enables parallel computation:

Rscript --slave --no-restore --no-save scripts/09_asset_parallel.R

To compare the results from ASSET, phenocluster, and traditional meta-analysis, we performed a meta-analysis using METAL. Since we have seven phenoclusters, we created seven separate scripts for METAL (10_s_metalLM(1-7).sh), and executed all these scripts with 10_Metal.bsub:

bsub < scripts/10_Metal.bsub -R "rusage[mem=8G]"

List of script and their functions for the Step 3:

Script Function
09 Parallel ASSET
10 Meta-analysis with METAL


4) GPS-GEV Test and LDSC

To understand the genetic correlation between LN subtypes and the created phenoclusters, we first performed linkage disequilibrium score regression (LDSC) using LDSC v1.0.1 software. The GWAS summary statistics for LN subtypes and phenoclusters were formatted using the munge_sumstats Python script from LDSC to estimate genetic correlation with HapMap3 variants, as recommended. For the genetic correlation estimation, SNPs with a minor allele frequency (MAF) of less than 5% were excluded, and the MHC region (chr6: 25-35 Mb) was also excluded from this analysis. We downloaded the pre-computed linkage disequilibrium (LD) scores for European ancestry from the Alkes Group website here.

a) GWASlab to format sumstats

bsub < scripts/11_gwaslab_ldsc.bsub -R "rusage[mem=16G]"

b) LDSC munge to format

bsub < scripts/12_ldsc_munge.bsub -R "rusage[mem=8G]"

c) LDSC

bsub < scripts/13_ldsc.bsub -R "rusage[mem=8G]"

d) Combine results

bash 14_combine_ldsc_results.sh

Due to our relatively small effective sample size (Neff = 4 / (1/Ncases + 1/Ncontrols)), most pairwise correlation tests failed. Therefore, we employed an alternative method called the genome-wide pairwise-association signal sharing (GPS) test [9]. While the GPS test does not provide a quantitative measurement of correlation like LDSC, it does offer evidence against the null hypothesis of bivariate independence.

Recently, the GPS was enhanced by fitting it with a generalized extreme value distribution (GEVD) instead of using the standard exponential transformation as initially proposed. We used the GPS-GEV test, excluding only the MHC region (chr6: 29,624,758-33,170,276). The GPS test statistic for P values from a pair of GWAS was computed using the computeGpsCLI application, and permuteTraitsCLI was used to generate null realizations of the GPS test statistic with 100 permutations. Finally, P values for the GPS statistic were obtained using the fit_gevd_and_compute_pvalue.R script, which fits a generalized extreme value distribution (GEVD) to the null realizations of the GPS statistic and reports a P value.

a) To compute GPS:

bsub < scripts/15_computeGPScli.bsub -R "rusage[mem=200G]"

b) To permute GPS:

bsub < scripts/16_permuteTraitsCLI.bsub -R "rusage[mem=200G]"

c) To fit GEVD and claculate P value: Note: Script 18_compute_pvalue.bsub uses 17_fit_gevd_compute_pvalue.R.

bsub < scripts/18_compute_pvalue.bsub -R "rusage[mem=200G]"
Script Function
LDSC
11 GWASlab
12 LDSC Munge
13 LDSC
14 Combine LDSC Results
GPS-GEV
15 Compute GPS
16 Permute GPS
17 Fit GEV-Pvalue R script
18 Runner for script 17


5) Create FUMA inputs

FUMA requires GWAS summary statistics to be in a specific format and under 600 MB in size. To meet these requirements, we utilized different custom R scripts tailored for Regenie, METAL, and ASSET GWAS summary statistics.

After obtaining the FUMA results for each set of GWAS summary statistics, the results need to be combined and summarized. This was accomplished using the script 22_fuma2functional.R.

Steps to Prepare FUMA Inputs:

a) Regenie to FUMA

Rscript --slave --no-restore --no-save scripts/19_regenie2fuma_input.R

b) METAL to FUMA

Rscript --slave --no-restore --no-save scripts/20_metal2fuma.R

c) ASSET to FUMA

Rscript --slave --no-restore --no-save scripts/21_asset2fuma.R

d) FUMA to summary functional tables

Rscript --slave --no-restore --no-save scripts/22_fuma2functional.R
Script Function
19 REGENIE2FUMA
20 [METAL2FUMA](scripts/20_metal2fuma.R
21 ASSET2FUMA
22 FUMA2FUNC


6) Plots

The results from the GPS-GEV test and LDSC were visualized using custom R scripts.

Steps to Generate Plots:

a) To generate corrplot for GPS-GEV and LDSC

Rscript --slave --no-restore --no-save scripts/23_gps_corrplot.R

b) To generate custom forest plots

Note: The script was adapted from Katherine Hoffman's blog. The source was last accessed on 30/08/2024.

Rscript --slave --no-restore --no-save scripts/24_forest_plot.R
Script Function
23 GPS_LDSC Corrplot
24 Forest plot


Note

This repository is created to enhance reproducibility. It will not be actively maintained.

Tip

If you need assistance at any step, please feel free to email me.

Important

It is not possible to share the full datasets used in this pipeline, such as UKBB data. Additionally, some steps require manual verification and the creation of data files.

Warning

This pipeline, or any part of it, should not be used for medical or diagnostic purposes.

Caution

If you use any part of this pipeline in your work, please make sure to cite the original work software/package and if possible our work.