❕ This repository is a work in progress, a working example is provided with instructions below ❕
The scripts in 'PheWAS-cluster' represent our approach of Phe-WAS based clustering of Mendelian Randomisation instruments (PWC-MR). Our approach was used to investigate the large causal effect of body mass index (BMI) on educational attainment (EDU) -0.19 [-0.22, -0.16], where we hypothesise that potential horizontal pleiotropic effects (emerging due to heritable confounders, dynastic effects, genetic subtypes of obesity and other pleiotropic mechanisms, see panel a in the figure below) are biassing BMI's effect on educational attainment.
The main steps of the PWC-MR method are (illustrated in panel b of the figure below):
- Instrument selection and PheWAS
- IV clustering
- Enrichment analysis and cluster specific MR
Please download the working-example
folder and set it as your working directory in R. There are two subfolders you would use, and one file to download from here (150 MBs):
-
data: this subfolder contains all the data files needed to run the example scripts (main analysis of PWC-MR), these are:
unstdBeta_df.csv
/unstdSE_df.csv
/tstat_df.csv
/pval_df.csv
these data frames contain the effect, standard error, t-statistic (beta/SE), and p-values respectively for the 348 genome wide significant BMI SNPs across 408 traits (filtered for sample size > 50'000). For other trait analysis, these data frames must be obtained either from UKBB or from PhenoScnanner with similar filitering criteria (sample size, no duplicate traits).trait_info_nfil.csv
contains information on the 408 traits including: trait, description, effective sample size, variable_type...fpaths_fil_nfil.txt
contains the file paths of the traits used, this file is needed simple to remove duplicate traits if they have multiple versions (UKBB artefact).sig-clumped-IVs_21001.csv
contains the genome-wide significant and clumped BMI SNPs/IVs used for the TwoSampleMR causal effect estimate using all SNPs and the various clustered SNPs. The columns needed/included are: SNP, variant, beta, se, pval.exposure, effect_allele, other_allele, chr, tstat, N.clumped-IVs_845.csv
contains the effects of the same BMI SNPs/IVs but for the outcome of interest, in this case it is EDU (Age completed full time education). The columns needed/included are: SNP, variant, beta, se, pval, effect_allele, other_allele, chr, tstat, N.
-
scripts: this subfolder has 2 main scripts numbered in order of use. In both of these scripts, the
# variable set-up
section in the begining should be updated to include the correct path for theworking-example
directory in the variableres_dir
.
The scripts are:-
1_QC_filtering.R
This script reads inunstdBeta_df.csv
,unstdSE_df.csv
,tstat_df.csv
,pval_df.csv
,trait_info_nfil.csv
, andfpaths_fil_nfil.txt
from the data subfolder. It then proceeds to filter out traits that have NA effects, duplicate traits (specifically exposure), traits with an exposure-genetic correlation > 0.75 (can be changed).The remaining SNP-trait effect matrix is then standardised, and SNPs are further removed if they are more strongly associated with traits other than the exposure.
Lastly, all the variables are saved into an '.RData' file in the main directory called
QCdata_21001.Rdata
to be used in the second script. -
2_Clumping.R
This script reads in the previously createdQCdata_21001.Rdata
as well as the SNP effects for the exposure and outcome traits to be used in TwoSampleMR;sig-clumped-IVs_21001.csv
andclumped-IVs_845.csv
.The script then normalises the absolute value of the SNPxTrait effect matrix by row (SNP), and proceeds to run K-means clustering on the matrix after determining what the best number of clusters are (ranging from 2 to 50) using the AIC score. Then, the SNPs in each cluster are used to estimate a causal effect estimate on the outcome, as well as the SNPs altogether.
Lastly, an enrichment ratio is calculated for each trait across all the clusters, and then the top 10 enriched traits for each cluster are written into an output file.
There are several outputs from this script including plots for the AIC score of cluster numbers ranging from 2 to 50, SNP allocation into various clusters, MR estimates for the various clusters in .csv and .pdf format. Top 10 enriched traits for each cluster are also output in both .csv and .pdf format.
-
-
Hail_AllxAll.csv
is a data frame containing the genetic and phenotypic correlation of multiple UKBB traits, downloaded from Neale's lab here in October 2021.
This analysis was run entirely on R version 4.2.2 (2022-10-31 as well as R version 4.1.3 (2022-03-10).
Analysis run time takes on average 15 minutes.
Secondary analysis is provided as non-custom scripts (systematic confounder search using MR followed by MVMR), that can be modified to include proper paths for analysis.