Reproducing the simulations and real data analysis reported in `Doubly robust and computationally efficient high-dimensional variable selection’
Abhinav Chakraborty, Jeffrey Zhang, Eugene Katsevich
This repository contains code to reproduce the analyses reported in the paper “Doubly robust and computationally efficient high-dimensional variable selection” (arXiv, 2024).
Ensure that your system has the required software dependencies.
- R version 4.2.2 or higher
- Nextflow version 23.10.1 or higher
First, clone the symcrt2-manuscript
repository onto your machine.
git clone [email protected]:Katsevich-Lab/symcrt2-manuscript.git
We used a config file to increase the portability of our code across
machines. Create a config file called .research_config
in your home
directory.
cd
touch ~/.research_config
Define the following variable within this file:
LOCAL_SYMCRT2_DATA_DIR
: the location of the directory in which to store simulation results.LOCAL_EXTERNAL_DATA_DIR
: the location of the directory in which to store external real data.
The contents of the .research_config
file should look like something
along the following lines.
LOCAL_INTERNAL_DATA_DIR="/Users/jeffreyzhang/data/projects/"
LOCAL_SYMCRT2_DATA_DIR=$LOCAL_INTERNAL_DATA_DIR"symcrt2/"
LOCAL_EXTERNAL_DATA_DIR="/Users/jeffreyzhang/data/external/"
Next, create an .Rprofile file in your home directory (if you have not yet done so).
cd
touch .Rprofile
Add the following command to your .Rprofile.
.get_config_path <- function(dir_name) {
cmd <- paste0("source ~/.research_config; echo $", dir_name)
system(command = cmd, intern = TRUE)
}
Next, we recommend downloading the simulation results data from Dropbox,
so that you can reproduce the figures without having to rerun the
simulations. The data are stored in .rds format. Download the results
directory from here: Dropbox results
repository
and place the results directory into LOCAL_SYMCRT2_DATA_DIR/private.
This can also be done using the following commands: First,
source ~/.research_config
cd $LOCAL_SYMCRT2_DATA_DIR"/private"
wget --max-redirect=20 -O download.zip https://www.dropbox.com/scl/fo/qz1ctahx7tn5i2barich9/ABYSILth4DWkAyzqCc9umkU?rlkey=7lp7suuzvd126vdc4jxexynv5&dl=1
Then, execute
unzip -o download.zip
If you would like to rerun the simulations from scratch, do not download the results and instead follow the steps in the next section.
Navigate to the symcrt2-manuscript directory. All scripts below must be executed from this directory.
Also, for the commands below, depending on the limits of your cluster, you may need to set the max_gb and max_hours parameters differently. The defaults are 7.5 and 4, respectively.
# tower PCM statistical simulation:
echo "bash code/run_simulation_pipeline.sh --sim_name split_pcm_stat_manu" | qsub -N run_all
# PCM statistical simulation:
echo "bash code/run_simulation_pipeline.sh --sim_name oat_pcm_stat_manu" | qsub -N run_all
# oracle GCM statistical simulation:
echo "bash code/run_simulation_pipeline.sh --sim_name gcm_stat_manu" | qsub -N run_all
# HRT statistical simulation:
echo "bash code/run_simulation_pipeline.sh --sim_name hrt_stat_manu" | qsub -N run_all
# tower PCM, PCM, oracle GCM computational simulations:
qsub code/run_all_simulations.sh
# HRT computational simulations:
qsub code/run_hrt_simulation_100.sh
qsub code/run_hrt_simulation_125.sh
qsub code/run_hrt_simulation_150.sh
qsub code/run_hrt_simulation_175.sh
qsub code/run_hrt_simulation_200.sh
Before creating the figures, please ensure that your working directory is set to symcrt2-manuscript. The figures are placed in the manuscript/figures directory.
# Figure 1
Rscript figures_code/plot_figure_1.R
# Figures 2,3,4.
Rscript figures_code/plot_stat_figures.R
# Choosing the splitting proportion (Figures 5-8 in the Appendix)
Rscript figures_code/plot_choose_proportions.R
Download the LIU22
directory from here: Dropbox LIU22
repository
and place the LIU22 directory into LOCAL_EXTERNAL_DATA_DIR. This can
also be done using the following commands:
First,
source ~/.research_config
cd $LOCAL_EXTERNAL_DATA_DIR
wget --max-redirect=20 -O download.zip https://www.dropbox.com/scl/fo/qz1ctahx7tn5i2barich9/ABYSILth4DWkAyzqCc9umkU?rlkey=7lp7suuzvd126vdc4jxexynv5&dl=1
Then, execute
unzip -o download.zip
Navigate to the symcrt2-manuscript directory. All scripts below must be executed from this directory.
Rscript data_analysis/preprocess.R
# tower PCM
qsub data_analysis/run_r_script.sh data_analysis/split_pcm_da_manu.R
# tower GCM
qsub data_analysis/run_r_script.sh data_analysis/gcm_da_manu.R
# HRT
qsub data_analysis/run_r_script.sh data_analysis/hrt_da_manu_035.R
# PCM
qsub data_analysis/run_r_script.sh data_analysis/oat_pcm_da_manu_035.R
Rscript data_analysis/construct_results_table.R
We thank Timothy Barry and Ziang Niu for sharing their code and providing inspiration for our code pipeline.