This repository contains an implementation of the following paper
- Li, C., Shen, X., & Pan, W. (2023). Nonlinear causal discovery with confounders. Journal of the American Statistical Association.
The method is named Deconfounded Functional Structure Estimation (DeFuSE).
The simulations of DeFuSE are in Jupyter Notebooks:
-
./example_small.ipynb
: 50 simulations (random and hub graphs) whenp, n = 30, 500
. -
./example_large.ipynb
: 50 simulations (random and hub graphs) whenp, n = 100, 500
.
The implementation of DeFuSE is in directory ./defuse/
.
-
./defuse/defuse.py
: definesDeFuSE
class. -
./defuse/defusenet.py
: defines neural network structures, includingMLP
class andAMLP
(additive MLP) class. -
./defuse/feature.py
: defines functions for feature selection. -
./defuse/trainer.py
: definesTrainer
class. -
./defuse/utils.py
: defines utility functions, including graph and data generating functions.
The code of full simulations (including other methods) is in directory ./simulation/
.
-
./simulation/data.py
: simulates data. -
./simulation/defuse_simulation.py
: conducts simulations for DeFuSE. -
./simulation/notears_simulation.py
: conducts simulations for NOTEARS [2]. -
./simulation/simulation.R
: conducts simulations for CAM [3], RFCI [4], and LRpS-GES [1]. -
./simulation/Python/
:./simulation/Python/notears/
: contains an implementation of NOTEARS.
-
./simulation/R/
:-
./simulation/R/methods/
: contains R files defining a unified interface for CAM, RFCI, and LRpS-GES. -
utils.R
: defines utility functions, including graph metrics.
-
-
./simulation/data/
: stores simulated data. -
./simulation/results/
: stores the simulation results.
For Python, use conda to create an environment named defuse
.
git clone https://github.com/chunlinli/defuse.git
cd defuse
conda env create -f environment.yml
conda activate defuse
To install DeFuSE, run the following Bash script.
pip install .
To install NOTEARS, run the following Bash script.
pip install simulation/Python/notears
For R, the version is 4.1.1 and the following packages are used.
pkg <- c(
"CAM","lrpsadmm","pcalg","bnlearn","mvtnorm", # required
"dplyr","tidyr","progress","ggplot2","tidyverse","glue","scales","kableExtra" # suggested
)
install.packages(pkg)
NOTE: some packages have dependencies unavailable from CRAN. The user may need to install them manually.
The code is tested on a server with specs:
System Version: Ubuntu 18.04.6 LTS 4.15.0-176-generic x86_64
Model name: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Total Number of Cores: 64
Memory: 528 GB
No GPU is required.
For DeFuSE simulations, run the following notebooks.
-
./example_small.ipynb
takes roughly 3 hrs to run. -
./example_large.ipynb
takes roughly 12 hrs to run.
For complete simulations, first run the following script to generate data.
python simulation/data.py
Then run the following scripts.
python simulation/defuse_simulation.py
python simulation/notears_simulation.py # requires NOTEARS
Rscript simulation/simulation.R # requires other R packages
NOTE: the complete simulations will take more than 100 hrs to complete.
If you find the code useful, please consider citing
@article{li2023nonlinear,
author = {Chunlin Li, Xiaotong Shen, Wei Pan},
title = {Nonlinear causal discovery with confounders},
year = {2023},
journal={Journal of the American Statistical Association}
}
The code is maintained on GitHub. This project is in development.
Implementing the structure learning algorithms is error-prone. If you spot any error, please file an issue here or contact me via email -- I will be grateful to be informed.
[1] Frot, B., Nandy, P., & Maathuis, M. H. (2019). Robust causal structure learning with some hidden variables, JRSSB. Open-sourced softwares: LRpS+GES is implemented by lrpsadmm and pcalg.
[2] Zheng, X., Dan, C., Aragam, B., Ravikumar, P., & Xing, E. P. (2020). Learning sparse nonparametric DAGs, AISTATS 2020. Open-sourced software: NOTEARS.
[3] Bühlmann, P., Peters, J., & Ernest, J. (2014). CAM: Causal additive models, high-dimensional order search and penalized regression, AOS. Open-sourced software: CAM.
[4] Colombo, D., Maathuis, M. H., Kalisch, M., & Richardson, T. S. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables, AOS. Open-sourced software: RFCI is implemented by pcalg.
[5] Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H., & Bühlmann, P. (2012). Causal Inference Using Graphical Models with the R Package pcalg, JSS. Open-sourced software: pcalg.
In addition, part of the simulation code is adapted from Frot's code and Zheng's code.
I would like to thank the authors of above open-sourced software.