Skip to content

Latest commit

 

History

History
89 lines (64 loc) · 6.3 KB

README.md

File metadata and controls

89 lines (64 loc) · 6.3 KB

Sorghum bicolor bicolor Phenophase Bayesian Belief Network in R & Python

This project uses:

  • Rocker Group's Tidyverse R 4.0 Ubuntu 18 LTS docker container image
  • data from the TERRA-REF project accessed through the traits R package
  • jags for Gibbs Sampled MCMC modeling
  • causalnex to implement the NO TEARS directed acyclic graph structure learning algorithm as described here
  • causalnex has dependencies: pandas, sklearn, and igraph

To develop a causal Bayesian network, also known as a Bayesian Belief Network, predicting growth rate as a phenotype from the Sorghum bicolor biomass accumulation panel.

This analysis produces a casual inference Bayesian Belief Network similar to Judea Pearle's work, where the nodes (vertices) of the network represent variables and the edges (arcs) represent linked dependencies supported by conditional probailities.


Methods

Docker Setup

To run any aspect of this analysis it is recommended that you have Docker installed on the host machine. Or use singularity-ce to run the containers on high performance clusters.

Running the Analyses with Docker

  • All RScripts detailed below can be run with the container image cyversevice/rstudio-bayes-cpu:4.0-ubuntu-jags, including the growth rate modeling
  • All python code will run in the command line with this Docker container image and is written so that this repository is mounted as a volume in the container image as /work/phenophasebbn/
    • Ex. docker run --rm -it -v /local/path/to/phenophasebbn/:/work/phenophasebbn/ rbartelme/pytorch-causalnex:0.10.0 python /work/phenophasebbn/bbn/bbn_structure.py (See note below)
    • The current Dockerfile for this image is contained in this repository at /causal_nex/Dockerfile
  • A JupyterLab Docker container image has been created to facilitate the exploration of the python codebase

Initial Graph Embedding

Initial Graph

In order to speed up the directed acyclic graph generation for the Bayesian Belief Network, an initial graph was instantiated using lists of tuples that reference the edge/node connections and directions outlined in the conceptual diagram above.

NOTE: Learning the graph structure without any expert knowledge graph encodings via the NO TEARS implementation in causalnex without GPU acceleration is a computationally intensive process and may not solve the graph structure with the Sorghum gene data included in these analyses.


Network Workflow Description

How the contents of this repository were used to generate the analysis.

1. Processing raw data:

  • Weather & phenotype data processing:
    • Code: /bnprocess_functional.R
    • Exports (TSV):
      • /season4_combined.txt
      • /season6_combined.txt
      • /ksu_combined.txt (No longer used in final analysis)
  • Genomic Data:
    • Code to process the SNP frequency by Sorghum bicolor gene table from this repository can be found in /genomic_preprocessign/snp_normalization.R
    • Exports (TSV):
      • /genomic_preprocessing/genewise_snp_relative_abundance.txt where the relative abundance of single nucleotide polymorphisms is calculated relative to the Sorghum bicolor biomass accumilation panel population
  • Development work:
    • notes and pseudo code are in /sandbox/ and /bnprocess_mac.R

2. Model Growth Rate by Sorghum bicolor Cultivar using JAGS in R:

  • /jags/ contains the dev code for the growth rate modeling below, these scripts & files are used in the bbn structure learning model
  • Full logistic growth rate modeling by Jessica Guo
  • Summary plots of the logistic growth models can be found in /data_figs/

3. Prepare dataset for structure learning in R & Python:

  • Join genomic, environmental, and phenotypic data
    • This is done with the Rscript /bbn/join_datasets.R
  • Exports:
    • /bbn/rgr_snp_joined.csv

4. BBN Structure Learning in Python with NO TEARS algorithm:

  • Ingest joined data /bbn/rgr_snp_joined.csv and learns structure with:
    • /bbn/bbn_structure.py
  • Process categorical data with labelencoder from scikit-learn
  • Encode expert knowledge into graph structure via a list of tuples in the first invocation of StructureModel()
    • png exported as /bbn/init_graph.png (as of 10-25-2021 this takes a long time to write the png and is commented out of the code, the pickle of this graph is available at /bbn/expert_sm.pickle for the CPD fitting in step #5 after unpicklign the structure model binary)
  • Optional: learn graph structure with NO TEARS using the from_pandas function from causalnex blacklisting spurrious node + edge connections with a second list of tuples
  • Exports:
    • categorical label encodings for genotype (or cultivar) /bbn/genotype_map.json & /bbn/season_map.json
    • Currently stuck solving graph structure, so only expert knowledge encoded graph is available

5. Discritized Data Mapping & Conditional Probability Distribution Fitting:

  • Import Bayesian Network by structure model pickle
  • Instantiate Bayesian network with BayesianNetwork() function from causalnex
  • Map continuous variables into categories
  • A detailed checklist of these steps can be found= in this GitHub issue