Precursor recommendation for inorganic synthesis by machine learning materials similarity from scientific literature
Data and codes for "Precursor recommendation for inorganic synthesis by machine learning materials similarity from scientific literature".
git clone https://github.com/CederGroupHub/SynthesisSimilarity.git
cd SynthesisSimilarity
pip install -e .
cd ..
# download necessary for synthesis recommendation
python -m SynthesisSimilarity download_necessary_data
# The following command is not useful for synthesis recommendation.
# It's only used to download optional data for benchmarking purposes.
# (optional) python -m SynthesisSimilarity download_optional_data
The precursor recommendation is implemented by referring the synthesis of a novel target material to the known recipe of a similar material, mimicking the human synthesis design process. The similarity of two target materials is evaluated with the cosine similarity of encoded vectors generated by the synthesis context-based encoding model (PrecursorSelector encoding) in this work. When the precursors from the reference material do not cover all the elements in the target, we use a masked precursor completion (MPC) model to predict the missing precursors.
As a brief summary, the useful scripts reproducing the main results in this work are in the folder "scripts". Other auxiliary codes are in the folders "core" and "scripts_utils". The trained model is in the folder "models". If you download the data using "python -m SynthesisSimilarity download_necessary_data" and "python -m SynthesisSimilarity download_optional_data", the downloaded data will be saved in the folders "rsc" and "other_rsc". More details are displayed as follows.
SynthesisSimilarity
├── README.md # A simple introduction of the repo
├── setup.py # Used to install the repo as a python package
├── MANIFEST.in # MANIFEST file used by setup.py
├── requirements.txt # Python packages required for this repo
├── requirements_optional.txt # Optional packages not needed for basic use
└── SynthesisSimilarity # The main directory
├── core # The directory of the core modules and framework for the PrecursorSelector model in this work
│ ├── activations # Activation functions (from https://github.com/tensorflow/models)
│ │ ├── gelu.py # Activation function of gelu()
│ │ ├── gelu_test.py # Test for gelu()
│ │ ├── __init__.py # Python init script for current directory
│ │ ├── swish.py # Activation function of swish()
│ │ └── swish_test.py # Test for swish()
│ ├── bert_modeling.py # Attention block (from https://github.com/tensorflow/models)
│ ├── bert_optimization.py # Additional optimization functions (from https://github.com/tensorflow/models)
│ ├── callbacks.py # Callback functions for monitoring the training process and validation
│ ├── circle_loss.py # Circle loss (adapted from https://github.com/zhen8838/Circle-Loss)
│ ├── encoders.py # Encoder functions to convert the composition of a target material to an encoded vector
│ ├── exp_models.py # The example of how to extend current model to other synthesis prediction tasks
│ ├── focal_loss.py # Focal loss (from https://github.com/artemmavrin/focal-loss)
│ ├── __init__.py # Python init script for current directory
│ ├── layers.py # Low-level neural network modules to be inserted as layers in a more complex network
│ ├── losses.py # The loss function used for gradient descent
│ ├── mat_featurization.py # The example of how to extend input from composition to other materials features
│ ├── model_framework.py # The multi-task framework of the representation model in this work
│ ├── model_utils.py # Handy functions to use the model
│ ├── task_models.py # Neural network modules corresponding to different prediction tasks to be used in the multi-task framework
│ ├── tf_utils.py # Handy functions for tensorflow (adapted from https://github.com/tensorflow/models)
│ ├── utils.py # Handy functions for data processing
│ └── vector_utils.py # Handy functions for operations with vectors
├── examples # The directory of useful examples
│ ├── synthesis_recommendation.py # Precursor recommendation for the given composition of a target material
│ └── __init__.py # Python init script for current directory
├── __init__.py # Python init script for current directory
├── __main__.py # Used for module commands such as "python -m SynthesisSimilarity download_necessary_data"
├── models # The directory of trained models
│ ├── SynthesisEncoding # The directory of minimum model files for similarity evaluation (the encoder part of the whole model)
│ │ ├── model_config.json # The configuration file summarizing important attributes of the model
│ │ └── saved_model # The directory of files for reloading a tensorflow model
│ │ ├── assets # A directory for reloading a tensorflow model
│ │ ├── saved_model.pb # A file for reloading a tensorflow model
│ │ └── variables # A directory of files for reloading a tensorflow model
│ │ ├── variables.data-00000-of-00001 # A file for reloading a tensorflow model
│ │ └── variables.index # A file for reloading a tensorflow model
│ └── SynthesisRecommendation # The directory of model files for precursor recommendation
│ ├── cmd_parameters.json # The configuration file summarizing important attributes of the model
│ ├── model_meta.pkl # The configuration file of all attributes of the model
│ └── saved_model # The directory of files for reloading a tensorflow model
│ ├── checkpoint # A checkpoint file for reloading a tensorflow model
│ ├── cp.ckpt.data-00000-of-00001 # A checkpoint file for reloading a tensorflow model
│ └── cp.ckpt.index # A checkpoint file for reloading a tensorflow model
├── other_rsc # The directory of model files for benchmark, but not needed for the model in this study
│ ├── fasttext_pretrained_matsci # The FastText encoding model from https://figshare.com/s/70455cfcd0084a504745 (Kim, E., Jensen, Z., Grootel, A.V., Huang, K., Staib, M., Mysore, S., Chang, H.S., Strubell, E., McCallum, A., Jegelka, S. and Olivetti, E., 2020. Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks. Journal of Chemical Information and Modeling)
│ │ ├── fasttext_embeddings-MINIFIED.model # A file for reloading the FastText model
│ │ ├── fasttext_embeddings-MINIFIED.model.vectors_ngrams.npy # A file for reloading the FastText model
│ │ ├── fasttext_embeddings-MINIFIED.model.vectors.npy # A file for reloading the FastText model
│ │ └── fasttext_embeddings-MINIFIED.model.vectors_vocab.npy # A file for reloading the FastText model
│ └── matminer The Magpie encoding model retrieved from the matminer package (Ward, L., Dunn, A., Faghaninia, A., Zimmermann, N. E. R., Bajaj, S., Wang, Q., Montoya, J. H., Chen, J., Bystrom, K., Dylla, M., Chard, K., Asta, M., Persson, K., Snyder, G. J., Foster, I., Jain, A., Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60-69 (2018). Ward, L., Agrawal, A., Choudhary, A., & Wolverton, C. (2016). A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials, 2(1), 1-7.)
│ ├── mp_imputer_preset_v1.0.2.pkl # A file for reloading the Magpie model
│ └── mp_scaler_preset_v1.0.2.pkl # A file for reloading the Magpie model
├── rsc # The directory of data files for model training and evaluation in this work
│ ├── data_split.npz # Data splitted based on publication year and prototype formula
│ ├── ele_order_counter.json # Statistics of how often authors put one element in front of another when writing the string for a material formula
│ ├── pre_count_normalized_by_rxn_ss.json # Statistics of the frequency to use each precursor in the literature-mined synthesis reactions
│ └── reactions_v20_20210820_ss.jsonl # The text-mined solid-state synthesis dataset from materials science papers
├── scripts # The directory of useful scripts reproducing the main results in this work
│ ├── _00_download_model_and_data.py # Download data from google drive for the PrecursorSelector model
│ ├── _01_synthesis_recommendation.py # Precursor recommendation for the given composition of a target material
│ ├── _02_target_material_similarity.py # Similarity evaluation for two target materials based on the PrecursorSelector encoding
│ ├── _03_masked_precursor_completion.py # Prediction of the complete precursors given the target material and partial precursors
│ ├── _04_reaction_relationship.py # Plot relationships between targets and their shared precursors
│ ├── _05_recommendation_benchmark.py # Benchmark of precursor recommendation using various algorithms
│ ├── _06_computation_time_similarity.py # Time cost for similarity evaluation
│ └── __init__.py # Python init script for current directory
└── scripts_utils # The directory of handy functions for the scripts
├── benchmark_utils.py # Handy functions for benchmark
├── data_set_utils.py # Handy functions for loading data
├── FastTextSimilarity_utils.py # Handy functions for using the FastText model
├── __init__.py # Python init script for current directory
├── MatminerSimilarity_utils.py # Handy functions for using the Magpie model
├── multi_processing_utils.py # Handy functions for using the parallel processing using multiple CPU cores
├── precursors_recommendation_utils.py # Handy functions for precursor recommendation
├── recommendation_utils.py # Handy functions for general recommendation
├── similarity_utils.py # Handy functions for general similarity evaluation
├── TarMatSimilarity_utils.py # Handy functions for evaluation of target similarity
└── train_utils.py # Handy functions for model training