Skip to content

Commit

Permalink
Merge pull request #4 from SeonghwanSeo/dev-2.0.1
Browse files Browse the repository at this point in the history
Version 2.0.1
  • Loading branch information
SeonghwanSeo authored Aug 8, 2024
2 parents b9294d5 + 4c52b4d commit cb161c4
Show file tree
Hide file tree
Showing 56 changed files with 1,023,087 additions and 498,973 deletions.
125 changes: 37 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,112 +29,82 @@ If you have any problems or need help with the code, please add an issue or cont

## Table of Contents

- [Environment](#environment)
- [Installation](#installation)
- [Data](#data)
- [Dataset Structure](#dataset-structure)
- [Prepare Your Own Dataset](#prepare-your-own-dataset)
- [Model Training](#model-training)
- [Preprocess](#preprocess)
- [Model Training](#model-training)
- [Training](#training)
- [Generation](#generation)

## Environment
## Installation

The project can be installed by pip with `--find-links` arguments for torch-geometric package.

```bash
pip install -e . --find-links https://data.pyg.org/whl/torch-2.1.2+cu121.html # CUDA
pip install -e . --find-links https://data.pyg.org/whl/torch-2.1.2+cpu.html # CPU-only
```


- python=3.9
- [PyTorch]((https://pytorch.org/))=1.12
- [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)=2.1.0
- Tensorboard=2.11.0
- [Pandas](https://pandas.pydata.org/)=1.5.1
- [RDKit](https://www.rdkit.org/docs/Install.html)=2022.3.5
- [OmegaConf](http://Omegaconf.readthedocs.io)=2.3.0
- Parmap=1.6.0

## Data

### Dataset Structure

#### Data Directory Structure

Move to `data/` directory. Initially, the structure of directory `data/` is as follows.
Initially, the structure of directory `data/` is as follows.

```bash
├── data/
├── ZINC/ (ZINC15 Database)
│ ├── smiles/
│ │ ├── train.smi (https://github.com/wengong-jin/icml18-jtnn/tree/master/data/zinc/train.txt)
│ │ ├── valid.smi (https://github.com/wengong-jin/icml18-jtnn/tree/master/data/zinc/valid.txt)
│ │ └── test.smi (https://github.com/wengong-jin/icml18-jtnn/tree/master/data/zinc/test.txt)
│ ├── get_data.py
├── ZINC/ (Constructed from https://github.com/wengong-jin/icml18-jtnn)
│ ├── data.csv
│ └── library.csv
└── 3CL_ZINC/ (Smina calculation result. (ligands: ZINC15, receptor: 7L13))
├── data.csv
├── split.csv
└── library.csv (Same to data/ZINC/library.csv)
├── 3CL_ZINC/ (Smina calculation result. (ligands: ZINC15, receptor: 7L13))
└── UniDock_ZINC/ (UniDock calculation result.)
```

- `data/ZINC/`, `data/3CL_ZINC/` : Dataset which used in our paper.

#### Prepare ZINC15 Dataset

Move to `data/ZINC` directory, and run `python get_data.py`. And then, `data.csv` and `split.csv` will be created. The dataset for 3CL docking is already prepared.

```bash
├── data/
├── ZINC/
├── smiles/
├── get_data.py
├── data.csv new!
├── split.csv new!
└── library.csv
```

### Prepare Your Own Dataset

For your own dataset, you need to prepare `data.csv` and `split.csv` as follows.
For your own dataset, you need to prepare `data.csv` as follows.

- `./data/<OWN-DATA>/data.csv`

```
SMILES,Property1,Property2,...
c1ccccc1,10.25,32.21,...
C1CCCC1,35.1,251.2,...
KEY,SMILES,Property1,Property2,...
1,c1ccccc1,10.25,32.21,...
2,C1CCCC1,35.1,251.2,...
...
```

- SMILES must be RDKit-readable.
- `KEY` field is optional.
- `SMILES` field must be RDKit-readable.
- If you want to train a single molecule set with different properties, you don't have to configure datasets separately for each property. You need to configure just one dataset file which contains all of property information. For example, `ZINC/data.csv` contains information about `mw`, `logp`, `tpsa`, `qed`, and you can train the model with property or properties, e.g. `mw`, `[mw, logp, tpsa]`.

- `./data/<OWN-DATA>/split.csv`
### Preprocess

```
train,0
train,1
...
val,125
...
test,163
...
```

- First column is data type (train, val, test), and second column is index of `data.csv`.

And then, you need to create a ***building block library***. Go to root directory and run `./script/get_library.py`.
You need to preprocess dataset. Go to root directory and run `./script/preprocess.py`.

```shell
cd <ROOT-DIR>
python ./script/get_library.py \
--data_dir ./data/<OWN-DATA> \
--cpus <N-CPUS>
python ./script/preprocess.py \
--data_dir ./data/<DATA-DIR> \
--cpus <N-CPUS> \
--split_ratio 0.9 # train:val split ratio.
```

After this step, the structure of directory `data/` is as follows.
After preprocessing step, the structure of directory `data/` is as follows.

```bash
├── data/
├── <OWN-DATA>/
├── <DATA-DIR>/
├── data.csv
├── split.csv
└── library.csv new!
├── valid_data.csv new!
├── data.pkl new!
├── library.csv new!
└── split.csv new!
```


Expand All @@ -143,30 +113,6 @@ After this step, the structure of directory `data/` is as follows.

The model training requires less than <u>*12 hours*</u> with 1 GPU(RTX2080) and 4 CPUs(Intel Xeon Gold 6234).

### Preprocess (Optional)

You can skip data processing during train by pre-processing data with `./script/preprocess.py`.

```shell
cd <ROOT-DIR>
python ./script/preprocess.py \
--data_dir ./data/<DATA-DIR> \
--cpus <N-CPUS>
```

After preprocessing step, the structure of directory `data/` is as follows. `data.csv`, `split.csv` and `library.csv` are required, and `data.pkl` is optional.

```bash
├── data/
├── <DATA-DIR>/
├── data.csv
├── data.pkl new!
├── split.csv
└── library.csv
```



### Training

```shell
Expand Down Expand Up @@ -203,6 +149,8 @@ python ./script/train.py \
--property affinity
```



## Generation

The model generates 20 to 30 molecules per 1 second with 1 CPU(Intel Xeon E5-2667 v4).
Expand Down Expand Up @@ -279,3 +227,4 @@ Generator config (Yaml)
alpha: 0.75
max_iteration: 10
```

Loading

0 comments on commit cb161c4

Please sign in to comment.