released 1.0

autoliuweijie · Dec 12, 2019 · d62bcec · d62bcec
1 parent ec98693
commit d62bcec
Show file tree

Hide file tree

Showing 58 changed files with 92,852 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,100 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# IPython Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# dotenv
+.env
+
+# virtualenv
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+
+# Rope project settings
+.ropeproject
+
+# osx
+
+*.idea/
+*.DS_Store
+*.env/
+*.vscode/
+
+# data
+*_results.tsv
+*.out
diff --git a/README.md b/README.md
@@ -1,2 +1,178 @@
 # K-BERT
-The source code of K-BERT will be published after the publication of the paper.
+![](https://img.shields.io/badge/license-MIT-000000.svg)
+
+Sorce code and datasets for ["K-BERT: Enabling Language Representation with Knowledge Graph"](https://arxiv.org/abs/1909.07606v1).
+
+
+## Requirements
+
+Hardware:
+```
+Memory >= 32G
+GPU Memory >= 24G
+```
+
+Software:
+```
+Python3
+Pytorch >= 1.0
+argparse == 1.1
+```
+
+
+## Prepare
+
+* Download the ``google_model.bin`` from [here](https://share.weiyun.com/5GuzfVX), and save it to the ``models/`` directory.
+* Download the ``CnDbpedia.spo`` from [here](https://share.weiyun.com/5BvtHyO), and save it to the ``brain/kgs/`` directory.
+* Optional - Download the datasets for evaluation from [here](https://share.weiyun.com/5Id9PVZ), unzip and place them in the ``datasets/`` directory.
+
+The directory tree of K-BERT:
+```
+K-BERT
+├── brain
+│   ├── config.py
+│   ├── __init__.py
+│   ├── kgs
+│   │   ├── CnDbpedia.spo
+│   │   ├── HowNet.spo
+│   │   └── Medical.spo
+│   └── knowgraph.py
+├── datasets
+│   ├── book_review
+│   │   ├── dev.tsv
+│   │   ├── test.tsv
+│   │   └── train.tsv
+│   ├── chnsenticorp
+│   │   ├── dev.tsv
+│   │   ├── test.tsv
+│   │   └── train.tsv
+│    ...
+│
+├── models
+│   ├── google_config.json
+│   ├── google_model.bin
+│   └── google_vocab.txt
+├── outputs
+├── uer
+├── README.md
+├── requirements.txt
+├── run_kbert_cls.py
+└── run_kbert_ner.py
+```
+
+
+## K-BERT for text classification
+
+### Classification example
+
+Run example on Book review with CnDbpedia:
+```sh
+CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_cls.py \
+    --pretrained_model_path ./models/google_model.bin \
+    --config_path ./models/google_config.json \
+    --vocab_path ./models/google_vocab.txt \
+    --train_path ./datasets/book_review/train.tsv \
+    --dev_path ./datasets/book_review/dev.tsv \
+    --test_path ./datasets/book_review/test.tsv \
+    --epochs_num 5 --batch_size 32 --kg_name CnDbpedia \
+    --output_model_path ./outputs/kbert_bookreview_CnDbpedia.bin \
+    > ./outputs/kbert_bookreview_CnDbpedia.log &
+```
+
+Results:
+```
+Best accuracy in dev : 88.80%
+Best accuracy in test: 87.69%
+```
+
+Options of ``run_kbert_cls.py``:
+```
+useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
+        [--config_path] - Path to the model configuration file.
+        [--vocab_path] - Path to the vocabulary file.
+        --train_path - Path to the training dataset.
+        --dev_path - Path to the validating dataset.
+        --test_path - Path to the testing dataset.
+        [--epochs_num] - The number of training epoches.
+        [--batch_size] - Batch size of the training process.
+        [--kg_name] - The name of knowledge graph, "HowNet", "CnDbpedia" or "Medical".
+        [--output_model_path] - Path to the output model.
+```
+
+### Classification benchmarks
+
+Accuracy (dev/test %) on different dataset:
+| Dataset       | HowNet       | CnDbpedia     |
+| :-----        | :----:       | :----:        |
+| Book review   | 88.75/87.75  | 88.80/87.69   |
+| ChnSentiCorp  | 95.00/95.50  | 94.42/95.25   |
+| Shopping      | 97.01/96.92  | 96.94/96.73   |
+| Weibo         | 98.22/98.33  | 98.29/98.33   |
+| LCQMC         | 88.97/87.14  | 88.91/87.20   |
+| XNLI          | 77.11/77.07  | 76.99/77.43   |
+
+
+## K-BERT for named entity recognization (NER)
+
+### NER example
+
+Run an example on the msra_ner dataset with CnDbpedia:
+```
+CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_ner.py \
+    --pretrained_model_path ./models/google_model.bin \
+    --config_path ./models/google_config.json \
+    --vocab_path ./models/google_vocab.txt \
+    --train_path ./datasets/msra_ner/train.tsv \
+    --dev_path ./datasets/msra_ner/dev.tsv \
+    --test_path ./datasets/msra_ner/test.tsv \
+    --epochs_num 5 --batch_size 16 --kg_name CnDbpedia \
+    --output_model_path ./outputs/kbert_msraner_CnDbpedia.bin \
+    > ./outputs/kbert_msraner_CnDbpedia.log &
+```
+
+Results:
+```
+The best in dev : precision=0.957, recall=0.962, f1=0.960
+The best in test: precision=0.953, recall=0.959, f1=0.956
+```
+
+Options of ``run_kbert_ner.py``:
+```
+useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
+        [--config_path] - Path to the model configuration file.
+        [--vocab_path] - Path to the vocabulary file.
+        --train_path - Path to the training dataset.
+        --dev_path - Path to the validating dataset.
+        --test_path - Path to the testing dataset.
+        [--epochs_num] - The number of training epoches.
+        [--batch_size] - Batch size of the training process.
+        [--kg_name] - The name of knowledge graph.
+        [--output_model_path] - Path to the output model.
+```
+
+
+## K-BERT for domain-specific tasks
+
+Experimental results on domain-specific tasks (Precision/Recall/F1 %):
+| KG            | Finance_QA         | Law_QA              | Finance_NER        | Medicine_NER        |
+| :-----        | :----:             | :----:              | :----:             | :----:              |
+| HowNet        |  0.805/0.888/0.845 | 0.842/0.903/0.871   | 0.860/0.888/0.874  | 0.935/0.939/0.937   |
+| CN-DBpedia    |  0.814/0.881/0.846 | 0.814/0.942/0.874   | 0.860/0.887/0.873  | 0.935/0.937/0.936   |
+| MedicalKG     | --                 | --                  | --                 | 0.944/0.943/0.944   |
+
+
+## Acknowledgement
+
+This work is a joint study with the support of Peking University and Tencent Inc.
+
+If you use this code, please cite this paper:
+```
+@inproceedings{weijie2019kbert,
+  title={{K-BERT}: Enabling Language Representation with Knowledge Graph},
+  author={Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, Ping Wang},
+  booktitle={Proceedings of AAAI 2020},
+  year={2020}
+}
+```
+
+
diff --git a/brain/__init__.py b/brain/__init__.py
@@ -0,0 +1,2 @@
+# coding: utf-8
+from brain.knowgraph import KnowledgeGraph
diff --git a/brain/config.py b/brain/config.py
@@ -0,0 +1,28 @@
+import os
+
+
+FILE_DIR_PATH = os.path.dirname(os.path.abspath(__file__))
+
+KGS = {
+    'HowNet': os.path.join(FILE_DIR_PATH, 'kgs/HowNet.spo'),
+    'CnDbpedia': os.path.join(FILE_DIR_PATH, 'kgs/CnDbpedia.spo'),
+    'Medical': os.path.join(FILE_DIR_PATH, 'kgs/Medical.spo'),
+}
+
+MAX_ENTITIES = 2
+
+# Special token words.
+PAD_TOKEN = '[PAD]'
+UNK_TOKEN = '[UNK]'
+CLS_TOKEN = '[CLS]'
+SEP_TOKEN = '[SEP]'
+MASK_TOKEN = '[MASK]'
+ENT_TOKEN = '[ENT]'
+SUB_TOKEN = '[SUB]'
+PRE_TOKEN = '[PRE]'
+OBJ_TOKEN = '[OBJ]'
+
+NEVER_SPLIT_TAG = [
+    PAD_TOKEN, UNK_TOKEN, CLS_TOKEN, SEP_TOKEN, MASK_TOKEN,
+    ENT_TOKEN, SUB_TOKEN, PRE_TOKEN, OBJ_TOKEN
+]
diff --git a/brain/kgs/.gitignore b/brain/kgs/.gitignore
@@ -0,0 +1 @@
+CnDbpedia.spo
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# coding: utf-8
		from brain.knowgraph import KnowledgeGraph