Skip to content

Latest commit

 

History

History
247 lines (232 loc) · 7.43 KB

README.md

File metadata and controls

247 lines (232 loc) · 7.43 KB

The Business Scene Dialogue corpus

©2020, The University of Tokyo

Updates

November 10, 2021: Further fix for the speaker information.
November 2, 2021: The data are updated by fixing incorrect speaker information and some misspellings in the conversation text.

Corpus Description

The Japanese-English business conversation corpus, namely Business Scene Dialogue (BSD) corpus, was constructed in 3 steps: 1) selecting business scenes, 2) writing monolingual conversation scenarios according to the selected scenes, and 3) translating the scenarios into the other language. Half of the monolingual scenarios were written in Japanese and the other half were written in English. The whole construction process was supervised by a person who satisfies the following conditions to guarantee the conversations to be natural:

  • has the experience of being engaged in language learning programs, especially for business conversations
  • is able to smoothly communicate with others in various business scenes both in Japanese and English
  • has the experience of being involved in business

We provide balanced training, development and evaluation splits from BSD corpus. The documents in these sets are balanced in terms of scenes and original languages. In this repository we publicly share the full development and evaluation sets and a part of the training data set.

Training Development Evaluation
Sentences 20,000 2,051 2,120
Scenarios 670 69 69

Corpus Statistics

Data Set Scene Scenarios Sentences Scenarios Sentences
JA-EN EN-JA
Training Face-to-face 122 3525 103 2986
Phone call 68 1944 75 2175
General chatting 61 1915 72 1883
Meeting 56 1964 58 1787
Training 12 562 19 463
Presentation 6 607 18 189
Total 325 10,000 345 10,000
Development Face-to-face 11 319 12 314
Phone call 6 176 7 185
General chatting 7 223 8 248
Meeting 7 240 7 219
Training 1 40 1 23
Presentation 1 31 1 33
Total 34 997 35 1054
Evaluation Face-to-face 12 381 11 345
Phone call 6 163 7 212
General chatting 7 211 8 212
Meeting 7 228 7 229
Training 1 38 1 30
Presentation 1 31 1 40
Total 34 1052 35 1068

Corpus Structure

The corpus is structured in json format consisting of documents, which consist of sentence pairs. Each sentence pair has a sentence number, speaker name in English and Japanese, text in English and Japanese, original language, scene of the scenario (tag), and title of the scenario (title).

[
    {
        "id": "190315_E001_17",
        "tag": "training",
        "title": "Training: How to do research",
        "original_language": "en",
        "conversation": [
            {
                "no": 1,
                "en_speaker": "Mr. Ben Sherman",
                "ja_speaker": "ベン シャーマンさん",
                "en_sentence": "I will be teaching you how to conduct research today.",
                "ja_sentence": "今日は調査の進め方についてトレーニングします。"
          },
            ...
	      ]
      },
	...
]

License

Our dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license.

Reference

If you use this dataset, please cite the following paper: Matīss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa (2019). "Designing the Business Conversation Corpus." In Proceedings of the 6th Workshop on Asian Translation, 2019.

@inproceedings{rikters-etal-2019-designing,
    title = "Designing the Business Conversation Corpus",
    author = "Rikters, Mat{\=\i}ss  and
      Ri, Ryokan  and
      Li, Tong  and
      Nakazawa, Toshiaki",
    booktitle = "Proceedings of the 6th Workshop on Asian Translation",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-5204",
    doi = "10.18653/v1/D19-5204",
    pages = "54--61"
}

Acknowledgements

This work was supported by "Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation", the Commissioned Research of National Institute of Information and Communications Technology (NICT), JAPAN.