< back 日本語版

llm-japanese-dataset v0: Building a Japanese chat dataset for large-scale language models [in Japanese]

Masanori HIRANO, Masahiro SUZUKI, Hiroki SAKAJI

Special Interest Group on Natural Language Processing, Information Processing Society of Japan, Sep. 1, 2023


Conference

Special Interest Group on Natural Language Processing, Information Processing Society of Japan (SIG-NL, IPSJ)

Abstract

This study constructed a Japanese chat dataset for large language models. The dataset contains approximately 8.4 million records and includes various tasks in chat format, such as translation and knowledge tasks. To confirm the benefits of our constructed dataset, we tuned an existing large language model and confirmed its performance qualitatively. Those results revealed challenges in building large language models and language resources for them in Japanese.

Keywords

Large Language Model; Dataset; Japanese; Chat;


Paper

Official page


bibtex

@inproceedings{Hirano2023-signl257,
  title={{llm-japanese-dataset v0: Building a Japanese chat dataset for large-scale language models [in Japanese]}},
  author={Masanori HIRANO and Masahiro SUZUKI and Hiroki SAKAJI},
  booktitle={Special Interest Group on Natural Language Processing, Information Processing Society of Japan},
  url={http://id.nii.ac.jp/1001/00227482/},
  year={2023}
}