< back

llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology

Masanori HIRANO, Masahiro SUZUKI, Hiroki SAKAJI

The 26th International Conference on Network-Based Information Systems, pp. 442-454, Sep. 6, 2023


Conference

The 12th International Workshop on Web Services and Social Media (WSSM-2023) in The 26th International Conference on Network-Based Information Systems (NBiS-2023)

Abstract

This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records. Recently, LLMs have been developed and gaining popularity. However, high-performing LLMs are usually mainly for English. There are two ways to support languages other than English by those LLMs: constructing LLMs from scratch or tuning existing models. However, in both ways, datasets are necessary parts. In this study, we focused on supporting Japanese in those LLMs and making a dataset for training or tuning LLMs in Japanese. The dataset we constructed consisted of various tasks, such as translation and knowledge tasks. In our experiment, we tuned an existing LLM using our dataset and evaluated the performance qualitatively. The results suggest that our dataset is possibly beneficial for LLMs. However, we also revealed some difficulties in constructing LLMs in languages other than English.

Keywords

Large Language Model; Dataset; Japanese; Chat;


Paper

arXiv:2305.12720 (doi.org/10.48550/arXiv.2305.12720), ssrn.com/abstract=4454626 (doi.org/10.2139/ssrn.4454626)

doi

10.1007/978-3-031-40978-3_47


bibtex

@inproceedings{Hirano2023-nbis,
  title={{llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology}},
  author={Masanori HIRANO and Masahiro SUZUKI and Hiroki SAKAJI},
  booktitle={The 26th International Conference on Network-Based Information Systems},
  pages={442-454},
  doi={10.1007/978-3-031-40978-3_47},
  archivePrefix={arXiv},
  arxivId={2305.12720},
  year={2023}
}