FinDeBERTaV2: Word-Segmentation-Free Pre-trained Language Model for Finance [in Japanese]

Masahiro Suzuki, Hiroki Sakaji, Masanori Hirano, Kiyoshi Izumi

Transactions of the Japanese Society for Artificial Intelligence, vol.39, no.4, p. FIN23-G_1-14, 2024

Abstract

Financial documents are increasing year by year, and natural language processing (NLP) techniques are widely applied to process these documents. Specifically, Transformer-based pre-trained models such as BERT have been successful in NLP in recent years. These cutting-edge models have been adapted to the financial domain by pretraining with financial corpora, while most financial pre-trained models are BERT-based and do not benefit from the mechanisms of recent state-of-the-art models. Many Japanese models need to perform word segmentation based on morphological analysis, which may reduce portability and processing efficiency. In this study, we show that models without pre-word segmentation have no performance disadvantage over models with pre-word segmentation in both the financial and general domains, while the models without pre-word segmentation may benefit from reduced computational complexity and increased model processing efficiency due to a reduced number of tokens. The length of tokens had little effect on the performance of the downstream classification tasks. We build both general models and financial models without pre-word segmentation on a large scale. We show that our financial pre-trained models perform better than conventional models on classification tasks in the financial domain, and that the general models can be good baselines to adapt to specialized domains. Our evaluation experiments show that additional pre-training is effective because it takes advantage of the performance of models constructed from large general corpora. We have released the pre-trained models at https://huggingface.co/izumi-lab.

Keywords

language model; domain-specific pre-training; financial market; natural language processing;

doi

10.1527/tjsai.39-4_FIN23-G

bibtex

@journal{Suzuki2024-tjsai,
  title={{FinDeBERTaV2: Word-Segmentation-Free Pre-trained Language Model for Finance [in Japanese]}},
  author={Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
  journal={Transactions of the Japanese Society for Artificial Intelligence},
  volume={39},
  number={4},
  pages={FIN23-G_1-14},
  doi={10.1527/tjsai.39-4_FIN23-G},
  year={2024}
}