人工知能学会論文誌, vol.39, no.4, p. FIN23-G_1-14, 2024
Financial documents are increasing year by year, and natural language processing (NLP) techniques are widely applied to process these documents. Specifically, Transformer-based pre-trained models such as BERT have been successful in NLP in recent years. These cutting-edge models have been adapted to the financial domain by pretraining with financial corpora, while most financial pre-trained models are BERT-based and do not benefit from the mechanisms of recent state-of-the-art models. Many Japanese models need to perform word segmentation based on morphological analysis, which may reduce portability and processing efficiency. In this study, we show that models without pre-word segmentation have no performance disadvantage over models with pre-word segmentation in both the financial and general domains, while the models without pre-word segmentation may benefit from reduced computational complexity and increased model processing efficiency due to a reduced number of tokens. The length of tokens had little effect on the performance of the downstream classification tasks. We build both general models and financial models without pre-word segmentation on a large scale. We show that our financial pre-trained models perform better than conventional models on classification tasks in the financial domain, and that the general models can be good baselines to adapt to specialized domains. Our evaluation experiments show that additional pre-training is effective because it takes advantage of the performance of models constructed from large general corpora. We have released the pre-trained models at https://huggingface.co/izumi-lab.
language model; domain-specific pre-training; financial market; natural language processing;
@journal{Suzuki2024-tjsai, title={{FinDeBERTaV2: 単語分割フリーな金融事前学習言語モデル}}, author={鈴木 雅弘 and 坂地 泰紀 and 平野 正徳 and 和泉 潔}, journal={人工知能学会論文誌}, volume={39}, number={4}, pages={FIN23-G_1-14}, doi={10.1527/tjsai.39-4_FIN23-G}, year={2024} }