Information Processing & Management, vol.60, no.2, e103194, 2023
The application of natural language processing (NLP) to financial fields is advancing with an increase in the number of available financial documents. Transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) have been successful in NLP in recent years. These cutting-edge models have been adapted to the financial domain by applying financial corpora to existing pre-trained models and by pre-training with the financial corpora from scratch. In Japanese, by contrast, financial terminology cannot be applied from a general vocabulary without further processing. In this study, we construct language models suitable for the financial domain. Furthermore, we compare methods for adapting language models to the financial domain, such as pre-training methods and vocabulary adaptation. We confirm that the adaptation of a pre-training corpus and tokenizer vocabulary based on a corpus of financial text is effective in several downstream financial tasks. No significant difference is observed between pre-training with the financial corpus and continuous pre-training from the general language model with the financial corpus. We have released our source code and pre-trained models.
Language models; Domain-specific pre-training; Financial market; Natural language processing;
@journal{Suzuku2023-ipm, title={{Constructing and Analyzing Domain-Specific Language Model for Financial Text Mining}}, author={Masahiro SUZUKI and Hiroki SAKAJI and Masanori HIRANO and Kiyoshi IZUMI}, journal={Information Processing & Management}, issn={0306-4573}, volume={60}, number={2}, pages={e103194}, doi={10.1016/j.ipm.2022.103194}, year={2023} }