Dept. of Computer Science, KU Leuven; Leuven.AI
IDLab (Internet and Data Science Lab), Ghent University - imec
Pre-training large transformer-based language models on gigantic corpora and later repurposing them as base models for finetuning on downstream tasks has proven instrumental to the recent advances in computational linguistics. However, the prohibitively high cost associated with pre-training often hampers the regular updates of base models to incorporate the latest linguistic developments. To address this issue, we present an innovative approach for efficiently producing more powerful and up-to-date versions of RobBERT, our cutting-edge Dutch language models, by leveraging existing language models designed for high-resource languages. With RobBERT-2023, we deliver a freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis, while mitigating the inclusion of previously over-represented terms from adult-oriented content. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, RobBERT-2023 is entirely initialized using the RoBERTa-large model. To initialize an embedding table tailored to the newly devised Dutch tokenizer, we rely on a token translation strategy introduced by Remy et al. (2023). To assess the efficacy of RobBERT-2023, we evaluate its performance using the same benchmarks employed for the state-of-the-art RobBERT-2022 model. Our experimental results demonstrate that RobBERT-2023 not only surpasses its predecessor in various aspects but also achieves these enhancements at a significantly reduced training cost. This work represents a significant step forward in keeping Dutch language models up-to-date and demonstrates the potential of model conversion techniques for reducing the environmental footprint of NLP research.