University of Groningen
University of Groningen
University of Groningen
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. The total set of nine tasks include four tasks that were previously not available in Dutch. These new tasks are Dutch Words in Context (WiC-NL), Dutch Pronoun Resolution (DPR), Dutch Choice of Plausible Alternatives (COPA-NL) and Dutch Question Answering (SQuAD-NL). The new tasks have been created by converting previous annotations and by translation from English with post-correction. Moreover, we include Part-of-Speech Tagging (Lassy Small), Named Entity Recognition (SoNaR-1), Natural Language Inference (SICK-NL), Sentiment Analysis (DBRD) and Abusive Language Detection (DALC). The full set of tasks has been balanced to include word-, sentence- and document-level tasks. This makes DUMB more general than popular English benchmarks like GLUE.
Instead of relying on a mean score across tasks, we propose Relative Error Reduction (RER), which compares the DUMB performance of models to a strong baseline which can be referred to in the future even when assessing different sets of models. This approach acknowledges that small improvements on tasks with high scores can be a better indicator of higher performance than the same absolute improvement on low scores.
Through a comparison of 14 pre-trained models (mono- and multi-lingual, of varying sizes), we assess the internal consistency of the benchmark tasks, as well as the factors that likely enable high performance. Our results indicate that current Dutch monolingual models under-perform and suggest training larger Dutch models with other architectures and pre-training objectives. At present, the highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV (base). Regression models show that monolingual could achieve much higher performance compared to current models by increasing model size and choosing different architectures such as DeBERTaV3. Moreover, our robust performance based on fine-tuning encoder models with grid-search and multiple runs can be used as a strong baseline for evaluating generative models that depend on more variable prompt design and output interpretation.
In addition to highlighting best strategies for training larger Dutch models, DUMB will foster further research on Dutch. A public leaderboard will be made available online.