Evaluating Dutch Language Models Using a Dutch Variant of the SimLex-999 Dataset

Lizzy Brans

Faculty of Humanities

Jelke Bloem

Universiteit van Amsterdam

Word embeddings have been revolutionary in the landscape of natural language processing (NLP). These embeddings represent words as dense vectors, providing a means to capture complex semantic relationships in a computational format. Despite this advancement, resources for evaluating word embeddings have been disproportionately skewed towards the English language, creating a significant deficit for Dutch. This study addressed this gap by creating and validating a Dutch adaptation of the well-established SimLex-999 dataset.

This Dutch version of the SimLex-999 dataset was validated via an in-depth process that involved garnering semantic similarity judgments from a cohort of 235 native Dutch speakers. We utilised a continuous sliding scale for these judgments, facilitating a nuanced reflection of the semantic similarity between word pairs. After collection, we aggregated these ratings into a total similarity score for each pair. We ascertained the reliability of the newly constructed dataset through the computation of the intraclass correlation coefficient (ICC), a widely-recognized statistical measure of agreement among raters. Although the single-rater ICC values demonstrated a range from fair (0.26) to good (0.59), indicating a degree of variability in individual rating consistency, the average rater ICCs demonstrated a high level of consistency (0.84 to 0.96), thus corroborating the reliability of the dataset. Analysis of the dataset revealed a substantial semantic overlap with the original English and German versions of the SimLex-999 datasets. Spearman correlation scores for the English-Dutch and English-German comparisons were 0.7487 and 0.7478, respectively, suggesting a considerable degree of semantic similarity.

The evaluation provided intriguing insights into the functioning of these models. Despite variations observed across transformer layers and word types, Bertje displayed superior alignment with human semantic similarity judgments. This finding indicates Bertje's potential as a more human-like model in terms of semantic representation, suggesting it is a preferred choice for tasks requiring human-like semantic comprehension. In contrast, RobBERT, while performing commendably, exhibited less alignment with human semantic judgments in comparison. However, different contexts or application requirements could potentially favour the characteristics of the RobBERT model.

One common observation from the evaluation was related to both models' interactions with out-of-vocabulary (OOV) words. Though not a significant impediment to their overall performance, it was recognised that Bertje and RobBERT exhibited room for improvement. While not drastically affecting the current functioning, this opportunity for refinement could enhance the models' capability of dealing with OOV words. By doing so, the performance, robustness, and generalisation of Dutch language models could be improved in diverse real-world applications where encountering OOV words might be a common occurrence.

In conclusion, this research contributes significantly to Dutch NLP by providing a reliable tool for evaluating Dutch word embeddings. This tool enables a more nuanced and comprehensive evaluation of these embeddings, fostering the development of effective Dutch language models. The study also points to a potential improvement in treating OOV words. This aspect could further refine the robustness and utility of Dutch language models in real-world applications.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023