Utrecht University
Recent advancements in language models years have resulted in increased performance on various language task, mostly due to the availability of large datasets for pre-training and fine-tuning. However, acquiring labelled datasets for fine-tuning remains a significant challenge, particularly for low-resource tasks, due to the high cost and need for expert knowledge. Addressing this problem is crucial for expanding the potential applications of language models and increasing their capabilities in various domains.
This paper will explore data augmentation techniques as a potential solution to this problem, with a focus on Dutch low-resource tasks. Although data augmentation methods are commonly used in other fields within the machine learning domain, data augmentation for text is less commonplace. While some methods exist, they provide little to no improvement, or are too complex to be applied universally. A newly proposed data augmentation method leverages Large Language Models (LLM's) to generate realistic synthetic data, by providing the LLM with a description of the given task, and a small amount of examples. While early research into this method has been conducted, it focuses only on the English language, while we explore the capabilities of LLM's as a data augmentation method for Dutch.
The generated data will be used to expand the original low-resource dataset, and will be evaluated intrinsically by comparing the synthetic data to the original data on several linguistic metrics. The extrinsic evaluation will be performed by fine-tuning a Dutch pre-trained language model for each synthetic dataset generated using one of the augmentation methods, and comparing the performance of the fine-tuned model on a held out test set.