Benchmarking Zero-Shot Text Classification for Dutch

Loic De Langhe

LT3, Language and Translation Technology Team, Ghent University

Aaron Maladry

LT3, Language and Translation Technology Team, Ghent University

Luna De Bruyne

LT3, Language and Translation Technology Team, Ghent University

Bram Vanroy

LT3, Language and Translation Technology Team, Ghent University

Pranaydeep Singh

LT3, Language and Translation Technology Team, Ghent University

Sofie Labat

LT3, Language and Translation Technology Team, Ghent University

Orphée De Clercq

LT3, Language and Translation Technology Team, Ghent University

Els Lefever

LT3, Language and Translation Technology Team, Ghent University

Veronique Hoste

LT3, Language and Translation Technology Team, Ghent University

The advent and popularisation of Large Language Models (LLMs) have given rise to prompt-based NLP techniques which eliminate the need for large manually annotated corpora and computationally expensive training or fine-tuning processes. Zero-shot learning (ZSL) in particular presents itself as an attractive alternative to the classical train-development-test paradigm for many downstream tasks as it provides a quick and inexpensive way of leveraging implicitly encoded knowledge in LLMs.

Despite the large interest in zero-shot applications within the domain of NLP as a whole, there is often no consensus on the methodology, analysis and evaluation of zero-shot pipelines. As a tentative step towards finding such a consensus, this work provides a detailed overview of available methods, resources, caveats and evaluation strategies for zero-shot prompting within the Dutch language domain.

Additionally, we present a centralised zero-shot benchmark for a large variety of Dutch NLP tasks using a series of standardised data sets and unified evaluation strategies. To ensure that this benchmark is representative, we investigate a selection of diverse prompting strategies and methodologies for a variety of state-of-the-art Dutch Natural Language Inference models, masked language models (BERTje, RobBERT) and autoregressive language models (Dutch GPT2, Flan-T5). As evaluation tasks, we include span detection (aspect extraction, event detection, event argument extraction) and classification tasks (sentiment analysis, emotion detection, irony detection, news topic classification and die/dat prediction). These tasks also vary in subjectivity and domain, ranging from more social (emotion and irony detection for social media) to factual tasks (topic classification and event detection for news).

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023