Controllable Sentence Simplification in Dutch

Theresa Seidl

KU Leuven

Vincent Vandeghinste

KU Leuven

Tim Van de Cruys

KU Leuven

Text simplification aims to reduce complexity in vocabulary and syntax, enhancing the readability and comprehension of text. This paper presents a supervised sentence simplification approach for Dutch, using a pre-trained large language model (T5). Given the absence of a parallel corpus in Dutch, a synthetic dataset is generated from established parallel corpora. The implementation incorporates a sentence-level discrete parametrization mechanism, enabling control over the simplification features. By incorporating control tokens into the training data, the model's output can be tailored to different simplification scenarios and target audiences. The controlled attributes include sentence length, word length, paraphrasing, and lexical and syntactic complexity.

This thesis contributes a dedicated set of control tokens specifically tailored to the Dutch language. It shows that significant simplification can be achieved using a synthetic dataset, with as few as 2000 parallel rows, although optimal performance requires a minimum of 10000 rows. The fine-tuned model achieves a 36.85 SARI score on the test set, supporting its effectiveness in the simplification process.

This research contributes to the field of TS by discussing the implementation of a supervised sentence simplification approach for Dutch. The findings highlight the potential of synthetic datasets and control tokens in achieving effective simplification, despite the lack of a parallel corpus in the target language.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023