Automatic sentence-level simplification for Dutch

Charlotte Van de Velde

KU Leuven

Bram Vanroy

KU Leuven

Vincent Vandeghinste

KU Leuven

The task of simplification on sentence level has recently been attempted for many languages, and is often approached with neural-based methods. It involves the replacing of difficult words with simple synonyms and shortening and clarifying sentence structures while keeping the overall meaning of the sentence, in order to facilitate easier reading and understanding for several target groups of readers. These methods require a large amount of data, in the form of an aligned corpus in a given language and its simple version, which is not readily available for Dutch.

To solve the lack of data, this thesis proposes the synthetic generation of a parallel corpus for (complex) Dutch and simple Dutch, collected from the ChatGPT language model. Several prompts and language models were compared for the generation of this data beforehand, based on the similarity of their output to a corpus of simple Dutch. The synthetically generated data was then used in a semi-supervised approach to fine-tune a T5 model on the task of simplification for Dutch. The UL2 objectives (given the T5 model pre-trained on Dutch) were chosen to be fine-tuned for the task of simplification, resulting in a base model that simplifies Dutch on sentence level.

The result of the fine-tuning was evaluated quantitatively with SARI (52.962) and Rouge scores, which indicated that the model functions well, but can also be improved. A qualitative analysis of the results produced a similar impression, where the evaluated examples were simplified accurately when the difficult source sentences were synthetically generated.

The recommendation for the development of a simplification system using the method proposed here would be to adapt the synthetic training data to include multiple types of content, style, and structure, as well as incorporate human evaluation by user groups of the target audience. Given updates to the synthetic data, the options for future research are open, with possibilities of applying the method to other languages than Dutch, to document-level rather than sentence-level, and to specific target groups of readers in order to facilitate easier reading and understanding.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips