Giving Dutch Some Style: Data and Models for Text Formality Transfer in Dutch

Marieke Weultjes

University of Groningen

Huiyuan Lai

University of Groningen

Malvina Nissim

University of Groningen

Text style transfer (TST) is the task of automatically transforming text written in a specific style into another, while mostly preserving the original meaning. Formality transfer, which aims to change an informal text into a formal version (or vice versa) is one instantiation of this task. One of the reasons researchers have focused on this task is data availability, and in particular GYAFC dataset, which contains parallel English sentence pairs for both training and evaluation. Building on this, the multilingual dataset called XFORMAL has been created, including manually crafted test data for Italian, French, and Portuguese, while the training pairs are automatically translated from English. Thanks to this data, and to the recent successes of transfer learning leveraging large multilingual models, TST systems have gone beyond English. While the availability of gold parallel training data allows for the development of better performing models, TST systems built on multilingual models with adaptation strategies have proven quite successful, even in absence of original training data. To date, though, formality transfer has not been addressed in Dutch due to a complete lack of data for this task. Our contribution fills this gap by presenting the first Dutch dataset designed for the formality transfer task. We describe in detail the methodology employed in creating this dataset, including the selection of source text (focused on collecting informal data), all preprocessing steps and annotation process, including the manual creation of multiple references. We also provide a dataset analysis, identifying and classifying different types of formality edits and assessing the similarity between the gold reference sentences. Next, in order to provide a TST system for Dutch as well as to further test the feasibility of transfer learning approaches for this task, building upon (our) previous work in style transfer we use the pre-trained model mBART as base, enhanced with an adaptation training strategy. Our experiments involve the use of both English and Dutch training data in three different settings: English training data, Dutch data automatically translated from English, translated Dutch and original English data together. We observe that (i) relying solely on English training data yields poor results, with output being neither in the target language nor the target style; (ii) automatically translating the training data into Dutch, creating pseudo-parallel Dutch training data, produces significantly better outcomes. Overall, most models exhibit consistent understanding of punctuation and capitalization rules, while their proficiency in paraphrasing, sentence completion/splitting, and other formality edits varies substantially across models. Models based on adaptation strategies exhibit more limitations, generating output sentences that sometimes lack proper linguistic structure and regularly leave noisy elements in the rewritten sentences. These observations highlight the complexity of formality style transfer especially in the absence of original training pairs as it is the case for Dutch. Such results and all created datasets (manually for testing, and automatically translated for training), which we make available, can serve as a solid basis for further research on the formality transfer task in Dutch.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips