University of Groningen
University of Groningen
The Dutch language has undergone numerous spelling reforms in the past centuries. Natural Language Processing (NLP) tools designed for and trained on modern spelling struggle with such orthographic variation, requiring domain adaption of the tools, or preprocessing of the data. Van Cranenburgh and van Noord (2022) opt for the latter approach and present a rule-based system to normalize 19th-century Dutch spelling to its modern equivalent; for example:
Wat drommel kon [ @alt die dien ] [ @alt oude ouden ] heer bewegen , zich uittegeven voor een aanbidder van mijn zusje Truitje die [ @alt zere zeere ] [ @alt ogen oogen ] had , of van mijn [ @alt broer broêr ] Gerrit die altijd met zijn neus speelde?
However, the rule-based system is incomplete and improving the rules is requires extensive domain-knowledge and careful engineering, whereas modern machine translation methods can automatically learn the spelling reforms from data. Therefore, we propose and evaluate a neural machine translation (NMT) approach for historical Dutch spelling normalization. We fine-tune two pretrained language models:
• Flan-T5, an improved multilingual version of the original T5 (Raffel et al., 2020), and
• ByT5 (Xue et al., 2022), a token-free model which operates directly on the raw text and characters.
For training, we collect 19th-century Dutch novels from Project Gutenberg. We normalize the spelling of these texts with the rule-based system in order to obtain silver data. For evaluation, we use gold data consisting of novel fragments with manually corrected spelling from the OpenBoek corpus (van Cranenburgh and van Noord, 2022). We perform an intrinsic evaluation using precision, recall, and error reduction rate (ERR), and also evaluate on down-stream tasks (POS tagging and coreference resolution). This allows for a direct comparison between the rule-based system and NMT models to analyze which yields the best performance, and to determine whether the NMT system is able to generalize over its training data (i.e., correctly normalize spelling variants not seen in the training data).