Large-scale Manual and Automatic Evaluation of Crawled Parallel Corpora

Rik van Noord

University of Groningen

Antonio Toral

University of Groningen

Our work is part of the MaCoCu project: Massive collection and curation of mono-lingual and bi-lingual data. This collaborative project, which is funded by the Connecting Europe Facility, involves four different partners: University of Groningen (represented by us), Institut Jožef Stefan (Slovenia), University of Alicante (Spain), and Prompsit Language Engineering (Spain). The primary objective of our project is to construct large and high-quality mono-lingual and parallel corpora for eleven under-resourced European Languages: Albanian, Bosnian, Bulgarian, Croatian, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian and Turkish. Our strategy is to automatically crawl top-level domains, as opposed to existing resources that exploit Common Crawl (e.g. ParaCrawl, CCAlign, OSCAR) with the hypothesis that this leads to higher quality data.

Our work focuses on the evaluation of the parallel crawled corpora. We aim to show that the crawled data is indeed of high quality by conducting several experiments. For one, we perform a large-scale manual evaluation of three parallel corpora: MaCoCu (ours), ParaCrawl and CCAlign. We hired professional translators across all languages who assessed the quality of the translations of the different corpora. This experiment aims to establish which corpus has the highest quality parallel data, without taking data set size into account.

However, when training Neural Machine Translation (NMT) systems, the size of the data set is obviously a major factor. To address this aspect, we perform an automatic evaluation of the three different parallel corpora by training Transformer-based NMT systems. We either train the models from scratch, or continue training a strong publicly available NMT system. We compare several experimental configurations: using the corpora as is, controlling for the amount of compute used, and controlling for the number of unique parallel sentences employed. We also combine the three corpora to train a single system, with the aim of achieving (open source) state-of-the-art performance. We assess performance across multiple languages, evaluation sets, domains and metrics. This evaluation is currently in progress and will be finished by August 1st.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023