Classifying the Quality of Digitized VOC Documents

Carsten Schnober

Netherlands eScience Center

Kay Pepping

Huygens Instituut

Maartje Hids

Huygens Instituut

Lodewijk Petram

Huygens Instituut

The GLOBALISE project project is developing an "online infrastructure that unlocks the key series of VOC documents and reports". Said handwritten documents and reports, dating from the 17th and 18th centuries, have been scanned by the Dutch National Archives in The Hague, as part of their ongoing efforts to make the most used archives in their collection accessible online. The ca. 5 million scans of the GLOBALISE corpus were converted to machine-readable text using the open source HTR tool Loghi. In spite of recent improvements in recognizing handwritten texts, error rates vary widely per page. This may be due to variation in handwriting, but also, for example, to different types of documents or pages with a different scan orientation.

For downstream tasks required in the GLOBALISE project, such as event detection or named entity recognition, pages with bad quality HTR are useless or even harmful. In order to improve the results of those tasks, we aim to identify bad quality pages automatically, and reprocess them in a targeted manner. The goal is thus to identify as many bad quality pages as possible (recall), while minimizing the number of documents for re-processing (precision).

We will present our findings while adapting the method proposed for Luxembourgish by Schneider & Maurer (2022) to our historic Dutch dataset. We have analysed our specific data and present differences, commonalities, and other findings. The result is publicly available, as well as the pipeline comprising all steps to train a similar classifier, e.g. for other languages or time periods.

We have manually annotated 500 documents in terms of quality class (Good, Medium, Bad), out of which we eventually used 328 documents that were not empty, and in which both annotators agreed on the quality.

In summary, we have applied a set of features that resemble what Schneider & Maurer (2022) have proposed:

- dictionary score: the number of tokens that occur in a dictionary of the language; we have used both a modern day Dutch dictionary, and a dictionary generated from VOC documents.

- tri-gram comparison: a metric comparing the distribution of character tri-grams of a document with the expected distribution of the target language.

- Garbage token detection: a metric based on tokens that appear to be no real words, hence 'garbage', based on a set of heuristics such as token length, unusual sequences of vowels or consonants etc.

We have used our annotated data set for training various classifiers, including k Nearest Neighbours and Feed-foward Neural Networks. The latter performed best, but different classifier choices and parametrizations led to small differences in accuracy only, in the range between 0.72 and 0.74. This led to the preliminary conclusion that classifier algorithms cannot lead to significant improvents on accuracy. Instead, non-textual features such as layout and text ordering contribute to the misclassifications. We will present our error analysis in more depth and show possibilities and shortcomings of this text-based approach.

We will also show how well the features defined by Schneider & Maurer (2022) translate to our historic Dutch dataset. In a very pragmatic approach, we will demonstrate how our implementation follows the principles of open science and open source software so that others can apply our implementation on their own data, or adapt it to their own needs.

References

Schneider, Pit, and Yves Maurer. “Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction.” Journal of Data Mining & Digital Humanities 2022, no. Digital humanities in languages (November 30, 2022). https://doi.org/10.46298/jdmdh.8561.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips