Netherlands eScience Center
Radboud University Nijmegen
Radboud University Nijmegen
Radboud University Nijmegen
Radboud University Nijmegen
Radboud University Nijmegen
The project HDSC (Historical Database of Suriname and the Caribbean) aims at collecting and distributing life courses of inhabitants of Suriname and the Caribbean islands who lived between 1830 and 1950. For this purpose the project digitizes several registers of inhabitants of the region with the help of hundreds of volunteers. The registers include slave registers, emancipation registers, civil registers, censuses and migration registers. Life courses are created by linking the persons appearing in different registers, for example in birth, marriage and death registers.
Digitization of the hundreds of thousands of records requires considerable time and effort which is why we have started exploring automatic approaches. In particular, we have applied the popular program Transkribus for recognizing texts from scans. The program has separate models for identifying hand-written text and printed text. Our application is special in the sense that our documents contain both hand-written and printed text. We have trained a hand-written Transkribus model on Curaçao death records to recognize both hand-written and printed text [Hoek 2023]. The model achieves a character error rate of about 3%.
The character error rate of the model does not tell the complete story. Our documents contain a lot of formulaic language which can be easy to identify because it occurs frequently and appears in printed text. However, we are primarily interested in the names, dates, locations and professions. These are more challenging because they include infrequent tokens and they appear in the hand-written text. It is easy to evaluate the quality of formulaic language, because we know what to expect. Therefore we have developed a method based on the identified formulaic language to estimate the text recognition quality of individual documents. Documents with a perceived low quality will be put aside for a manual check.
For identifying the entities in the documents, we compared two methods: 1. a machine learning approach for named entity recognition [De Vries, 2020] combined with regular expressions for identifying specific entities (for example the names of the deceased), and 2. ChatGPT 3.5. The two approaches performed similarly for finding the name of the deceased person: around 30%. Many of the missed names involved small character identification errors. In fact, if a Levenshtein distance of up to three would be allowed for matching names, the systems would have identified more than 80% of the names. ChatGPT did perform a lot better on identifying and processing dates (90% vs 60%). It is more robust in identifying and correcting the date words, which come from a small restricted set.
Linking the names in the Curaçao death records have led to several matches, in particular for the witnesses. We hope that these links will help to improve the quality of the character recognition. In the whole pipeline of document scanning, text recognition, named entity recognition and entity linking, we will keep the human volunteers involved at all steps to guarantee the quality of our data sets.