Leiden University
University College London
University of York, Historic England, Archaeology Data Service
Leiden University
Archaeology is a destructive process in which, after excavation, the evidence primarily becomes written documentation. As such, the archaeological domain creates huge amounts of text, from books and scholarly articles to unpublished fieldwork reports (‘grey literature’). We are experiencing a significant increase in archaeological investigations and easy access to the information hidden in these texts is a substantial problem for the archaeological field. In the Netherlands alone, it is estimated that 4,000 new grey literature reports are being created each year, as well as numerous books, papers and monographs. Furthermore, as research – such as desk based assessment – is increasingly being carried out online remotely, these documents need to be made more easily Findable, Accessible, Interoperable and Reusable. Searching for and analysing these documents is a time consuming task when done by hand, and will often lack consistency.
In the EXALT project at Leiden University we are working on the development of a semantic search engine for archaeology in and around the Netherlands, indexing all available, open-access texts, which includes Dutch, English and German language documents.
In this context, we are systematically researching and comparing different methods for extracting information from archaeological texts, in these 3 languages. The specific information extraction task we are looking at is Named Entity Recognition (NER). We develop NER methods tailored to the archaeological domain and in this process we compare a rule-based knowledge driven approach (using GATE), a feature-based machine learning method (Conditional Random Fields), and a deep learning method (BERT). We also compare the results of the multilingual BERT model, a language specific BERT model, and a further pre-trained archaeology specific BERT model, in the 3 targeted languages.
Previous studies have investigated different applications of text mining in archaeological literature, but this often occurred at a relatively small scale, in isolated case studies, or as proof-of-concept type work. With this study, we compare multiple methods in multiple languages, and we aim to contribute to guidelines and good practice for text mining in archaeology. Specifically, we will compare not only the overall quality of each approach, but also the time, digital literacy, hardware, and labelled data needed to run each method, and we analyse the language- and domain-specific challenges. We also pay attention to the energy usage and CO2 output of these machine learning models and the impact on climate change, something that’s particularly poignant during the ongoing energy crisis.