Annotation pipeline for unedited Byzantine Greek

Colin Swaelens

Language Technology & Translation Team (UGent)

Ilse De Vos

Language Technology & Translation Team (UGent)

Els Lefever

Department of Linguistics (UGent)

The Database of Byzantine Book Epigrams or DBBE (Ricceri et al. 2023) contains over 12,000 epigrams. They are stored both as occurrences - the epigrams exactly as they occur in the manuscripts - and as types - their orthographically normalised counterparts. The decision to link multiple occurrences to a single type was pragmatic as well as conceptual. Creating fewer types not only freed up time to trace new occurrences, it was also a straightforward way to group similar occurrences. Soon however, this all-or-nothing system ran against its limitations: What exactly does “similar” mean? How “similar” do occurrences need to be for them to be put under the same type? In order to add linguistic information enabling more advanced similarity detection and visualisation, we developed the first morphological analyzer for non-normalised Byzantine Greek.

To develop a part-of-speech tagger for Ancient and Byzantine Greek, we first compared three different transformer-based language models with embedding representations: BERT (Devlin 2018), ELECTRA (Clark 2020), and RoBERTa (Liu 2019). To train these models, two data sets were compiled: one consisting of all Ancient and Byzantine Greek text corpora that are available online, and that same set complemented with the Modern Greek Wikipedia data. This allowed us to ascertain whether or not Modern Greek contributes to the modelling of Byzantine Greek.

For the supervised task of fine-grained part-of-speech tagging, we compiled a training set based on existing treebanks and complemented it with a small set of 2,000 manually annotated tokens from DBBE occurrences. To train the part-of-speech tagger, we made use of the FLAIR framework (Akbik et al. 2019), where the contextual token embeddings from DBBErt were stacked with randomly initialised character embeddings. These were processed by a bi-LSTM encoder (hidden size of 256) and a CRF decoder. For evaluation, a gold standard containing 10,000 tokens of non-normalised Byzantine Greek epigrams out of the DBBE corpus was compiled, manually annotated and validated through an inter-annotator agreement study.

The experimental results look very promising, with the BERT model trained on all Greek data achieving the best performance both for assigning the part-of-speech (82.76%) and for full-fledged morphological analysis (68.75%). A comparison with the RNN Tagger (Schmid 2019) revealed that our tagger outperforms the latter with almost 4% on the DBBE gold standard.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips