University of Amsterdam
University of Amsterdam
This study addresses the classification of nouns with animate entities as referents, encompassing proper names, profession names, institutions, and company names – i.e. nouns that denote the quality of their referents as alive, sentient, or volitional. While previous literature in this field has shed light on this phenomenon, further research is needed especially for less studied languages. However, such research often faces challenges due to the limited availability of large-scale human-annotated corpora in these languages. Our primary aim is to provide what is to our knowledge the first automatic animacy classifier for Romanian nouns, using minimal lexical resources and avoiding the need for large-scale human annotation.
To address these challenges, we propose a methodology that employs a classifier to differentiate between two classes of Romanian nouns: human and non-human. Our approach leverages a set of pre-trained word embeddings for Romanian nouns (Păiş & Tufiş, 2018) and lexical information extracted from Romanian WordNet (Dumitrescu et al., 2018). We train three classifiers (RandomForest – RF, Multi-Layer Perceptron – MLP, k-Nearest Neighbours – KNN) and compare their performance on the same dataset.
Following the training and evaluation process, we find that the RF classifier exhibits the highest suitability for the task, achieving a precision score of 90.3%. Notably, the MLP approach achieves slightly higher accuracy, but falls short in precision. Conversely, the KNN classifier demonstrates the lowest performance at a 70% precision and 76.4% accuracy. The superior precision score of the RF classifier (90.3%) and its substantially better performance compared to the 54.04% baseline establish it as the most appropriate choice for the task.
In sum, we contribute with some insights into the field of automatic animacy classification from a less studies language, by leveraging minimal existing resources and requiring no manual human annotation. This paper thus fills a gap in the literature by addressing animacy-based noun classification for the Romanian language. By utilising RoWordNet a set of pre-trained word embeddings, our study enhances our understanding of animacy constraints on grammar and pragmatics. Moreover, it facilitates the development of more accurate natural language processing tools for Romanian.