Aequa-tech, Turin, Italy
LT3, Ghent University
LT3, Ghent University
LT3, Ghent University
LT3, Ghent University
In past years, different disciplines have been interested in the study of the pragmatic phenomenon of irony. Shifting from neuroscience, psychology and theoretical linguistics to Natural Language Processing, the study of figurative language – irony and sarcasm in particular – has proven to be an intriguing topic and a difficult task. Indeed, the presence of irony could undermine a correct identification of sentiment in a given text, as well as complicate other downstream tasks. Therefore, computational linguists are, still nowadays, actively studying irony and its primary characteristics. However, due to its highly subjective character, identifying what is ironic or not, and why, is an extremely complex task for both humans and machines. In the last 15 years, NLP researchers have developed a significant body of literature on methodologies for creating corpora, designing annotation schemas, engineering features, and building models for investigating irony and its automated detection (Wallace, 2015). What emerges, quite blatantly, from the survey of related work is that the majority of automatic tools have been developed for English. It is worth mentioning that there is quite some work on other languages as well, however, to a much lesser extent compared to English, thus, leaving room for new and finer-grained investigation on this topic.
With this project, we aim to investigate cross-lingual and cross-cultural irony detection, by applying a multi-layered scheme for the fine-grained annotation of irony (Karoui et al., 2017) to benchmark datasets of irony detection in tweets in three different languages, namely: Italian, English, and Dutch. The scheme was previously assessed on social media data in French, English and Italian (Karoui et al., 2017; Cignarella et al., 2018a). Within the project, we rely on two benchmark datasets from different sources which are annotated for irony and homogenize them by applying the same textual pre-processing and mapping their irony labels. Furthermore, we reannotate the entire datasets by applying two new layers: (1) a 7-point scale measuring the likelihood of the text being ironic and (2) irony activators, which are annotated selecting text spans, inspired by Cignarella et al. (2020).
For English we use the Semeval-2018 Task 3 dataset (Van Hee et al., 2018a), and for Italian we exploit the ironITA 2018 dataset (Cignarella et al., 2018a). We combine them with a novel dataset of Dutch tweets (Van Hee et al., 2018b), and we apply the same multi-layered annotation scheme on all three language portions. The resulting multilingual corpus will be balanced and will consist of about 15,000 tweets.
The main contribution of this work is, therefore, an enhanced multilingual resource with fine-grained annotation labels for the phenomenon of irony, that can be both exploited for linguistic investigation and for multilingual irony detection. The resource is particularly novel as it combines languages that belong to different language families (Romance and Germanic), thus, contributing to the expansion of multilingual datasets available for this challenging task and facilitating cross-linguistic and cross-cultural comparisons.
---
---
References
Cignarella, A. T., Frenda, S., Basile, V., Bosco, C., Patti, V., & Rosso, P. (2018a). Overview of the EVALITA 2018 Task on Irony Detection in Italian Tweets (ironITA). In Proceedings of Sixth Evaluation Campaign of Natural Language Processing and. Speech Tools for Italian. Final Workshop (EVALITA. 2018) (Vol. 2263, pp. 1-6). CEUR-WS.org.
Cignarella, A. T., Bosco, C., Patti, V., & Lai, M. (2018b). Application and Analysis of a Multi-layered Scheme for Irony on the Italian Twitter Corpus TWITTIRÒ. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 4204-4211). ELRA.
Cignarella, A. T., Sanguinetti, M., Bosco, C., & Paolo, R. (2020). Marking irony activators in a Universal Dependencies treebank: The case of an Italian Twitter corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 5098-5105). ELRA.
Karoui, J., Benamara, F., Moriceau, V., Patti, V., Bosco, C., & Aussenac-Gilles, N. (2017). Exploring the Impact of Pragmatic Phenomena on Irony Detection in Tweets: A Multilingual Corpus Study. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017) (Volume 1, Long Papers, pp. 262-272). ACL.
Van Hee, C., Lefever, E., and Hoste, V. (2018a). SemEval-2018 Task 3: Irony Detection in English Tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation (pp. 39-50). ACL.
Van Hee, C., Lefever, E., and Hoste, V. (2018b). Exploring the fine-grained analysis and automatic detection of irony on Twitter. Language Resources and Evaluation, 52(3), 707–731.
Wallace, B. C. (2015). Computational irony: A survey and new perspectives. Artificial intelligence review, 43, 467-483.