A comparison of ChatGPT and fine-tuned BERT with regard to ABSA in the domain of literary criticism

Gunther Martens

Ghent University

Lore De Greve

Ghent University

Els Lefever

Ghent University

Pranaydeep Singh

Ghent University

Generative pre-trained transformers continue to be the talk of the town, but have been met with far more limited enthusiasm within NLP research. Bracketing the admittedly vexed question whether it is feasible and/or ethical to use sparsely documented proprietary models, this contribution aims to assess the relative merits and drawbacks of LLM by pitching ChatGPT's API against our in-house Transformers-based architecture developed for the specific task of ABSA (Aspect-Based Sentiment Analysis) in the domain of literary criticism. (De Greve et al 2021)

The increase in the number of book reviews created by users has presented unparalleled prospects for empirical investigations on books, reading habits, and audience participation, jumpstarting various recent rule-based and transformer-based approaches in empirical literary studies and cultural analytics. (Boot 2023; Salgaro 2023) Despite this flurry of attention, the process of domain adaptation continues to be time-consuming and challenging. Recently various studies have aimed to explore whether large language models (LLM) may provide zero-shot or few-shot alternatives. They have arrived at very mixed results, ranging from abysmal (Chen et al. 2023) to an assessment of performing on average 25% worse than SOTA models (Kocoń et al 2023). Specifically with regard to emotion and sentiment extraction, others have been more hopeful (Rathje 2023). While on “subtasks of Relation Extraction and Event Extraction GPT models may “rarely exceed 30% of SOTA”, Han et al (2023) have noticed that “almost all sub-tasks of ABSA can reach more than 50% of SOTA” via plain zero-shot prompting. While we cannot evaluate performance on the extended range of tasks employed in the cited research, we will be able to compare results generated by ChatGPT's 3.5 Turbo API with our BERT-based annotations and with our manual annotations for various subtasks of ABSA. In a second step, it will be explored how few-shot prompting affects performance and to what extent our architecture may profit from the ability to generate fully annotated synthetic training data. Finally, it will be considered to what extent the generative model's potential for steadily growing context windows, for "chaining" subtasks and for multimodality opens up venues for potential future research.

Cited references

Boot, Peter, ‘“A Pretty Sublime Mix of WTF and OMG”. Four Explorations into the Practice of Evaluation on Online Book Reviewing Platforms’, Journal of Cultural Analytics, 7.2 (2023) <https://doi.org/10.22148/001c.68086>

Chen, Xuanting, Junjie Ye, Can Zu, Nuo Xu, Rui Zheng, Minlong Peng, and others, ‘How Robust Is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks’ (arXiv, 2023) <https://doi.org/10.48550/arXiv.2303.00293>

De Greve, Lore, Pranaydeep Singh, Cynthia Van Hee, Els Lefever, and Gunther Martens, ‘Aspect-Based Sentiment Analysis for German : Analyzing “talk of Literature” Surrounding Literary Prizes on Social Media’, COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL, 11 (2021), 85–104

Han, Ridong, Tao Peng, Chaohao Yang, Benyou Wang, Lu Liu, and Xiang Wan, ‘Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors’ (arXiv, 2023) <https://doi.org/10.48550/arXiv.2305.14450>

Kocoń, Jan, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, and others, ‘ChatGPT: Jack of All Trades, Master of None’ (arXiv, 2023) <http://arxiv.org/abs/2302.10724> [accessed 19 May 2023]

Rathje, Steve, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire Robertson, and Jay J. Van Bavel, ‘GPT Is an Effective Tool for Multilingual Psychological Text Analysis’ (PsyArXiv, 2023) <https://doi.org/10.31234/osf.io/sekf5>

Salgaro, Massimo, Stylistics, Stylometry and Sentiment Analysis in German Studies: The Operationalization of Literary Values (Göttingen: V&R unipress, 2023)

Zhu, Yiming, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, and Gareth Tyson, ‘Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks’ (arXiv, 2023) <http://arxiv.org/abs/2304.10145> [accessed 19 May 2023]

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023