Using naïve speakers’ judgements to train a transformer-based subjectivity regressor

Elena Savinova

Centre for Language Studies, Radboud University, Nijmegen

Fermín Moscoso del Prado Martín

Centre for Language Studies, Radboud University, Nijmegen

The problem of subjectivity detection, i.e. identifying opinions, attitudes, beliefs and private states in a text, is often approached as a preparatory binary task for sentiment analysis. However, according to linguistic theories, subjectivity is rather a gradual measure and some utterances can be perceived as more subjective than others. In this work, we approach subjectivity analysis as a regression task and use a semi-supervised approach to train a transformer RoBERTa model to produce sentence-level subjectivity scores based on a small subset of human annotations. The dataset that we used consists of news articles and Facebook posts produced by four major UK news sources. For a subset of 398 sentences from our dataset, we obtained annotations from 19 native English speakers via Prolific. The annotators were asked to rate subjectivity of the sentences on a 7-point scale. There were no explicit guidelines except for brief definitions of subjective (expressing opinions, attitudes and beliefs) and objective (stating factual information). After fine-tuning RoBERTa-base model on the unlabeled part of our dataset, we trained it on the human-labeled subset (train/validation/test split: 298/50/50), with the average subjectivity scores per sentence normalized into [0-1] scale. Evaluation of or model’s performance on the test set showed that it highly correlated with the average human rater’s judgements about subjectivity (r = .79) and outperformed a widely used rule-based "pattern" subjectivity regressor (r = .28). We also evaluated our model as a binary classifier by choosing a threshold of .5 for the human labels and optimizing the threshold value (.6245) as a function of F1 for the model’s labels. The model performed well on both our test set (F1 = .80; accuracy 92%) and on the test set of the benchmark subjectivity dataset (F1 = .79; accuracy 78.2%), which consists of sentences taken form movie plot summaries (automatically labeled as objective) and movie review snippets (automatically labeled as subjective). Taking a closer look at the benchmark dataset suggests that some of our model’s errors in fact reflect the mislabeling present in this dataset: there are sentences taken from movie plot summaries that seem rather subjective (e.g. “What better place for a writer to pick up a girl?”). To investigate this issue further, we trained a two-way classifier based on a distilBERT-base-uncased transformer on the benchmark subjectivity dataset and tested its performance on our human-labeled test set. A relatively poor performance of this classifier (F1 = .25; accuracy 75.5%) on our data suggests that the state-of-the-art subjectivity classifiers trained on the benchmark subjectivity dataset might be learning how to distinguish the language of movie review snippets from that of movie plot descriptions, rather than classifying subjectivity, as perceived by native speakers. To conclude, we show that our model generalizes better to other discourse types than the current best systems trained on the benchmark subjectivity dataset. Our work highlights the importance of using human annotations in such complex tasks as subjectivity detection an opens further discussion about the nature of subjectivity classifiers trained on the benchmark dataset.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023