CLCG, University of Groningen
CLCG, University of Groningen
Connotation is a dimension of meaning that refers to the emotional or cultural associations a word may carry beyond its literal definition. Differences in connotations can occur among communities and diachronically. This work investigates the application of a fully unsupervised method on a collection of 2.197 billion tweets in Dutch to quantify the diachronic connotation changes over a period of nine years (2014 - 2022).
Our method is an adaptation of the Connotative Hyperplane (Basile et al., 2022), whose basic components are: (i) (at least) two corpora covering different time period; (ii) a list of seed words representing extremes values of a connotative dimension; (iii) a list of target words. In our case, we selected the connotative dimension of polarity and identified 8 highly positive and 8 highly negative seed words. We then trained an embedding model based on Word2Vec (Mikolov et al., 2013) using the concatenation of pairs of diachronic corpora. Before computing the embeddings, the target word is modified. In particular, two third of the occurrences of the target word in each corpus is labeled with a PERIOD1 and PERIOD2 appendix, respectively. The final corpus will contain three versions of the target word, roughly in the same number of occurrences: w, w_PERIOD1, and w_PERIOD2. This avoids using alignment methods between the diachronic corpora and obtain embedding representations from a unique space. Once the embedding space is generated, we used the embeddings of the seed words to train an SVM with cosine kernel, where the labels are the positions of the words in the connotative axis. We can thus measure the angular distance between the embedding vectors of each target word from the hyperplane as cosine distance. For each temporally marked target words we obtain a score (either positive or negative) whose difference returns a diachronic connotation shift.
With respect to the original approach, we introduced three major changes: (i) the target words correspond to keywords describing five specific events in the Dutch society (i.e., climate change, protein transition, Dutch national elections, COVID-19, and aviation disasters) along different time spans; (ii) we improved the pre-processing steps by selecting only target words above a frequency threshold (i.e., 200 occurrences per corpus) and removing sources of noise (e.g., URLs and user names); (iii) we followed the approach in Gonen et al. (2020) to measure potential shifts in meaning of a target word.
Our results show that in no case any of our target words has undergone lexical semantic change, and that the observed shifts vary in intensity across the different events. For instance, we observe small changes for the keywords associated with the protein transition while they are larger (and on the negative side) for climate change. In a manual evaluation, we found that the direction of the connotative shift identified by the Connotative Hyperplane aligns with an accuracy of 57% with manually annotated data. To conclude, our results indicate that societal events can impact connotative shifts of words, and the Connotative Hyperplane can be a valid approach to measure them.