Beyond Perplexity: Definition Generation for Examining Temporal Generalization of Large Language Models

Iris Luden

Institute for Logic, Language and Computation

Raquel Fernández

University of Amsterdam

The emergence of large language models (LLMs) has significantly improved performance across various Natural Language Processing (NLP) tasks. However, the field of NLP predominantly follows a static language modeling paradigm, resulting in performance deterioration of LLMs over time. This indicates a lack of temporal generalization, i.e., the ability to extend their capabilities to data beyond their training period. In real-life NLP applications, models are often pre-trained on data from one time period and next deployed for tasks which inherently involves temporally shifted data. So far, performance deterioration of LLMs is primarily attributed to the factual changes over time, leading to attempts of updating a LLMs factual knowledge to avoid performance deterioration. However, not only the facts of the world, but also the language we use to describe it constantly changes. Recent studies have indicated a relationship between performance deterioration and semantic change. While previous publications have demonstrated that LLM performance may deteriorate over time, this is typically measured using perplexity scores and relative performance on downstream tasks. However, such dry comparisons of perplexity and accuracy do not explain the effects of temporally shifted data on LLMs in practice. Given the potential societal impact of NLP applications, it is crucial to not only understand if LLM performance deteriorates over time, but also gain insight into how the degradation of performance (particularly caused by semantic change) is reflected in the output of LLMs. This work investigates how semantic change in temporally shifted data impacts the performance of a LLM on the downstream task of contextualized word definition generation. This approach offers a dual perspective: quantitative measurement of performance deterioration, as well as human-interpretable output through the generated definitions. First, we construct two diachronic corpora of Twitter and Reddit data, such that one overlaps in time with the pre-training period, and the other is temporally shifted. Next, we use a lexical semantic change system to collect a set of semantically changed target words, a set of stable words, and collect a set of emerging new words. Third, we evaluate the performance of the definition generation model in both time periods, and analyze whether semantic change impacts performance. Fourth, we compare the results with cross entropy and perplexity scores for the same inputs. The results indicate that (i) the model’s performance deteriorates for the task of contextualized word definition generation, (ii) the performance deteriorates more for semantically changing words compared to semantically stable words, (iii) the model exhibits significantly lower performance and potential bias for emerging new words, and (iv) the performance does not correlate with loss or (pseudo)-perplexity scores. Overall, our results show that definition generation can be a promising task to assess a model’s capacity for temporal generalization with respect to semantic and lexical change.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips