Bridging the gap between LLMs and books: from short-context windows to long documents using fragment-level representations

Simone van Bruggen

Bookarang

Niels Bogaards

Bookarang

Maaike Koninx

Bookarang

The recent surge in capability of large language models has led to massive improvements in performance for many text-based tasks. However, the issue of accurately analysing very long texts remains an open issue. Even with models with larger context windows (ranging from 4096 tokens with Big Bird to the upcoming 32K token-sized window of new GPT models) context-dependent tasks on long sequences, such as style of writing or character development in fiction books, remain difficult and often inefficient. Available alternatives like a book’s title or jacket text lack this level of information, and while short summaries can serve as simplified proxies they are insufficiently rich for more complex tasks. Applying the power of short-context models to much longer context windows can improve the performance on these tasks. Having access to the full long text and its properties rather than a condensed paragraph can also help with identifying specific parts, allowing for more fine grained tasks and efficient input compression.

We generate a dataset consisting of book text fragments and their corresponding generated responses from GPT4 on several qualitative axes, including complexity, humour, and realism. Because manually collecting large-scale fragment-level annotations is extremely time-consuming, using a language model to generate a synthetic dataset proves a feasible solution while approximating the required level of precision. Existing BERT-based models are finetuned on this dataset per axis. We apply these distilled models to score subsequent fragments in longer texts and combine these results into a coherent representation of the complete text. These full text representations can serve as minimal working proxy for the main text and can be used in downstream tasks.

We show examples of applying the representations to two tasks dependent on the abstract properties in longer pieces of texts. We train supervised models to score (1) the writing style, determining whether a book style is more matter-of-fact and prosaic, or rich and poetic, and (2) level of modernity in books. Training relatively small and fast LSTM models on the full text representation data improves performance for both analysis tasks over other proxies such as blurb, metadata and summaries.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023