[insert plot here]: Leveraging LLMs to augment book descriptions for librarians.

Robin van der Markt

Bookarang

Niels Bogaard

Bookarang

In the Netherlands, around 30.000 books get published each year. An important task for librarians is to filter through these books in order to curate their collections. A reliable, readable representation of the book is vital in this acquisition decision moment.

In this work, a system has been developed to summarise book information, using an ensemble of template-based text generation, and text generation using LLM models to combine structured and unstructured information.

In the existing system, we use template-based text generation to produce a text about the book from structured data, such as genre, theme and author information. This information is collected using our information retrieval system, which gathers information about the book and author by applying information retrieval methods on the book's text, back cover blurb, graphical content and external sources.

Additional unstructured book information, such as plot and story development (for fiction) or content summaries (for non-fiction), are typically found in the blurb, which is crafted by the publisher for promotional purposes and may not be entirely objective. Capturing this information in a fixed template format proved difficult due to the variability and subjectivity of this text, as well as the need to separate key points from irrelevant details. While previous text generation approaches failed to accurately and concisely represent this unstructured information, recent large language models allow for producing a fluent and factual text that can be guided with specific directions. We extend the previously generated text with plot information from the book's blurb using Large Language Models. We instruct the model to extract plot information from the blurb, remove promotional information and subjective texts, and inject the plot information in the templated based text. Several LLMs have been analysed for this purpose, and GPT4 was found to be the best performing model for this task.

The resulting text from the ensemble is used in practice to provide purchasing information for libraries. This was previously a time-consuming job for humans who would be required to read the book and hand craft a summary. The generated summary described in this work is used by libraries to make decisions about acquiring books for their collection, cutting down the book release-to-library time by roughly 85%. By automating this process, libraries are able to make faster and better decisions about which books they want to acquire for their collection.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023