Generating Diverse Synthetic Data using LLMs for Domain Adaptation in Dense Retrieval

Konstantinos Papakostas

Zeta Alpha, University of Amsterdam

Jakub Zavrel

Zeta Alpha

Andrew Yates

University of Amsterdam

Within the field of Neural Information Retrieval, domain transfer poses a considerable challenge, as current models often fail to accurately generalize across diverse domains. This is especially the case for dense retrieval models (bi-encoders). This study explores the potential of using Large Language Models (LLMs) to generate synthetic training data for such systems. We present a correlation analysis on the BEIR benchmark (Thakur et al, 2021) to evaluate current methodologies, and we show the modelling components that typically result in performance improvements. Based on these findings, and earlier work on the InPars model (Jeronymo et al, 2023), we propose new strategies for leveraging synthetic data, which involve identifying the implicit user intent, enforcing query diversity, and further refining the quality of the training data. In particular, we study the ability of LLMs to bootstrap their own prompts with an instruction that generates varied synthetic data from small sets of domain-specific examples.

We report the impact of these different strategies for generating synthetic data on domain adaptation on the BEIR benchmark. The results of this study show that the application of LLMs in generating synthetic data can significantly boost the performance of domain-specific retrieval systems.

References:

Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira. 2023. InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. https://arxiv.org/abs/2301.01820

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych. 2021.

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. https://arxiv.org/abs/2104.08663

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips