Zeta Alpha, University of Amsterdam
Zeta Alpha
University of Amsterdam
Within the field of Neural Information Retrieval, domain transfer poses a considerable challenge, as current models often fail to accurately generalize across diverse domains. This is especially the case for dense retrieval models (bi-encoders). This study explores the potential of using Large Language Models (LLMs) to generate synthetic training data for such systems. We present a correlation analysis on the BEIR benchmark (Thakur et al, 2021) to evaluate current methodologies, and we show the modelling components that typically result in performance improvements. Based on these findings, and earlier work on the InPars model (Jeronymo et al, 2023), we propose new strategies for leveraging synthetic data, which involve identifying the implicit user intent, enforcing query diversity, and further refining the quality of the training data. In particular, we study the ability of LLMs to bootstrap their own prompts with an instruction that generates varied synthetic data from small sets of domain-specific examples.
We report the impact of these different strategies for generating synthetic data on domain adaptation on the BEIR benchmark. The results of this study show that the application of LLMs in generating synthetic data can significantly boost the performance of domain-specific retrieval systems.
References:
Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira. 2023. InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. https://arxiv.org/abs/2301.01820
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych. 2021.
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. https://arxiv.org/abs/2104.08663