Controlled Data Augmentation with Large Language Models for Named Entity Recognition

Stephan Raaijmakers

TNO, Leiden University

Daan Vos

TNO

Data annotation for training machine learning-based NLP is a demanding task. It entails data collection, data cleansing, and careful training and controlled deployment of multiple annotators. Clearly, this comes with budgetary ramifications. Data annotation is important since high quality training datasets are scarce yet crucial for creating performant NLP systems.

Recent advancements in the field of Large Language Models (LLMs, generative artificial intelligence) have facilitated the ability of NLP practitioners to partially or fully automate annotation tasks through in-context learning (or prompting). Yet, while LLMs have the potential for cost reduction and may speed up the process of constructing training datasets, there are downsides as well: LLMs exhibit tendencies to hallucinate and may deviate from (or fail to obey) given instructions.

In our research, we propose a system that enables domain experts in a security use case to supply a Named Entity Recognition (NER) tagger with new training data. This is achieved by retraining the NER tagger using synthetic, LLM-generated training data for human-preferred additional named entities.

A crucial ingredient of our setup is to bring humans in the loop of the data augmentation process. Additionally, we calibrate the process of data augmentation on the class entropy of the retrained NER tagger. This class entropy reflects the uncertainty of the tagger when discriminating between the various entities.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023