TNO, Leiden University
TNO
Data annotation for training machine learning-based NLP is a demanding task. It entails data collection, data cleansing, and careful training and controlled deployment of multiple annotators. Clearly, this comes with budgetary ramifications. Data annotation is important since high quality training datasets are scarce yet crucial for creating performant NLP systems.
Recent advancements in the field of Large Language Models (LLMs, generative artificial intelligence) have facilitated the ability of NLP practitioners to partially or fully automate annotation tasks through in-context learning (or prompting). Yet, while LLMs have the potential for cost reduction and may speed up the process of constructing training datasets, there are downsides as well: LLMs exhibit tendencies to hallucinate and may deviate from (or fail to obey) given instructions.
In our research, we propose a system that enables domain experts in a security use case to supply a Named Entity Recognition (NER) tagger with new training data. This is achieved by retraining the NER tagger using synthetic, LLM-generated training data for human-preferred additional named entities.
A crucial ingredient of our setup is to bring humans in the loop of the data augmentation process. Additionally, we calibrate the process of data augmentation on the class entropy of the retrained NER tagger. This class entropy reflects the uncertainty of the tagger when discriminating between the various entities.