Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Department of Public Health & Primary Care, Leiden University Medical Centre, Leiden, The Netherlands
In this paper, we apply BERT-like structures to unstructured medical notes with the goal of classifying them on the basis of the respective patients’ smoking statuses. Additionally, we perform tests with classifying whether a patient is an active alcohol consumer and whether they are an active user of recreational drugs. By performing these tests we demonstrate how deeper transformer models can be used to improve upon a baseline of string matching on the task of lifestyle classification in free text medical texts.
Our input dataset was provided to us by a Dutch hospital, and contains the majority of their digital medical notes. The dataset consists of a total of 148.000 Dutch texts, ranging from consult notes to clinical letters between medical professionals. Every text was labelled automatically using a string matching query on the basis of smoking, drinking and drug usage statuses. We show that string matching has relatively many flaws within the context of these texts, as a large portion of edge cases got assigned the wrong labels. As these labels were thus deemed unreliable we hand-labelled a set of 4.700 texts to serve as ground truth.
We conducted a systematic literature review in which we explored the existing literature on BERT models used within the clinical domain, with a preference for Dutch BERT models. Following this literature review, we decided to train an ALBERT-like model from scratch using the full collection of texts. We furthermore experimented with pretraining on top of three existing Dutch models; RobBERT, belabBERT and MedRoBERTa.nl. Lastly, we decided to experiment with translating input texts to English to finetune English clinical BERT models ClinicalBERT and BioBERT. For this we used the opus-mt-nl-en neural translation model from the University of Helsinki. We translated the 4.700 hand-labelled texts. Our approach of translating Dutch texts to serve as input for English BERT models can be considered a scientific novelty.
Ultimately, the model that was pretrained on top of MedRoBERTa.nl ranked first among Macro F1-scores on both the smoking (0.93) and drugs (0.77) classification tasks test sets, whilst performing second best on the alcohol task (0.79). In comparison, the string matching model achieved a Macro F1-score of 0.22 and a less deep Stochastic Gradient Descent model achieved a Macro F1-score of 0.85 on the smoking task. Interestingly, the ClinicalBERT model that was finetuned using translated data outperformed every other model on the alcohol task (0.80) and only performed slightly worse than MedRoBERTA.nl on classifying smoking status (0.92). As we only translated the hand-labelled data, not the full collection of texts, and yet achieved comparable if not better results than Dutch BERT models, we show promise for implementing neural translation models in non-English BERT research. Because of the novelty of combining neural translation with BERT classification we show in particular that translation is a possible option to consider for future BERT classification finetuning tasks. We furthermore show that BERT models are able to outperform classic machine learning methods and string matching on the task of lifestyle classification in clinical text.