Everyone deserves their voice to be heard: Towards fair ASR for all Dutch speakers

Rik Raes

Eindhoven University

Saskia Lensink

TNO

Mykola Pechenizkiy

Eindhoven University

Everyone deserves their voice to be heard. Whisper AI takes a significant step towards this goal through automatic speech recognition (ASR). Whisper introduces a multitask and multilingual ASR system that approaches human-level accuracy and robustness on the English language and state-of-the-art (SotA) performance on many other languages. Recent evidence suggests that SotA ASRs can produce biased transcriptions, such that the quality of the transcriptions of different speaker groups varies greatly, due to the large variation in speech. This research investigates to what extent Whisper displays a predictive bias in performance differences towards genders, age groups, and accents for the Dutch language, and to what extent this bias can be mitigated by fine-tuning the model on more diverse speech data.

We present performances of different Whisper models on the 100+ hour publicly available Dutch part of the Common Voice data set, and on a real-world use-case of a public broadcasting corporation with an additional 100+ hours of data. Large differences in word error rate (WER) are observed across gender, age groups, and accents for all model sizes. The different model sizes bring the trade-off between model size, overall performance, and predictive bias since we find that the largest model does not necessarily perform the best overall but has the least predictive bias. Moreover, we report on our efforts to identify and mitigate biases through fine-tuning.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips