Radboud University Nijmegen
Radboud University Nijmegen
Radboud University Nijmegen and NovoLearning
Radboud University Nijmegen
Radboud University Nijmegen
The last few years have witnessed an increasing demand for Automatic Speech Recognition (ASR) technology that can be successfully implemented in educational applications supporting the development of language skills. The main reason for this is the need for digital applications that can support the development of speaking and reading skills in the context of teacher shortage, declining reading skills, and increased learner autonomy. In general, such applications have been hindered by poor ASR performance in the case of non-native speech and child speech, in particular for languages other than English. However, recent advancements in ASR research seem to pave the way for broader implementation of this technology in language learning applications for children and non-native speakers, also in languages that are generally less resourced than English.
In this presentation, we report on research that investigated the performance of state-of-the-art, pre-trained ASR models; Wav2Vec2.0 and Whisper AI, as compared to a more traditional Kaldi ASR model. This was carried out with the view of developing an application that can support children acquiring Dutch as a foreign language. In this study we addressed the following research questions:
RQ1 How do state-of-the art ASR systems perform on different categories of child speech, in particular native speech, non-native speech, read speech and extemporaneous speech?
RQ2 How does the performance of these ASR systems vary between primary school-aged and secondary school-aged children?
RQ3 To what extent do ASR-based evaluations of child speech quality compare to assessments by adult and child raters?
We first evaluated ASR performance on read and extemporaneous speech of native and non-native Dutch children in primary (7-11 years old) and secondary (12-16 years old) school. We then investigated ASR-based evaluations of the children's pronunciation, fluency, intelligibility and reading performance and their relation to judgements of the same speech recordings assigned by adult Dutch speakers and children.
The results show that recent, pre-trained transformer-based ASR models outperform Kaldi and achieve acceptable performance in both read and extemporaneous modalities, despite the challenging nature of child and non-native speech. Whisper AI, in particular, performs the best in terms of word error rate (WER), for the majority of speech types. For primary-aged children, read speech, Whisper achieves a WER of 14.8% compared to 21% for Wav2Vec2.0. For secondary-school aged children, looking at extemporaneous speech, Whisper achieved a WER of 30.7% and 33.8% for the native and non-native speakers, compared to the 44.8% and 44.6% of Wav2Vec2.0. In terms of native versus non-native speech, the difference in ASR performance becomes even more apparent when looking at the read speech for this age group. Whisper’s WER for native speakers was 8%, a stark contrast to the 24.8% for non-native speakers.
ASR technology also makes it possible to extract additional measures that provide insights into language proficiency, even without 100 percent recognition accuracy. Our analysis showed that ASR-based feedback can be enhanced by child-assigned evaluations to develop motivating and pedagogically sound language learning applications.