LT3, Language and Translation Technology Team, Ghent University
Department of Linguistics, Centre for Diversity and Learning, Ghent University
Department of Linguistics, Centre for Diversity and Learning, Ghent University
Department of Linguistics, Centre for Diversity and Learning, Ghent University
As of 2024 centralized testing will be launched in Flanders, starting with secondary education. Writing is among the tested skills and with thousands of pupils participating, the need arises for exploring automated scoring techniques. In this respect Automated Essay Scoring (AES) has proven suitable, currently employed in tests such as the TOEFL or GRE (Richardson, 2021) and yielding high correlations with humans evaluators. However, most research into AES has been performed on one particular language (English) and genre (essays), mostly written for higher education purposes (Strobl et al., 2019).
In this work we explore the possibilities of automatically scoring Dutch texts written by learners in the first stage of secondary education. To this purpose we can rely on authentic data coming from six writing assessments carried out by more than 2,300 Flemish. All this data has been holistically scored using pairwise comparison (van Daal et al., 2019), reaching a reliability score of at least .80.
Following the current evolutions in NLP the two main paradigms in training an AES (Ke and Ng, 2019) are explored: feature-based and deep learning. For the first approach, all data was first processed with T-Scan (Pander Maat et al., 2014) to derive a large range of linguistic features, after which joint optimization experiments were conducted. For the deep learning approach we experimented with fine-tuning Dutch monolingual large language models - BERTje (de Vries et al., 2019) and RobBERT (Delobelle et al., 2020).
Results on a first assessment reveal that both approaches are on par. We will discuss the results on all six assessments and the strengths and weaknesses of both approaches while also considering their complementarity. We also zoom in on the challenges of working with writing products from young learners.
Delobelle, P., Winters, T. & Berendt B. (2020). RobBERT: a Dutch RoBERTa-based Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3255–3265.
de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G. & Nissim, M. (2019). BERTje: A Dutch BERT Model. arXiv:1912.09582.
Ke, Z. & Ng, V. (2019). Automated Essay Scoring: A Survey of the State of the Art. In Proceedings of IJCAI 2019, pages 6300-6308.
Pander Maat, H.L.W., Kraf, R.L., van den Bosch, A., van Gompel, M., Kleijn, S., Sanders, T.J.M. & van der Sloot, K. (2014). T-Scan: a new tool for analyzing Dutch text. Computational Linguistics in The Netherlands journal, 4, 53-746.
Richardson, M., & Clesham, R. (2021). Rise of the machines? The evolving role of AI technologies in high-stakes assessment. London Review of Education, 19(1).
Strobl, C., Ailhaud, E., Benetos, K., Devitt, A., Kruse, O., Proske, A. & Rapp, C. (2019). Digital support for academic writing: A review of technologies and pedagogies. Computers & Education, 131, pages 33-48.
van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., and De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: examining implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26, 59–74.