Speech versus script: a language model analysis

Serkan Bay

University of Amsterdam, Institute for Logic, Language and Computation ILLC

Jelke Bloem

University of Amsterdam, Institute for Logic, Language and Computation ILLC

A model is shaped by the type and structure of its data. Considering that

speech differs from script, training a model on transcribed speech data should

give shape to a model with different properties compared to training it on writ-

ten data. This study explored how by intrinsically evaluating - i.e. in terms

of perplexity and word similarity - two bi-directional LSTM language models

trained respectively on ’Corpus Gesproken Nederlands (CGN)’ and ’Lassy Klein

Corpus (LKC)’. The corpora were preprocessed in such a way that they matched

in word count, however the structural differences (like variation of vocabulary,

repetitiveness etc.) remained.

The GCN model showed that it was capable of predicting its own test set with

more confidence (lower perplexity) than the LKC model. The models were also

evaluated on each other’s test sets. Here, the LKC model achieved a lower per-

lexity, suggesting that it is more able to generalize.

Furthermore, the CGN model correctly associated more target words compared

to the LKC model, implying that it is more capable of learning the semantical

relationships between words.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips