University of Amsterdam, Institute for Logic, Language and Computation ILLC
University of Amsterdam, Institute for Logic, Language and Computation ILLC
A model is shaped by the type and structure of its data. Considering that
speech differs from script, training a model on transcribed speech data should
give shape to a model with different properties compared to training it on writ-
ten data. This study explored how by intrinsically evaluating - i.e. in terms
of perplexity and word similarity - two bi-directional LSTM language models
trained respectively on ’Corpus Gesproken Nederlands (CGN)’ and ’Lassy Klein
Corpus (LKC)’. The corpora were preprocessed in such a way that they matched
in word count, however the structural differences (like variation of vocabulary,
repetitiveness etc.) remained.
The GCN model showed that it was capable of predicting its own test set with
more confidence (lower perplexity) than the LKC model. The models were also
evaluated on each other’s test sets. Here, the LKC model achieved a lower per-
lexity, suggesting that it is more able to generalize.
Furthermore, the CGN model correctly associated more target words compared
to the LKC model, implying that it is more capable of learning the semantical
relationships between words.