Out with the old, in with the new? Comparing embedding models for feature engineering

Michael Bauwens

UCLL Research & Expertise

Demi Paspont

UCLL Research & Expertise

Peter Vanbrabant

UCLL Research & Expertise

In today’s NLP landscape, pretrained models such as BERT (Devlin et al., 2019) and OpenAI’s text embeddings (Neelakantan et al., 2022) are often used to embed text to create features for machine learning or semantic search. Rather quickly, these types of models dominated NLP benchmarks (such as GLUE) and superseded previous embedding techniques. The question arises whether these previous ones, such as word2vec (Mikolov et al., 2013), or fastText (Bojanowski et al., 2017) still have advantages over the current state-of-the-art for the objective of feature engineering.

In this contribution, Dutch (or multilingual) pretrained embedding models are compared with respect to their performance as well as their ease of use, resource costs, and availability. This last perspective is especially relevant since previous research on this topic often focused on English, rather than Dutch. While Dutch and multilingual transformer models are more readily available, this is not always the case for the earlier models. The models considered for comparison are both static embedding models (word2vec, fastText, doc2vec (Le and Mikolov, 2014), GloVe (Pennington et al., 2014)) as well as context-dependent models (BERT, GPT, and similar models).

Firstly, the models that operate on a word level (e.g. word2vec) are intrinsically evaluated on word similarity using our own Dutch relation identification dataset, similar to the work of Tulkens et al. (2016). This dataset is a translation and modification of a dataset offered through the Gensim library (Rehurek and Sojka, 2011), which is heavily based on Mikolov et al. (2013). Secondly, all models are used to create document representations, via different pooling methods, for three downstream applications: (i) clustering, (ii) classification and regression, and (iii) semantic search. The first two tasks are applied to an internal dataset with elicited emails (Vanbrabant et al. 2023) in order to cluster similar emails (i) and predict the gender or personality of the authors of the email (ii). For the last task of semantic search (research still ongoing), the quality of the match between the user query and search result will be evaluated through human feedback in order to gauge how well the semantic search engine performs.

As preliminary results, interestingly, the static models that reach the highest score for the intrinsic evaluation are not necessarily the largest models. For the clustering task, only OpenAI’s text embeddings seemed to render more or less defined clusters. As for the classification and regression, the models were not able to predict the personality of the email’s author correctly. Gender classification was successful with an accuracy of 84%, yet this was largely impacted by the presence of a signature with a male or female name. A research is ongoing to measure to what extent the single occurrence of an author’s name influences the document embedding.

The different embedding architectures have their trade-offs but they all still have their place in NLP. Static models have advantages in terms of efficiency and simplicity while transformers are more resource-demanding but achieve remarkable performance in various NLP tasks.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805

Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. arXiv. https://doi.org/10.48550/arXiv.1405.4053

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. https://doi.org/10.48550/arXiv.1301.3781

Neelakantan, A., Xu, T., Puri, R., Radford, A., Han, J. M., Tworek, J., Yuan, Q., Tezak, N., Kim, J. W., Hallacy, C., Heidecke, J., Shyam, P., Power, B., Nekoul, T. E., Sastry, G., Krueger, G., Schnurr, D., Such, F. P., Hsu, K., Weng, L. (2022). Text and Code Embeddings by Contrastive Pre-Training (arXiv:2201.10005). https://doi.org/10.48550/arXiv.2201.10005

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. http://www.aclweb.org/anthology/D14-1162

Rehurek, R., & Sojka, P. (2011). Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic.

Tulkens, S., Emmery, C., & Daelemans, W. (2016). Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4130–4136. https://aclanthology.org/L16-1652

Vanbrabant, P., Bauwens, M., Tummers, J. (2023). Customer segmentation in complaint emails: a corpus linguistic contextualisation of the lexical approach to personality. Presentation at Association of Business Communication Regional Conference, Naples Italy.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023