Klinkt leuk! A tool to predict associations with names and (non)words in the Dutch language

Aron Joosse

Utrecht University

Erik-Jan van Kesteren

Utrecht University

Giovanni Cassani

Tilburg University

In this work, we present a free-to-use tool to predict people’s associations with any character string in the Dutch language. Furthermore, we predict people’s associations with real first names, computer-generated company names, and nonwords based on their written form, considering the intuitions of participants in an online survey.

Expanding on prior research on sound symbolism in names and (non)words, we address the following gaps in the literature. First, we address the absence of sound symbolism studies in the Dutch language by using Dutch stimuli and native Dutch participants. Second, we build upon prior research using embeddings extracted from Distributional Semantic Models (DSMs), specifically the semantic representations extracted using FastText (FT), and compare them to embeddings extracted from a Dutch-language transformer model, RobBERT V2. Third, we extend the analysis of sound symbolism to novel attributes by (1) collecting novel data for the associations of intelligence and trustworthiness, and (2) exploring the extent to which existing open-access word norm data covering a wide range of associations (e.g., concreteness, imageability, arousal) can be used to study sound symbolism.

For the novel data, the real first names were randomly sampled from a list of the top 10,000 most common first names in the Netherlands. The company names were generated using a pseudoword generator using a list of almost 100,000 cleaned names from a Dutch company register. Furthermore, the nonwords were randomly sampled from the nonwords provided in the Dutch Lexicon Project 2. In total, we sampled 200 words for all three word types. Using these words, we conducted a survey of native Dutch speaking students who rated real first names, computer-generated company names, and nonwords on a 4-item best-worst scale. All word types were rated on the associations of: (1) femininity, (2) goodness-evilness, (3) intelligence, and (4) trustworthiness.

The open-access word norm data was found using a literature search of Dutch word norm and word association rating data. In total, over 10 studies were found spanning roughly than 10 distinct associations rated on between 176 and 24,000 words.

Names were featurized using a custom FT model (n-grams between 2 and 5, window size = 5) from a combination of the SoNaR-500 corpus, Corpus Gesproken Nederlands (CGNL) and the Dutch part of the 2018 CommonCrawl snapshot. We sentence-tokenized the corpus, lowercased all words and removed punctuation. For BERT, the pretrained RobBERT V2 model was used. We then embedded names using these models. We predicted participant ratings of our novel and open-access data using neural network regression to correct for the large number of dimensions in the embeddings.

No preliminary results are available yet. With the validated associations, an online free-to-use tool will be built in which users can input any syntactically legal Dutch name, word, or character string to acquire an overview of the predicted sound symbolic associations. Furthermore, the tool will output a list of the input string’s nearest semantic neighbors.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023