Tilburg University
Tilburg University
We investigate to what extent the cross-modal latent space learned by CLIP (Contrastive Image-Language Pre-training) captures form-meaning correspondences for existing and made-up words.
We use fictional character names as well as a set of pseudowords as targets. Character names were presented to participants in an online survey and rated on how well they fit a male/female or young/old character. Names could be real (e.g., John), talking (e.g., Bolt), or made-up (e.g., Arobynn). Talking names are hypothesized to convey semantic intuitions based on the meaning of the word they rely upon. Real names could leverage both form patterns and co-occurrences in language. Made-up names, finally, have no linguistic context of occurrence, and their interpretation is hypothesized to only rely on sub-lexical regularities in language. Pseudowords were presented as stimuli in a best-worst scaling survey where participants were asked to indicate which adjectives out of a sample of 4 were the best and worst fit for one of six basic emotions (joy, sadness, anger, disgust, fear, surprise).
We then generated images using Stable Diffusion, which relies on CLIP’s latent space to guide the generation, starting from fixed prompts in which only the name or pseudoword varied (the face of a character called <name>; a(n) <pseudoword> face). The images were then fed to three off-the-shelves Computer Vision classifiers per attribute, to assess how robust patterns are. We compute the inter-annotator agreement (IAA) of each group of classifiers using Fleiss k. Then, we measure the correlation between the ground truth scores (obtained in the online surveys) and the classifier scores, to measure whether the models exhibit similar patterns as human subjects.
The IAA is high for gender (k = 0.729) and moderate for age (k = 0.424). We further see a strong positive correlation for real names (r = .94, p < .001), and slightly lower for made-up (r = .72, p < .001) and talking names (r = .73, p < .001). As for age, real names (r = .72, p < .001) and talking names (r = .57, p < .001) show a robust correlation, while made-up names show only a moderate correlation (r = .39, p = .02). This shows that CLIP learns sound-symbolic associations for two prominent attributes just from statistical co-occurrences between (sub-)lexical patterns and visual features.
For emotions, the picture is less clear-cut: correlations are weak for joy (r = .17, p = .004) and surprise (r = .19, p = .001), negative for fear (r = -.13, p = .03), while close to zero for sadness (r = .04, p = .50), anger (r = .002, p = .97) and disgust (r = .05, p = .39).
These results show that cross-model co-occurrences can explain semantic intuitions about known and unknown strings alike, relying on systematic patterns at the sub-lexical level which extend to the cross-model domain. This same mechanism accounts for patterns relating to existing words, suggesting a single mechanism can account for the interpretation of known and unknown words.