ContextLens: Word Embeddings for Linguists

Hans van Halteren

Radboud University

Nelleke Oostdijk

Radboud University

Reza Shokrzad

Radboud University

Although the smallest meaning-bearing unit in language is the morpheme, people rather tend to think in terms of words when generating or analyzing utterances. We have dictionaries that list the meaning of words or their translation. In our community, a central tool for thinking about the relation between words and meaning is WordNet. And this thinking shows us that the peculiarities of natural language have led to a complication: words and meanings are not linked up one-to-one. One word can have several meanings (senses) and one meaning can have several words (synonyms).

There are several reasons we might we want to know more about these relations. Language users might want to know what a word means, how it should be translated or what the best choice is in a certain context. Lexicographers need to get an overview of all possible uses of a word in order to include it optimally in their dictionary. Linguists want to get an overview of the whole vocabulary – or at least a semantic field – to understand how these relations influence language use. Computational linguists would like to have more insight into how words are used to improve their NLP systems or – these days – figure out how their language models actually work.

Computational linguists have already moved on from WordNet via distributional semantics to word embeddings. Linguists – who are to be the focus in this paper – partly followed to distributional semantics, but are hesitant to switch to word embeddings. One reason we presume is that the first generation of word embeddings (word2vec, GloVe, etc) were embeddings of words rather than word senses. The other reason is the source of embeddings: a black box with very complicated innards where it is unclear how it is creating what. The first reason has disappeared with context dependent embeddings like BERT. As for the second, it is not always necessary to understand the full workings of something in order to be able to use it fruitfully (think cars).

In this paper, we investigate how linguists, especially those with no training in deep learning, can still be enabled to study word semantics by way of word sense embeddings. In a tool called ContextLens, we employ clustering and visualization techniques to cast the embeddings into a more easily interpretable form for investigation by linguists. Tests in which the SemCor corpus is used as reference material show that there are still some hurdles to be overcome, but that our approach is a viable route to aligning (this area of) AI and linguistics, and thus contribute to advancement in linguistic word semantics, and possibly to explainable AI as well.

(In addition to a presentation of ContextLens, its components and what we learned in the process of developing it, which would be best in a talk rather than a poster, we are also able to demonstrate the tool during a poster/demo session.)

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023