Utrecht University
Vrije Universiteit Amsterdam
Utrecht University
The calculation of semantic similarity between texts as well as the detection and generation of paraphrases is a highly active research area in NLP. Most work has focused on context-independent paraphrases (e.g., "I love CLIN." and "I like CLIN very much."). However, context-dependent paraphrases have been underexplored: For example, in the setting of dialogs, expressions like "here" and "Antwerpen" or "I" and "you" can mean the same thing for the considered context but are assigned low semantic similarity by general semantic models.
Further, in dialog, paraphrases of the previous speaker are usually only part of a larger utterance (e.g., Speaker 1: "I AM HERE.", Speaker 2: "Great that YOU ARE IN ANTWERPEN in June. But CLIN33 is only in September."). Previous work has typically classified texts holistically as paraphrases or not, whereas we aim to highlight the precise selection of tokens that represent the paraphrase.
We use the available "MediaSum" corpus to select 2-person CNN and NPR news interview transcripts and aim to highlight matching tokens in guest and host utterances for context-(in)dependent paraphrases. We expect the share of paraphrases to be high in news interviews relative to other settings as journalists are often trained to use paraphrases or acknowledgments of the guest's previous points.
We annotate ~1,100 pairs of utterance for possible paraphrase candidates (i.e., "clearly is not a paraphrase" vs. "is possibly a paraphrase"). 750 pairs were annotated by the first author and ~350 pairs were annotated by 36 crowd-workers recruited via the platform Prolific. We identified several difficulties with paraphrase detection: Similar to previous work, we find that exact equivalence is rare, e.g., often an excerpt is subsumed in the other text (e.g., "They change their tax status." follows from "They give up their tax exemption."). Specific to our setting, we find that often a partial selection of tokens could be understood as a paraphrase (upper-cased tokens), but actually does not make up a paraphrase in context (e.g., "PEOPLE GETTING OUT of Mosul WILL BE IN dire NEED of assistance." and "You must be working with different scenarios because different waves of PEOPLE COMING OUT WILL INVOLVE different NEEDS."). Further, the shift with interlocutor change can involve more complex situational changes than just the "I" -> "you" shift (e.g., "WE have been the punching bag of the president" and "You said the president used CHICAGO as a punching bag.").
To further understand the difficulties and inherent ambiguity in paraphrase annotation, we perform a crowdsourcing experiment in which 100 text pairs from news interviews are highlighted for paraphrases by a set of 20 trained annotators. Results are pending. We hope to extend recent work in NLP that found the disagreement between annotators to be a valuable resource in judging task/instance difficulty and ambiguity.
Further, being able to accurately identify paraphrases in context is relevant as paraphrases are an important signal of understanding. Research in this area could, for example, help build writing tools for discussion moderation or help dialog systems send more natural signals of task reception.