SpaceNLI: Evaluating the Consistency of Predictions on Spatial Natural Language Inference

Lasha Abzianidze

Institute for Language Sciences

Joost Zwarts

Utrecht University

Yoad Winter

Utrecht University

While many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts.

We generate SpaceNLI in the following way. First, we create spatial NLI problems -- a set of pairs of premise sentences and a hypothesis sentence labeled with one of the three inference labels (entailment, contradiction, and neutral). After several phases of validation by expert annotators, we keep 160 NLI problems. For these problems, we create patterns by replacing noun phrases with NP placeholders. For instance, an NLI problem "John drove through the tunnel" entailing "John was in the tunnel" is rewritten as "NP1 drove through NP2" entailing "NP1 was in NP2". For each pattern, we generate 200 sample NLI problems by replacing NPs with concrete phrases. While doing so, we take care of the selection restrictions imposed by the context to avoid nonsensical sentences. Additionally, the patterns are organized into four groups based on the type of spatial expressions: argument orientation, directional, projective, and non-projective.

We use SpaceNLI to (1) evaluate state-of-the-art NLI models on the out-of-the-domain semantic phenomena and (2) gauge to what extent their predictions are sensitive to inference-preserving NP replacements.

We evaluate several Large Language model-based NLI models trained on various general NLI datasets. Results show that the models achieve accuracy scores in the interval of 54.1-66.5. While this shows that the models can do some spatial reasoning, they are far from being good at it.

In addition to the standard sample-based accuracy score, we introduce a pattern-based accuracy (PA) and argue that it is a more reliable and stricter measure than the standard accuracy for evaluating a system’s performance on pattern-based generated data samples. The PA score with a threshold t% measures how the ratio of the correctly classified inference patterns, where a pattern is considered correctly classified if a system classifies at least t% of the samples generated from the pattern.

In the PA-based evaluation results, NLI systems in their predictions show a lack of consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the “between” preposition) are the most challenging ones for the models.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips