A Benchmark for Recipe Understanding in Artificial Agents

Jens Nevens

Artificial Intelligence Laboratory, Vrije Universiteit Brussel

Robin De Haes

Artificial Intelligence Laboratory, Vrije Universiteit Brussel

Katrien Beuls

Faculté d'Informatique, Université de Namur

Paul Van Eecke

Artificial Intelligence Laboratory, Vrije Universiteit Brussel

Recipes are commonly used as a test bed to assess the ability of machines to understand how to perform everyday activities. Indeed, understanding how to execute recipes is a challenging endeavour due to the underspecified and grounded nature of recipe texts and the fact that recipe execution is a knowledge-intensive and precise activity. Recipe understanding has so far been operationalised through the tasks of (i) semantic parsing of recipe texts, e.g. Mori et al. (2014) and (ii) the execution of recipes in a robotic setting, e.g. Beetz et al. (2011). However, the semantic representations obtained through semantic parsing are often not validated in simulation, making it unclear whether they are precise enough to be executed and can be integrated with situated reasoning and common-sense or domain knowledge. Moreover, the robotic execution of recipes is still in an exploratory stage and is not designed to assess the rich common-sense reasoning that is needed to understand recipes in a human-like manner. In this talk, we introduce a novel benchmark that assesses the ability of an artificial chef to cook a dish specified through a natural language recipe in a simulated kitchen environment. Concretely, the benchmark comprises a corpus of 30 recipes, a procedural semantic representation language, qualitative and quantitative kitchen simulators, and a standardised evaluation procedure. The benchmark task consists in mapping between existing recipes formulated in natural language and concrete actions executed in one of the kitchen simulators. The procedural semantic representation language specifies 38 atomic actions that a robotic chef should be able to perform, such as bake, cut, mix, etc. The use of kitchen simulators, rather than an actual robotic kitchen, allows us to focus on a broader range of instructions and on the high-level logic underlying their execution, rather than on the lower-level robotic control. The standardised evaluation procedure focusses on the successful execution of the instructions. Consequently, the semantic parsing of a recipe text needs to yield a sequence of instructions that is concrete and detailed enough to be executed by the artificial chef. Our novel metric, the dish approximation score, quantifies how similar the prepared dish is to the gold-standard dish, independently from the steps involved in the preparation. In sum, tackling this benchmark requires reasoning over natural language texts, qualitative and quantitative physics, common-sense knowledge, knowledge of the cooking domain, and the action space of a virtual or robotic chef. The benchmark thereby addresses the growing interest in human-centric systems that combine natural language processing and situated reasoning to perform everyday activities. Our benchmark is accessible at https://ehai.ai.vub.ac.be/recipe-execution-benchmark/.

References

• Shinsuke Mori, Hirokuni Maeta, Yoko Yamakata, and Tetsuro Sasada, ‘Flow Graph Corpus from Recipe Texts’, in Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 2370–2377, Paris, France, (2014).

• Michael Beetz, Ulrich Klank, Ingo Kresse, Alexis Maldonado, Lorenz Mösenlechner, Dejan Pangercic, Thomas Rühr, and Moritz Tenorth, ‘Robotic Roommates Making Pancakes’, in Proceedings of the 11th IEEE-RAS International Conference on Humanoid Robots, pp. 529–536, New York, NY, USA, (2011).

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023