University of Groningen
TNG
Effective deliberation is key to decision making. Being an indispensable means of critical thinking, its influence is undoubtedly seen in the success of individuals and organizations; conflicts can be resolved with rational outcomes, biased standpoints can be identified, and vital decisions can be taken collectively. However, effective deliberation requires a certain kind of knowledge exchange; diverse, concise, and credible arguments should be delivered objectively to address several perspectives of a contested topic. Supporting the deliberation process, our long-term goal is to build a framework that provides users with automatically generated high-quality argumentation knowledge for the topic under discussion. In order to achieve this objective, we conducted a preliminary study where we explored the automatic generation of arguments based on specific patterns of reasoning. We focused on argumentation schemes, a well-established argumentation theory. Leveraging Large Language Models (LLMs), we devised a prompt-based method that can generate arguments tailored to a chosen topic and scheme. For instance, if someone wants to generate an argument about "smoking" using the "Argument from Cause to Effect" scheme, our method can produce multiple arguments that meet these criteria. Our approach involves creating a set of prompt templates manually, based on predefined "prompt types." Each prompt type consists of rules that define the required content and structure of a prompt. These prompts serve as inputs to a set of autoregressive transformer language models (without fine-tuning) that generate arguments accordingly. To assess the effectiveness of each model-prompt combination in generating high-quality arguments based on argumentation schemes, we selected a range of controversial and diverse topics from the IBM debater datasets (available at https://research.ibm.com/haifa/dept/vst/debating_data.shtml). We generated arguments for all combinations of model-prompt pairs and topics. However, due to the large number of arguments generated, manual evaluation would be impractical. Therefore, we propose an automatic filtering pipeline. The pipeline begins by automatically evaluating each argument using four metrics: content richness, stance, argumentativeness, and topic relevance. We then select arguments with the highest average scores for further evaluation, which involves applying the same metrics mentioned earlier, along with additional measures of plausibility and bias. The main findings from our evaluation processes are as follows: (1) the GPT series produces arguments of the highest quality, and (2) the quality of arguments varies significantly depending on the type and complexity of the argumentation scheme employed. Furthermore, despite initially identifying the best arguments based on automatically evaluated scores, we observed a noticeable drop in quality during the manual evaluation stage.