Skip to content

dicelab-rhul/LELMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LELMA Image

LELMA Framework

LELMA is a framework for verifying and self-improving the correctness of the reasoning generated by LLMs, written in Python and Prolog. It was developed for reasoning in game-theoretical dilemmas, but thanks to modularity can be adapted to other domains. The framework and its evaluation are described in more detail in this paper.

Overview

The framework consists of four main components:

  • Reasoner: An LLM responsible for producing reasoning.
  • Translator: An LLM that translates statements from the Reasoner's output into logical queries sent to the Solver.
  • Solver: A normal logic program implemented in Prolog.
  • Feedback loop: This mechanism provides feedback if any query evaluations fail. Each failed query is translated back to natural language and forwarded to the Reasoner using a feedback prompt.

The general overview of the architecture is shown below.

LELMA schema

Installation

To install the required dependencies run:

pip install -r requirements.txt

The framework requires also SWI-Prolog to be installed.

Usage

To run the sample experiment, use the following command in your terminal:

python experiment.py

You can modify the parameters of the experiment by modifying full_experiment.ini. To use GPT-4 and Gemini, used by default in the experiment, the respective API keys have to be stored in environment variables.

Project Structure

The structure of the project is as follows:

.
├── DATA/
│   └── CONFIG/
│   └── TEMPLATES/
├── llms/
│   ├── gemini.py
│   ├── gpt4.py
├── src/
│   └── base/
│       ├── base_llm.py
│       ├── base_prompt_maker.py
│   ├── lelma.py
│   ├── prompt_maker.py
│   ├── setup_logger.py
│   ├── solver.py
├── experiment.py
└── solver.pl

base directory contains abstract classes that need to be implemented to adapt the framework for a specific use-case (base_prompt_maker.py) or to use a specific LLM (base_llm.py). DATA directory contains configuration data and the templates for prompts and predicates.

Adaptations

To adapt the framework for the use in other domains, the following steps are needed:

  1. Solver needs to be replaced with a domain-specific solver.
  2. Templates for instruction, translation, and feedback prompt have to be provided.
  3. Predicates.csv file for the domain has to be specified. This file serves as a basis for the translation from natural language and back. The columns are as follows:
Column Name Description
predicate A Prolog predicate.
regex A regular expression pattern used to extract predicates' arguments.
long_desc A detailed natural language explanation of the predicate.
short_desc A brief natural language explanation of the predicate.
inverse_mapping A regular expression for creating natural language feedback from failed queries as their negation.
inverse_mapping_positive A regular expression for creating natural language feedback from failed queries by substituting correct values.
  1. A class inheriting from BasePromptMaker has to be implemented to handle creating prompts from the templates (see PromptMaker as an example). Note that there are two alternative types of feedback based on failed queries: one providing the negation of failed queries (e.g. "10 is not the highest payoff for choice B") and one substituting the correct values to the failed queries (e.g. "35 is the highest payoff for choice B"). The latter turned out to be more effective in the experiments.

Evaluation

The framework was evaluated using two LLMs, GPT-4 Omni and Gemini 1.0 Pro. The models were prompted to reason and choose an action in one-shot games: Prisoner's Dillema (PD), Stag Hunt (SH), and Hawk-Dove (HD). The payoff matrices were as follows:

Prisoner's Dilemma

P1/P2 Betray (R) Confess (B)
Betray (R) (1, 1) (5, 0)
Confess (B) (0, 5) (3, 3)

Stag Hunt

P1/P2 Hare (R) Stag (B)
Hare (R) (1, 1) (3, 0)
Stag (B) (0, 3) (5, 5)

Hawk-Dove

P1/P2 Hawk (R) Dove (B)
Hawk (R) (0, 0) (5, 1)
Dove (B) (1, 5) (3, 3)

Each model had at most five reasoning attempts in the feedback loop. Each game was repeated 30 times. The rules of the game were presented in the prompt in natural language. The name of the game was not given, and the action names were substituted with 'B' and 'R' to make the task more challenging. To assess the effectiveness of the framework in detecting and correcting reasoning errors, each reasoning sample was later manually evaluated by three independent evaluators.The evaluation protocol is available here, the logs from the experiment here, and the aggreagated evaluations here.

Choice distribution

Choices GPT4 Choices Gemini

After the correction of reasoning errors, especially for GPT-4 Omni, the risk-averse choices are more prevalent in the final reasoning attempt in comparison to the original attempt.

Reasoning correctness

Correctness GPT4 Correctness Gemini

Reasoning correctness, according to the criteria specified in the evaluation protocol, increases in the final reasoning attempt, in particular for GPT-4 Omni which is able to effectively use the corrective feedback.

Confusion matrix

Conf mat GPT4 Conf mat Gemini

The confusion matrix for the first reasoning attempt, based on the actual correctness (determined by human evaluators) and predicted correctness (determined based on the absence of failed predicates), shows high error detection accuracy.

Authors

Agnieszka Mensfelt
Kostas Stathis
Vince Trencsenyi

Citing This Work

@article{mensfelt2024lelma,
      title={Logic-Enhanced Language Model Agents for Trustworthy Social Simulations}, 
      author={Agnieszka Mensfelt and Kostas Stathis and Vince Trencsenyi},
      year={2024},
      journal={arXiv preprint arXiv:2408.16081},
      url={https://arxiv.org/abs/2408.16081} 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published