Prompt Leakage Probing

Overview

The Prompt Leakage Probing project provides a framework for testing Large Language Model (LLM) agents for their susceptibility to system prompt leaks, it was implemented using FastAgency and AutoGen. It currently implements two attack strategies:

Simple Attack: Uses PromptGeneratorAgent and PromptLeakageClassifierAgent to attempt prompt extraction.
Base64 Attack: Enables PromptGeneratorAgent to encode sensitive parts of the prompt in Base64 to bypass sensitive prompt detection.

Prerequisites

Ensure you have the following installed:

Python >=3.10

Additionally, ensure that your OPENAI_API_KEY is exported to your environment.

Setup Instructions

1. Install the Project

Clone the repository and install the dependencies:

pip install ."[dev]"

2. Run the Project Locally

Start the application using the provided script:

./scripts/run_fastapi_locally.sh

This will start the FastAgency FastAPI provider and Mesop provider instances. You can then access the application through your browser.

3. Use the Devcontainer (Optional)

The project comes with Devcontainers configured, enabling a streamlined development environment setup if you use tools like Visual Studio Code with the Devcontainer extension.

Application Usage

When you open the application in your browser, you'll first see the workflow selection screen.

Running the Tests

After selecting "Attempt to leak the prompt from selected LLM model", you will start a workflow for probing the LLM for prompt leakage. During this process, you will:

Select the prompt leakage scenario you want to test.
Choose the model you want to test.
Specify the number of attempts to leak the prompt in the chat.

The PromptGeneratorAgent will then generate adversarial prompts aimed at making the tested agent leak its prompt. After each response from the tested agent, the PromptLeakageClassifierAgent will analyze the response and report the level of prompt leakage.

Prompt generation

In this step, the PromptGeneratorAgent generates adversarial prompts designed to elicit prompt leakage from the tested agent. The tested model is then probed with the generated prompt using a function call.

Tested agent response

The tested agent's response is the returned value from the function call initiated by the PromptGeneratorAgent. This response represents how the agent reacts to the probing prompt and serves as a input for subsequent analysis.

Response classification

The response is then passed to the PromptLeakageClassifierAgent, which evaluates it for signs of prompt leakage. This evaluation informs whether sensitive information from the original prompt has been exposed. Additionally, the response may guide further prompt refinement, enabling an iterative approach to testing the agent's resilience against prompt leakage attempts.

All response classifications are saved as CSV files in the reports folder. These files contain the prompt, response, reasoning, and leakage level. They are used to display the reports flow, which we will now demonstrate.

Displaying the Reports

In the workflow selection screen, select "Report on the prompt leak attempt". This workflow provides a detailed report for each prompt leakage scenario and model combination that has been tested.

Tested Models

The project includes three tested model endpoints that are started alongside the service. These endpoints are used to demonstrate the prompt leakage workflow and can be accessed through predefined routes. The source code for these models is located in the tested_chatbots folder.

Description of Tested Models

Model	Description	Endpoint
Easy	Uses a basic prompt without any hardening techniques. No canary words are included, and no LLM guardrail is applied.	`/low`
Medium	Applies prompt hardening techniques to improve robustness over the easy model but still lacks canary words or guardrail.	`/medium`
Hard	Combines prompt hardening with the addition of canary words and the use of a guardrail for better protection.	`/high`

Implementation Details

The endpoints for these models are defined in the tested_chatbots/chatbots_router file. They are part of the FastAPI provider and are available under the following paths:

/low: Easy model endpoint.
/medium: Medium model endpoint.
/high: Hard model endpoint.

These endpoints demonstrate different levels of susceptibility to prompt leakage and serve as examples to test the implemented agents and scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.devcontainer		.devcontainer
.github		.github
docker		docker
docs/docs/assets/img		docs/docs/assets/img
prompt_leakage_probing		prompt_leakage_probing
reports		reports
scripts		scripts
tests		tests
.codespell-whitelist.txt		.codespell-whitelist.txt
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
.semgrepignore		.semgrepignore
README.md		README.md
fly.toml		fly.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt Leakage Probing

Overview

Prerequisites

Setup Instructions

1. Install the Project

2. Run the Project Locally

3. Use the Devcontainer (Optional)

Application Usage

Running the Tests

Prompt generation

Tested agent response

Response classification

Displaying the Reports

Tested Models

Description of Tested Models

Implementation Details

About

Releases

Packages

Contributors 3

Languages

airtai/prompt-leakage-probing

Folders and files

Latest commit

History

Repository files navigation

Prompt Leakage Probing

Overview

Prerequisites

Setup Instructions

1. Install the Project

2. Run the Project Locally

3. Use the Devcontainer (Optional)

Application Usage

Running the Tests

Prompt generation

Tested agent response

Response classification

Displaying the Reports

Tested Models

Description of Tested Models

Implementation Details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages