Skip to content

Commit

Permalink
Merge branch 'main' into nb-failures-debug-1
Browse files Browse the repository at this point in the history
  • Loading branch information
AlejandroEsquivel committed Dec 9, 2024
2 parents 1dd3229 + 4855635 commit eb816b5
Show file tree
Hide file tree
Showing 10 changed files with 375 additions and 165 deletions.
6 changes: 4 additions & 2 deletions docs/concepts/async_streaming.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Async Stream-validate LLM responses\n",
"# Async stream-validate LLM responses\n",
"\n",
"Asynchronous behavior is generally useful in LLM applciations. It allows multiple, long-running LLM requests to execute at once. Adding streaming to this situation allows us to make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.\n",
"Asynchronous behavior is generally useful in LLM applications. It allows multiple, long-running LLM requests to execute at once. \n",
"\n",
"With streaming, you can make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.\n",
"\n",
"**Note**: learn more about streaming [here](./streaming).\n"
]
Expand Down
75 changes: 54 additions & 21 deletions docs/concepts/deploying.md

Large diffs are not rendered by default.

37 changes: 37 additions & 0 deletions docs/concepts/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Performance

Performance for Gen AI apps can mean two things:

* Application performance: The total time taken to return a response to a user request
* Accuracy: How often a given LLM returns an accurate answer

This document addresses application performance and strategies to minimize latency in responses. For tracking accuracy, see our [Telemetry](/docs/concepts/telemetry) page.

## Basic application performance

Guardrails consist of a guard and a series of validators that the guard uses to validate LLM responses. Generally, a guard runs in sub-10ms performance. Validators should only add around 100ms of additional latency when configured correctly.

The largest latency and performance issues will come from your selection of LLM. It's important to capture metrics around LLM usage and assess how different LLMs handle different workloads in terms of both performance and result accuracy. [Guardrails AI's LiteLLM support](https://www.guardrailsai.com/blog/guardrails-litellm-validate-llm-output) makes it easy to switch out LLMs with minor changes to your guard calls.

## Performance tips

Here are a few tips to get the best performance out of your Guardrails-enabled applications.

**Use async guards for the best performance**. Use the `AsyncGuard` class to make concurrent calls to multiple LLMs and process the response chunks as they arrive. For more information, see [Async stream-validate LLM responses](/docs/async-streaming).

**Use a remote server for heavy workloads**. More compute-intensive workloads, such as remote inference endpoints, work best when run with dedicated memory and CPU. For example, guards that use a single Machine Learning (ML) model for validation can run in milliseconds on GPU-equipped machines, while they may take tens of seconds on normal CPUs. However, guardrailing orchestration itself performs better on general compute.

To account for this, offload performance-critical validation work by:

* Using [Guardrails Server](/docs/concepts/deploying) to run certain guard executions on a dedicated server
* Leverage [remote validation inference](/docs/concepts/remote_validation_inference) to configure validators to call a REST API for inference results instead of running them locally

The Guardrails client/server model is hosted via Flask. For best performance, [follow our guidelines on configuring your WSGI servers properly](/docs/concepts/deploying) for production.

**Use purpose-built LLMs for re-validators**. When a guard fails, you can decide how to handle it by setting the appropriate OnFail action. The `OnFailAction.REASK` and `OnFailAction.FIX_REASK` action will ask the LLM to correct its output, with `OnFailAction.FIX_REASK` running re-validation on the revised output. In general, re-validation works best when using a small, purpose-built LLM fine-tuned to your use case.

## Measure performance using telemetry

Guardrails supports OpenTelemetry (OTEL) and a number of OTEL-compatible telemetry providers. You can use telemetry to measure the performance and accuracy of Guardrails AI-enabled applications, as well as the performance of your LLM calls.

For more, read our [Telemetry](/docs/concepts/telemetry) documentation.
41 changes: 23 additions & 18 deletions docs/concepts/remote_validation_inference.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,22 @@
"source": [
"# Remote Validation Inference\n",
"\n",
"## The Need\n",
"## The problem\n",
"\n",
"As a concept, guardrailing has a few areas which, when unoptimized, can be extremely latency and resource expensive to run. The main two areas are in guardrailing orchestration and in the ML models used for validating a single guard. These two are resource heavy in slightly different ways. ML models can run with really low latency on GPU-equipped machines, while guardrailing orchestration benefits from general memory and compute resources. Some ML models used for validation run in tens of seconds on CPUs, while they run in milliseconds on GPUs.\n",
"As a concept, [guardrailing](https://www.guardrailsai.com/docs/concepts/guard) has a few areas that, when unoptimized, can introduce latency and be extremely resource-expensive. The main two areas are: \n",
"\n",
"* Guardrailing orchestration; and\n",
"* ML models that validate a single guard\n",
"\n",
"These are resource-heavy in slightly different ways. ML models can run with low latency on GPU-equipped machines. (Some ML models used for validation run in tens of seconds on CPUs, while they run in milliseconds on GPUs.) Meanwhile, guardrailing orchestration benefits from general memory and compute resources. \n",
"\n",
"## The Guardrails approach\n",
"\n",
"The Guardrails library tackles this problem by providing an interface that allows users to separate the execution of orchestraion from the exeuction of ML-based validation.\n",
"The Guardrails library tackles this problem by providing an interface that allows users to separate the execution of orchestration from the execution of ML-based validation.\n",
"\n",
"The layout of this solution is a simple upgrade to validator libraries themselves. Instead of *always* downloading and installing ML models, they can be configured to reach out to a remote endpoint. This remote endpoint hosts the ML model behind an API that has a uninfied interface for all validator models. Guardrails hosts some of these as a preview feature for free, and users can host their own models as well by following the same interface.\n",
"The layout of this solution is a simple upgrade to validator libraries themselves. Instead of *always* downloading and installing ML models, you can configure them to call a remote endpoint. This remote endpoint hosts the ML model behind an API that presents a unified interface for all validator models. \n",
"\n",
"Guardrails hosts some of these for free as a preview feature. Users can host their own models by following the same interface.\n",
"\n",
"\n",
":::note\n",
Expand All @@ -26,15 +33,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Guardrails Inferencing Endpoints\n",
"## Using Guardrails inferencing endpoints\n",
"\n",
"To use an guardrails endpoint, you simply need to find a validator that has implemented support. Validators with a Guardrails hosted endpoint are labeled as such on the [Validator Hub](https://hub.guardrailsai.com). One example is ToxicLanguage.\n",
"To use a guardrails endpoint, find a validator that has implemented support. Validators with a Guardrails-hosted endpoint are labeled as such on the [Validator Hub](https://hub.guardrailsai.com). One example is [Toxic Language](https://hub.guardrailsai.com/validator/guardrails/toxic_language).\n",
"\n",
"\n",
":::note\n",
"To use remote inferencing endpoints, you need to have a Guardrails API key. You can get one by signing up at [the Guardrails Hub](https://hub.guardrailsai.com).\n",
"To use remote inferencing endpoints, you need a Guardrails API key. You can get one by signing up at [the Guardrails Hub](https://hub.guardrailsai.com). \n",
"\n",
"Then, run `guardrails configure`\n",
"Then, run `guardrails configure`.\n",
":::"
]
},
Expand Down Expand Up @@ -79,7 +86,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The major benefit of hosting a validator inference endpoint is the increase in speed and throughput compared to running locally. This implementation makes use cases such as streaming much more viable!\n"
"The benefit of hosting a validator inference endpoint is the increase in speed and throughput compared to running locally. This implementation makes use cases such as [streaming](https://www.guardrailsai.com/docs/concepts/streaming) much more viable in production.\n"
]
},
{
Expand Down Expand Up @@ -114,11 +121,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Toggling Remote Inferencing\n",
"\n",
"To enable/disable remote inferencing, you can run the cli command `guardrails configure` or modify your `~/.guardrailsrc`.\n",
"## Toggling remote inferencing\n",
"\n",
"\n"
"To enable/disable remote inferencing, you can run the CLI command `guardrails configure` or modify your `~/.guardrailsrc`."
]
},
{
Expand All @@ -142,10 +147,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To disable remote inferencing from a specific validator, you can add a `use_local` kwarg to the validator's initializer\n",
"To disable remote inferencing from a specific validator, add a `use_local` kwarg to the validator's initializer. \n",
"\n",
":::note\n",
"When runnning locally, you may need to reinstall the validator with the --install-local-models flag.\n",
"When running locally, you may need to reinstall the validator with the `--install-local-models` flag.\n",
":::"
]
},
Expand All @@ -172,9 +177,9 @@
"source": [
"## Hosting your own endpoint\n",
"\n",
"Validators are able to point to any endpoint that implements the interface that Guardrails validators expect. This interface can be found in the `_inference_remote` method of the validator.\n",
"Validators can point to any endpoint that implements the interface that Guardrails validators expect. This interface can be found in the `_inference_remote` method of the validator.\n",
"\n",
"After implementing this interface, you can host your own endpoint (for example, using gunicorn and Flask) and point your validator to it by setting the `validation_endpoint` constructor argument.\n"
"After implementing this interface, you can host your own endpoint (for example, [using gunicorn and Flask](https://flask.palletsprojects.com/en/stable/deploying/gunicorn/)) and point your validator to it by setting the `validation_endpoint` constructor argument.\n"
]
},
{
Expand Down Expand Up @@ -225,7 +230,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.7"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -639,7 +639,7 @@ resource "aws_lb_listener" "app_lb_listener" {
resource "aws_lb_target_group" "app_lb" {
name = "${local.deployment_name}-nlb-tg"
protocol = "TCP"
port = 80
port = var.backend_server_port
vpc_id = aws_vpc.backend.id
target_type = "ip"
Expand All @@ -650,6 +650,7 @@ resource "aws_lb_target_group" "app_lb" {
timeout = "3"
unhealthy_threshold = "3"
path = "/"
port = var.backend_server_port
}
lifecycle {
Expand Down
40 changes: 22 additions & 18 deletions docs/integrations/llama_index.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 3,
"metadata": {},
"outputs": [
{
Expand All @@ -50,15 +50,14 @@
"\n",
"\n",
"Installing hub:\u001b[35m/\u001b[0m\u001b[35m/guardrails/\u001b[0m\u001b[95mcompetitor_check...\u001b[0m\n",
"✅Successfully installed guardrails/competitor_check version \u001b[1;36m0.0\u001b[0m.\u001b[1;36m1\u001b[0m!\n",
"✅Successfully installed guardrails/competitor_check!\n",
"\n",
"\n"
]
}
],
"source": [
"! guardrails hub install hub://guardrails/detect_pii --no-install-local-models -q\n",
"! guardrails hub install hub://guardrails/competitor_check --no-install-local-models -q"
"! guardrails hub install hub://guardrails/detect_pii hub://guardrails/competitor_check --no-install-local-models -q"
]
},
{
Expand All @@ -70,7 +69,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 4,
"metadata": {},
"outputs": [
{
Expand All @@ -79,7 +78,7 @@
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 75042 100 75042 0 0 959k 0 --:--:-- --:--:-- --:--:-- 964k\n"
"100 75042 100 75042 0 0 353k 0 --:--:-- --:--:-- --:--:-- 354k\n"
]
}
],
Expand All @@ -99,7 +98,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -136,7 +135,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -148,7 +147,12 @@
" competitors=[\"Fortran\", \"Ada\", \"Pascal\"],\n",
" on_fail=\"fix\"\n",
" )\n",
").use(DetectPII(pii_entities=\"pii\", on_fail=\"fix\"))"
").use(\n",
" DetectPII(\n",
" pii_entities=[\"PERSON\", \"EMAIL_ADDRESS\"], \n",
" on_fail=\"fix\"\n",
" )\n",
")"
]
},
{
Expand All @@ -162,21 +166,21 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The author worked on writing short stories and programming, starting with early attempts on an IBM 1401 using Fortran in 9th grade, and later transitioning to microcomputers like the TRS-80 and Apple II to write games, rocket prediction programs, and a word processor.\n"
"The author is Paul Graham. Growing up, he worked on writing short stories and programming, starting with the IBM 1401 in 9th grade using an early version of Fortran. Later, he transitioned to microcomputers like the TRS-80 and began programming more extensively, creating simple games and a word processor.\n"
]
}
],
"source": [
"# Use index on it's own\n",
"query_engine = index.as_query_engine()\n",
"response = query_engine.query(\"What did the author do growing up?\")\n",
"response = query_engine.query(\"Who is the author and what did they do growing up?\")\n",
"print(response)"
]
},
Expand All @@ -189,14 +193,14 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The author worked on writing short stories and programming, starting with early attempts on an IBM 1401 using [COMPETITOR] in 9th <URL>er, the author transitioned to microcomputers, building a Heathkit kit and eventually getting a TRS-80 to write simple games and <URL>spite enjoying programming, the author initially planned to study philosophy in college but eventually switched to AI due to a lack of interest in philosophy courses.\n"
"The author is <PERSON>. Growing up, he worked on writing short stories and programming, starting with the IBM 1401 in 9th grade using an early version of [COMPETITOR]. Later, he transitioned to microcomputers like the TRS-80 and Apple II, where he wrote simple games, programs, and a word processor. \n"
]
}
],
Expand All @@ -206,7 +210,7 @@
"\n",
"guardrails_query_engine = GuardrailsQueryEngine(engine=query_engine, guard=guard)\n",
"\n",
"response = guardrails_query_engine.query(\"What did the author do growing up?\")\n",
"response = guardrails_query_engine.query(\"Who is the author and what did they do growing up?\")\n",
"print(response)\n",
" "
]
Expand All @@ -220,14 +224,14 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The author worked on writing short stories and programming while growing <URL>ey started with early attempts on an IBM 1401 using [COMPETITOR] in 9th <URL>er, they transitioned to microcomputers, building simple games and a word processor on a TRS-80 in <DATE_TIME>.\n"
"The author is <PERSON>. Growing up, he worked on writing short stories and programming. He started with early attempts on an IBM 1401 using [COMPETITOR] in 9th grade. Later, he transitioned to microcomputers, building a Heathkit kit and eventually getting a TRS-80 to write simple games and programs. Despite enjoying programming, he initially planned to study philosophy in college but eventually switched to AI due to a lack of interest in philosophy courses. \n"
]
}
],
Expand All @@ -237,7 +241,7 @@
"chat_engine = index.as_chat_engine()\n",
"guardrails_chat_engine = GuardrailsChatEngine(engine=chat_engine, guard=guard)\n",
"\n",
"response = guardrails_chat_engine.chat(\"Tell me what the author did growing up.\")\n",
"response = guardrails_chat_engine.chat(\"Tell me who the author is and what they did growing up.\")\n",
"print(response)"
]
}
Expand Down
Loading

0 comments on commit eb816b5

Please sign in to comment.