Skip to content

Commit

Permalink
v0.0.15
Browse files Browse the repository at this point in the history
  • Loading branch information
MadcowD committed Nov 21, 2024
1 parent 6b4e4ab commit 56af99a
Show file tree
Hide file tree
Showing 7 changed files with 253 additions and 35 deletions.
210 changes: 210 additions & 0 deletions docs/src/core_concepts/evaluation.rst.partial
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
Evaluations
===========

Evaluations in ELL provide a powerful framework for assessing and analyzing Language Model Programs (LMPs). This guide covers the core concepts and features of the evaluation system.

Basic Usage
----------

Here's a simple example of creating and running an evaluation:

.. code-block:: python

import ell
from ell import Evaluation

@ell.simple(model="gpt-4")
def my_lmp(input_text: str):
return f"Process this: {input_text}"

# Define a metric function
def accuracy_metric(datapoint, output):
return float(datapoint["expected_output"].lower() in output.lower())

# Create an evaluation
eval = Evaluation(
name="basic_evaluation",
n_evals=10,
metrics={"accuracy": accuracy_metric}
)

# Run the evaluation
results = eval.run(my_lmp, n_workers=10)

Core Components
-------------

Evaluation Configuration
~~~~~~~~~~~~~~~~~~~~~~~

The ``Evaluation`` class accepts several key parameters:

- ``name``: A unique identifier for the evaluation
- ``n_evals``: Number of evaluations to run
- ``metrics``: Dictionary of metric functions
- ``dataset``: Optional dataset for evaluation
- ``samples_per_datapoint``: Number of samples per dataset point (default: 1)

Metrics
~~~~~~~

Metrics are functions that assess the performance of your LMP. They can be:

1. Simple scalar metrics:

.. code-block:: python

def length_metric(_, output):
return len(output)

2. Structured metrics:

.. code-block:: python

def detailed_metric(datapoint, output):
return {
"length": len(output),
"contains_keyword": datapoint["keyword"] in output,
"response_time": datapoint["response_time"]
}

3. Multiple metrics:

.. code-block:: python

metrics = {
"accuracy": accuracy_metric,
"length": length_metric,
"detailed": detailed_metric
}

Dataset Handling
~~~~~~~~~~~~~~

Evaluations can use custom datasets:

.. code-block:: python

dataset = [
{
"input": {"question": "What is the capital of France?"},
"expected_output": "Paris"
},
{
"input": {"question": "What is the capital of Italy?"},
"expected_output": "Rome"
}
]

eval = Evaluation(
name="geography_quiz",
dataset=dataset,
metrics={"accuracy": accuracy_metric}
)

Parallel Execution
~~~~~~~~~~~~~~~~

Evaluations support parallel execution for improved performance:

.. code-block:: python

# Run with 10 parallel workers
results = eval.run(my_lmp, n_workers=10, verbose=True)

Results and Analysis
------------------

Result Structure
~~~~~~~~~~~~~~

Evaluation results include:

- Metric summaries (mean, std, min, max)
- Individual run details
- Execution metadata
- Error tracking

Accessing Results
~~~~~~~~~~~~~~~

.. code-block:: python

# Get mean accuracy
mean_accuracy = results.metrics["accuracy"].mean()

# Get standard deviation
std_accuracy = results.metrics["accuracy"].std()

# Access individual runs
for run in results.runs:
print(f"Run ID: {run.id}")
print(f"Success: {run.success}")
print(f"Duration: {run.end_time - run.start_time}")

Advanced Features
---------------

Evaluation Types
~~~~~~~~~~~~~~

ELL supports different types of evaluations:

- ``METRIC``: Numerical performance metrics
- ``ANNOTATION``: Human or model annotations
- ``CRITERION``: Pass/fail criteria

Version Control
~~~~~~~~~~~~~

Evaluations support versioning:

- Version numbers
- Commit messages
- History tracking
- Multiple runs per version

Error Handling
~~~~~~~~~~~~

Robust error handling and reporting:

- Automatic error capture
- Failed run management
- Success status tracking
- Detailed error messages

ELL Studio Integration
--------------------

The evaluation system integrates with ELL Studio, providing:

- Visual evaluation management
- Result visualization
- Run comparisons
- Filtering and search
- Metric summaries
- Version control interface

Best Practices
------------

1. **Metric Design**
- Keep metrics focused and specific
- Use appropriate return types
- Handle edge cases

2. **Dataset Management**
- Use representative data
- Include edge cases
- Maintain dataset versioning

3. **Performance Optimization**
- Use appropriate worker counts
- Monitor resource usage
- Cache results when possible

4. **Version Control**
- Use meaningful commit messages
- Track major changes
- Maintain evaluation history
8 changes: 4 additions & 4 deletions examples/evals/summaries.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import ell.lmp.function


dataset: List[ell.evaluation.Datapoint] = [
dataset = [
{
"input": { # I really don't like this. Forcing "input" without typing feels disgusting.
"text": "The Industrial Revolution was a period of major industrialization and innovation that took place during the late 1700s and early 1800s. It began in Great Britain and quickly spread throughout Western Europe and North America. This revolution saw a shift from an economy based on agriculture and handicrafts to one dominated by industry and machine manufacturing. Key technological advancements included the steam engine, which revolutionized transportation and manufacturing processes. The textile industry, in particular, saw significant changes with the invention of spinning jennies, water frames, and power looms. These innovations led to increased productivity and the rise of factories. The Industrial Revolution also brought about significant social changes, including urbanization, as people moved from rural areas to cities for factory work. While it led to economic growth and improved living standards for some, it also resulted in poor working conditions, child labor, and environmental pollution. The effects of this period continue to shape our modern world."
Expand Down Expand Up @@ -126,7 +126,7 @@ def length_criterion(_, output):
eval_list = ell.evaluation.Evaluation(
name="test_list",
dataset=dataset,
criteria=[score_criterion, length_criterion],
metrics=[score_criterion, length_criterion],
)

# Example using a dictionary of criteria (as before)
Expand All @@ -139,8 +139,8 @@ def length_criterion(_, output):
# Run evaluation with list-based criteria
print("EVAL WITH GPT-4o (list-based criteria)")
results = eval_list.run(summarizer, n_workers=4, verbose=False).results
print("Mean critic score:", results.metrics["score"].mean())
print("Mean length of completions:", results.metrics["length"].mean())
print("Mean critic score:", results.metrics["score_criterion"].mean())
print("Mean length of completions:", results.metrics["length_criterion"].mean())

# Run evaluation with dict-based criteria
print("EVAL WITH GPT-4o (dict-based criteria)")
Expand Down
2 changes: 1 addition & 1 deletion examples/rag/rag.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from ell import ell
import ell


class VectorStore:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "ell-ai"
version = "0.0.14"
version = "0.0.15"
description = "ell - the language model programming library"
authors = ["William Guss <[email protected]>"]
license = "MIT"
Expand Down
9 changes: 8 additions & 1 deletion src/ell/evaluation/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,19 @@ def wrap_callable(value):
)
for k, v in value.items()
}
elif isinstance(value, list):
return [
function(type=LMPType.LABELER)(v)
if callable(v) and not hasattr(v, "__ell_track__")
else v
for v in value
]
elif callable(value) and not hasattr(value, "__ell_track__"):
return function()(value)
elif value is None:
return value
else:
raise ValueError(f"Expected dict, callable, or None, got {type(value)}")
raise ValueError(f"Expected dict, list, callable, or None, got {type(value)}")

# Validate dataset/n_evals
if self.dataset is None and self.n_evals is None:
Expand Down
51 changes: 24 additions & 27 deletions src/ell/stores/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def write_invocation(

def write_evaluation(self, evaluation: SerializedEvaluation) -> str:
with Session(self.engine) as session:
try:
with session.no_autoflush: # Prevent autoflush while we query
# Check if the evaluation already exists
existing_evaluation = session.exec(
select(SerializedEvaluation).where(
Expand All @@ -136,33 +136,30 @@ def write_evaluation(self, evaluation: SerializedEvaluation) -> str:
# Add the new evaluation
session.add(evaluation)

# Process labelers
for labeler in evaluation.labelers:
existing_labeler = session.exec(
select(EvaluationLabeler).where(
(EvaluationLabeler.evaluation_id == evaluation.id)
& (EvaluationLabeler.name == labeler.name)
)
).first()

if existing_labeler:
# Update existing labeler
existing_labeler.type = labeler.type
existing_labeler.labeling_lmp_id = labeler.labeling_lmp_id
existing_labeler.labeling_rubric = labeler.labeling_rubric
else:
# Add new labeler
labeler.evaluation_id = evaluation.id
session.add(labeler)
# Process labelers
for labeler in evaluation.labelers:
existing_labeler = session.exec(
select(EvaluationLabeler).where(
and_(
EvaluationLabeler.evaluation_id == evaluation.id,
EvaluationLabeler.name == labeler.name
)
)
).first()

if existing_labeler:
# Update existing labeler
existing_labeler.type = labeler.type
existing_labeler.labeling_lmp_id = labeler.labeling_lmp_id
existing_labeler.labeling_rubric = labeler.labeling_rubric
else:
# Add new labeler
labeler.evaluation_id = evaluation.id
session.add(labeler)

session.commit()
return evaluation.id

session.commit()
return evaluation.id
except IntegrityError as e:
session.rollback()
raise ValueError(f"Error writing evaluation: {str(e)}")
except Exception as e:
session.rollback()
raise e

def write_evaluation_run(self, evaluation_run: SerializedEvaluationRun) -> int:
with Session(self.engine) as session:
Expand Down
6 changes: 5 additions & 1 deletion tests/.exampleignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,8 @@ azure_ex.py
openrouter_ex.py
vllm_ex.py
*_ex.py
bedrock_hello.py
bedrock_hello.py
hello_postgres.py
exa/exa.py
exa.py
wikipedia_mini_rag.py

0 comments on commit 56af99a

Please sign in to comment.