Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evals #269

Merged
merged 109 commits into from
Nov 19, 2024
Merged

Evals #269

Changes from 1 commit
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
f471ca2
more metrics rambling
MadcowD Sep 29, 2024
4ab83de
basic evaluation class
MadcowD Sep 29, 2024
9cf8635
todos
MadcowD Sep 30, 2024
a6100a7
fix bedrock
MadcowD Sep 30, 2024
f622740
more ramblings
MadcowD Oct 1, 2024
3a699c2
switch to logging
MadcowD Oct 4, 2024
b21afd9
more ramblings
MadcowD Oct 4, 2024
1a6ef0d
thoughts
MadcowD Oct 4, 2024
36e81fb
working evals
MadcowD Oct 4, 2024
1048e90
eval update
MadcowD Oct 4, 2024
77cb872
Delete docs/ramblings/test.py
MadcowD Oct 4, 2024
cd64ab9
Delete docs/ramblings/typeddictpartial.py
MadcowD Oct 4, 2024
6afad20
eval update
MadcowD Oct 4, 2024
3e2a60a
cleaner specification of criteria
MadcowD Oct 5, 2024
17dc98e
more spec discussions around chat and yielding
MadcowD Oct 5, 2024
936b5e1
.
MadcowD Oct 5, 2024
d6d3c05
.
MadcowD Oct 7, 2024
866607b
interesting human eval spec
MadcowD Oct 8, 2024
49f550f
refactor types locations
MadcowD Oct 9, 2024
e303f75
new back relationships for evaluatio nrun summaries
MadcowD Oct 9, 2024
fb86f48
some more interesintg thogyuhts on human feedback and data collection
MadcowD Oct 9, 2024
a6cd5cf
adding idea of a invocation group but not fully implemented
MadcowD Oct 9, 2024
ed545a1
working evals with road to summarizatioin
MadcowD Oct 9, 2024
db22e99
.
MadcowD Oct 9, 2024
ffc770d
example of a discriminator
MadcowD Oct 9, 2024
d93d735
additioanl update
MadcowD Oct 9, 2024
a973158
solve the no dataset problem
MadcowD Oct 10, 2024
e5df68c
working writing to the db
MadcowD Oct 10, 2024
25443a9
base models
MadcowD Oct 10, 2024
a992f35
data model stubs
MadcowD Oct 10, 2024
a5c60b5
working serverside api
MadcowD Oct 10, 2024
8c3d1a3
eval cards
MadcowD Oct 10, 2024
f1fb2ac
.
MadcowD Oct 10, 2024
26afc57
working eval cards
MadcowD Oct 11, 2024
5697612
shrinking version badges?
MadcowD Oct 11, 2024
0a087b7
horizontal layout
MadcowD Oct 11, 2024
e5c4476
cleaner layout
MadcowD Oct 11, 2024
53fe114
trendlines
MadcowD Oct 11, 2024
7577358
clean metrics
MadcowD Oct 11, 2024
6001206
latest run summary and invocaiton plots!
MadcowD Oct 11, 2024
d315026
basic eval page
MadcowD Oct 11, 2024
99d689c
generalize version history
MadcowD Oct 12, 2024
1cd52fa
diea for the metrics page.
MadcowD Oct 12, 2024
8538de3
more goofing
MadcowD Oct 12, 2024
4a01ff9
unified icon
MadcowD Oct 12, 2024
9d8b194
refacroed computation graph a bit more
MadcowD Oct 12, 2024
fe6aa91
better layouting algorithm
MadcowD Oct 12, 2024
fcc8ed2
metric displays
MadcowD Oct 12, 2024
aee962e
metrics
MadcowD Oct 12, 2024
e66ebf6
beautful
MadcowD Oct 12, 2024
6fbc927
.
MadcowD Oct 12, 2024
26abbf6
added error bars
MadcowD Oct 13, 2024
c30410e
Merge branch 'main' into wguss/metrics
MadcowD Oct 13, 2024
2270b98
.
MadcowD Oct 13, 2024
49379c4
datanow comes in real time
MadcowD Oct 13, 2024
8f4628b
enable parallel writes
MadcowD Oct 13, 2024
29a2e05
fix writing to wrong graph
MadcowD Oct 13, 2024
6d84d6a
.
MadcowD Oct 14, 2024
7dcdd64
.
MadcowD Oct 14, 2024
d7ea716
.
MadcowD Oct 24, 2024
a606d1c
refactor evalaution location
MadcowD Oct 24, 2024
cf0cda2
evaluations
MadcowD Oct 24, 2024
2b633fa
a bit of cleanup & refactor
MadcowD Oct 24, 2024
2e3a687
lcoal evalaution utils
MadcowD Oct 24, 2024
1d67ef6
local util refactor pt 2
MadcowD Oct 24, 2024
5240c5b
fix bug where invocation id interface didnt work when the store wasn'…
MadcowD Oct 24, 2024
245bfa5
new tests
MadcowD Oct 24, 2024
a4d8d0f
more tests
MadcowD Oct 24, 2024
cd3b9e7
additioanl refactor todo
MadcowD Oct 24, 2024
5cf03bf
refactor for serialization module
MadcowD Oct 24, 2024
c8bc762
a bit cleaner version of the evaluator now
MadcowD Oct 24, 2024
e4e8bb3
push
MadcowD Oct 28, 2024
702f24f
streaming evaluations.
MadcowD Oct 29, 2024
fb42417
in progress display
MadcowD Oct 29, 2024
6514b69
hierarchical cluster sorting for parallel execution.
MadcowD Oct 29, 2024
38a5c59
partial labelers
MadcowD Oct 29, 2024
c190b60
.
MadcowD Oct 29, 2024
5d64765
convert to flat representation
MadcowD Oct 29, 2024
76fe640
flat
MadcowD Oct 29, 2024
cbfea1a
.
MadcowD Oct 31, 2024
e0de211
new individual evaluation page
MadcowD Oct 31, 2024
89c4a92
label display
MadcowD Nov 1, 2024
1f7bcff
is constant subtle bras
MadcowD Nov 1, 2024
033a542
fixing sdiebar bug
MadcowD Nov 1, 2024
b69a93e
expandable groups
MadcowD Nov 1, 2024
8e648ef
fix bool bug
MadcowD Nov 2, 2024
85b7f73
histogram view
MadcowD Nov 2, 2024
ab86f21
adding histogram and better sorting
MadcowD Nov 2, 2024
e7929fd
working search and filteres for the evals
MadcowD Nov 2, 2024
8775ce3
metrics show up now in the sidebar
MadcowD Nov 2, 2024
362565d
fix null renderer bug
MadcowD Nov 2, 2024
1f62914
some clean asserts
MadcowD Nov 2, 2024
bc8097e
passing tests
MadcowD Nov 2, 2024
b748679
evals
MadcowD Nov 2, 2024
e98ede6
sliguhtly improved tables
MadcowD Nov 13, 2024
7de875c
all versions of evaluations
MadcowD Nov 13, 2024
d8caaff
evals folder
MadcowD Nov 13, 2024
09a6dea
dataset hash -> dataset id
MadcowD Nov 13, 2024
6c41f19
dataset storage
MadcowD Nov 13, 2024
1836c2b
dataset view ish
MadcowD Nov 13, 2024
b6308aa
dataset page
MadcowD Nov 13, 2024
08bd13f
merging new main:
MadcowD Nov 19, 2024
1febcda
migration for exisitng ell studio databases
MadcowD Nov 19, 2024
6e866d7
fixed migrations and module seperation
MadcowD Nov 19, 2024
7f1f162
preferring main
MadcowD Nov 19, 2024
c1e4d04
cleanup evals
MadcowD Nov 19, 2024
b8778c5
python 3.9 fix
MadcowD Nov 19, 2024
a9f1f49
clean up eval lsit
MadcowD Nov 19, 2024
18c26eb
3.9 lru cache fix
MadcowD Nov 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
partial labelers
  • Loading branch information
MadcowD committed Oct 29, 2024
commit 38a5c5996d1fadedf0645b9c2b76162bc653b212
3 changes: 1 addition & 2 deletions src/ell/evaluation/evaluation.py
Original file line number Diff line number Diff line change
@@ -28,7 +28,7 @@
from ell.evaluation.results import *
import dill

class EvaluationRun(BaseModel):
class EvaluationRun(BaseModel):
model_config = ConfigDict(arbitrary_types_allowed=True)
results: EvaluationResults = Field(default_factory=EvaluationResults)
dataset: Optional[Dataset] = Field(default=None)
@@ -80,7 +80,6 @@ class Evaluation(BaseModel):
id: Optional[str] = Field(default=None)


# XXX: Dones't support partial params outside of the dataset like client??
def run(
self,
lmp,
59 changes: 53 additions & 6 deletions src/ell/evaluation/serialization.py
Original file line number Diff line number Diff line change
@@ -8,6 +8,7 @@
from ell.evaluation.util import needs_store
from ell.lmp._track import serialize_lmp
from ell.store import Store
from ell.util._warnings import _autocommit_warning
from ell.util.closure_util import ido
from ell.util.closure_util import hsh
import ell.util.closure
@@ -43,23 +44,28 @@ def write_evaluation(evaluation) -> None:
evaluation.has_serialized = True
else:
# TODO: Merge with other versioning code.
version_number = (
version_number, latest_version = (
max(
itertools.chain(
map(lambda x: x.version_number, existing_versions), [-1]
)
map(lambda x: (x.version_number, x), existing_versions),
[(-1, None)]
),
key=lambda x: x[0]
)
) +1
)
version_number += 1
commit_message = None
if config.autocommit:
# TODO: Implement
pass
commit_message = generate_commit_message(evaluation, metrics_ids, annotation_ids, criteiron_ids, latest_version)


# Create SerializedEvaluation
serialized_evaluation = SerializedEvaluation(
id=evaluation.id,
name=evaluation.name,
dataset_hash=dataset_hash,
n_evals=evaluation.n_evals or len(evaluation.dataset or []),
commit_message=commit_message,
version_number=version_number,
)

@@ -86,6 +92,47 @@ def create_labelers(names, ids, labeler_type):
evaluation.has_serialized = True
cast(Store, config.store).write_evaluation(serialized_evaluation)

def generate_commit_message(evaluation, metrics_ids, annotation_ids, criteiron_ids, latest_version):
# XXX
if not _autocommit_warning():
from ell.util.differ import write_commit_message_for_diff
# Get source code for all metrics, annotations and criterion
# In this case we actually dont want to automatically generate a commmit message using gpt-4o we can just detect a cahgne inthe lablers and use that as the primary mechanism.
# Get labelers from latest version if it exists
if latest_version:
latest_labelers = latest_version.labelers

# Group labelers by type
latest_metrics = {l.name: l.labeling_lmp_id for l in latest_labelers if l.type == EvaluationLabelerType.METRIC}
latest_annotations = {l.name: l.labeling_lmp_id for l in latest_labelers if l.type == EvaluationLabelerType.ANNOTATION}
latest_criterion = next((l.labeling_lmp_id for l in latest_labelers if l.type == EvaluationLabelerType.CRITERION), None)

# Compare with current labelers
metrics_changed = {name: id for name, id in zip(evaluation.metrics.keys(), metrics_ids)
if name not in latest_metrics or latest_metrics[name] != id}
annotations_changed = {name: id for name, id in zip(evaluation.annotations.keys(), annotation_ids)
if name not in latest_annotations or latest_annotations[name] != id}
criterion_changed = criteiron_ids[0] if criteiron_ids and (not latest_criterion or latest_criterion != criteiron_ids[0]) else None

# Generate commit message if there are changes
if metrics_changed or annotations_changed or criterion_changed:
changes = []
summary_parts = []
if metrics_changed:
changes.append(f"Changed metrics: {', '.join(metrics_changed.keys())}")
summary_parts.append(f"{len(metrics_changed)} metrics")
if annotations_changed:
changes.append(f"Changed annotations: {', '.join(annotations_changed.keys())}")
summary_parts.append(f"{len(annotations_changed)} annotations")
if criterion_changed:
changes.append("Changed criterion")
summary_parts.append("criterion")

summary = f"Updated {', '.join(summary_parts)}"
details = " | ".join(changes)
commit_message = f"{summary}\n\n{details}"
return commit_message


@needs_store
def write_evaluation_run_start(evaluation, evaluation_run) -> int: