-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genai user feedback evaluation #1322
base: main
Are you sure you want to change the base?
Genai user feedback evaluation #1322
Conversation
|
I will share some thoughts on the challenges we(at Langtrace) faced while implementing this: For some context, user feedback evaluations are generally collected as a thumbs up or thumbs down for LLM generations (typically in a chatbot) for the sake of understanding the model performance. So, this is a critical requirement for folks building with LLMs today. Challenges:
If you are curious to learn more about how we implemented this, see below the link:
|
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | | ||
|---|---|---|---|---|---| | ||
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||
| [`gen_ai.system`](/docs/attributes-registry/gen-ai.md) | string | The Generative AI product as identified by the client or server instrumentation. [1] | `openai` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this would be needed specifically. And, more importantly - how can this be achieved if the evaluation happens asynchronously from the generation. I think it's better to use span links or something similar to connect this with the original generation span.
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
9acac7e
to
a538723
Compare
1a2fcb1
to
2da159c
Compare
|
||
The event name MUST be `gen_ai.evaluation.user_feedback`. | ||
|
||
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should be in the common section and we should talk about user_feedback as an example.
|
||
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | | ||
|---|---|---|---|---|---| | ||
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, user feedback often relates to the overall output of an LLM application (which used multiple LLM completions to produce a final response to the user). gen_ai.response.id
targets LLM completions specifically which limits how user feedback can be used if response id is required. I'd suggest to allow for correlating user feedback with an id that can be set on any non-LLM-completion span, especially if this will define the schema for other evaluation metrics going forward.
We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized. Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time. |
Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the |
+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful |
We need a correlation(s) that works also when span_id is not available. The trace context is not available in all situations where evaluation scores or feedback are captured. There could also be other correlations in a system, response_id, session_id, turn_id, that are meaningful to a particular application or toolset. Is there a straightforward way to offer more than one option in the conventions? response_id, span_id, turn_id, etc. I'd think you want to |
<!-- markdownlint-capture --> | ||
<!-- markdownlint-disable --> | ||
|
||
The event name MUST be `gen_ai.evaluation.user_feedback`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussing at GenAI call:
- metrics for score are potentially more useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gen_ai.evaluation.relevance
dimensions:
- evaluator method
- ...
|
||
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | | ||
|---|---|---|---|---|---| | ||
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this should not be required and we should allow other (any) correlation ids.
- we should call out that all evaluations should allow adding arbitrary correlation ids
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Hi all 👋 I’m a maintainer of Arize-Phoenix and OpenInference. Just want to share some experiences with evaluations as it pertains to instrumentation in case it is helpful. Both the Arize SaaS platform and Phoenix open-source application accept human evaluations submitted via the UI or programmatic evaluations (LLM or code). Since these evaluations come after the span has ended, OpenInference doesn’t attach evaluations to the OTel span under scrutiny itself. Rather, we maintain separate evaluation tables with foreign key relations back to the spans table (in the case of Phoenix, which uses a relational DB) or keep the evaluations as columns in the spans table (in the case of Arize, which uses an OLAP DB). LLM evaluations are traced in the same way as LLM calls in the application, and OpenInference doesn’t currently have semantic conventions specifically related to evaluations. We think of an evaluation as being comprised of:
This attempts to capture that some evaluations are categorical, some numeric, some both, and some accompanied by a human- or LLM-generated explanation. Given the generic nature of the evaluations we ingest, we don’t place max/ min limitations on the score. Much of what we evaluate is not just LLM calls, but chains, retrievals made via RAG, entire traces or “sessions” (groups of traces corresponding to a back-and-forth conversation between a user and the application). So we allow evaluations to be attached not just to LLM spans, but to any span kind defined in the OpenInference spec (typically, we attach evaluations for traces to the root span). The evaluation interface is pretty consistent no matter what we’re evaluating. |
Changes
It provides details for user feedback event which can be used for evaluation purposes.
Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.
Merge requirement checklist
[chore]