Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genai user feedback evaluation #1322

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

truptiparkar7
Copy link

@truptiparkar7 truptiparkar7 commented Aug 6, 2024

Changes

It provides details for user feedback event which can be used for evaluation purposes.

Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.

Merge requirement checklist

@truptiparkar7 truptiparkar7 requested review from a team August 6, 2024 18:50
Copy link

linux-foundation-easycla bot commented Aug 6, 2024

CLA Not Signed

model/trace/gen-ai.yaml Outdated Show resolved Hide resolved
@karthikscale3
Copy link
Contributor

karthikscale3 commented Aug 14, 2024

I will share some thoughts on the challenges we(at Langtrace) faced while implementing this:

For some context, user feedback evaluations are generally collected as a thumbs up or thumbs down for LLM generations (typically in a chatbot) for the sake of understanding the model performance. So, this is a critical requirement for folks building with LLMs today.

Challenges:

  1. Because the feedback can only be collected post the LLM generates the response, it means the span for the LLM generation has already been created. And today, as far as I can tell, there is no way to attach an attribute to an already generated span in a OTEL native way(there is no API to do this).

  2. As a result, at Langtrace, we decided to send the spanId of the LLM generated response to the application layer through a higher order function/decorator which the application developer needs to use in order to capture user feedback scores. On the application layer, the developer has access to the spanId which is then used for attaching the user feedback score and other user metadata such as user Id that uniquely identifies the user who gave this feedback.

  3. Now, at this stage, you have 2 options: Either generate a new span that's a child of this span(which is very tricky to establish) or store the evaluation against the spanId in a completely separate metadata store. We went with the latter approach for a few reasons:

  • Create a new child span was very tricky to make it work, especially when we are talking about streaming responses or using other implementations of the LLM SDK (like vercel ai sdk)
  • Attaching the feedback to the span by exposing a vendor specific API off the database that stores this span was expensive and difficult to maintain(also as a general rule of thumb we weren't comfortable mutating the trace data post generation)
  • For conversations happening in a single session, it ends up creating multiple feedback spans and when users change their feedback for the same generated response, we end up creating more than one span linking the same response ID or the span ID and it's impossible to know what the actual feedback is unless you sort the spans by created time which was not clean.

If you are curious to learn more about how we implemented this, see below the link:

| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
|---|---|---|---|---|---|
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`gen_ai.system`](/docs/attributes-registry/gen-ai.md) | string | The Generative AI product as identified by the client or server instrumentation. [1] | `openai` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this would be needed specifically. And, more importantly - how can this be achieved if the evaluation happens asynchronously from the generation. I think it's better to use span links or something similar to connect this with the original generation span.

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added Stale and removed Stale labels Sep 13, 2024
@truptiparkar7 truptiparkar7 requested review from a team as code owners September 19, 2024 21:10

The event name MUST be `gen_ai.evaluation.user_feedback`.

| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should be in the common section and we should talk about user_feedback as an example.


| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
|---|---|---|---|---|---|
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, user feedback often relates to the overall output of an LLM application (which used multiple LLM completions to produce a final response to the user). gen_ai.response.id targets LLM completions specifically which limits how user feedback can be used if response id is required. I'd suggest to allow for correlating user feedback with an id that can be set on any non-LLM-completion span, especially if this will define the schema for other evaluation metrics going forward.

@Rutledge
Copy link

Rutledge commented Oct 6, 2024

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.

Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

@karthikscale3
Copy link
Contributor

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.

Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

@marcklingen
Copy link

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.
Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful

@drewby
Copy link
Member

drewby commented Oct 10, 2024

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.
Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful

We need a correlation(s) that works also when span_id is not available. The trace context is not available in all situations where evaluation scores or feedback are captured. There could also be other correlations in a system, response_id, session_id, turn_id, that are meaningful to a particular application or toolset.

Is there a straightforward way to offer more than one option in the conventions? response_id, span_id, turn_id, etc. I'd think you want to require at least one be present.

<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

The event name MUST be `gen_ai.evaluation.user_feedback`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussing at GenAI call:

  • metrics for score are potentially more useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen_ai.evaluation.relevance
dimensions:

  • evaluator method
  • ...


| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
|---|---|---|---|---|---|
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Copy link
Contributor

@lmolkova lmolkova Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • this should not be required and we should allow other (any) correlation ids.
  • we should call out that all evaluations should allow adding arbitrary correlation ids

Copy link

github-actions bot commented Nov 2, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 2, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 10, 2024
@lmolkova lmolkova reopened this Nov 10, 2024
@github-actions github-actions bot removed the Stale label Nov 11, 2024
@axiomofjoy
Copy link

Hi all 👋 I’m a maintainer of Arize-Phoenix and OpenInference. Just want to share some experiences with evaluations as it pertains to instrumentation in case it is helpful.

Both the Arize SaaS platform and Phoenix open-source application accept human evaluations submitted via the UI or programmatic evaluations (LLM or code). Since these evaluations come after the span has ended, OpenInference doesn’t attach evaluations to the OTel span under scrutiny itself. Rather, we maintain separate evaluation tables with foreign key relations back to the spans table (in the case of Phoenix, which uses a relational DB) or keep the evaluations as columns in the spans table (in the case of Arize, which uses an OLAP DB). LLM evaluations are traced in the same way as LLM calls in the application, and OpenInference doesn’t currently have semantic conventions specifically related to evaluations.

We think of an evaluation as being comprised of:

  • name (required)
  • label (optional)
  • score (optional)
  • explanation (optional)

This attempts to capture that some evaluations are categorical, some numeric, some both, and some accompanied by a human- or LLM-generated explanation. Given the generic nature of the evaluations we ingest, we don’t place max/ min limitations on the score.

Much of what we evaluate is not just LLM calls, but chains, retrievals made via RAG, entire traces or “sessions” (groups of traces corresponding to a back-and-forth conversation between a user and the application). So we allow evaluations to be attached not just to LLM spans, but to any span kind defined in the OpenInference spec (typically, we attach evaluations for traces to the root span). The evaluation interface is pretty consistent no matter what we’re evaluating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.