Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Instrument latency without streaming duration #290

Merged
merged 10 commits into from
Mar 11, 2024

Conversation

dosuken123
Copy link
Contributor

@dosuken123 dosuken123 commented Mar 4, 2024

What does this do?

Add a brief description of what the feature or update does.

This PR adds an option to track HTTP response duration without streaming duration.

Config Example:

    instrumentator.add(
        metrics.latency(
            should_include_handler=True,
            should_include_method=True,
            should_include_status=True,
            buckets=(0.5, 1, 2.5, 5, 10, 30, 60),
        ),
        metrics.latency(
            metric_name="http_request_duration_without_streaming_seconds",
            should_include_handler=True,
            should_include_method=True,
            should_include_status=True,
            buckets=(0.5, 1, 2.5, 5, 10, 30, 60),
            should_exclude_streaming_duration=True,               # <= New option
        )
    )

https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/blob/main/ai_gateway/app.py?ref_type=heads#L51-58

Output example:

# HELP http_request_response_start_duration_seconds Duration of HTTP requests in seconds
# TYPE http_request_response_start_duration_seconds histogram
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="0.5",method="POST",status="2xx"} 0.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="1.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="2.5",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="5.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="10.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="30.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="60.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="+Inf",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_count{handler="/v2/code/generations",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_sum{handler="/v2/code/generations",method="POST",status="2xx"} 0.6706487989995367
# HELP http_request_response_start_duration_seconds_created Duration of HTTP requests in seconds
# TYPE http_request_response_start_duration_seconds_created gauge
http_request_response_start_duration_seconds_created{handler="/v2/code/generations",method="POST",status="2xx"} 1.7095186511967359e+09

Fixes #291

Why do we need it?

Users often feel the latency as the first chunk arrival instead of the last chunk arrival as LLM inference APIs usually support HTTP streaming to improve the UX. We want to instrument the duration.

Who is this for?

GitLab, software developers, LLM app optimizations

Linked issues

Related to https://gitlab.com/gitlab-com/runbooks/-/merge_requests/6928#note_1796949998

Reviewer notes

Add special notes for your reviewer.

This commit adds a feature to track the latency excluding
streaming duration.
@dosuken123 dosuken123 force-pushed the track-response-start-duration branch from ff9169e to 019c8a5 Compare March 4, 2024 02:27
@dosuken123 dosuken123 changed the title Track response start duration Instrument latency without streaming duration Mar 4, 2024
@trallnag trallnag changed the title Instrument latency without streaming duration feat: Instrument latency without streaming duration Mar 11, 2024
Copy link

codecov bot commented Mar 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.98%. Comparing base (c608c4e) to head (3ae762e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #290      +/-   ##
==========================================
+ Coverage   95.79%   95.98%   +0.19%     
==========================================
  Files           5        5              
  Lines         357      374      +17     
==========================================
+ Hits          342      359      +17     
  Misses         15       15              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@trallnag
Copy link
Owner

Hi @dosuken123 thanks for the proposal and implementation. I will be included in the next version that will be released sometime this week

@trallnag trallnag merged commit 4530ba4 into trallnag:master Mar 11, 2024
6 checks passed
@dosuken123
Copy link
Contributor Author

@trallnag Thanks for help! Much appreciated 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instrument latency without streaming duration
2 participants