You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# HELP http_request_response_start_duration_seconds Duration of HTTP requests in seconds
# TYPE http_request_response_start_duration_seconds histogram
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="0.5",method="POST",status="2xx"} 0.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="1.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="2.5",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="5.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="10.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="30.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="60.0",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_bucket{handler="/v2/code/generations",le="+Inf",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_count{handler="/v2/code/generations",method="POST",status="2xx"} 1.0
http_request_response_start_duration_seconds_sum{handler="/v2/code/generations",method="POST",status="2xx"} 0.6706487989995367
# HELP http_request_response_start_duration_seconds_created Duration of HTTP requests in seconds
# TYPE http_request_response_start_duration_seconds_created gauge
http_request_response_start_duration_seconds_created{handler="/v2/code/generations",method="POST",status="2xx"} 1.7095186511967359e+09
Why do we need it?
Users often feel the latency as the first chunk arrival instead of the last chunk arrival as LLM inference APIs usually support HTTP streaming to improve the UX. We want to instrument the duration.
What does this do?
We want to add an option to track HTTP response duration without streaming duration.
Config Example:
https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/blob/main/ai_gateway/app.py?ref_type=heads#L51-58
Output example:
Why do we need it?
Users often feel the latency as the first chunk arrival instead of the last chunk arrival as LLM inference APIs usually support HTTP streaming to improve the UX. We want to instrument the duration.
Who is this for?
GitLab, software developers, LLM app optimizations
Linked issues
Related to https://gitlab.com/gitlab-com/runbooks/-/merge_requests/6928#note_1796949998
Reviewer notes
The text was updated successfully, but these errors were encountered: