-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handler metrics to egress #526
Conversation
- HTTP proxy - max retries - max/min retry delay - AWS logging option
* histogram of upload response time with labels for type: file, manifest, segment, liveplaylist, playlist; and status: success,failure * counter of total upload writes * guage of segment channel size * guage of playlist channel size * counter of backup storage writes
@frostbyte73 I have made the changes as you suggested (refactoring handler metrics to a new struct in I've built and verified the changes locally in our test environment. |
reviewing the code some more, I added some additional refactoring to further concentrate the metrics use in the |
There was a timing window when the handler grpc client was available, but before it was ready to send back metrics and if metrics were gathered there, the whole HTTP metrics response would be something like this ``` An error has occurred while serving metrics: [from Gatherer livekit#2] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /tmp/EGH_ME3wf4gMCrac/service_rpc.sock: connect: no such file or directory" ``` That came from prometheus.registry.go:761, which sends back an error from the `Gatherers.Gather()` invocation which the Prometheus HTTP handler then decides to return. What I've done is to quietly swallow errors from the handler (after logging) and simply return empty metrics objects and no error.
There was a timing window when the handler grpc client was defined, but before it was ready to send back metrics and if metrics were gathered then, the whole HTTP metrics response would be something like this
That came from |
fixed typo in refactored metrics collection that transposed success/failure response time and counts |
Thanks! Would you be able to update the test code to the new Upload signature as well? |
@biglittlebigben I would gladly make the change if I knew how or where. I see the failure from the test run
But I don't have these files in my repo
So, I don't see where to update them. When I run
locally, it works and I don't get the error (which makes sense since I don't have those files). If you tell me how I can effect the change, I would gladly fix the test code to match the new signatures. |
@biglittlebigben Oh... I see the problem. Those "missing" files were introduced AFTER the branch point were I created my fork. I had tried running based on egress 1.8.0, but I ran into other problems. When I checked with @frostbyte73, he ended up recommending that I stay based on the 1.7.12 version because the newer egress required a newer livekit-server that we did not have available yet. So, it seems like there are two ways forward...
|
@biglittlebigben I was unable to make the changes to fix the test (which runs on |
Paying more attention to the failure, it looks like the current branch fails to build, not specifically the integration tests. The code will eventually be merged with main, so would need to build after merge/rebase with that. |
It should be a fairly straightforward rebase, so I can fix the patch if needed. |
That is what I did in #529 |
withdrawing this one because #529 was merged |
The interesting work in egress happens in separate process called a
handler
launched for each unique egress. This architecture provides good isolation, but theserver
process which provides the metrics to be scraped for ingestion into a time-series database did not have visibility into what was happening in the handler processes.This PR adds support to aggregate the metrics from each active handler process every time the server is queried to render metrics for collection (changes in
handler.go
,process.go
,server.go
as well asipc.proto
)These changes, expose the
go
built-in metrics from the handler processes (go_gc_duration_seconds
, etc..) for collection by scraping the prometheus port of the server process.The PR also implements some metric collectors in the handler to provide application-level metrics.
An example of metrics provided by these is captured here in this file: handler-metrics-pr-example.txt
Note: This PR is on top of [PR#521}(https://github.com//pull/521) but is largely independent of it (only one file in common and no conflicting lines), however, I needed the changes in PR#521 to test in my environment.