Add handler metrics to egress #526

dandoug · 2023-11-03T22:20:14Z

The interesting work in egress happens in separate process called a handler launched for each unique egress. This architecture provides good isolation, but the server process which provides the metrics to be scraped for ingestion into a time-series database did not have visibility into what was happening in the handler processes.

This PR adds support to aggregate the metrics from each active handler process every time the server is queried to render metrics for collection (changes in handler.go, process.go, server.go as well as ipc.proto)

These changes, expose the go built-in metrics from the handler processes (go_gc_duration_seconds, etc..) for collection by scraping the prometheus port of the server process.

The PR also implements some metric collectors in the handler to provide application-level metrics.

histogram of upload response time with labels for type: file, manifest, segment, liveplaylist, playlist; and status: success,failure
counter of total upload writes
guage of segment channel size
guage of playlist channel size
counter of backup storage writes

An example of metrics provided by these is captured here in this file: handler-metrics-pr-example.txt

Note: This PR is on top of [PR#521}(https://github.com//pull/521) but is largely independent of it (only one file in common and no conflicting lines), however, I needed the changes in PR#521 to test in my environment.

- HTTP proxy - max retries - max/min retry delay - AWS logging option

* histogram of upload response time with labels for type: file, manifest, segment, liveplaylist, playlist; and status: success,failure * counter of total upload writes * guage of segment channel size * guage of playlist channel size * counter of backup storage writes

pkg/config/pipeline.go

pkg/pipeline/sink/file.go

…byte73

…byte73

dandoug · 2023-11-06T17:03:34Z

@frostbyte73 I have made the changes as you suggested (refactoring handler metrics to a new struct in stats package and moved the instrumentation of uploader response time down to the remoteUploader implementation.

I've built and verified the changes locally in our test environment.

dandoug · 2023-11-06T18:00:56Z

reviewing the code some more, I added some additional refactoring to further concentrate the metrics use in the stats package

There was a timing window when the handler grpc client was available, but before it was ready to send back metrics and if metrics were gathered there, the whole HTTP metrics response would be something like this ``` An error has occurred while serving metrics: [from Gatherer livekit#2] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /tmp/EGH_ME3wf4gMCrac/service_rpc.sock: connect: no such file or directory" ``` That came from prometheus.registry.go:761, which sends back an error from the `Gatherers.Gather()` invocation which the Prometheus HTTP handler then decides to return. What I've done is to quietly swallow errors from the handler (after logging) and simply return empty metrics objects and no error.

dandoug · 2023-11-06T19:19:04Z

There was a timing window when the handler grpc client was defined, but before it was ready to send back metrics and if metrics were gathered then, the whole HTTP metrics response would be something like this

An error has occurred while serving metrics:

[from Gatherer #2] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /tmp/EGH_ME3wf4gMCrac/service_rpc.sock: connect: no such file or directory"

That came from prometheus.registry.go:761, which sends back an error from the Gatherers.Gather() invocation which the Prometheus HTTP handler then decides to return. What I've done is to quietly swallow errors from the handler (after logging) and simply return empty metrics objects and no error.

dandoug · 2023-11-06T21:07:12Z

fixed typo in refactored metrics collection that transposed success/failure response time and counts

biglittlebigben · 2023-11-07T00:00:05Z

Thanks! Would you be able to update the test code to the new Upload signature as well?

dandoug · 2023-11-07T02:58:34Z

@biglittlebigben I would gladly make the change if I knew how or where.

I see the failure from the test run

Step 15/36 : RUN if [ "$TARGETPLATFORM" = "linux/arm64" ]; then GOARCH=arm64; else GOARCH=amd64; fi &&     CGO_ENABLED=1 GOOS=linux GOARCH=${GOARCH} GO111MODULE=on go build -a -o egress ./cmd/server
 ---> Running in e46c7c8324d0
# github.com/livekit/egress/pkg/pipeline/sink
pkg/pipeline/sink/image.go:111:75: not enough arguments in call to s.Upload
	have (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool)
	want (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool, string)
pkg/pipeline/sink/image_manifest.go:74:77: not enough arguments in call to u.Upload
	have (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool)
	want (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool, string)
pkg/pipeline/sink/sink.go:78:44: not enough arguments in call to uploader.New
	have (config.UploadConfig, string)
	want (config.UploadConfig, string, "github.com/livekit/egress/pkg/stats".HandlerMonitor)
The command '/bin/sh -c if [ "$TARGETPLATFORM" = "linux/arm64" ]; then GOARCH=arm64; else GOARCH=amd64; fi &&     CGO_ENABLED=1 GOOS=linux GOARCH=${GOARCH} GO111MODULE=on go build -a -o egress ./cmd/server' returned a non-zero code: 1

Error: Process completed with exit code 1.

But I don't have these files in my repo

pkg/pipeline/sink/image.go
pkg/pipeline/sink/image_manifest.go
pkg/pipeline/sink/sink.go

So, I don't see where to update them. When I run

 docker build -t test -f build/test/Dockerfile --build-arg TARGETPLATFORM=linux/amd64 .

locally, it works and I don't get the error (which makes sense since I don't have those files).

If you tell me how I can effect the change, I would gladly fix the test code to match the new signatures.

dandoug · 2023-11-07T17:36:47Z

@biglittlebigben Oh... I see the problem. Those "missing" files were introduced AFTER the branch point were I created my fork.

I had tried running based on egress 1.8.0, but I ran into other problems. When I checked with @frostbyte73, he ended up recommending that I stay based on the 1.7.12 version because the newer egress required a newer livekit-server that we did not have available yet.

So, it seems like there are two ways forward...

rebased this PR on the latest code, but then we will need a compatible version of livekit-server for us to use in our environment, or
kick off integration test on a version of the code that does not depend on features beyond egress 1.7.12

dandoug · 2023-11-07T18:58:13Z

@biglittlebigben I was unable to make the changes to fix the test (which runs on main/latest) in this branch, so I created a new PR, #529, cherry-picked these changes and applied the additional changes to image.go, image_manifest.go, and sink.go to make the code build. I believe the changes are correct, although I am unable to verify in our environment since we're pegged on v1.7.12 at the moment.

biglittlebigben · 2023-11-07T19:01:35Z

@biglittlebigben Oh... I see the problem. Those "missing" files were introduced AFTER the branch point were I created my fork.

I had tried running based on egress 1.8.0, but I ran into other problems. When I checked with @frostbyte73, he ended up recommending that I stay based on the 1.7.12 version because the newer egress required a newer livekit-server that we did not have available yet.

So, it seems like there are two ways forward...

rebased this PR on the latest code, but then we will need a compatible version of livekit-server for us to use in our environment, or

kick off integration test on a version of the code that does not depend on features beyond egress 1.7.12

Paying more attention to the failure, it looks like the current branch fails to build, not specifically the integration tests. The code will eventually be merged with main, so would need to build after merge/rebase with that.

biglittlebigben · 2023-11-07T19:05:30Z

Paying more attention to the failure, it looks like the current branch fails to build, not specifically the integration tests. The code will eventually be merged with main, so would need to build after merge/rebase with that.

It should be a fairly straightforward rebase, so I can fix the patch if needed.

dandoug · 2023-11-07T19:19:58Z

Paying more attention to the failure, it looks like the current branch fails to build, not specifically the integration tests. The code will eventually be merged with main, so would need to build after merge/rebase with that.

It should be a fairly straightforward rebase, so I can fix the patch if needed.

That is what I did in #529

dandoug · 2023-11-14T17:03:17Z

withdrawing this one because #529 was merged

dandoug added 2 commits November 3, 2023 10:54

Add more options to the S3 section of config.yaml for egress

93ebc2d

- HTTP proxy - max retries - max/min retry delay - AWS logging option

frostbyte73 reviewed Nov 3, 2023

View reviewed changes

pkg/config/pipeline.go Outdated Show resolved Hide resolved

frostbyte73 reviewed Nov 3, 2023

View reviewed changes

pkg/pipeline/sink/file.go Outdated Show resolved Hide resolved

refactored handler metrics collection per review comments from @frost…

6718b5d

…byte73

more refactoring

25fc2bd

dang... flipped success/failure metrics during the refactor

e1ed05c

dandoug mentioned this pull request Nov 7, 2023

apply changes from handler-metrics to egress 1.8.0 base #529

Merged

dandoug closed this Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handler metrics to egress #526

Add handler metrics to egress #526

dandoug commented Nov 3, 2023

dandoug commented Nov 6, 2023

dandoug commented Nov 6, 2023

dandoug commented Nov 6, 2023 •

edited

Loading

dandoug commented Nov 6, 2023

biglittlebigben commented Nov 7, 2023

dandoug commented Nov 7, 2023

dandoug commented Nov 7, 2023

dandoug commented Nov 7, 2023

biglittlebigben commented Nov 7, 2023

biglittlebigben commented Nov 7, 2023

dandoug commented Nov 7, 2023

dandoug commented Nov 14, 2023

Add handler metrics to egress #526

Add handler metrics to egress #526

Conversation

dandoug commented Nov 3, 2023

dandoug commented Nov 6, 2023

dandoug commented Nov 6, 2023

dandoug commented Nov 6, 2023 • edited Loading

dandoug commented Nov 6, 2023

biglittlebigben commented Nov 7, 2023

dandoug commented Nov 7, 2023

dandoug commented Nov 7, 2023

dandoug commented Nov 7, 2023

biglittlebigben commented Nov 7, 2023

biglittlebigben commented Nov 7, 2023

dandoug commented Nov 7, 2023

dandoug commented Nov 14, 2023

dandoug commented Nov 6, 2023 •

edited

Loading