Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handler metrics to egress #526

Closed
wants to merge 6 commits into from
Closed

Conversation

dandoug
Copy link
Contributor

@dandoug dandoug commented Nov 3, 2023

The interesting work in egress happens in separate process called a handler launched for each unique egress. This architecture provides good isolation, but the server process which provides the metrics to be scraped for ingestion into a time-series database did not have visibility into what was happening in the handler processes.

This PR adds support to aggregate the metrics from each active handler process every time the server is queried to render metrics for collection (changes in handler.go, process.go, server.go as well as ipc.proto)

These changes, expose the go built-in metrics from the handler processes (go_gc_duration_seconds, etc..) for collection by scraping the prometheus port of the server process.

The PR also implements some metric collectors in the handler to provide application-level metrics.

  • histogram of upload response time with labels for type: file, manifest, segment, liveplaylist, playlist; and status: success,failure
  • counter of total upload writes
  • guage of segment channel size
  • guage of playlist channel size
  • counter of backup storage writes

An example of metrics provided by these is captured here in this file: handler-metrics-pr-example.txt

Note: This PR is on top of [PR#521}(https://github.com//pull/521) but is largely independent of it (only one file in common and no conflicting lines), however, I needed the changes in PR#521 to test in my environment.

- HTTP proxy
- max retries
- max/min retry delay
- AWS logging option
* histogram of upload response time with labels for type: file, manifest, segment, liveplaylist, playlist;  and status: success,failure
* counter of total upload writes
* guage of segment channel size
* guage of playlist channel size
* counter of backup storage writes
pkg/config/pipeline.go Outdated Show resolved Hide resolved
@dandoug
Copy link
Contributor Author

dandoug commented Nov 6, 2023

@frostbyte73 I have made the changes as you suggested (refactoring handler metrics to a new struct in stats package and moved the instrumentation of uploader response time down to the remoteUploader implementation.

I've built and verified the changes locally in our test environment.

@dandoug
Copy link
Contributor Author

dandoug commented Nov 6, 2023

reviewing the code some more, I added some additional refactoring to further concentrate the metrics use in the stats package

There was a timing window when the handler grpc client was available, but before it was ready to send back metrics and if metrics were gathered there, the whole HTTP metrics response would be something like this
```
An error has occurred while serving metrics:

[from Gatherer livekit#2] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /tmp/EGH_ME3wf4gMCrac/service_rpc.sock: connect: no such file or directory"
```

That came from prometheus.registry.go:761, which sends back an error from the `Gatherers.Gather()` invocation which the Prometheus HTTP handler then decides to return.   What I've done is to quietly swallow errors from the handler (after logging) and simply return empty metrics objects and no error.
@dandoug
Copy link
Contributor Author

dandoug commented Nov 6, 2023

There was a timing window when the handler grpc client was defined, but before it was ready to send back metrics and if metrics were gathered then, the whole HTTP metrics response would be something like this

An error has occurred while serving metrics:

[from Gatherer #2] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /tmp/EGH_ME3wf4gMCrac/service_rpc.sock: connect: no such file or directory"

That came from prometheus.registry.go:761, which sends back an error from the Gatherers.Gather() invocation which the Prometheus HTTP handler then decides to return. What I've done is to quietly swallow errors from the handler (after logging) and simply return empty metrics objects and no error.

@dandoug
Copy link
Contributor Author

dandoug commented Nov 6, 2023

fixed typo in refactored metrics collection that transposed success/failure response time and counts

@biglittlebigben
Copy link
Contributor

Thanks! Would you be able to update the test code to the new Upload signature as well?

@dandoug
Copy link
Contributor Author

dandoug commented Nov 7, 2023

@biglittlebigben I would gladly make the change if I knew how or where.

I see the failure from the test run

Step 15/36 : RUN if [ "$TARGETPLATFORM" = "linux/arm64" ]; then GOARCH=arm64; else GOARCH=amd64; fi &&     CGO_ENABLED=1 GOOS=linux GOARCH=${GOARCH} GO111MODULE=on go build -a -o egress ./cmd/server
 ---> Running in e46c7c8324d0
# github.com/livekit/egress/pkg/pipeline/sink
pkg/pipeline/sink/image.go:111:75: not enough arguments in call to s.Upload
	have (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool)
	want (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool, string)
pkg/pipeline/sink/image_manifest.go:74:77: not enough arguments in call to u.Upload
	have (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool)
	want (string, string, "github.com/livekit/egress/pkg/types".OutputType, bool, string)
pkg/pipeline/sink/sink.go:78:44: not enough arguments in call to uploader.New
	have (config.UploadConfig, string)
	want (config.UploadConfig, string, "github.com/livekit/egress/pkg/stats".HandlerMonitor)
The command '/bin/sh -c if [ "$TARGETPLATFORM" = "linux/arm64" ]; then GOARCH=arm64; else GOARCH=amd64; fi &&     CGO_ENABLED=1 GOOS=linux GOARCH=${GOARCH} GO111MODULE=on go build -a -o egress ./cmd/server' returned a non-zero code: 1

Error: Process completed with exit code 1.

But I don't have these files in my repo

  • pkg/pipeline/sink/image.go
  • pkg/pipeline/sink/image_manifest.go
  • pkg/pipeline/sink/sink.go

So, I don't see where to update them. When I run

 docker build -t test -f build/test/Dockerfile --build-arg TARGETPLATFORM=linux/amd64 .

locally, it works and I don't get the error (which makes sense since I don't have those files).

If you tell me how I can effect the change, I would gladly fix the test code to match the new signatures.

@dandoug
Copy link
Contributor Author

dandoug commented Nov 7, 2023

@biglittlebigben Oh... I see the problem. Those "missing" files were introduced AFTER the branch point were I created my fork.

I had tried running based on egress 1.8.0, but I ran into other problems. When I checked with @frostbyte73, he ended up recommending that I stay based on the 1.7.12 version because the newer egress required a newer livekit-server that we did not have available yet.

So, it seems like there are two ways forward...

  1. rebased this PR on the latest code, but then we will need a compatible version of livekit-server for us to use in our environment, or

  2. kick off integration test on a version of the code that does not depend on features beyond egress 1.7.12

@dandoug
Copy link
Contributor Author

dandoug commented Nov 7, 2023

@biglittlebigben I was unable to make the changes to fix the test (which runs on main/latest) in this branch, so I created a new PR, #529, cherry-picked these changes and applied the additional changes to image.go, image_manifest.go, and sink.go to make the code build. I believe the changes are correct, although I am unable to verify in our environment since we're pegged on v1.7.12 at the moment.

@biglittlebigben
Copy link
Contributor

@biglittlebigben Oh... I see the problem. Those "missing" files were introduced AFTER the branch point were I created my fork.

I had tried running based on egress 1.8.0, but I ran into other problems. When I checked with @frostbyte73, he ended up recommending that I stay based on the 1.7.12 version because the newer egress required a newer livekit-server that we did not have available yet.

So, it seems like there are two ways forward...

  1. rebased this PR on the latest code, but then we will need a compatible version of livekit-server for us to use in our environment, or
  2. kick off integration test on a version of the code that does not depend on features beyond egress 1.7.12

Paying more attention to the failure, it looks like the current branch fails to build, not specifically the integration tests. The code will eventually be merged with main, so would need to build after merge/rebase with that.

@biglittlebigben
Copy link
Contributor

Paying more attention to the failure, it looks like the current branch fails to build, not specifically the integration tests. The code will eventually be merged with main, so would need to build after merge/rebase with that.

It should be a fairly straightforward rebase, so I can fix the patch if needed.

@dandoug
Copy link
Contributor Author

dandoug commented Nov 7, 2023

Paying more attention to the failure, it looks like the current branch fails to build, not specifically the integration tests. The code will eventually be merged with main, so would need to build after merge/rebase with that.

It should be a fairly straightforward rebase, so I can fix the patch if needed.

That is what I did in #529

@dandoug dandoug closed this Nov 14, 2023
@dandoug
Copy link
Contributor Author

dandoug commented Nov 14, 2023

withdrawing this one because #529 was merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants