Have you ever asked yourself "how do I know how good/bad my application is performing without waiting for breaks?". If so, this article is for you.
Today you're going to learn how to deploy an Observability Stack for your application using Kubernetes or Docker Compose.
Firstly, what's an observability stack? Basically it's a sum of components that will bring insights about an application. This is also called Telemetry.
Number of requests, average time of response of some endpoint, most used endpoints, how many errors happened during some process, and so on. These all are metrics that can be used in order to understand better how an application works and where it needs some improvements.
There are many tools out there and, as it is with almost everything, it all depends on one needs and what one likes. For this article, the stack consists of OpenTelemetry, Jaeger and Prometheus.
Those tools are all Open Source, which means one can use them as he likes without worrying about licenses.
According to their website, "OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior".
So, basically, it's an aggregate of telemetry data that can be used with many different tools and libraries
Jaeger it's a end-to-end distributed tracing tool that is widely used in the industry. Their GitHub repository has over 16k stars.
Prometheus is a metric and alerting open source monitoring tool.
Before we begin actually writing some code definitions, here's a disclaimer: this is probably not the best way of doing this, nor the most effective. But, this is what have worked for me and filled my demands. Feel free to leave some advices and best practices in the comments, I'll be glad.
In a "normal" stack, there's probably some application to get the tracings and another to get the metrics. That would be the case in the current scenario, but with the latest versions of Jaeger one can make use of their spans¹ to produce metrics directly from them.
Let's walkthrough the architecture:
- A "MicroSim" container creates simulated spans (in this case there's no MicroSim, the spans will be actually collected from the app);
- Then sends them to the OpenTelemetry Collector. Those spans can be instrumented manually or using some library (like
otelgin
orotelzap
, for example); - After that, the spans will be sent to the Jaeger, which will collect the tracing data, and the metrics derived from those spans will be sent to Prometheus;
- Finally, Jaeger will query those metrics to show both metrics and tracings in just one space.
Therefore, at the end of the article, there'll be only one dashboard (Jaeger UI) for both metrics and tracing data of the application.
As mentioned before, both metrics and tracings will derive from the same spans that one instrument.
There are many ways of instrumenting an application. It all will depend firstly on the level of control one wants over them and, secondly, which language one's using.
For Golang
, for example, one can make use of many instrumentation libraries. If, for instance, Gin
is used, one can get otelgin
to instrument all HTTP requests automatically. That's precisely what the article will do.
To download otelgin
, just run:
go get -u go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin
And to use it is as simple as:
router = gin.New()
router.Use(otelgin.Middleware("ServiceName"))
The library will act as a Gin
middleware and record the requests.
Trace data is already being gathered, but it's not being manipulated and sent to some place. This is what the next step is.
The provider, in this case, is part of the OpenTelemetry bag of tools. First, install its package for Go:
go get -u go.opentelemetry.io/otel
Basically it's needed the following things to create a provider:
- a sampler;
- a processor;
- a resource; and
- an exporter.
Here are the code snippets for them.
import (
"os"
"go.opentelemetry.io/otel/sdk/trace"
)
// Helper function to define sampling.
// When in development mode, AlwaysSample is defined,
// otherwise, sample based on Parent and IDRatio will be used.
func getSampler() trace.Sampler {
ENV := os.Getenv("GO_ENV")
switch ENV {
case "development":
return trace.AlwaysSample()
case "production":
return trace.ParentBased(trace.TraceIDRatioBased(0.5))
default:
return trace.AlwaysSample()
}
}
This is a help function that one can use in order to define some behaviour when in development and production mode. Sampling is, in basic terms, how often the query of data is done. So, if trace.AlwaysSample()
is defined, for example, every single request will be traced.
The other option is the blending of two types of sampling, in which the sample will only be collected if the parent was sampled (trace.ParentBased
) and also there's a ratio that that happens(trace.TraceIDRatioBased(0.5)
), in this case half of cases.
import (
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
"go.opentelemetry.io/otel/attribute"
)
// Returns a new OpenTelemetry resource describing this application.
func newResource(ctx context.Context) *resource.Resource {
res, err := resource.New(ctx,
resource.WithFromEnv(),
resource.WithProcess(),
resource.WithTelemetrySDK(),
resource.WithHost(),
resource.WithAttributes(
// the service name used to display traces in backends
semconv.ServiceNameKey.String(os.Getenv("SERVICE_NAME")),
attribute.String("environment", os.Getenv("GO_ENV")),
),
)
if err != nil {
log.Fatalf("%s: %v", "Failed to create resource", err)
}
return res
}
The resource is used to define an application, for one to know from where the data comes from.
For this, Jaeger exporter library is needed.
go get -u go.opentelemetry.io/otel/exporters/jaeger
In a normal environment, one would use this library with the endpoint of Jaeger in order to send the span data. But, as the span data is being sent to the OpenTelemetryCollector, the OPEN_TELEMETRY_COLLECTOR_URL
will be used.
import (
"os"
"go.opentelemetry.io/otel/exporters/jaeger"
)
// Creates Jaeger exporter
func exporterToJaeger() (*jaeger.Exporter, error) {
return jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(os.Getenv("OPEN_TELEMETRY_COLLECTOR_URL"))))
}
With all this in hands, the provider can be initiated.
import (
"context"
"io"
"log"
"os"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
// Initiates OpenTelemetry provider sending data to OpenTelemetry Collector.
func InitProviderWithJaegerExporter(ctx context.Context) (func(context.Context) error, error) {
exp, err := exporterToJaeger()
if err != nil {
log.Fatalf("error: %s", err.Error())
}
tp := trace.NewTracerProvider(
trace.WithSampler(getSampler()),
trace.WithBatcher(exp),
trace.WithResource(newResource(ctx)),
)
otel.SetTracerProvider(tp)
return tp.Shutdown, nil
}
It needs to be called before the code that will create spans, in this case, the HTTP requests with the otelgin
library. Something like:
func main() {
ctx := context.Background()
/// Observability
shutdown, err := telemetry.InitProviderWithJaegerExporter(ctx)
if err != nil {
log.Fatalf("%s: %v", "Failed to initialize opentelemetry provider", err)
}
defer shutdown(ctx)
/// rest of code
}
At this point your application is instrumented and ready to send data to the OpenTelemetryCollector do its magic.
We'll need to deploy three services: OpenTelemetryCollector, Jaeger and Prometheus.
We can deploy it using different ways. Here are the Docker Compose and the Kubernetes way.
Docker Compose it's a great tool for everyone in the Software Development area. If one doesn't know it yet, probably it's time to learn. It helps a lot when deploying some stack in a quicker way is needed.
To be honest I prefer it to Kubernetes. It's more intuitive and simple.
Here's the docker-compose.yml file.
# docker-compose.yml file
version: "3.5"
services:
jaeger:
networks:
- backend
image: jaegertracing/all-in-one:latest
volumes:
- "./jaeger-ui.json:/etc/jaeger/jaeger-ui.json"
command: --query.ui-config /etc/jaeger/jaeger-ui.json
environment:
- METRICS_STORAGE_TYPE=prometheus
- PROMETHEUS_SERVER_URL=http://prometheus:9090
ports:
- "14250:14250"
- "14268:14268"
- "6831:6831/udp"
- "16686:16686"
- "16685:16685"
otel_collector:
networks:
- backend
image: otel/opentelemetry-collector-contrib:latest
volumes:
- "./otel-collector-config.yml:/etc/otelcol/otel-collector-config.yml"
command: --config /etc/otelcol/otel-collector-config.yml
ports:
- "14278:14278"
depends_on:
- jaeger
prometheus:
networks:
- backend
image: prom/prometheus:latest
volumes:
- "./prometheus.yml:/etc/prometheus/prometheus.yml"
ports:
- "9090:9090"
networks:
backend:
And here are the configurations one can use:
- jaeger-ui.json
{
"monitor": {
"menuEnabled": true
},
"dependencies": {
"menuEnabled": true
}
}
- otel-collector-config.yml
receivers:
jaeger:
protocols:
thrift_http:
endpoint: "0.0.0.0:14278" # port opened on the otel collector
# Dummy receiver that's never used, because a pipeline is required to have one.
otlp/spanmetrics:
protocols:
grpc:
endpoint: "localhost:65535"
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp/jaeger: # Jaeger supports OTLP directly. The default port for OTLP/gRPC is 4317
endpoint: "jaeger:4317"
tls:
insecure: true
processors:
batch:
spanmetrics:
metrics_exporter: prometheus
service:
pipelines:
traces:
receivers: [jaeger]
processors: [spanmetrics, batch]
exporters: [otlp/jaeger]
# The exporter name in this pipeline must match the spanmetrics.metrics_exporter name.
# The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
metrics/spanmetrics:
receivers: [otlp/spanmetrics]
exporters: [prometheus]
- prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
scrape_configs:
- job_name: aggregated-trace-metrics
static_configs:
- targets: ['otel_collector:8889'] # using the name of the OpenTelemetryCollector container defined in the docker compose file
As one can see, it's pretty straightforward.
The Kubernetes version is pretty much the same as above, with only some tweaks that are necessary.
If you find yourself at a point where you need to translate a Docker Compose file to a Kubernetes one, you can use Kompose, which is a conversion tool for this matter. Be aware that some tweaks will probably be necessary after the process, but it helps A LOT!
Here are the K8s files.
- jaeger.yaml
# Namespace where the Telemetry will be created
apiVersion: v1
kind: Namespace
metadata:
name: telemetry
---
# Jaeger ConfigMap for UI
apiVersion: v1
kind: ConfigMap
metadata:
name: jaeger-ui-configmap
namespace: telemetry
data:
jaeger-ui.json: |-
{
"monitor": {
"menuEnabled": true
},
"dependencies": {
"menuEnabled": true
}
}
---
# Jaeger Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-deployment
namespace: telemetry
labels:
app: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- args:
- --query.ui-config
- /etc/jaeger/ui/jaeger-ui.json
env:
- name: METRICS_STORAGE_TYPE
value: prometheus
- name: PROMETHEUS_SERVER_URL
value: http://prometheus-service:9090
image: jaegertracing/all-in-one:latest
imagePullPolicy: IfNotPresent
name: jaeger
ports:
- containerPort: 14250
- containerPort: 14268
- containerPort: 6831
protocol: UDP
- containerPort: 16686
- containerPort: 16685
volumeMounts:
- mountPath: /etc/jaeger/ui
name: jaeger-volume
restartPolicy: Always
volumes:
- name: jaeger-volume
configMap:
name: jaeger-ui-configmap
items:
- key: jaeger-ui.json
path: jaeger-ui.json
---
# Jaeger Service
apiVersion: v1
kind: Service
metadata:
name: jaeger-service
namespace: telemetry
spec:
selector:
app: jaeger
ports:
- name: "14250"
port: 14250
targetPort: 14250
- name: "14268"
port: 14268
targetPort: 14268
- name: "6831"
port: 6831
protocol: UDP
targetPort: 6831
- name: "16686"
port: 16686
targetPort: 16686
- name: "16685"
port: 16685
targetPort: 16685
- otel-collector.yaml
# Otel-Collector ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: otel-collector-configmap
namespace: telemetry
data:
otel-collector-config.yml: |-
receivers:
jaeger:
protocols:
thrift_http:
endpoint: "0.0.0.0:14278" # port opened on the otel collector
# Dummy receiver that's never used, because a pipeline is required to have one.
otlp/spanmetrics:
protocols:
grpc:
endpoint: "localhost:65535"
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp/jaeger: # Jaeger supports OTLP directly. The default port for OTLP/gRPC is 4317
endpoint: "jaeger-service:4317"
tls:
insecure: true
processors:
batch:
spanmetrics:
metrics_exporter: prometheus
service:
pipelines:
traces:
receivers: [jaeger]
processors: [spanmetrics, batch]
exporters: [otlp/jaeger]
# The exporter name in this pipeline must match the spanmetrics.metrics_exporter name.
# The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
metrics/spanmetrics:
receivers: [otlp/spanmetrics]
exporters: [prometheus]
---
# Otel-Collector Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector-deployment
namespace: telemetry
labels:
app: otel-collector
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- args:
- --config
- /etc/otelcol/otel-collector-config.yml
image: otel/opentelemetry-collector-contrib:latest
imagePullPolicy: IfNotPresent
name: otel-collector
ports:
- containerPort: 14278
- containerPort: 8889
volumeMounts:
- mountPath: /etc/otelcol
name: otel-collector-volume
restartPolicy: Always
volumes:
- name: otel-collector-volume
configMap:
name: otel-collector-configmap
items:
- key: otel-collector-config.yml
path: otel-collector-config.yml
---
# Otel-Collector Service
apiVersion: v1
kind: Service
metadata:
name: otel-collector-service
namespace: telemetry
spec:
selector:
app: otel-collector
ports:
- name: "14278"
port: 14278
targetPort: 14278
- name: "8889"
port: 8889
targetPort: 8889
- prometheus.yaml
# Prometheus ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: prometheus-configmap
namespace: telemetry
data:
prometheus.yml: |-
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
scrape_configs:
- job_name: aggregated-trace-metrics
static_configs:
- targets: ['otel-collector-service:8889']
---
# Prometheus Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: telemetry
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- image: prom/prometheus:latest
imagePullPolicy: IfNotPresent
name: prometheus
ports:
- containerPort: 9090
volumeMounts:
- mountPath: /etc/prometheus
name: prometheus-volume
restartPolicy: Always
volumes:
- name: prometheus-volume
configMap:
name: prometheus-configmap
items:
- key: prometheus.yml
path: prometheus.yml
---
# Prometheus Service
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: telemetry
spec:
selector:
app: prometheus
ports:
- name: "9090"
port: 9090
targetPort: 9090
As it happens we all great things, this article is coming to an end. But, before that, one may be wondering
"What's the value of that variable OPEN_TELEMETRY_COLLECTOR_URL that we pass on to the Exporter?"
Yeah, right, that will depend in how one deployed the stack.
Basically it's the API endpoint for the traces data that the OpenTelemetryCollector will receive. It's something like:
OPEN_TELEMETRY_COLLECTOR_URL = http://{OpenTelemetryCollectorIP}:14278/api/traces
So, if one is, for example, running the stack using Docker Compose in the same machine as their Go application, that would be
http://localhost:14278/api/traces
.
If one is using inside a Kubernetes cluster, that would be http://otel-collector-service.telemetry.svc:14278/api/traces
.
That's all folks! If you have any question, just let me know. Follow me to more subjects like this.
Cheers!
¹ Spans are "blocks of code" in the Jaeger tool that contains a context do something, like producing trace data.