Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Commit

Permalink
add intro for OTLP and Collector, basic spell checks
Browse files Browse the repository at this point in the history
  • Loading branch information
jtl-novatec committed Apr 16, 2024
1 parent df44eda commit 324825e
Show file tree
Hide file tree
Showing 6 changed files with 298 additions and 52 deletions.
45 changes: 45 additions & 0 deletions tutorial/content/intro/goals_of_otel/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: "Why is OpenTelemetry promising?"
linktitle: "Goals of OpenTelemetry"
draft: false
weight: 30
---

<!-- ### history
OpenTelemetry is the result of the merger from OpenTracing and OpenCensus. Both of these products had the same goal - to standardize the instrumentation of code and how telemetry data is sent to observability backends. Neither of the products could solve the problem independently, so the CNCF merged the two projects into OpenTelemetry. This came with two major advantages. One both projects joined forces to create a better overall product and second it was only one product and not several products. With that standardization can be reached in a wider context of telemetry collection which in turn should increase the adoption rate of telemetry collection in applications since the entry barrier is much lower. The CNCF describes OpenTelemetry as the next major version of OpenTracing and OpenCensus and as such there are even migration guides for both projects to OpenTelemetry.
-->

<!-- ### promises -->
At the time of writing, OpenTelmetry is the [second fastest-growing project](https://www.cncf.io/reports/cncf-annual-report-2023/#projects) within the CNCF.
OpenTelemetry receives so much attention because it promises to be a fundamental shift in the way we produce telemetry.
It's important to remember that observability is a fairly young discipline.
In the past, the rate of innovation and conflicts of interest prevented us from defining widely adopted standards for telemetry. <!-- quote -->
However, the timing and momentum of OpenTelemetry appear to have a realistic chance of pushing for standardization of common aspects of telemetry.

#### Instrument once, use everywhere
A key promise of OpenTelemetry is that you *instrument code once and never again* and the ability *to use that instrumentation everywhere*.
OpenTelemetry recognizes that, should its efforts be successful, it will be a core dependency for many software projects.
Therefore, it follows strict processes to provide [*long-term stability guarantees*](https://opentelemetry.io/docs/specs/otel/versioning-and-stability/).
Once a signal is declared stable, the promise is that clients will never experience a breaking API change.

#### Separate telemetry generation from analysis
Another core idea of OpenTelemetry is *separate the mechanisms that produce telemetry from the systems that analyzes it*.
Open and vendor-agnostic instrumentation marks a fundamental *change in the observability business*.
Instead of pouring resources into building proprietary instrumentation and keeping it up to date, vendors must differentiate themselves through feature-rich analysis platforms with great usability.
OpenTelemetry *fosters competition*, because users no longer stuck with the observability solution they chose during development.
After switching to OpenTelemetry, you can move platforms without having to re-instrument your entire system.

#### Make software observable by default
With OpenTelemetry, open-source developers are able to add *native instrumentation to their projects without introducing vendor-specific code* that burdens their users.
The idea is to *make observability a first-class citizen during development*.
By having software ship with built-in instrumentation, we no longer need elaborate mechanisms to capture and integrate it after the fact.

#### Improve how we use telemetry
Last (and definitely not least), OpenTelemetry tries to change how we think about and use telemetry.
Instead of having three separate silos for logs, metrics, and traces, OpenTelemetry follows a paradigm of linking telemetry signals together.
With context creating touch points between signals, the overall value and usability of telemetry increase drastically.
For instance, imagine the ability to jump from conspicuous statistics in a dashboard straight to the related logs.
Correlated telemetry data helps to reduce the cognitive load on humans operating complex systems.
Being able to take advantage of linked data will mark a new generation of observability tools.
38 changes: 19 additions & 19 deletions tutorial/content/intro/how_we_got_here/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ First, there is the *workload*.
These are the operations a system performs to fulfill its objectives.
For instance, when a user sends a request, a distributed system often breaks it down into smaller tasks handled by different services.
Second, there are *software abstractions* that make up the structure of the distributed system.
These includes elements such as load balancers, services, pods, containers and more.
This includes elements such as load balancers, services, pods, containers and more.
Lastly, there are physical machines that provide computational *resources* (e.g. RAM, CPU, disk space, network) to carry out work.

{{< figure src="images/workload_resource_analysis_gregg.png" width=400 caption="workload and resource analysis [[Gregg16]](https://www.brendangregg.com/Slides/ACMApplicative2016_SystemMethodology/#18)" >}}
Expand All @@ -26,12 +26,12 @@ developers need highly detailed telemetry that they can use to pinpoint specific
-->
Depending on our background, we often have a certain bias when investigating the performance of / troubleshooting problems in a distributed system.
Application developers typically concentrate on workload-related aspects, whereas operations teams tend to look at physical resources.
To truely understand a system, we must combine insights from multiple angles and figure out how they relate to one another.
To truly understand a system, we must combine insights from multiple angles and figure out how they relate to one another.
However, before we can analyze something, we must first capture aspects of system behavior.
As you may know, we commonly do this through combination of *logs*, *metrics* and *traces*.
As you may know, we commonly do this through a combination of *logs*, *metrics* and *traces*.
Although it seems normal today, things weren't always this way.
But why should you be concerned about the past?
The reason is that OpenTelemetry tries to address problems that are the result of historical developments. <!-- TODO: ref Ted Young -->.
The reason is that OpenTelemetry tries to address problems that are the result of historical developments. <!-- TODO: ref Ted Young -->

#### logs
{{< figure src="images/logs.png" width=600 caption="Exemplary log files" >}}
Expand All @@ -48,9 +48,9 @@ hard to agree on semantics / language
A *log* is an append-only data structure that records events occurring in a system.
A log entry consists of a timestamp that denotes when something happened and a message to describe details about the event.
However, coming up with a standardized log format is no easy task.
One reason is that different types of software often convey different pieces of information. Logs of an HTTP web server are bound to look different from those of the kernel.
One reason is that different types of software often convey different pieces of information. The logs of an HTTP web server are bound to look different from those of the kernel.
But even for similar software, people often have different opinions on what good logs should look like.
Apart from content, log formats also vary with their consumer. Initially, text-based formats catered to human readability.
Apart from content, log formats also vary with their consumers. Initially, text-based formats catered to human readability.
However, as software systems became more complex, the volume of logs soon became unmanageable.
To combat this, we started encoding events as key/value pairs to make them machine-readable.
This is commonly known as structured logging.
Expand Down Expand Up @@ -81,42 +81,42 @@ Instead of just looking at individual events—logs—tracing systems looked at
{{< figure src="images/distributed_system.drawio.png" width=400 caption="Exemplary architecture of a distributed system" >}}

As distributed systems grew in scale, it became clear that traditional logging systems often fell short when trying to debug complex problems.
The reason is that we often have to undertand the chain of events in a system.
The reason is that we often have to understand the chain of events in a system.
On a single machine, stack traces allow us to track an exception back to a line of code.
In a distributed environment, we don't have this luxury.
Instead, we perform extensive filtering to locate log events of interest.
To understand the larger context, we must identify other related event events.
To understand the larger context, we must identify other related events.
This often results in lots of manual labour (e.g. comparing timestamps) or requires extensive domain knowledge about the applications.
Recognizing this problem, Google developed [Dapper](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf), which popularized the concept of distributed tracing.
On a fundamental level, tracing is logging on steroids.
The underlying idea is to add transactional context to logs.
By indexing this based on this information, it is possible infer causality and reconstruct the journey of requests in the system.
By indexing this based on this information, it is possible to infer causality and reconstruct the journey of requests in the system.

#### three pillars of observability
On the surface, logs, metrics, and traces share many similarities in their lifecycle and components.
Everything starts with instrumentation that captures and emits data.
The data has to have certain structure, which is defined by a format.
The data has to have a certain structure, which is defined by a format.
Then, we need a mechanism to collect and forward a piece of telemetry.
Often there is some kind of agent to further enrich, process and batch data before ingesting it in a backend.
Often, there is some kind of agent to further enrich, process and batch data before ingesting it in a backend.
This typically involves a database to efficiently store, index and search large volumes of data.
Finally, there is analysis frontend to make the data accessible to the end-user.
However, in practice, we develop dedicated systems for each type of telemetry, and for good reason:
Each telemetry signal poses it's own unique technical challenge.
This is mainly due to the different nature of data.
The design of data models, interchange formats, transmission protocols, highly depends on whether you are dealing with un- or semi-structured textual information, compact numerical values inside a time series, or graph-like structures depicting causality between events.
Even for a single signal, there is no consensus on these kind of topics.
Each telemetry signal poses its own unique technical challenge.
This is mainly due to the different nature of the data.
The design of data models, interchange formats, and transmission protocols, highly depends on whether you are dealing with un- or semi-structured textual information, compact numerical values inside a time series, or graph-like structures depicting causality between events.
Even for a single signal, there is no consensus on these kinds of topics.
Furthermore, the way we work with and derive insights from telemetry varies dramatically.
A system might need to perform full-text search, inspect single events, analyze historical trends, visualize request flow, diagnose performance bottlenecks, and more.
A system might need to perform full-text searches, inspect single events, analyze historical trends, visualize request flow, diagnose performance bottlenecks, and more.
These requirements manifest themselves in the design and optimizations of storage, access patterns, query capabilities and more.
When addressing these technical challenges, [vertical integration](https://en.wikipedia.org/wiki/Vertical_integration) emerges as a pragmatic solution.
In practice, observability vendors narrow the scope of the problem to a single signal and provide instrumentation to generate *and* tools to analyse telemetry, as a single, fully-integrated, solution.
In practice, observability vendors narrow the scope of the problem to a single signal and provide instrumentation to generate *and* tools to analyze telemetry, as a single, fully integrated, solution.

{{< figure src="images/three_pillars_of_observability.drawio.png" width=400 caption="The three pillars of observability, including metrics, traces and logs" >}}

Having dedicated systems for logs, metrics, and traces is why we commonly refer to them as the *the three pillars of observability*.
Having dedicated systems for logs, metrics, and traces is why we commonly refer to them as the *three pillars of observability*.
The notion of pillars provides a great mental framework because it emphasizes that:
- there are different categories of telemetry
- each pillar has its unique strengths and stands on its own
- each pillar has its own unique strengths and stands on its own
- pillars are complementary / must be combined to form a stable foundation for achieving observability


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 324825e

Please sign in to comment.