diff --git a/2023-11-16_Meet_Vector/README.md b/2023-11-16_Meet_Vector/README.md index c1dc7cd9..5c443f0b 100644 --- a/2023-11-16_Meet_Vector/README.md +++ b/2023-11-16_Meet_Vector/README.md @@ -94,7 +94,7 @@ del(.message) 4️⃣ Remove `message` to avoid paying twice raw and structured signal. -/!\ This program will fail on error meaning that vector will stop on error. +⚠️ This program will fail on error meaning that vector will stop on error. It is also possible to handle errors in vrl but it comes at repeating the same error handling every time. diff --git a/2023-11-23_Vector_in_action/README.md b/2023-11-23_Vector_in_action/README.md index ddad4b17..ce2cbacf 100644 --- a/2023-11-23_Vector_in_action/README.md +++ b/2023-11-23_Vector_in_action/README.md @@ -67,7 +67,7 @@ del(.message) 4️⃣ Remove `message` to avoid paying twice raw and structured signal. -/!\ This program will fail on error meaning that vector will stop on error. +⚠️ This program will fail on error meaning that vector will stop on error. It is also possible to handle errors in vrl but it comes at repeating the same error handling every time. diff --git a/2023-11-30_What_is_OpenTelemetry/README.md b/2023-11-30_What_is_OpenTelemetry/README.md index a6fd1dc8..bc240947 100644 --- a/2023-11-30_What_is_OpenTelemetry/README.md +++ b/2023-11-30_What_is_OpenTelemetry/README.md @@ -80,7 +80,7 @@ gRPC has been chosen from OpenTelemetry for performance, tooling and specificati An excellent medium post explains this perfectly why json and REST would not be a good choice over gRPC : https://medium.com/data-science-community-srm/json-is-incredibly-slow-heres-what-s-faster-ca35d5aaf9e8 -/!\ OTLP does not use protocol buffer streaming at all. Another post will be done later on this topic about missed opportunities. Pay close attention that large payload or long workflow cannot be integrated in OTLP directly. Another usecase, for large logs cannot fit in such configuration since all the data should be sent at once causing large memory allocation on backend and collectors. +⚠️ OTLP does not use protocol buffer streaming at all. Another post will be done later on this topic about missed opportunities. Pay close attention that large payload or long workflow cannot be integrated in OTLP directly. Another usecase, for large logs cannot fit in such configuration since all the data should be sent at once causing large memory allocation on backend and collectors. ### gRPC @@ -105,7 +105,7 @@ To solve this, automatic instrumentation can be done by using framework integrat This instrumentation can also produce metrics and traces because [doing everything with log can be an antipattern](../2023-10-18_What_is_not_an_observability_solution/What_is_not_an_o11y_solution.md#all-you-need-is-logs). -/!\ There is no magic, instrumentation collects and aggregate data, consumes CPU and Memory with an overhead. Also, if the storage used is the memory, which is good for low overhead, it means that all signals are lost when the process panic. Such solutions are best effort. For instance, SLA can be different between signals and logs can be flushed to the disk (and even synchronously or asynchonously depending SLA) while metrics and traces might not. +⚠️ There is no magic, instrumentation collects and aggregate data, consumes CPU and Memory with an overhead. Also, if the storage used is the memory, which is good for low overhead, it means that all signals are lost when the process panic. Such solutions are best effort. For instance, SLA can be different between signals and logs can be flushed to the disk (and even synchronously or asynchonously depending SLA) while metrics and traces might not. ### Manual diff --git a/2023-12-07_Meet_Graphite/README.md b/2023-12-07_Meet_Graphite/README.md new file mode 100644 index 00000000..55a92b94 --- /dev/null +++ b/2023-12-07_Meet_Graphite/README.md @@ -0,0 +1,165 @@ +# 2023-11-30 #8 Meet Graphite + +Graphite has been created by Orbitz (2006), a hotel industry actor to monitor and support their growth. + +Other Time Series Database [TSDB](https://en.wikipedia.org/wiki/Time_series_database) was already there before like [RRDtool](https://en.wikipedia.org/wiki/RRDtool). + +Why Orbitz decided to not use [RRDtool](https://oss.oetiker.ch/rrdtool/) + [Cacti](https://www.cacti.net/) and created Graphite ? + +Is Graphite still worth over other new and existing solutions ? + +## Why Graphite ? +Reference: https://graphite.readthedocs.io/en/latest/faq.html#does-graphite-use-rrdtool + +A problem in RRDtool is that it does not really support temporary absence of the data (null/nil/None data) and uses zero `0` which is a good default but depending the usage it is not. + +> How to calculate throughput ? If the latency down to `0`, then the throughput is infinite ? + +According to the `null` latency value use case, using `0` is not a good trade-offs at all and this is the first reason why Orbitz team decided to create Graphite ([before 2006](https://graphite.readthedocs.io/en/latest/faq.html#does-graphite-use-rrdtool)). + +## What is Graphite ? +References: +- https://graphite.readthedocs.io/en/latest/faq.html#what-is-graphite +- https://graphite.readthedocs.io/en/latest/overview.html#about-the-project + +Usually, TSDB backends are used for non functional requirements like request per second, ... + +[According to the bellow use case](https://graphite.readthedocs.io/en/latest/faq.html#who-should-use-graphite), Graphite can also be suitable for measuring business values: + +> "For example, Graphite would be good at graphing stock prices because they are numbers that change over time." + +[Prometheus comparison](https://prometheus.io/docs/introduction/comparison/#summary) highlights such use case + +> "Prometheus offers a richer data model and query language, in addition to being easier to run and integrate into your environment. If you want a clustered solution that can hold historical data long term, Graphite may be a better choice." + +The [monotonicity and temporality post](../2023-11-09_Monotonicity/README.md#cumulative-vs-delta) illustrates the fact and trade-offs. + +Graphite is not a monolith and multiple components compose graphite such as [carbon](https://graphite.readthedocs.io/en/latest/carbon-daemons.html) + +## Quickstart +Reference: https://graphite.readthedocs.io/en/latest/install.html#docker + +Using [graphite with docker](https://graphite.readthedocs.io/en/latest/install.html#docker) is the easiest way to test graphite quickly. + +The docker image is not production ready though and many components are installed by default to make it easy to use for development but not for production. + +A demo with other backends is available on a [previous post demo](../2023-11-09_Monotonicity/demo/README.md#context) + +## Architecture and Scalability + +Graphite has been forked and updated to support scalability at different scope over time since the project has a long history since 2006. + +Projects: +- [Graphite](https://github.com/graphite-project) +- [Go Graphite](https://github.com/go-graphite) + +An excellent (old) post from Teads mentioned how to scale graphite: https://medium.com/teads-engineering/scaling-graphite-in-a-cloud-environment-6a92fb495e5 + +Graphite can be viewed as a backend or as a protocol and other backends are compatible with it, like prometheus, mimir, victoriametrics but with different aggregation temporality which can conflict with the main feature of graphite ([long lived cumulative counter](../2023-11-09_Monotonicity/demo/README.md#long-lived-cumulative-counter)). + +⚠️ All backends are not fully compliant with [long lived counters](../2023-11-09_Monotonicity/demo/README.md#long-lived-cumulative-counter) and if this feature matter, it is important to scale the data storage first or any other graphite components like the [Go Graphite](https://github.com/go-graphite) does. + +### whisper +Reference: https://github.com/graphite-project/whisper + +Differences with RRD: https://graphite.readthedocs.io/en/latest/whisper.html#differences-between-whisper-and-rrd + +Whisper is the default TSDB with graphite. Graphite can support [much more TSDB](https://graphite.readthedocs.io/en/1.1.8/tools.html#storage-backend-alternates +) with different trade-offs (clickhouse, InfluxDB, ...). + +### carbon +References: +- https://github.com/graphite-project/carbon +- https://graphite.readthedocs.io/en/stable/carbon-daemons.html + +Carbon is the write path of the metrics signal. It serves different purpose like: + +- Replicate and shard writes to backend (ie: whisper) +- Rewrite metrics +- Allow or Block metrics +- Aggregate metrics + +### graphite-web +Reference: https://github.com/graphite-project/graphite-web + +As opposed to carbon, graphite-web is responsible for the metric read path. This component serves the api and graph visualization. + +Usually, only the [api](https://graphite-api.readthedocs.io/en/latest/) part of graphite-web is used in conjuction with a frontend like [grafana](https://grafana.com/grafana/). + +## Protocol +Reference: https://graphite.readthedocs.io/en/latest/feeding-carbon.html + +Carbon supports many protocols but the most used is the straightforward plain text protocol. + +### Plain Text +` ` + +```bash +PORT=2003 +SERVER=graphite.your.org +echo "local.random.diceroll 4 `date +%s`" | nc ${SERVER} ${PORT} +``` +### Labels +Reference: https://graphite.readthedocs.io/en/latest/tags.html + +Depending the backend configuration, the `` can contain tags (aka labels). + +`my.series;tag1=value1;tag2=value2` + +## StatsD +Reference: https://www.etsy.com/codeascraft/measure-anything-measure-everything/ + +[StatsD](https://github.com/statsd/statsd) has been [created by Etsy](https://www.etsy.com/codeascraft/measure-anything-measure-everything/) to send metrics without performance overhead or simply impacting SLA when the metrics backend is dead. By simply using UDP to send metrics to StatsD, the observed application is not responsible anymore to manage state and is decoupled from the metrics backend which is good if SLAs are different. StatsD also reduces the rate and sends data at a given resolution (ie: 10s). + +The protocol is not the same as Graphite but simpler and still plain text: `:|` + +```bash +echo "foo:1|c" | nc -u -w0 127.0.0.1 8125 +``` + +A demo is available from this previous post: [graphite + statsd vs other backends](../2023-11-09_Monotonicity/demo/README.md#context) with [this statsd udp configuration](../2023-11-09_Monotonicity/demo/graphite/statsd/udp.js) + +## Archiving old data +Reference: https://graphite.readthedocs.io/en/latest/whisper.html#archives-retention-and-precision + +Optimizing space over the time is crucial. Data can simply be deleted or compressed. Compression can be lossless or lossy and depending the use case, supporting both can be a good idea. + +It is possible to setup lossy compression by increasing the resolution period datapoint. A datapoint can be at a resolution of 10s for the last 3 months then at 1 minute to reduce space by 6 (60s / 10s). + +## Telemetry temporality + +As mentioned in [OpenTelemetry metrics temporality](../2023-11-30_What_is_OpenTelemetry/README.md#metrics) and the [Monotonicity demo](../2023-11-09_Monotonicity/demo/README.md#long-lived-cumulative-counter) graphite is a delta metrics temporality backend which supports [long lived cumulative counter](../2023-11-09_Monotonicity/demo/README.md#long-lived-cumulative-counter). + +## Additional tools + +### Grafana +Reference: https://grafana.com/docs/grafana/latest/datasources/graphite/ + +Grafana comes from the concatenation of 2 words: `Graphite` and `Kibana` to make Graphite visualization as smooth as possible. + +The main difference between Grafana and other competitors is a cached datasource support without involving a full synchronization which impacts resources and costs. + +Grafana offers the best integration for graphite since it has been created for. + +According to the OTLP and prometheus Grafana intregration, Grafana metrics backend like Mimir supports only cumulative metrics while graphite is a true delta metrics backend. A dedicated post is comparing [the pros and cons of delta and cumulative temporality](../2023-11-09_Monotonicity/README.md). + +A dedicated post will be created later for Grafana. + +### Datadog +Reference: https://www.datadoghq.com/blog/dogstatsd-mapper/ + +Datadog has a centralized model where all the data should be stored inside its database which is a bit different from Grafana since you can choose via a [collector](../2023-11-30_What_is_OpenTelemetry/README.md#collector) to sync or fetch and cache data. + +Datadog is a drop-in solution for graphite but seems supporting delta temporality. + +A dedicated post will be created later for Datadog. + +## Backends comparison +[Graphite vs VictoriaMetrics vs Prometheus vs Mimir demo from previous post](../2023-11-09_Monotonicity/demo/README.md#datapoints-visualization-comparison) + +## Conclusion +Graphite and all the middleware/TSDB (StatsD, clickhouse, ...) have changed significantly to support labels and scalability. In the meantime, prometheus won the battle for the observability and rates monitoring while delta modes and other usecases are not fully covered by those alternatives. + +As mentioned by the prometheus team, graphite is best at supporting [long lived cumulative counters](../2023-11-09_Monotonicity/demo/README.md#long-lived-cumulative-counter) with few labels. + +As soon as scalability becomes important for metrics, labels and pure observability, other solutions should be considered. \ No newline at end of file diff --git a/README.md b/README.md index 457dba99..47e4001c 100644 --- a/README.md +++ b/README.md @@ -2,12 +2,13 @@ Welcome to o11y weekly (observability weekly!). Here is the latest news / feedbacks about observability. -This week, let's see [what is OpenTelemetry](./2023-11-30_What_is_OpenTelemetry/README.md) and as a conclusion what are its limitations ? +After [OpenTelemetry](./2023-11-30_What_is_OpenTelemetry/README.md), this week, let's [Meet Graphite](./2023-12-07_Meet_Graphite/README.md) and see if it is still worth and when over other solutions and protocols ? # Last Post -[2023-11-30 #7 What is OpenTelemetry ?](./2023-11-30_What_is_OpenTelemetry/README.md) +[2023-11-30 #8 Meet Graphite](./2023-12-07_Meet_Graphite/README.md) ## Archives +- [2023-11-30 #7 What is OpenTelemetry ?](./2023-11-30_What_is_OpenTelemetry/README.md) - [2023-11-23 #6 Vector in action](./2023-11-23_Vector_in_action/README.md) - [2023-11-16 #5 Meet Vector](./2023-11-16_Meet_Vector/README.md) - [2023-11-09 #4 Monotonicity](./2023-11-09_Monotonicity/README.md) diff --git a/_DRAFT/2023-12_Meet_Graphite.md b/_DRAFT/2024-09_Meet_Grafana.md similarity index 100% rename from _DRAFT/2023-12_Meet_Graphite.md rename to _DRAFT/2024-09_Meet_Grafana.md diff --git a/_DRAFT/2024-XX_APM_Magic_Quadrant.md b/_DRAFT/2024-XX_APM_Magic_Quadrant.md new file mode 100644 index 00000000..e69de29b diff --git a/_DRAFT/2024-XX_Datadog.md b/_DRAFT/2024-XX_Datadog.md new file mode 100644 index 00000000..e69de29b diff --git a/_DRAFT/2024-XX_Datalifecycle.md b/_DRAFT/2024-XX_Datalifecycle.md new file mode 100644 index 00000000..8ae4f105 --- /dev/null +++ b/_DRAFT/2024-XX_Datalifecycle.md @@ -0,0 +1 @@ +Datalifecyle, compaction and cleanup policies \ No newline at end of file diff --git a/_DRAFT/2024-XX_Grafana_Agent.md b/_DRAFT/2024-XX_Grafana_Agent.md new file mode 100644 index 00000000..e69de29b diff --git a/_DRAFT/2024_XX_Observability_vs_Dataplatform.md b/_DRAFT/2024_XX_Observability_vs_Dataplatform.md new file mode 100644 index 00000000..d634442a --- /dev/null +++ b/_DRAFT/2024_XX_Observability_vs_Dataplatform.md @@ -0,0 +1,7 @@ +How observability backends are built ? Observability backend architecture is simply a OLAP Dataplatform with oppiniated trade-offs + +https://www.influxdata.com/blog/introduction-apache-arrow/ + +InfluxDB IOx with rust and apache arrow + +https://github.com/metrico/influxdb_iox/commit/ab17bbc9efbb8568ea5a95ccb9d4bbddd33fc9ea \ No newline at end of file