From cfad180775b4c8e2c92f41d52191e8e33c172f52 Mon Sep 17 00:00:00 2001 From: Max Lemieux Date: Tue, 26 Nov 2024 14:36:36 -0800 Subject: [PATCH 01/15] add --formula flag to reinstall command (#2163) this prevents reinstalling the unrelated alloy cask --- docs/sources/configure/macos.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sources/configure/macos.md b/docs/sources/configure/macos.md index fe6fb03507..3285565649 100644 --- a/docs/sources/configure/macos.md +++ b/docs/sources/configure/macos.md @@ -47,7 +47,7 @@ To customize the {{< param "PRODUCT_NAME" >}} service on macOS, perform the foll 1. Reinstall the {{< param "PRODUCT_NAME" >}} Formula by running the following command in a terminal: ```shell - brew reinstall alloy + brew reinstall --formula alloy ``` 1. Restart the {{< param "PRODUCT_NAME" >}} service by running the command in a terminal: From cc383c1edf988fd4763582c86a2e4b85bcc0f055 Mon Sep 17 00:00:00 2001 From: Cristian Greco Date: Wed, 27 Nov 2024 12:00:53 +0100 Subject: [PATCH 02/15] database_observability: additional configuration and cleanup: (#2171) - update CHANGELOG to mention new component - add query_samples_enabled argument - show only redacted samples - improve logging --- CHANGELOG.md | 14 +++--- .../database_observability.mysql.md | 8 ++-- .../database_observability.go | 3 ++ .../mysql/collector/query_sample.go | 22 ++++----- .../mysql/collector/query_sample_test.go | 5 ++- .../mysql/collector/schema_table.go | 7 +-- .../mysql/collector/schema_table_test.go | 3 +- .../database_observability/mysql/component.go | 45 +++++++++++-------- 8 files changed, 62 insertions(+), 45 deletions(-) create mode 100644 internal/component/database_observability/database_observability.go diff --git a/CHANGELOG.md b/CHANGELOG.md index f8a3380282..5196871581 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -22,6 +22,8 @@ Main (unreleased) - Add `otelcol.exporter.syslog` component to export logs in syslog format (@dehaansa) +- (_Experimental_) Add a `database_observability.mysql` component to collect mysql performance data. + ### Enhancements - Add second metrics sample to the support bundle to provide delta information (@dehaansa) @@ -39,7 +41,7 @@ Main (unreleased) - Fixed an issue in the `otelcol.processor.attribute` component where the actions `delete` and `hash` could not be used with the `pattern` argument. (@wildum) -- Fixed a race condition that could lead to a deadlock when using `import` statements, which could lead to a memory leak on `/metrics` endpoint of an Alloy instance. (@thampiotr) +- Fixed a race condition that could lead to a deadlock when using `import` statements, which could lead to a memory leak on `/metrics` endpoint of an Alloy instance. (@thampiotr) - Fix a race condition where the ui service was dependent on starting after the remotecfg service, which is not guaranteed. (@dehaansa & @erikbaranowski) @@ -97,7 +99,7 @@ v1.5.0 - Add support for relative paths to `import.file`. This new functionality allows users to use `import.file` blocks in modules imported via `import.git` and other `import.file`. (@wildum) -- `prometheus.exporter.cloudwatch`: The `discovery` block now has a `recently_active_only` configuration attribute +- `prometheus.exporter.cloudwatch`: The `discovery` block now has a `recently_active_only` configuration attribute to return only metrics which have been active in the last 3 hours. - Add Prometheus bearer authentication to a `prometheus.write.queue` component (@freak12techno) @@ -110,9 +112,9 @@ v1.5.0 - Fixed a bug in `import.git` which caused a `"non-fast-forward update"` error message. (@ptodev) -- Do not log error on clean shutdown of `loki.source.journal`. (@thampiotr) +- Do not log error on clean shutdown of `loki.source.journal`. (@thampiotr) -- `prometheus.operator.*` components: Fixed a bug which would sometimes cause a +- `prometheus.operator.*` components: Fixed a bug which would sometimes cause a "failed to create service discovery refresh metrics" error after a config reload. (@ptodev) ### Other changes @@ -151,7 +153,7 @@ v1.4.3 - `pyroscope.scrape` no longer tries to scrape endpoints which are not active targets anymore. (@wildum @mattdurham @dehaansa @ptodev) -- Fixed a bug with `loki.source.podlogs` not starting in large clusters due to short informer sync timeout. (@elburnetto-intapp) +- Fixed a bug with `loki.source.podlogs` not starting in large clusters due to short informer sync timeout. (@elburnetto-intapp) - `prometheus.exporter.windows`: Fixed bug with `exclude` regular expression config arguments which caused missing metrics. (@ptodev) @@ -170,7 +172,7 @@ v1.4.2 - Fix parsing of the Level configuration attribute in debug_metrics config block - Ensure "optional" debug_metrics config block really is optional -- Fixed an issue with `loki.process` where `stage.luhn` and `stage.timestamp` would not apply +- Fixed an issue with `loki.process` where `stage.luhn` and `stage.timestamp` would not apply default configuration settings correctly (@thampiotr) - Fixed an issue with `loki.process` where configuration could be reloaded even if there diff --git a/docs/sources/reference/components/database_observability/database_observability.mysql.md b/docs/sources/reference/components/database_observability/database_observability.mysql.md index 2bfc85953e..e51af705f8 100644 --- a/docs/sources/reference/components/database_observability/database_observability.mysql.md +++ b/docs/sources/reference/components/database_observability/database_observability.mysql.md @@ -25,9 +25,10 @@ The following arguments are supported: | Name | Type | Description | Default | Required | | -------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------- | ------- | -------- | -| `data_source_name` | `secret` | [Data Source Name](https://github.com/go-sql-driver/mysql#dsn-data-source-name) for the MySQL server to connect to. | | yes | -| `forward_to` | `list(LogsReceiver)` | Where to forward log entries after processing. | | yes | -| `collect_interval` | `duration` | How frequently to collect query samples from database | `"10s"` | no | +| `data_source_name` | `secret` | [Data Source Name](https://github.com/go-sql-driver/mysql#dsn-data-source-name) for the MySQL server to connect to. | | yes | +| `forward_to` | `list(LogsReceiver)` | Where to forward log entries after processing. | | yes | +| `collect_interval` | `duration` | How frequently to collect information from database | `"10s"` | no | +| `query_samples_enabled` | `bool` | Whether to enable collection of query samples | `true` | no | ## Blocks @@ -67,7 +68,6 @@ loki.write "logs_service" { } } ``` - ## Compatible components diff --git a/internal/component/database_observability/database_observability.go b/internal/component/database_observability/database_observability.go new file mode 100644 index 0000000000..176aedc5eb --- /dev/null +++ b/internal/component/database_observability/database_observability.go @@ -0,0 +1,3 @@ +package database_observability + +const JobName = "integrations/db-o11y" diff --git a/internal/component/database_observability/mysql/collector/query_sample.go b/internal/component/database_observability/mysql/collector/query_sample.go index bb8d354385..8632f57085 100644 --- a/internal/component/database_observability/mysql/collector/query_sample.go +++ b/internal/component/database_observability/mysql/collector/query_sample.go @@ -12,6 +12,7 @@ import ( "github.com/xwb1989/sqlparser" "github.com/grafana/alloy/internal/component/common/loki" + "github.com/grafana/alloy/internal/component/database_observability" "github.com/grafana/alloy/internal/runtime/logging/level" "github.com/grafana/loki/v3/pkg/logproto" ) @@ -108,27 +109,28 @@ func (c *QuerySample) fetchQuerySamples(ctx context.Context) error { var digest, query_sample_text, query_sample_seen, query_sample_timer_wait string err := rs.Scan(&digest, &query_sample_text, &query_sample_seen, &query_sample_timer_wait) if err != nil { - level.Error(c.logger).Log("msg", "failed to scan query samples", "err", err) + level.Error(c.logger).Log("msg", "failed to scan query samples", "digest", digest, "err", err) break } - redacted, err := sqlparser.RedactSQLQuery(query_sample_text) + query_sample_redacted, err := sqlparser.RedactSQLQuery(query_sample_text) if err != nil { - level.Error(c.logger).Log("msg", "failed to redact sql query", "err", err) + level.Error(c.logger).Log("msg", "failed to redact sql query", "digest", digest, "err", err) + break } c.entryHandler.Chan() <- loki.Entry{ - Labels: model.LabelSet{"job": "integrations/db-o11y"}, + Labels: model.LabelSet{"job": database_observability.JobName}, Entry: logproto.Entry{ Timestamp: time.Unix(0, time.Now().UnixNano()), - Line: fmt.Sprintf(`level=info msg="query samples fetched" op="%s" digest="%s" query_sample_text="%s" query_sample_seen="%s" query_sample_timer_wait="%s" query_redacted="%s"`, OP_QUERY_SAMPLE, digest, query_sample_text, query_sample_seen, query_sample_timer_wait, redacted), + Line: fmt.Sprintf(`level=info msg="query samples fetched" op="%s" digest="%s" query_sample_seen="%s" query_sample_timer_wait="%s" query_sample_redacted="%s"`, OP_QUERY_SAMPLE, digest, query_sample_seen, query_sample_timer_wait, query_sample_redacted), }, } - tables := c.tablesFromQuery(query_sample_text) + tables := c.tablesFromQuery(digest, query_sample_text) for _, table := range tables { c.entryHandler.Chan() <- loki.Entry{ - Labels: model.LabelSet{"job": "integrations/db-o11y"}, + Labels: model.LabelSet{"job": database_observability.JobName}, Entry: logproto.Entry{ Timestamp: time.Unix(0, time.Now().UnixNano()), Line: fmt.Sprintf(`level=info msg="table name parsed" op="%s" digest="%s" table="%s"`, OP_QUERY_PARSED_TABLE_NAME, digest, table), @@ -140,15 +142,15 @@ func (c *QuerySample) fetchQuerySamples(ctx context.Context) error { return nil } -func (c QuerySample) tablesFromQuery(query string) []string { +func (c QuerySample) tablesFromQuery(digest, query string) []string { if strings.HasSuffix(query, "...") { - level.Info(c.logger).Log("msg", "skipping parsing truncated query") + level.Info(c.logger).Log("msg", "skipping parsing truncated query", "digest", digest) return []string{} } stmt, err := sqlparser.Parse(query) if err != nil { - level.Error(c.logger).Log("msg", "failed to parse sql query", "err", err) + level.Error(c.logger).Log("msg", "failed to parse sql query", "digest", digest, "err", err) return []string{} } diff --git a/internal/component/database_observability/mysql/collector/query_sample_test.go b/internal/component/database_observability/mysql/collector/query_sample_test.go index f8ed1d03b1..8288a821ef 100644 --- a/internal/component/database_observability/mysql/collector/query_sample_test.go +++ b/internal/component/database_observability/mysql/collector/query_sample_test.go @@ -7,6 +7,7 @@ import ( "time" loki_fake "github.com/grafana/alloy/internal/component/common/loki/client/fake" + "github.com/grafana/alloy/internal/component/database_observability" "github.com/prometheus/common/model" "go.uber.org/goleak" @@ -59,9 +60,9 @@ func TestQuerySample(t *testing.T) { lokiEntries := lokiClient.Received() for _, entry := range lokiEntries { - require.Equal(t, model.LabelSet{"job": "integrations/db-o11y"}, entry.Labels) + require.Equal(t, model.LabelSet{"job": database_observability.JobName}, entry.Labels) } - require.Equal(t, `level=info msg="query samples fetched" op="query_sample" digest="abc123" query_sample_text="select * from some_table where id = 1" query_sample_seen="2024-01-01T00:00:00.000Z" query_sample_timer_wait="1000" query_redacted="select * from some_table where id = :redacted1"`, lokiEntries[0].Line) + require.Equal(t, `level=info msg="query samples fetched" op="query_sample" digest="abc123" query_sample_seen="2024-01-01T00:00:00.000Z" query_sample_timer_wait="1000" query_sample_redacted="select * from some_table where id = :redacted1"`, lokiEntries[0].Line) require.Equal(t, `level=info msg="table name parsed" op="query_parsed_table_name" digest="abc123" table="some_table"`, lokiEntries[1].Line) err = mock.ExpectationsWereMet() diff --git a/internal/component/database_observability/mysql/collector/schema_table.go b/internal/component/database_observability/mysql/collector/schema_table.go index d4f59e8a28..e9de0e4392 100644 --- a/internal/component/database_observability/mysql/collector/schema_table.go +++ b/internal/component/database_observability/mysql/collector/schema_table.go @@ -12,6 +12,7 @@ import ( "github.com/prometheus/common/model" "github.com/grafana/alloy/internal/component/common/loki" + "github.com/grafana/alloy/internal/component/database_observability" "github.com/grafana/alloy/internal/runtime/logging/level" ) @@ -146,7 +147,7 @@ func (c *SchemaTable) extractSchema(ctx context.Context) error { schemas = append(schemas, schema) c.entryHandler.Chan() <- loki.Entry{ - Labels: model.LabelSet{"job": "integrations/db-o11y"}, + Labels: model.LabelSet{"job": database_observability.JobName}, Entry: logproto.Entry{ Timestamp: time.Unix(0, time.Now().UnixNano()), Line: fmt.Sprintf(`level=info msg="schema detected" op="%s" schema="%s"`, OP_SCHEMA_DETECTION, schema), @@ -179,7 +180,7 @@ func (c *SchemaTable) extractSchema(ctx context.Context) error { tables = append(tables, tableInfo{schema: schema, tableName: table, createTime: createTime, updateTime: updateTime}) c.entryHandler.Chan() <- loki.Entry{ - Labels: model.LabelSet{"job": "integrations/db-o11y"}, + Labels: model.LabelSet{"job": database_observability.JobName}, Entry: logproto.Entry{ Timestamp: time.Unix(0, time.Now().UnixNano()), Line: fmt.Sprintf(`level=info msg="table detected" op="%s" schema="%s" table="%s"`, OP_TABLE_DETECTION, schema, table), @@ -215,7 +216,7 @@ func (c *SchemaTable) extractSchema(ctx context.Context) error { c.cache.Add(cacheKey, table) c.entryHandler.Chan() <- loki.Entry{ - Labels: model.LabelSet{"job": "integrations/db-o11y"}, + Labels: model.LabelSet{"job": database_observability.JobName}, Entry: logproto.Entry{ Timestamp: time.Unix(0, time.Now().UnixNano()), Line: fmt.Sprintf(`level=info msg="create table" op="%s" schema="%s" table="%s" create_statement="%s"`, OP_CREATE_STATEMENT, table.schema, table.tableName, createStmt), diff --git a/internal/component/database_observability/mysql/collector/schema_table_test.go b/internal/component/database_observability/mysql/collector/schema_table_test.go index eb32b50585..7bf29a2977 100644 --- a/internal/component/database_observability/mysql/collector/schema_table_test.go +++ b/internal/component/database_observability/mysql/collector/schema_table_test.go @@ -9,6 +9,7 @@ import ( "github.com/DATA-DOG/go-sqlmock" "github.com/go-kit/log" loki_fake "github.com/grafana/alloy/internal/component/common/loki/client/fake" + "github.com/grafana/alloy/internal/component/database_observability" "github.com/prometheus/common/model" "github.com/stretchr/testify/require" "go.uber.org/goleak" @@ -76,7 +77,7 @@ func TestSchemaTable(t *testing.T) { lokiEntries := lokiClient.Received() for _, entry := range lokiEntries { - require.Equal(t, model.LabelSet{"job": "integrations/db-o11y"}, entry.Labels) + require.Equal(t, model.LabelSet{"job": database_observability.JobName}, entry.Labels) } require.Equal(t, `level=info msg="schema detected" op="schema_detection" schema="some_schema"`, lokiEntries[0].Line) require.Equal(t, `level=info msg="table detected" op="table_detection" schema="some_schema" table="some_table"`, lokiEntries[1].Line) diff --git a/internal/component/database_observability/mysql/component.go b/internal/component/database_observability/mysql/component.go index 35ed0d8857..14a7bf1a5a 100644 --- a/internal/component/database_observability/mysql/component.go +++ b/internal/component/database_observability/mysql/component.go @@ -17,6 +17,7 @@ import ( "github.com/grafana/alloy/internal/component" "github.com/grafana/alloy/internal/component/common/loki" + "github.com/grafana/alloy/internal/component/database_observability" "github.com/grafana/alloy/internal/component/database_observability/mysql/collector" "github.com/grafana/alloy/internal/component/discovery" "github.com/grafana/alloy/internal/featuregate" @@ -46,14 +47,18 @@ var ( _ syntax.Validator = (*Arguments)(nil) ) +// TODO(cristian) consider using something like "enabled_collectors" +// to allow users to enable/disable collectors. type Arguments struct { - DataSourceName alloytypes.Secret `alloy:"data_source_name,attr"` - CollectInterval time.Duration `alloy:"collect_interval,attr,optional"` - ForwardTo []loki.LogsReceiver `alloy:"forward_to,attr"` + DataSourceName alloytypes.Secret `alloy:"data_source_name,attr"` + CollectInterval time.Duration `alloy:"collect_interval,attr,optional"` + QuerySamplesEnabled bool `alloy:"query_samples_enabled,attr,optional"` + ForwardTo []loki.LogsReceiver `alloy:"forward_to,attr"` } var DefaultArguments = Arguments{ - CollectInterval: 10 * time.Second, + CollectInterval: 10 * time.Second, + QuerySamplesEnabled: true, } func (a *Arguments) SetToDefault() { @@ -155,7 +160,7 @@ func (c *Component) getBaseTarget() (discovery.Target, error) { model.SchemeLabel: "http", model.MetricsPathLabel: path.Join(httpData.HTTPPathForComponent(c.opts.ID), "metrics"), "instance": c.instanceKey(), - "job": "integrations/db-o11y", + "job": database_observability.JobName, }, nil } @@ -194,21 +199,23 @@ func (c *Component) Update(args component.Arguments) error { entryHandler := loki.NewEntryHandler(c.handler.Chan(), func() {}) - qsCollector, err := collector.NewQuerySample(collector.QuerySampleArguments{ - DB: dbConnection, - CollectInterval: c.args.CollectInterval, - EntryHandler: entryHandler, - Logger: c.opts.Logger, - }) - if err != nil { - level.Error(c.opts.Logger).Log("msg", "failed to create QuerySample collector", "err", err) - return err - } - if err := qsCollector.Start(context.Background()); err != nil { - level.Error(c.opts.Logger).Log("msg", "failed to start QuerySample collector", "err", err) - return err + if c.args.QuerySamplesEnabled { + qsCollector, err := collector.NewQuerySample(collector.QuerySampleArguments{ + DB: dbConnection, + CollectInterval: c.args.CollectInterval, + EntryHandler: entryHandler, + Logger: c.opts.Logger, + }) + if err != nil { + level.Error(c.opts.Logger).Log("msg", "failed to create QuerySample collector", "err", err) + return err + } + if err := qsCollector.Start(context.Background()); err != nil { + level.Error(c.opts.Logger).Log("msg", "failed to start QuerySample collector", "err", err) + return err + } + c.collectors = append(c.collectors, qsCollector) } - c.collectors = append(c.collectors, qsCollector) stCollector, err := collector.NewSchemaTable(collector.SchemaTableArguments{ DB: dbConnection, From 3bc6bfeb2a72df64068834ae8cfaf8f30db6c0a3 Mon Sep 17 00:00:00 2001 From: Sam DeHaan Date: Wed, 27 Nov 2024 09:34:45 -0500 Subject: [PATCH 03/15] fix: fully prevent panic in remotecfg ui (#2164) * Fully prevent panic in remotecfg ui * Address PR feedback --- internal/web/api/api.go | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/internal/web/api/api.go b/internal/web/api/api.go index 4e364e6333..53592b744a 100644 --- a/internal/web/api/api.go +++ b/internal/web/api/api.go @@ -66,7 +66,12 @@ func listComponentsHandlerRemoteCfg(host service.Host) http.HandlerFunc { return } - listComponentsHandlerInternal(svc.Data().(remotecfg.Data).Host, w, r) + data := svc.Data().(remotecfg.Data) + if data.Host == nil { + http.Error(w, "remote config service startup in progress", http.StatusInternalServerError) + return + } + listComponentsHandlerInternal(data.Host, w, r) } } @@ -108,7 +113,13 @@ func getComponentHandlerRemoteCfg(host service.Host) http.HandlerFunc { return } - getComponentHandlerInternal(svc.Data().(remotecfg.Data).Host, w, r) + data := svc.Data().(remotecfg.Data) + if data.Host == nil { + http.Error(w, "remote config service startup in progress", http.StatusInternalServerError) + return + } + + getComponentHandlerInternal(data.Host, w, r) } } From a319983e2a26cc4fd017a6e64a0bb4d8d0531e82 Mon Sep 17 00:00:00 2001 From: Piotr <17101802+thampiotr@users.noreply.github.com> Date: Wed, 27 Nov 2024 17:37:53 +0000 Subject: [PATCH 04/15] Fix deadlock due to infinite retry (#2174) * Fix deadlock due to infinite retry * changelog --- CHANGELOG.md | 3 +- .../runtime/internal/controller/loader.go | 32 +++++++++++++------ 2 files changed, 25 insertions(+), 10 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5196871581..6e06980152 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -35,13 +35,14 @@ Main (unreleased) ### Bugfixes - Fixed an issue in the `prometheus.exporter.postgres` component that would leak goroutines when the target was not reachable (@dehaansa) + - Fixed an issue in the `otelcol.exporter.prometheus` component that would set series value incorrectly for stale metrics (@YusifAghalar) - Fixed issue with reloading configuration and prometheus metrics duplication in `prometheus.write.queue`. (@mattdurham) - Fixed an issue in the `otelcol.processor.attribute` component where the actions `delete` and `hash` could not be used with the `pattern` argument. (@wildum) -- Fixed a race condition that could lead to a deadlock when using `import` statements, which could lead to a memory leak on `/metrics` endpoint of an Alloy instance. (@thampiotr) +- Fixed a few race conditions that could lead to a deadlock when using `import` statements, which could lead to a memory leak on `/metrics` endpoint of an Alloy instance. (@thampiotr) - Fix a race condition where the ui service was dependent on starting after the remotecfg service, which is not guaranteed. (@dehaansa & @erikbaranowski) diff --git a/internal/runtime/internal/controller/loader.go b/internal/runtime/internal/controller/loader.go index fae75f5865..9af5919722 100644 --- a/internal/runtime/internal/controller/loader.go +++ b/internal/runtime/internal/controller/loader.go @@ -10,6 +10,12 @@ import ( "time" "github.com/go-kit/log" + "github.com/grafana/dskit/backoff" + "github.com/hashicorp/go-multierror" + "go.opentelemetry.io/otel/attribute" + "go.opentelemetry.io/otel/codes" + "go.opentelemetry.io/otel/trace" + "github.com/grafana/alloy/internal/featuregate" "github.com/grafana/alloy/internal/runtime/internal/dag" "github.com/grafana/alloy/internal/runtime/internal/worker" @@ -19,11 +25,6 @@ import ( "github.com/grafana/alloy/syntax/ast" "github.com/grafana/alloy/syntax/diag" "github.com/grafana/alloy/syntax/vm" - "github.com/grafana/dskit/backoff" - "github.com/hashicorp/go-multierror" - "go.opentelemetry.io/otel/attribute" - "go.opentelemetry.io/otel/codes" - "go.opentelemetry.io/otel/trace" ) // The Loader builds and evaluates ComponentNodes from Alloy blocks. @@ -92,10 +93,11 @@ func NewLoader(opts LoaderOptions) *Loader { componentNodeManager: NewComponentNodeManager(globals, reg), // This is a reasonable default which should work for most cases. If a component is completely stuck, we would - // retry and log an error every 10 seconds, at most. + // retry and log an error every 10 seconds, at most. We give up after some time to prevent lasting deadlocks. backoffConfig: backoff.Config{ MinBackoff: 1 * time.Millisecond, MaxBackoff: 10 * time.Second, + MaxRetries: 20, // Give up after 20 attempts - it could be a deadlock instead of an overload. }, graph: &dag.Graph{}, @@ -736,19 +738,31 @@ func (l *Loader) EvaluateDependants(ctx context.Context, updatedNodes []*QueuedN l.concurrentEvalFn(nodeRef, dependantCtx, tracer, parentRef) }) if err != nil { - level.Error(l.log).Log( - "msg", "failed to submit node for evaluation - Alloy is likely overloaded "+ - "and cannot keep up with evaluating components - will retry", + level.Warn(l.log).Log( + "msg", "failed to submit node for evaluation - will retry", "err", err, "node_id", n.NodeID(), "originator_id", parent.Node.NodeID(), "retries", retryBackoff.NumRetries(), ) + // When backing off, release the mut in case the evaluation requires to interact with the loader itself. + l.mut.RUnlock() retryBackoff.Wait() + l.mut.RLock() } else { break } } + if err != nil && !retryBackoff.Ongoing() { + level.Error(l.log).Log( + "msg", "retry attempts exhausted when submitting node for evaluation to the worker pool - "+ + "this could be a deadlock, performance bottleneck or severe overload leading to goroutine starvation", + "err", err, + "node_id", n.NodeID(), + "originator_id", parent.Node.NodeID(), + "retries", retryBackoff.NumRetries(), + ) + } span.SetAttributes(attribute.Int("retries", retryBackoff.NumRetries())) if err != nil { span.SetStatus(codes.Error, err.Error()) From c27c8ac8eb0edb46a7101f323ef9e65c33b998c6 Mon Sep 17 00:00:00 2001 From: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com> Date: Wed, 27 Nov 2024 11:03:54 -0800 Subject: [PATCH 05/15] Clean up some of the linting warnings and errors (#2155) * Clean up some of the linting warnings and errors * Additional linting warning and error cleanup * More work on removing linting errors * More linting cleanup * Even more linting warning cleanup * Fix links to components * Fix link syntax in topic * Correct reference to AWS X-Ray * Add missing link in collect topic * Fix up some redirected links and minor syntax fixes * Fix typo in file name * Apply suggestions from code review Co-authored-by: Beverly Buchanan <131809838+BeverlyJaneJ@users.noreply.github.com> --------- Co-authored-by: Beverly Buchanan <131809838+BeverlyJaneJ@users.noreply.github.com> --- docs/sources/collect/_index.md | 2 +- docs/sources/collect/choose-component.md | 13 +- .../sources/collect/datadog-traces-metrics.md | 42 +++--- ...etry-data.md => ecs-opentelemetry-data.md} | 22 ++-- docs/sources/collect/logs-in-kubernetes.md | 95 +++++++------- docs/sources/collect/metamonitoring.md | 45 +++---- docs/sources/collect/opentelemetry-data.md | 54 ++++---- .../collect/opentelemetry-to-lgtm-stack.md | 120 +++++++++--------- docs/sources/collect/prometheus-metrics.md | 62 +++++---- docs/sources/configure/_index.md | 4 +- .../distribute-prometheus-scrape-load.md | 2 +- docs/sources/configure/kubernetes.md | 21 ++- docs/sources/configure/linux.md | 2 +- docs/sources/configure/macos.md | 4 +- docs/sources/configure/nonroot.md | 4 +- docs/sources/configure/windows.md | 6 +- docs/sources/introduction/_index.md | 18 +-- .../introduction/backward-compatibility.md | 6 +- .../introduction/estimate-resource-usage.md | 2 +- docs/sources/set-up/deploy.md | 28 ++-- docs/sources/set-up/install/_index.md | 2 +- docs/sources/set-up/install/ansible.md | 12 +- docs/sources/set-up/install/binary.md | 12 +- docs/sources/set-up/install/chef.md | 4 +- docs/sources/set-up/install/docker.md | 15 +-- docs/sources/set-up/install/kubernetes.md | 13 +- docs/sources/set-up/install/linux.md | 13 +- docs/sources/set-up/install/macos.md | 4 +- docs/sources/set-up/install/puppet.md | 12 +- docs/sources/set-up/install/windows.md | 6 +- docs/sources/set-up/migrate/from-flow.md | 52 ++++---- docs/sources/set-up/migrate/from-operator.md | 7 +- docs/sources/set-up/migrate/from-otelcol.md | 51 ++++---- .../sources/set-up/migrate/from-prometheus.md | 38 +++--- docs/sources/set-up/migrate/from-promtail.md | 46 +++---- docs/sources/set-up/migrate/from-static.md | 17 +-- docs/sources/set-up/run/binary.md | 2 +- docs/sources/set-up/run/linux.md | 2 +- docs/sources/set-up/run/macos.md | 2 +- docs/sources/set-up/run/windows.md | 2 +- .../sources/troubleshoot/component_metrics.md | 4 +- .../troubleshoot/controller_metrics.md | 2 +- docs/sources/troubleshoot/debug.md | 29 ++--- docs/sources/troubleshoot/profile.md | 30 ++--- docs/sources/troubleshoot/support_bundle.md | 21 ++- docs/sources/tutorials/_index.md | 2 +- .../tutorials/first-components-and-stdlib.md | 47 ++++--- .../tutorials/logs-and-relabeling-basics.md | 41 +++--- docs/sources/tutorials/processing-logs.md | 30 ++--- docs/sources/tutorials/send-logs-to-loki.md | 27 ++-- .../tutorials/send-metrics-to-prometheus.md | 17 ++- 51 files changed, 546 insertions(+), 568 deletions(-) rename docs/sources/collect/{ecs-openteletry-data.md => ecs-opentelemetry-data.md} (89%) diff --git a/docs/sources/collect/_index.md b/docs/sources/collect/_index.md index ad88e94104..8411043ddb 100644 --- a/docs/sources/collect/_index.md +++ b/docs/sources/collect/_index.md @@ -8,4 +8,4 @@ weight: 100 # Collect and forward data with {{% param "FULL_PRODUCT_NAME" %}} -{{< section >}} \ No newline at end of file +{{< section >}} diff --git a/docs/sources/collect/choose-component.md b/docs/sources/collect/choose-component.md index 05f9d4df0b..36a880d54c 100644 --- a/docs/sources/collect/choose-component.md +++ b/docs/sources/collect/choose-component.md @@ -17,10 +17,9 @@ The components you select and configure depend on the telemetry signals you want ## Metrics for infrastructure Use `prometheus.*` components to collect infrastructure metrics. -This will give you the best experience with [Grafana Infrastructure Observability][]. +This gives you the best experience with [Grafana Infrastructure Observability][]. -For example, you can get metrics for a Linux host using `prometheus.exporter.unix`, -and metrics for a MongoDB instance using `prometheus.exporter.mongodb`. +For example, you can get metrics for a Linux host using `prometheus.exporter.unix`, and metrics for a MongoDB instance using `prometheus.exporter.mongodb`. You can also scrape any Prometheus endpoint using `prometheus.scrape`. Use `discovery.*` components to find targets for `prometheus.scrape`. @@ -30,7 +29,7 @@ Use `discovery.*` components to find targets for `prometheus.scrape`. ## Metrics for applications Use `otelcol.receiver.*` components to collect application metrics. -This will give you the best experience with [Grafana Application Observability][], which is OpenTelemetry-native. +This gives you the best experience with [Grafana Application Observability][], which is OpenTelemetry-native. For example, use `otelcol.receiver.otlp` to collect metrics from OpenTelemetry-instrumented applications. @@ -48,12 +47,12 @@ with logs collected by `loki.*` components. For example, the label that both `prometheus.*` and `loki.*` components would use for a Kubernetes namespace is called `namespace`. On the other hand, gathering logs using an `otelcol.*` component might use the [OpenTelemetry semantics][OTel-semantics] label called `k8s.namespace.name`, -which wouldn't correspond to the `namespace` label that is common in the Prometheus ecosystem. +which wouldn't correspond to the `namespace` label that's common in the Prometheus ecosystem. ## Logs from applications Use `otelcol.receiver.*` components to collect application logs. -This will gather the application logs in an OpenTelemetry-native way, making it easier to +This gathers the application logs in an OpenTelemetry-native way, making it easier to correlate the logs with OpenTelemetry metrics and traces coming from the application. All application telemetry must follow the [OpenTelemetry semantic conventions][OTel-semantics], simplifying this correlation. @@ -65,7 +64,7 @@ For example, if your application runs on Kubernetes, every trace, log, and metri Use `otelcol.receiver.*` components to collect traces. -If your application is not yet instrumented for tracing, use `beyla.ebpf` to generate traces for it automatically. +If your application isn't yet instrumented for tracing, use `beyla.ebpf` to generate traces for it automatically. ## Profiles diff --git a/docs/sources/collect/datadog-traces-metrics.md b/docs/sources/collect/datadog-traces-metrics.md index 034a093e8c..2ab9da3590 100644 --- a/docs/sources/collect/datadog-traces-metrics.md +++ b/docs/sources/collect/datadog-traces-metrics.md @@ -20,9 +20,9 @@ This topic describes how to: ## Before you begin -* Ensure that at least one instance of the [Datadog Agent][] is collecting metrics and/or traces. -* Identify where you will write the collected telemetry. - Metrics can be written to [Prometheus]() or any other OpenTelemetry-compatible database such as Grafana Mimir, Grafana Cloud, or Grafana Enterprise Metrics. +* Ensure that at least one instance of the [Datadog Agent][] is collecting metrics and traces. +* Identify where to write the collected telemetry. + Metrics can be written to [Prometheus][] or any other OpenTelemetry-compatible database such as Grafana Mimir, Grafana Cloud, or Grafana Enterprise Metrics. Traces can be written to Grafana Tempo, Grafana Cloud, or Grafana Enterprise Traces. * Be familiar with the concept of [Components][] in {{< param "PRODUCT_NAME" >}}. @@ -45,7 +45,7 @@ The [otelcol.exporter.otlp][] component is responsible for delivering OTLP data Replace the following: - - _``_: The full URL of the OpenTelemetry-compatible endpoint where metrics and traces will be sent, such as `https://otlp-gateway-prod-eu-west-2.grafana.net/otlp`. + * _``_: The full URL of the OpenTelemetry-compatible endpoint where metrics and traces are sent, such as `https://otlp-gateway-prod-eu-west-2.grafana.net/otlp`. 1. If your endpoint requires basic authentication, paste the following inside the `endpoint` block. @@ -58,8 +58,8 @@ The [otelcol.exporter.otlp][] component is responsible for delivering OTLP data Replace the following: - - _``_: The basic authentication username. - - _``_: The basic authentication password or API key. + * _``_: The basic authentication username. + * _``_: The basic authentication password or API key. ## Configure the {{% param "PRODUCT_NAME" %}} Datadog Receiver @@ -78,7 +78,7 @@ The [otelcol.exporter.otlp][] component is responsible for delivering OTLP data ```alloy otelcol.processor.deltatocumulative "default" { - max_stale = “” + max_stale = "" max_streams = output { metrics = [otelcol.processor.batch.default.input] @@ -88,14 +88,14 @@ The [otelcol.exporter.otlp][] component is responsible for delivering OTLP data Replace the following: - - _``_: How long until a series not receiving new samples is removed, such as "5m". - - _``_: The upper limit of streams to track. New streams exceeding this limit are dropped. + * _``_: How long until a series not receiving new samples is removed, such as "5m". + * _``_: The upper limit of streams to track. New streams exceeding this limit are dropped. 1. Add the following `otelcol.receiver.datadog` component to your configuration file. ```alloy otelcol.receiver.datadog "default" { - endpoint = “:” + endpoint = ":" output { metrics = [otelcol.processor.deltatocumulative.default.input] traces = [otelcol.processor.batch.default.input] @@ -105,8 +105,8 @@ The [otelcol.exporter.otlp][] component is responsible for delivering OTLP data Replace the following: - - _``_: The host address where the receiver will listen. - - _``_: The port where the receiver will listen. + * _``_: The host address where the receiver listens. + * _``_: The port where the receiver listens. 1. If your endpoint requires basic authentication, paste the following inside the `endpoint` block. @@ -119,8 +119,8 @@ The [otelcol.exporter.otlp][] component is responsible for delivering OTLP data Replace the following: - - _``_: The basic authentication username. - - _``_: The basic authentication password or API key. + * _``_: The basic authentication username. + * _``_: The basic authentication password or API key. ## Configure Datadog Agent to forward telemetry to the {{% param "PRODUCT_NAME" %}} Datadog Receiver @@ -139,10 +139,10 @@ We recommend this approach for current Datadog users who want to try using {{< p Replace the following: - - _``_: The hostname where the {{< param "PRODUCT_NAME" >}} receiver is found. - - _``_: The port where the {{< param "PRODUCT_NAME" >}} receiver is exposed. + * _``_: The hostname where the {{< param "PRODUCT_NAME" >}} receiver is found. + * _``_: The port where the {{< param "PRODUCT_NAME" >}} receiver is exposed. -Alternatively, you might want your Datadog Agent to send metrics only to {{< param "PRODUCT_NAME" >}}. +Alternatively, you might want your Datadog Agent to send metrics only to {{< param "PRODUCT_NAME" >}}. You can do this by setting up your Datadog Agent in the following way: 1. Replace the DD_URL in the configuration YAML: @@ -150,8 +150,8 @@ You can do this by setting up your Datadog Agent in the following way: ```yaml dd_url: http://: ``` -Or by setting an environment variable: + Or by setting an environment variable: ```bash DD_DD_URL='{"http://:": ["datadog-receiver"]}' @@ -169,7 +169,5 @@ To use this component, you need to start {{< param "PRODUCT_NAME" >}} with addit [Datadog]: https://www.datadoghq.com/ [Datadog Agent]: https://docs.datadoghq.com/agent/ [Prometheus]: https://prometheus.io -[OTLP]: https://opentelemetry.io/docs/specs/otlp/ -[otelcol.exporter.otlp]: ../../reference/components/otelcol/otelcol.exporter.otlp -[otelcol.exporter.otlp]: ../../reference/components/otelcol/otelcol.exporter.otlp -[Components]: ../../get-started/components +[otelcol.exporter.otlp]: ../../reference/components/otelcol/otelcol.exporter.otlp/ +[Components]: ../../get-started/components/ diff --git a/docs/sources/collect/ecs-openteletry-data.md b/docs/sources/collect/ecs-opentelemetry-data.md similarity index 89% rename from docs/sources/collect/ecs-openteletry-data.md rename to docs/sources/collect/ecs-opentelemetry-data.md index 3a7a53a483..428bf0e926 100644 --- a/docs/sources/collect/ecs-openteletry-data.md +++ b/docs/sources/collect/ecs-opentelemetry-data.md @@ -1,5 +1,7 @@ --- canonical: https://grafana.com/docs/alloy/latest/collect/ecs-opentelemetry-data/ +alias: + - ./ecs-openteletry-data/ # /docs/alloy/latest/collect/ecs-openteletry-data/ description: Learn how to collect Amazon ECS or AWS Fargate OpenTelemetry data and forward it to any OpenTelemetry-compatible endpoint menuTitle: Collect ECS or Fargate OpenTelemetry data title: Collect Amazon Elastic Container Service or AWS Fargate OpenTelemetry data @@ -14,7 +16,7 @@ There are three different ways you can use {{< param "PRODUCT_NAME" >}} to colle 1. [Use a custom OpenTelemetry configuration file from the SSM Parameter store](#use-a-custom-opentelemetry-configuration-file-from-the-ssm-parameter-store). 1. [Create an ECS task definition](#create-an-ecs-task-definition). -1. [Run {{< param "PRODUCT_NAME" >}} directly in your instance, or as a Kubernetes sidecar](#run-alloy-directly-in-your-instance-or-as-a-kubernetes-sidecar). +1. [Run {{< param "PRODUCT_NAME" >}} directly in your instance, or as a Kubernetes sidecar](#run-alloy-directly-in-your-instance-or-as-a-kubernetes-sidecar) ## Before you begin @@ -55,11 +57,11 @@ In ECS, you can set the values of environment variables from AWS Systems Manager 1. Choose *Create parameter*. 1. Create a parameter with the following values: - * `Name`: otel-collector-config - * `Tier`: Standard - * `Type`: String - * `Data type`: Text - * `Value`: Copy and paste your custom OpenTelemetry configuration file or [{{< param "PRODUCT_NAME" >}} configuration file][configure]. + * Name: `otel-collector-config` + * Tier: `Standard` + * Type: `String` + * Data type: `Text` + * Value: Copy and paste your custom OpenTelemetry configuration file or [{{< param "PRODUCT_NAME" >}} configuration file][configure]. ### Run your task @@ -73,16 +75,16 @@ To create an ECS Task Definition for AWS Fargate with an ADOT collector, complet 1. Download the [ECS Fargate task definition template][template] from GitHub. 1. Edit the task definition template and add the following parameters. - * `{{region}}`: The region the data is sent to. + * `{{region}}`: The region to send the data to. * `{{ecsTaskRoleArn}}`: The AWSOTTaskRole ARN. * `{{ecsExecutionRoleArn}}`: The AWSOTTaskExcutionRole ARN. * `command` - Assign a value to the command variable to select the path to the configuration file. The AWS Collector comes with two configurations. Select one of them based on your environment: - * Use `--config=/etc/ecs/ecs-default-config.yaml` to consume StatsD metrics, OTLP metrics and traces, and X-Ray SDK traces. - * Use `--config=/etc/ecs/container-insights/otel-task-metrics-config.yaml` to use StatsD, OTLP, Xray, and Container Resource utilization metrics. + * Use `--config=/etc/ecs/ecs-default-config.yaml` to consume StatsD metrics, OTLP metrics and traces, and AWS X-Ray SDK traces. + * Use `--config=/etc/ecs/container-insights/otel-task-metrics-config.yaml` to use StatsD, OTLP, AWS X-Ray, and Container Resource utilization metrics. 1. Follow the ECS Fargate setup instructions to [create a task definition][task] using the template. -## Run {{% param "PRODUCT_NAME" %}} directly in your instance, or as a Kubernetes sidecar +## Run Alloy directly in your instance, or as a Kubernetes sidecar SSH or connect to the Amazon ECS or AWS Fargate-managed container. Refer to [9 steps to SSH into an AWS Fargate managed container][steps] for more information about using SSH with Amazon ECS or AWS Fargate. diff --git a/docs/sources/collect/logs-in-kubernetes.md b/docs/sources/collect/logs-in-kubernetes.md index 3e02efa808..d8b8b17fb2 100644 --- a/docs/sources/collect/logs-in-kubernetes.md +++ b/docs/sources/collect/logs-in-kubernetes.md @@ -19,19 +19,19 @@ This topic describes how to: ## Components used in this topic -* [discovery.kubernetes][] -* [discovery.relabel][] -* [local.file_match][] -* [loki.source.file][] -* [loki.source.kubernetes][] -* [loki.source.kubernetes_events][] -* [loki.process][] -* [loki.write][] +* [`discovery.kubernetes`][discovery.kubernetes] +* [`discovery.relabel`][discovery.relabel] +* [`local.file_match`][local.file_match] +* [`loki.source.file`][loki.source.file] +* [`loki.source.kubernetes`][loki.source.kubernetes] +* [`loki.source.kubernetes_events`][loki.source.kubernetes_events] +* [`loki.process`][loki.process] +* [`loki.write`][loki.write] ## Before you begin * Ensure that you are familiar with logs labelling when working with Loki. -* Identify where you will write collected logs. +* Identify where to write collected logs. You can write logs to Loki endpoints such as Grafana Loki, Grafana Cloud, or Grafana Enterprise Logs. * Be familiar with the concept of [Components][] in {{< param "PRODUCT_NAME" >}}. @@ -39,8 +39,8 @@ This topic describes how to: Before components can collect logs, you must have a component responsible for writing those logs somewhere. -The [loki.write][] component delivers logs to a Loki endpoint. -After a `loki.write` component is defined, you can use other {{< param "PRODUCT_NAME" >}} components to forward logs to it. +The [`loki.write`][loki.write] component delivers logs to a Loki endpoint. +After you define a `loki.write` component, you can use other {{< param "PRODUCT_NAME" >}} components to forward logs to it. To configure a `loki.write` component for logs delivery, complete the following steps: @@ -56,9 +56,9 @@ To configure a `loki.write` component for logs delivery, complete the following Replace the following: - - _`