Durable Queue as a WAL in Flow #370

mattdurham · 2023-08-28T15:37:03Z

Background

The current WAL implementation suffers from several downsides.

OOM errors while replaying
Replaying the WAL stops graph evaluation
High memory usage
Inability to re-send data between restarts
Lack of expiration promises (when does a metric expire is a hard question to answer)
Current WAL needs shims to work with the Flow concept

Proposal

Create a Disk Based Queue WAL+remote_write component. With a focus on

Limiting memory through no unbounded large maps
Memory should be limited to what it is reading/writing at the time
Flow native
Concrete expiration promises
Replayable
Performance on sending and writing samples within 90% of our current implementation
Simplify the implementation
Ability to make changes with scraping/wal within the Agent codebase more easily

With no need to cache the series themselves memory will no longer be dependent on the number series stored. Each commit from a scrape to the WAL creates a new record and the remote write side works its way through the queue bookmarking its position on each successful send. This gives us strong replayability guarantees.

On startup instead of replaying the WAL it only needs to verify the correctness of the WAL.

This is not meant to replace the current WAL implementation but be an experiment in using a queue based WAL.

mattdurham · 2023-08-28T15:38:25Z

I have an experimental PR with an implementation of the above but want to let the problem statement sit for a few days before submitting a PR. Also to give some more time for the benchmarks to run and double check them. They run for 6 hours.

mattdurham · 2023-08-28T19:40:41Z

Added some clarity to the above in that this is not meant to be a replacement of the current WAL nor ignore fixing the concerns with the current WAL.

thepalbi · 2023-08-28T20:32:26Z

@mattdurham can you link the branch? interested to see the implementation

arkbriar · 2023-11-21T07:42:51Z

Great point! I also found that the memory consumption was surprisingly high even though there were only ~10k active series. I've also got a rough implementation of a queue based remote write component and found that the memory usage drops at least 5x or more.

It would be great if some one in Grafana Labs can have a comprehensive design to replace the current WAL based implementation (which IMO is terrible).

jammiemil · 2024-05-10T10:30:41Z

Was there any progress on this? Some of the downsides listed above are hurting us significantly in production right now + decoupling Active Series from Memory Consumption would help alot.

mattdurham · 2024-05-10T11:52:04Z

It is still on a long term roadmap for me.

freak12techno · 2024-10-16T12:57:41Z

@mattdurham any progress on this? we're trying to integrate alloy into our setup and this feature (specifically, WAL persistence after reboot) is a must for us

mattdurham · 2024-10-16T14:00:22Z

Its coming soon #1564, though I will entirely stress this is the most experimental feature possible.

freak12techno · 2024-10-16T14:42:54Z

@mattdurham yeah we'll likely be one of the first production testers then lol (as all other solutions we've tried do not suit us)

freak12techno · 2024-10-17T12:27:53Z

@mattdurham is there a support for bearer token auth? I looked through the PR but could not find it, wonder if that'll be easy to add as we're using Prometheus managed solution which does not provide us with basic auth for remote_write, only with the bearer token auth

(I can probably add it myself, if that's okay)

mattdurham · 2024-10-17T12:38:45Z

Not at the moment, I wanted to get it out as cleanly as possible. I have #1904 that I am tracking next steps. If you want to create an issue for it, I can add it to the tracking list.

freak12techno · 2024-10-17T13:29:09Z

@mattdurham done: #1913. also, feel free to assign it to me, I want to try implementing it by myself, if there are no objections

freak12techno · 2024-10-17T16:16:10Z

@mattdurham one thing I noticed is that there's no persistence block anywhere in the PR above (while it is being referenced in the docs), and Alloy fails to start if it's provided, I assume that's a typo and that's supposed to be serialization? please correct me if I'm wrong

mattdurham · 2024-10-17T16:40:35Z

Conceptually it should be persistence, that was a late name change that I must not have changed the code.

mattdurham · 2024-10-17T16:43:27Z

Change incoming to fix that.

freak12techno · 2024-10-18T12:17:22Z

@mattdurham so I am trying to use this new functionality, it works the same way as the old prometheus.remote_write but somehow doesn't persist data if the app is rebooted. Is that expected?

Here's how I can reproduce it:

start Alloy with the following config (both locally and via Docker would do):

logging {
  level  = "info"
  format = "logfmt"
}

prometheus.exporter.unix "node_exporter" { }

prometheus.scrape "node_exporter" {
  targets = [
    {
      "__address__" = prometheus.exporter.unix.node_exporter.targets[0].__address__,
      "__metrics_path__" = prometheus.exporter.unix.node_exporter.targets[0].__metrics_path__,
      "source" = "node_exporter",
    },
  ]

  scrape_interval = "10s"
  honor_labels    = true

  // Sending these scraped metrics to remote Prometheus via remote_write.
  forward_to = [prometheus.write.queue.default.receiver]
}


prometheus.write.queue "default" {
  endpoint "default"{
    url = env("PROMETHEUS_HOST")
    bearer_token = env("PROMETHEUS_TOKEN")
  }

  ttl = "168h"

  persistence {
    batch_interval = "10s"
  }
}

turn off internet on the device
wait like 5 minutes
restart Alloy
see gaps in logs

Apparently from what I see by inserting a bunch of fmt.Printfs, it seems to create 1.committed, 2.commited etc. files, and immediately delete them, even if the internet isn't up so it is not able to send these. Wonder if I misconfigured something or it's a bug?

mattdurham · 2024-10-18T13:22:10Z

It will cache samples in memory for outgoing data. If you add parallelism. That being said I did find where it wasnt honoring the minimum capacity for enqueued items, so instead of 1 it consumes up to 64 files. Much appreciate the feedback and if you want more realtime feel free to ping me on the community slack I am mattd on there. PR incoming to make minimum capacity really minimum.

freak12techno · 2024-10-18T13:34:00Z

@mattdurham can't register on Slack unfortunately, probably due to me being a citizen of a sanctioned country, so will post here for now.

Just to clarify, here's our usecase and why do we need this:

we have devices that might be offline (as in, without internet) for the long time, that might be sometimes rebooted, (in fact, both of these are a common case for us)
we need to gather metrics/logs from these devices whenever we can
and send these to Prometheus remote_write if the device is online
the metrics that were collected during the time the device was offline should not be discarded and should instead be sent to Prometheus once the device has internet
metrics should be persisted during the reboot (I understand that there might be cases when a device is writing data to a file and unexpectedly shuts down mid-process, corrupting it)

Currently, both prometheus.remote_write and prometheus.write.queue (built from main branch) work for us except for the case when the devices loses its internet access, then reboots (either Alloy restarts or the whole server reboots, doesn't matter).

For parallelism, will check whether this helps and will get back to you with the results.

(Also, given that this issue is closed, should we open another one, to not lose this context?)

freak12techno · 2024-10-18T13:38:24Z

@mattdurham okay, added parallelism = 1, still behaves the same.

What concerns me here is that I added some fmt.Printfs to Alloy code to debug it, and it seems that it's writing a file to a disk and immediately removes it, even when the device is offline, so it apparently means that it's not persisted:

ts=2024-10-18T13:35:47.594704Z level=error msg="error in sending telemetry" component_path=/ component_id=prometheus.write.queue.default name=loop url=https://prometheus-host err="Post \"https://prometheus-host\": dial tcp: lookup prometheus-host: no such host"
writing file 2.committed map[compression:snappy meta_count:0 series_count:671 strings_count:403 version:alloy.metrics.queue.v1] # this is from filequeue.add
doing work
deleting file ../storage/prometheus.write.queue.default/default/wal/2.committed # this is from filequeue.delete
ts=2024-10-18T13:35:48.596425Z level=error msg="error in sending telemetry" component_path=/ component_id=prometheus.write.queue.default name=loop url=https://prometheus-host err="Post \"https://prometheus-host\": dial tcp: lookup prometheus-host: no such host"

mattdurham · 2024-10-18T13:51:24Z

Can you use community.grafana.com?

mattdurham · 2024-10-18T13:53:10Z

FYI apparently the library I am using defaults to a minimum capacity of enqueued messages as 64 even if the capacity is 1, which is what I used for consuming files. There is additional settings to solve that that I have added and testing now.

freak12techno · 2024-10-18T13:54:10Z

@mattdurham

Can you use community.grafana.com?

Yeah I can. Wonder what's the most convenient way for you to communicate, as I don't have a preference here and everything would work for me (except Slack unfortunately)

mattdurham · 2024-10-18T13:57:08Z

https://community.grafana.com/t/prometheus-write-queue-discussions/134563

freak12techno · 2024-10-28T11:01:53Z

@mattdurham are there any updates?

mattdurham added the proposal A proposal for new functionality. label Aug 28, 2023

mattdurham self-assigned this Aug 28, 2023

rfratto transferred this issue from grafana/agent Apr 11, 2024

rfratto added this to Alloy proposals Jun 14, 2024

rfratto moved this to Incoming in Alloy proposals Jun 14, 2024

mattdurham mentioned this issue Oct 16, 2024

Merge prometheus.write.queue into main. #1564

Merged

mattdurham closed this as completed in #1564 Oct 16, 2024

bboreham mentioned this issue Oct 22, 2024

Network wal pr #1717

Merged

github-actions bot added the frozen-due-to-age label Nov 28, 2024

github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable Queue as a WAL in Flow #370

Durable Queue as a WAL in Flow #370

mattdurham commented Aug 28, 2023 •

edited

Loading

mattdurham commented Aug 28, 2023

mattdurham commented Aug 28, 2023

thepalbi commented Aug 28, 2023 •

edited

Loading

arkbriar commented Nov 21, 2023

jammiemil commented May 10, 2024 •

edited

Loading

mattdurham commented May 10, 2024

freak12techno commented Oct 16, 2024 •

edited

Loading

mattdurham commented Oct 16, 2024

freak12techno commented Oct 16, 2024

freak12techno commented Oct 17, 2024 •

edited

Loading

mattdurham commented Oct 17, 2024

freak12techno commented Oct 17, 2024

freak12techno commented Oct 17, 2024

mattdurham commented Oct 17, 2024

mattdurham commented Oct 17, 2024

freak12techno commented Oct 18, 2024

mattdurham commented Oct 18, 2024

freak12techno commented Oct 18, 2024

freak12techno commented Oct 18, 2024

mattdurham commented Oct 18, 2024

mattdurham commented Oct 18, 2024

freak12techno commented Oct 18, 2024

mattdurham commented Oct 18, 2024

freak12techno commented Oct 28, 2024

Durable Queue as a WAL in Flow #370

Durable Queue as a WAL in Flow #370

Comments

mattdurham commented Aug 28, 2023 • edited Loading

Background

Proposal

mattdurham commented Aug 28, 2023

mattdurham commented Aug 28, 2023

thepalbi commented Aug 28, 2023 • edited Loading

arkbriar commented Nov 21, 2023

jammiemil commented May 10, 2024 • edited Loading

mattdurham commented May 10, 2024

freak12techno commented Oct 16, 2024 • edited Loading

mattdurham commented Oct 16, 2024

freak12techno commented Oct 16, 2024

freak12techno commented Oct 17, 2024 • edited Loading

mattdurham commented Oct 17, 2024

freak12techno commented Oct 17, 2024

freak12techno commented Oct 17, 2024

mattdurham commented Oct 17, 2024

mattdurham commented Oct 17, 2024

freak12techno commented Oct 18, 2024

mattdurham commented Oct 18, 2024

freak12techno commented Oct 18, 2024

freak12techno commented Oct 18, 2024

mattdurham commented Oct 18, 2024

mattdurham commented Oct 18, 2024

freak12techno commented Oct 18, 2024

mattdurham commented Oct 18, 2024

freak12techno commented Oct 28, 2024

mattdurham commented Aug 28, 2023 •

edited

Loading

thepalbi commented Aug 28, 2023 •

edited

Loading

jammiemil commented May 10, 2024 •

edited

Loading

freak12techno commented Oct 16, 2024 •

edited

Loading

freak12techno commented Oct 17, 2024 •

edited

Loading