Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Durable Queue as a WAL in Flow #370

Closed
mattdurham opened this issue Aug 28, 2023 · 24 comments · Fixed by #1564
Closed

Durable Queue as a WAL in Flow #370

mattdurham opened this issue Aug 28, 2023 · 24 comments · Fixed by #1564
Assignees
Labels
frozen-due-to-age proposal A proposal for new functionality.

Comments

@mattdurham
Copy link
Collaborator

mattdurham commented Aug 28, 2023

Background

The current WAL implementation suffers from several downsides.

  • OOM errors while replaying
  • Replaying the WAL stops graph evaluation
  • High memory usage
  • Inability to re-send data between restarts
  • Lack of expiration promises (when does a metric expire is a hard question to answer)
  • Current WAL needs shims to work with the Flow concept

Proposal

Create a Disk Based Queue WAL+remote_write component. With a focus on

  • Limiting memory through no unbounded large maps
  • Memory should be limited to what it is reading/writing at the time
  • Flow native
  • Concrete expiration promises
  • Replayable
  • Performance on sending and writing samples within 90% of our current implementation
  • Simplify the implementation
  • Ability to make changes with scraping/wal within the Agent codebase more easily

With no need to cache the series themselves memory will no longer be dependent on the number series stored. Each commit from a scrape to the WAL creates a new record and the remote write side works its way through the queue bookmarking its position on each successful send. This gives us strong replayability guarantees.

On startup instead of replaying the WAL it only needs to verify the correctness of the WAL.

This is not meant to replace the current WAL implementation but be an experiment in using a queue based WAL.

@mattdurham mattdurham added the proposal A proposal for new functionality. label Aug 28, 2023
@mattdurham mattdurham self-assigned this Aug 28, 2023
@mattdurham
Copy link
Collaborator Author

I have an experimental PR with an implementation of the above but want to let the problem statement sit for a few days before submitting a PR. Also to give some more time for the benchmarks to run and double check them. They run for 6 hours.

@mattdurham
Copy link
Collaborator Author

Added some clarity to the above in that this is not meant to be a replacement of the current WAL nor ignore fixing the concerns with the current WAL.

@thepalbi
Copy link

thepalbi commented Aug 28, 2023

@mattdurham can you link the branch? interested to see the implementation

@arkbriar
Copy link

Great point! I also found that the memory consumption was surprisingly high even though there were only ~10k active series. I've also got a rough implementation of a queue based remote write component and found that the memory usage drops at least 5x or more.

It would be great if some one in Grafana Labs can have a comprehensive design to replace the current WAL based implementation (which IMO is terrible).

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
@jammiemil
Copy link

jammiemil commented May 10, 2024

Was there any progress on this? Some of the downsides listed above are hurting us significantly in production right now + decoupling Active Series from Memory Consumption would help alot.

@mattdurham
Copy link
Collaborator Author

It is still on a long term roadmap for me.

@rfratto rfratto moved this to Incoming in Alloy proposals Jun 14, 2024
@freak12techno
Copy link
Contributor

freak12techno commented Oct 16, 2024

@mattdurham any progress on this? we're trying to integrate alloy into our setup and this feature (specifically, WAL persistence after reboot) is a must for us

@mattdurham
Copy link
Collaborator Author

Its coming soon #1564, though I will entirely stress this is the most experimental feature possible.

@freak12techno
Copy link
Contributor

@mattdurham yeah we'll likely be one of the first production testers then lol (as all other solutions we've tried do not suit us)

@freak12techno
Copy link
Contributor

freak12techno commented Oct 17, 2024

@mattdurham is there a support for bearer token auth? I looked through the PR but could not find it, wonder if that'll be easy to add as we're using Prometheus managed solution which does not provide us with basic auth for remote_write, only with the bearer token auth

(I can probably add it myself, if that's okay)

@mattdurham
Copy link
Collaborator Author

Not at the moment, I wanted to get it out as cleanly as possible. I have #1904 that I am tracking next steps. If you want to create an issue for it, I can add it to the tracking list.

@freak12techno
Copy link
Contributor

@mattdurham done: #1913. also, feel free to assign it to me, I want to try implementing it by myself, if there are no objections

@freak12techno
Copy link
Contributor

@mattdurham one thing I noticed is that there's no persistence block anywhere in the PR above (while it is being referenced in the docs), and Alloy fails to start if it's provided, I assume that's a typo and that's supposed to be serialization? please correct me if I'm wrong

@mattdurham
Copy link
Collaborator Author

Conceptually it should be persistence, that was a late name change that I must not have changed the code.

@mattdurham
Copy link
Collaborator Author

Change incoming to fix that.

@freak12techno
Copy link
Contributor

@mattdurham so I am trying to use this new functionality, it works the same way as the old prometheus.remote_write but somehow doesn't persist data if the app is rebooted. Is that expected?

Here's how I can reproduce it:

  1. start Alloy with the following config (both locally and via Docker would do):
logging {
  level  = "info"
  format = "logfmt"
}

prometheus.exporter.unix "node_exporter" { }

prometheus.scrape "node_exporter" {
  targets = [
    {
      "__address__" = prometheus.exporter.unix.node_exporter.targets[0].__address__,
      "__metrics_path__" = prometheus.exporter.unix.node_exporter.targets[0].__metrics_path__,
      "source" = "node_exporter",
    },
  ]

  scrape_interval = "10s"
  honor_labels    = true

  // Sending these scraped metrics to remote Prometheus via remote_write.
  forward_to = [prometheus.write.queue.default.receiver]
}


prometheus.write.queue "default" {
  endpoint "default"{
    url = env("PROMETHEUS_HOST")
    bearer_token = env("PROMETHEUS_TOKEN")
  }

  ttl = "168h"

  persistence {
    batch_interval = "10s"
  }
}
  1. turn off internet on the device
  2. wait like 5 minutes
  3. restart Alloy
  4. see gaps in logs

Apparently from what I see by inserting a bunch of fmt.Printfs, it seems to create 1.committed, 2.commited etc. files, and immediately delete them, even if the internet isn't up so it is not able to send these. Wonder if I misconfigured something or it's a bug?

@mattdurham
Copy link
Collaborator Author

It will cache samples in memory for outgoing data. If you add parallelism. That being said I did find where it wasnt honoring the minimum capacity for enqueued items, so instead of 1 it consumes up to 64 files. Much appreciate the feedback and if you want more realtime feel free to ping me on the community slack I am mattd on there. PR incoming to make minimum capacity really minimum.

@freak12techno
Copy link
Contributor

@mattdurham can't register on Slack unfortunately, probably due to me being a citizen of a sanctioned country, so will post here for now.

Just to clarify, here's our usecase and why do we need this:

  • we have devices that might be offline (as in, without internet) for the long time, that might be sometimes rebooted, (in fact, both of these are a common case for us)
  • we need to gather metrics/logs from these devices whenever we can
  • and send these to Prometheus remote_write if the device is online
  • the metrics that were collected during the time the device was offline should not be discarded and should instead be sent to Prometheus once the device has internet
  • metrics should be persisted during the reboot (I understand that there might be cases when a device is writing data to a file and unexpectedly shuts down mid-process, corrupting it)

Currently, both prometheus.remote_write and prometheus.write.queue (built from main branch) work for us except for the case when the devices loses its internet access, then reboots (either Alloy restarts or the whole server reboots, doesn't matter).

For parallelism, will check whether this helps and will get back to you with the results.

(Also, given that this issue is closed, should we open another one, to not lose this context?)

@freak12techno
Copy link
Contributor

@mattdurham okay, added parallelism = 1, still behaves the same.

What concerns me here is that I added some fmt.Printfs to Alloy code to debug it, and it seems that it's writing a file to a disk and immediately removes it, even when the device is offline, so it apparently means that it's not persisted:

ts=2024-10-18T13:35:47.594704Z level=error msg="error in sending telemetry" component_path=/ component_id=prometheus.write.queue.default name=loop url=https://prometheus-host err="Post \"https://prometheus-host\": dial tcp: lookup prometheus-host: no such host"
writing file 2.committed map[compression:snappy meta_count:0 series_count:671 strings_count:403 version:alloy.metrics.queue.v1] # this is from filequeue.add
doing work
deleting file ../storage/prometheus.write.queue.default/default/wal/2.committed # this is from filequeue.delete
ts=2024-10-18T13:35:48.596425Z level=error msg="error in sending telemetry" component_path=/ component_id=prometheus.write.queue.default name=loop url=https://prometheus-host err="Post \"https://prometheus-host\": dial tcp: lookup prometheus-host: no such host"

@mattdurham
Copy link
Collaborator Author

Can you use community.grafana.com?

@mattdurham
Copy link
Collaborator Author

FYI apparently the library I am using defaults to a minimum capacity of enqueued messages as 64 even if the capacity is 1, which is what I used for consuming files. There is additional settings to solve that that I have added and testing now.

@freak12techno
Copy link
Contributor

@mattdurham

Can you use community.grafana.com?

Yeah I can. Wonder what's the most convenient way for you to communicate, as I don't have a preference here and everything would work for me (except Slack unfortunately)

@mattdurham
Copy link
Collaborator Author

https://community.grafana.com/t/prometheus-write-queue-discussions/134563

@freak12techno
Copy link
Contributor

@mattdurham are there any updates?

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age proposal A proposal for new functionality.
Projects
Status: Incoming
Development

Successfully merging a pull request may close this issue.

5 participants