-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Durable Queue as a WAL in Flow #370
Comments
I have an experimental PR with an implementation of the above but want to let the problem statement sit for a few days before submitting a PR. Also to give some more time for the benchmarks to run and double check them. They run for 6 hours. |
Added some clarity to the above in that this is not meant to be a replacement of the current WAL nor ignore fixing the concerns with the current WAL. |
@mattdurham can you link the branch? interested to see the implementation |
Great point! I also found that the memory consumption was surprisingly high even though there were only ~10k active series. I've also got a rough implementation of a queue based remote write component and found that the memory usage drops at least 5x or more. It would be great if some one in Grafana Labs can have a comprehensive design to replace the current WAL based implementation (which IMO is terrible). |
Was there any progress on this? Some of the downsides listed above are hurting us significantly in production right now + decoupling Active Series from Memory Consumption would help alot. |
It is still on a long term roadmap for me. |
@mattdurham any progress on this? we're trying to integrate alloy into our setup and this feature (specifically, WAL persistence after reboot) is a must for us |
Its coming soon #1564, though I will entirely stress this is the most experimental feature possible. |
@mattdurham yeah we'll likely be one of the first production testers then lol (as all other solutions we've tried do not suit us) |
@mattdurham is there a support for bearer token auth? I looked through the PR but could not find it, wonder if that'll be easy to add as we're using Prometheus managed solution which does not provide us with basic auth for remote_write, only with the bearer token auth (I can probably add it myself, if that's okay) |
Not at the moment, I wanted to get it out as cleanly as possible. I have #1904 that I am tracking next steps. If you want to create an issue for it, I can add it to the tracking list. |
@mattdurham done: #1913. also, feel free to assign it to me, I want to try implementing it by myself, if there are no objections |
@mattdurham one thing I noticed is that there's no |
Conceptually it should be |
Change incoming to fix that. |
@mattdurham so I am trying to use this new functionality, it works the same way as the old Here's how I can reproduce it:
Apparently from what I see by inserting a bunch of |
It will cache samples in memory for outgoing data. If you add |
@mattdurham can't register on Slack unfortunately, probably due to me being a citizen of a sanctioned country, so will post here for now. Just to clarify, here's our usecase and why do we need this:
Currently, both prometheus.remote_write and prometheus.write.queue (built from main branch) work for us except for the case when the devices loses its internet access, then reboots (either Alloy restarts or the whole server reboots, doesn't matter). For parallelism, will check whether this helps and will get back to you with the results. (Also, given that this issue is closed, should we open another one, to not lose this context?) |
@mattdurham okay, added What concerns me here is that I added some
|
Can you use |
FYI apparently the library I am using defaults to a minimum capacity of enqueued messages as 64 even if the capacity is 1, which is what I used for consuming files. There is additional settings to solve that that I have added and testing now. |
Yeah I can. Wonder what's the most convenient way for you to communicate, as I don't have a preference here and everything would work for me (except Slack unfortunately) |
@mattdurham are there any updates? |
Background
The current WAL implementation suffers from several downsides.
Proposal
Create a Disk Based Queue WAL+remote_write component. With a focus on
With no need to cache the series themselves memory will no longer be dependent on the number series stored. Each commit from a scrape to the WAL creates a new record and the remote write side works its way through the queue bookmarking its position on each successful send. This gives us strong replayability guarantees.
On startup instead of replaying the WAL it only needs to verify the correctness of the WAL.
This is not meant to replace the current WAL implementation but be an experiment in using a queue based WAL.
The text was updated successfully, but these errors were encountered: