-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to calculate metrics? Bucket size, last updated, etc #137
Comments
This is sort of the same discussion as in Backblaze/B2_Command_Line_Tool#624 - API does not expose it and getting through the web panel is a security issue. I hope B2 server developers one day add those things to the API, but as of now there is not much I think unsatisfactory performance of As for monitoring of your backups:
if no files change, monitoring will show a backup failure, while a backup works correctly and it's the source that doesn't change (unless your backup application writes something anyway when incremental backup is empty, but b2cli
I think what you really need is the number and size of file versions newer than a certain point (time when last backup, assuming they don't overlap) and a scan is needed for this. You don't really care about old versions of the file, but about the new stuff. The metric is then actually useful - if the number of new uploads grows or dies down, it can be correlated with other changes in your infrastructure to investigate and troubleshoot issues with the backup system. You could also try to track the eviction of old file versions. Imagine a certain file is being overwritten at the source every day - then on the backup media you would have different versions stored for some time according to retention policy. If scanning (and a bit of storage) is an option, your monitoring system can separately track the amount and size of ADDED files, but it can also track HIDDEN and REMOVED file count and size, even group it with prefixes. If one of your backed-up systems drops a lot of data, you might want to observe that on a chart of backup monitoring system so that you can restore the data before the retention policy sweeps it up. Can you share some details about your operation? How many systems there are, what backup application do you use, how many files are there, how do they change (new are added, large files are modified in place / appended to / etc), can the monitoring system access the source, does the source have an index of some sorts on it (like the one for |
Thanks for your quick reply. Some info on our setup:
Is that the b2feedback@backblaze email? I am happy to send a request wherever is useful. I understand that you guys can't do anything here without it being in the API first. And yeah, scraping the web site is definitely a no-go.
Yeah, I was thinking of updating our backup script to touch the "canary" file before every backup, so that way the timestamp will change. Another option is to pick a dataset that we know changes every day, and use that as the canary. Not perfect, but it'll probably work.
It's true that we only care about rate of change for number/size of files. But if we send instantaneous numbers to prometheus, then we can use it to get the rate of change out of the timeseries, and alert on that. Anyway, as it stands now here's what I think we can probably do:
Open to any more ideas/advice though. Thanks for all the great ideas so far. |
Do you know the nature of the changes? Are the files replaced, rewritten completely, modified or removed (and new are created)? Are large files affected differently than the small ones? Are files compressible? Are there duplicate files? How is your retention policy set up? Does the script which calls b2 cli use the Another idea I have is that The email is I mentioned is any one of B2 support (I doubt they have specialized first-line teams for different kinds of B2 issues, so it all lands in the same pool of tickets anyway). We have considered consistency of backups a while ago. Imagine a system that only has two files, a data file and an index file (sort of like .tar but split in two). Now, if a server would succeed in uploading one of them but would fail to upload the other one, restore from such cloud state would not recover the functionality of the system. Sometimes only consistent backups make sense. I'm asking so many questions because it's rare to find a user that I can talk to and planning future features in my head is easier when I know how people use the thing. I hope you understand :) |
Jumping in to help answer a few questions (@jessicaaustin and I work at the same place).
Kind of all over the board, depending on the bucket. Some files are compressible, some already have compression applied. Duplicate files are a possibility but not common.
Generally 30 days, but again, it varies by bucket.
Yes, we use
We're tracking the exit code to send error notifications with logs when the backup fails. Since most of our buckets are large and contain many thousands/millions of files, we often get some noise about API nodes being too busy and end up with "backup incomplete" messages (as mentioned here it would nice to retry those failures indefinitely, or up to a defined timeout), but it works to keep us informed. However, as you guessed, we've been seeing backups run/hang for months, so we never get an alert. We recently adjusted default timeouts to be much lower (one day) and enabled stdout logging to investigate why they're hanging.
Yes, some of our backups include BorgBackup repositories, which are directories of inter-related files, much like git repositories:
If any file in the repo fails to sync, the resulting backup will be in an inconsistent state. |
This is interesting, because you use b2 along with b2cli as a sort of a backup appliance. You might find the new option to sync from a past time interesting... But I don't think this is good enough in your case, as the backup cycle may exceed a day. In some cases (though maybe not in yours) using sync to update a borg repo once per day, if the single operation takes more than a day, will make it so that restoring a consistent version is very hard if possible at all. It seems like "sync sessions" may solve many of the problems that you have: it would let you track the status of your backup operations, it would let you choose a session to restore from and you would know which session was consistent and which was not. This would enable "last backup completion time" in the bucket, but with more granularity (there can be many namespaces ("subdirectories") in the bucket with different keys and backup cycles and you could monitor them individually). Would |
Yes, we may need to break up some large b2 backup jobs into smaller chunks to make sure that they complete before the next local backup cycle starts, that's on us. The bigger issue I think is making sure that all files in a b2 sync get backed up successfully so that a consistent backup is ensured.
Sounds promising. I hope that the namespaces in the bucket would be optional, and that you could treat the entire bucket as a single namespace, as my instinct would be to just break those namespaces into their own buckets. To clarify, would each session record the full list of files (and attributes/stats on those files) that were backed up during that session?
Yeah, that seems like an easy win, possibly with a |
Please observe that there is a limit of 100 buckets per account. Separating everything to its own bucket may not be possible. The namespaces would be optional and the session would include a number of files. I think that's enough. The idea is that it should be possible to say whether a session was completed in full or not, so if we can know how many files there were supposed to be and we can do a full If the numbers don't match, then maybe the session is still being written, maybe the writer has died and will never complete it or maybe the file was deleted between the time writer planned to upload it and the upload actually happened - that's hard to say. But still we know what is consistent and what is not. Does it make sense? |
We may want to upload the session file twice, once when opening the session to indicate the list of files to be uploaded and then the second time the list could also have fileId associated with every file (and also size and checksum maybe?). Then the consistency of the session can be deeply verified by inspecting those file_ids with Finally, periodic intermediate uploads of the session file could happen to report partial progress, then |
We are in the process of setting up monitoring and alerts for our backblaze backups, so we are notified if one of our backup processes stops working.
Some metrics I'd like to track per bucket are:
For total size and number of files -- I found Backblaze/B2_Command_Line_Tool#404 which adds
--showSize
to the CLI. But I looked at the code that calculates this, and it recursively looks at every file in the bucket and adds it up. That's simply not going to perform well. (It worked fine for a small bucket, but when I tried it for one of our larger buckets I didn't get a result after 15 minutes of waiting and killed the process) What's strange is, I can see this info on the Backblaze website, so it seems like you guys know these stats per bucket? Is there a chance it could be exposed somehow?For time since last updated, we could add a "canary file" to the root of each bucket, make sure it gets updated regularly, and check that... but it would be far less brittle if backblaze provided this info. Do you store this info? If so, is there any way to access it?
For number of versions, I could parse out
lifecycle_rules
from the bucket info, so that's fine.Any guidance here? We use prometheus for metrics, so the plan is to use this python sdk and write a simple client to export the above metrics. It would be generic enough that we could open-source the project so others could use it. But as it stands right now, I can't figure out a way to get enough information to create useful metrics in a generic way.
Thanks!
The text was updated successfully, but these errors were encountered: