Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to calculate metrics? Bucket size, last updated, etc #137

Open
jessicaaustin opened this issue Jun 17, 2020 · 8 comments
Open

How to calculate metrics? Bucket size, last updated, etc #137

jessicaaustin opened this issue Jun 17, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@jessicaaustin
Copy link

jessicaaustin commented Jun 17, 2020

We are in the process of setting up monitoring and alerts for our backblaze backups, so we are notified if one of our backup processes stops working.

Some metrics I'd like to track per bucket are:

  • time since last updated (timestamp of most recently uploaded file)
    • this one is the most critical -- is our backup process running at all?
  • total size and number of files
    • we are adding data daily, so in general these numbers should slowly increase over time
  • number of versions
    • to make sure config didn't get accidentally messed up for buckets that should have versions

For total size and number of files -- I found Backblaze/B2_Command_Line_Tool#404 which adds --showSize to the CLI. But I looked at the code that calculates this, and it recursively looks at every file in the bucket and adds it up. That's simply not going to perform well. (It worked fine for a small bucket, but when I tried it for one of our larger buckets I didn't get a result after 15 minutes of waiting and killed the process) What's strange is, I can see this info on the Backblaze website, so it seems like you guys know these stats per bucket? Is there a chance it could be exposed somehow?

For time since last updated, we could add a "canary file" to the root of each bucket, make sure it gets updated regularly, and check that... but it would be far less brittle if backblaze provided this info. Do you store this info? If so, is there any way to access it?

For number of versions, I could parse out lifecycle_rules from the bucket info, so that's fine.

Any guidance here? We use prometheus for metrics, so the plan is to use this python sdk and write a simple client to export the above metrics. It would be generic enough that we could open-source the project so others could use it. But as it stands right now, I can't figure out a way to get enough information to create useful metrics in a generic way.

Thanks!

@ppolewicz
Copy link
Collaborator

This is sort of the same discussion as in Backblaze/B2_Command_Line_Tool#624 - API does not expose it and getting through the web panel is a security issue. I hope B2 server developers one day add those things to the API, but as of now there is not much b2-sdk-python and b2cli projects can do about it. The proper way to request this to be added to the server is to email B2 support - they then pass it to someone for consideration (and I cannot).

I think unsatisfactory performance of --showSize may be somewhat improved, but the code for that would not be elegant, so I would rather not go that way, but if you are willing to explore that direction, lets do it, maybe we'll discover something useful.

As for monitoring of your backups:

time since last updated (timestamp of most recently uploaded file)

if no files change, monitoring will show a backup failure, while a backup works correctly and it's the source that doesn't change (unless your backup application writes something anyway when incremental backup is empty, but b2cli sync doesn't)

total size and number of files

I think what you really need is the number and size of file versions newer than a certain point (time when last backup, assuming they don't overlap) and a scan is needed for this. You don't really care about old versions of the file, but about the new stuff. The metric is then actually useful - if the number of new uploads grows or dies down, it can be correlated with other changes in your infrastructure to investigate and troubleshoot issues with the backup system.

You could also try to track the eviction of old file versions. Imagine a certain file is being overwritten at the source every day - then on the backup media you would have different versions stored for some time according to retention policy. If scanning (and a bit of storage) is an option, your monitoring system can separately track the amount and size of ADDED files, but it can also track HIDDEN and REMOVED file count and size, even group it with prefixes. If one of your backed-up systems drops a lot of data, you might want to observe that on a chart of backup monitoring system so that you can restore the data before the retention policy sweeps it up.

Can you share some details about your operation? How many systems there are, what backup application do you use, how many files are there, how do they change (new are added, large files are modified in place / appended to / etc), can the monitoring system access the source, does the source have an index of some sorts on it (like the one for locate)?

@ppolewicz ppolewicz added the more-information-needed More information is needed label Jun 18, 2020
@jessicaaustin
Copy link
Author

Thanks for your quick reply. Some info on our setup:

  • We are running in a self-hosted datacenter and have full control over the setup
  • Yes, the monitoring system can access the source
  • We have 15 buckets, and out of those there are 9 that we actively push to every day
  • The data in these buckets is generally the output from sensors (think IoT or weather stations), environmental models (like weather forecasts), and scientific datasets. So it's generally increasing in size as we collect data each day
  • The buckets range in size from a few GB to 30TB, with most of them being >1TB. Definitely not all of the files in those buckets are getting updated every day, maybe 1-25% depending on the bucket.
    • to give a concrete example, one bucket currently has 19,882,172 files and is 1,168.3 GB and about 10-15% of those files change daily (the vast majority are existing files that are updated)
  • To run the backups, we have a lightweight script that calls the b2 cli. This runs in a docker container and is triggered essentially by a cron job
  • "does the source have an index of some sorts on it (like the one for locate)" -- not that i'm aware of but i can check

API does not expose it and getting through the web panel is a security issue. I hope B2 server developers one day add those things to the API, but as of now there is not much b2-sdk-python and b2cli projects can do about it. The proper way to request this to be added to the server is to email B2 support - they then pass it to someone for consideration (and I cannot).

Is that the b2feedback@backblaze email? I am happy to send a request wherever is useful. I understand that you guys can't do anything here without it being in the API first. And yeah, scraping the web site is definitely a no-go.

if no files change, monitoring will show a backup failure, while a backup works correctly and it's the source that doesn't change (unless your backup application writes something anyway when incremental backup is empty, but b2cli sync doesn't)

Yeah, I was thinking of updating our backup script to touch the "canary" file before every backup, so that way the timestamp will change. Another option is to pick a dataset that we know changes every day, and use that as the canary. Not perfect, but it'll probably work.

I think what you really need is the number and size of file versions newer than a certain point

It's true that we only care about rate of change for number/size of files. But if we send instantaneous numbers to prometheus, then we can use it to get the rate of change out of the timeseries, and alert on that.

Anyway, as it stands now here's what I think we can probably do:

  • time since last updated
    • this one is the most critical -- is our backup process running at all? and it's also the most likely to fail
    • we'll probably go with the "canary files" approach here since it doesn't sounds like this info is stored by backblaze
    • if we're making a feature request I'll request this one as well, since it would be a great generic metric that others could use
  • total size and number of files
    • i think for now, given the size of our buckets, we'll probably just post a feature request and wait for it to come out

Open to any more ideas/advice though. Thanks for all the great ideas so far.

@no-response no-response bot removed the more-information-needed More information is needed label Jun 18, 2020
@ppolewicz
Copy link
Collaborator

Do you know the nature of the changes? Are the files replaced, rewritten completely, modified or removed (and new are created)? Are large files affected differently than the small ones? Are files compressible? Are there duplicate files? How is your retention policy set up?

Does the script which calls b2 cli use the sync command? Because if it does, then it so happens that sync has to do a full scan of the remote side anyway. Theoretically we could add a --stats parameter which would display the summary of that scan, number of uploaded files, their total size, number of deleted files etc. Currently sync does not return much, maybe that needs to change - we already have this information, you paid for the transactions to get it - but it is just thrown away.

Another idea I have is that sync returns an exit code. Monitoring that one is, uh, critical. Are you tracking the exit code? Of course, if sync will hang for some reason, it will never exit and you will never know, so this must be taken into account too.

The email is I mentioned is any one of B2 support (I doubt they have specialized first-line teams for different kinds of B2 issues, so it all lands in the same pool of tickets anyway).

We have considered consistency of backups a while ago. Imagine a system that only has two files, a data file and an index file (sort of like .tar but split in two). Now, if a server would succeed in uploading one of them but would fail to upload the other one, restore from such cloud state would not recover the functionality of the system. Sometimes only consistent backups make sense. sync doesn't really know much about this, there is no backup session, no log which would clearly state which session was completed successfully and which wasn't. If this was added, sync would gain the ability to inspect the current and past sessions, which could be helpful in monitoring.

I'm asking so many questions because it's rare to find a user that I can talk to and planning future features in my head is easier when I know how people use the thing. I hope you understand :)

@srstsavage
Copy link

Jumping in to help answer a few questions (@jessicaaustin and I work at the same place).

Do you know the nature of the changes? Are the files replaced, rewritten completely, modified or removed (and new are created)? Are large files affected differently than the small ones? Are files compressible? Are there duplicate files?

Kind of all over the board, depending on the bucket. Some files are compressible, some already have compression applied. Duplicate files are a possibility but not common.

How is your retention policy set up?

Generally 30 days, but again, it varies by bucket.

Does the script which calls b2 cli use the sync command? Because if it does, then it so happens that sync has to do a full scan of the remote side anyway. Theoretically we could add a --stats parameter which would display the summary of that scan, number of uploaded files, their total size, number of deleted files etc. Currently sync does not return much, maybe that needs to change - we already have this information, you paid for the transactions to get it - but it is just thrown away.

Yes, we use sync to backup local directory trees to b2. Adding --stats is an interesting idea, especially because, as you point out, we're already paying the cost. If there was an option to output the stats in json format we could potentially parse it and feed the results to a prometheus push gateway, and also include the current time as the "last backup completion time" for the bucket. This would be less desirable than getting this information from the b2 api itself since we're relying on the output of the backup tool instead of asking the archive directly, but it would be better than what we have now.

Another idea I have is that sync returns an exit code. Monitoring that one is, uh, critical. Are you tracking the exit code? Of course, if sync will hang for some reason, it will never exit and you will never know, so this must be taken into account too.

We're tracking the exit code to send error notifications with logs when the backup fails. Since most of our buckets are large and contain many thousands/millions of files, we often get some noise about API nodes being too busy and end up with "backup incomplete" messages (as mentioned here it would nice to retry those failures indefinitely, or up to a defined timeout), but it works to keep us informed. However, as you guessed, we've been seeing backups run/hang for months, so we never get an alert. We recently adjusted default timeouts to be much lower (one day) and enabled stdout logging to investigate why they're hanging.

We have considered consistency of backups a while ago. Imagine a system that only has two files, a data file and an index file (sort of like .tar but split in two). Now, if a server would succeed in uploading one of them but would fail to upload the other one, restore from such cloud state would not recover the functionality of the system. Sometimes only consistent backups make sense. sync doesn't really know much about this, there is no backup session, no log which would clearly state which session was completed successfully and which wasn't. If this was added, sync would gain the ability to inspect the current and past sessions, which could be helpful in monitoring.

Yes, some of our backups include BorgBackup repositories, which are directories of inter-related files, much like git repositories:

$ tree --filelimit=10 example_borg_repo/
├── config
├── data
│   └── 0 [498 entries exceeds filelimit, not opening dir]
├── hints.852
├── index.852
├── integrity.852
└── README

If any file in the repo fails to sync, the resulting backup will be in an inconsistent state.

@ppolewicz
Copy link
Collaborator

This is interesting, because you use b2 along with b2cli as a sort of a backup appliance. You might find the new option to sync from a past time interesting... But I don't think this is good enough in your case, as the backup cycle may exceed a day. In some cases (though maybe not in yours) using sync to update a borg repo once per day, if the single operation takes more than a day, will make it so that restoring a consistent version is very hard if possible at all.

It seems like "sync sessions" may solve many of the problems that you have: it would let you track the status of your backup operations, it would let you choose a session to restore from and you would know which session was consistent and which was not. This would enable "last backup completion time" in the bucket, but with more granularity (there can be many namespaces ("subdirectories") in the bucket with different keys and backup cycles and you could monitor them individually).

Would --backup-window (instead of "5 retries of each file"), specified in number of seconds sync is allowed to work for, be a good solution for the transient issues you are sometimes facing?

@ppolewicz ppolewicz added the enhancement New feature or request label Jul 4, 2020
@srstsavage
Copy link

This is interesting, because you use b2 along with b2cli as a sort of a backup appliance. You might find the new option to sync from a past time interesting... But I don't think this is good enough in your case, as the backup cycle may exceed a day. In some cases (though maybe not in yours) using sync to update a borg repo once per day, if the single operation takes more than a day, will make it so that restoring a consistent version is very hard if possible at all.

Yes, we may need to break up some large b2 backup jobs into smaller chunks to make sure that they complete before the next local backup cycle starts, that's on us. The bigger issue I think is making sure that all files in a b2 sync get backed up successfully so that a consistent backup is ensured.

It seems like "sync sessions" may solve many of the problems that you have: it would let you track the status of your backup operations, it would let you choose a session to restore from and you would know which session was consistent and which was not. This would enable "last backup completion time" in the bucket, but with more granularity (there can be many namespaces ("subdirectories") in the bucket with different keys and backup cycles and you could monitor them individually).

Sounds promising. I hope that the namespaces in the bucket would be optional, and that you could treat the entire bucket as a single namespace, as my instinct would be to just break those namespaces into their own buckets. To clarify, would each session record the full list of files (and attributes/stats on those files) that were backed up during that session?

Would --backup-window (instead of "5 retries of each file"), specified in number of seconds sync is allowed to work for, be a good solution for the transient issues you are sometimes facing?

Yeah, that seems like an easy win, possibly with a 0 value to allow the backup to run indefinitely (although in practice you'd probably want a large valued sanity ceiling).

@ppolewicz
Copy link
Collaborator

Please observe that there is a limit of 100 buckets per account. Separating everything to its own bucket may not be possible.

The namespaces would be optional and the session would include a number of files. I think that's enough. The idea is that it should be possible to say whether a session was completed in full or not, so if we can know how many files there were supposed to be and we can do a full b2_list_file_versions, we can compare the number of files that were planned to be uploaded vs the number actually stored in the cloud.

If the numbers don't match, then maybe the session is still being written, maybe the writer has died and will never complete it or maybe the file was deleted between the time writer planned to upload it and the upload actually happened - that's hard to say. But still we know what is consistent and what is not. Does it make sense?

@ppolewicz
Copy link
Collaborator

We may want to upload the session file twice, once when opening the session to indicate the list of files to be uploaded and then the second time the list could also have fileId associated with every file (and also size and checksum maybe?). Then the consistency of the session can be deeply verified by inspecting those file_ids with b2api.get_file_info(), rather than scanning the entire bucket to see if the number of files actually matched. That could also make restore faster (especially for large buckets as it'd skip the scan if restoring from a session). Sync would need new syntax like b2 sync b2session://bucket/namespace_path/session_id ., b2 sync --list-sessions bucket namespace_path etc)

Finally, periodic intermediate uploads of the session file could happen to report partial progress, then b2 sync --list-sessions --json bucket optional/namespace/path would show everything there is to know even to a remote peer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants