Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back up Rust releases and crates #122

Open
4 tasks
jdno opened this issue Jul 4, 2024 · 7 comments
Open
4 tasks

Back up Rust releases and crates #122

jdno opened this issue Jul 4, 2024 · 7 comments
Assignees

Comments

@jdno
Copy link
Member

jdno commented Jul 4, 2024

Currently, all of Rust's releases and all crates are stored on AWS. While we have multiple measures in place to prevent accidental deletion of releases or crates, e.g. bucket replication to a different region and restricted access, our current setup does not sufficiently protect us against a few threats:

  • The threat model for Rust's infrastructure, created by the Rust Foundation's security engineer, highlights the risk of an AWS account compromise. If a malicious actor was able to gain administrator access to our AWS account, they could bypass a lot of safe guards and delete data.
  • Google accidentally deleted a customer account recently. If this would happen to our AWS account, we could potentially lose our data and our backups.

Therefore, we want to set up automated out-of-band backups for both Rust releases and crates. These backups will be hosted in GCP and have totally separate access controls compared to AWS. Specifically, none of the current infra-admins should have access to this separate environment to protect against an account compromise.

Tasks

  • Investigate a synchronization mechanism between AWS and GCP
  • Design and propose separate access controls to the new environment
  • Get feedback from the Rust Foundation's security engineer on the proposed design
  • Implement the proposed solution
@jdno jdno added this to infra-team Oct 24, 2022
@jdno jdno converted this from a draft issue Jul 4, 2024
@jdno jdno moved this from Backlog to Ready in infra-team Jul 4, 2024
@marcoieni marcoieni self-assigned this Jul 22, 2024
@marcoieni
Copy link
Member

marcoieni commented Jul 22, 2024

Proposed solution

  • Create one Google Object Storage for each AWS S3 bucket we want to backup.
  • Use Storage Transfer to automatically transfer the content of the s3 bucket into the Google Object Storage.

Execution plan

  1. With terraform, create the project where we want to store the objects and create an empty google object storage called crates-io. Set the storage class to "archive".
  2. With terraform, create a transfer for the crates-io bucket by entering the CloudFront domain cloudfront-static.crates.io. This step is documented in the s3-cloudfront docs. Select "Run every day" as scheduling option (let me know if you prefer hourly or weekly).
  3. For one week, monitor:
    • price: is it what we expected?
    • correctness: is the google bucket up-to-date with the S3 one?
  4. Do the same for the releases, i.e. static.rust-lang.org.
  5. Find a "monitoring system", that can range from "login every week and manually check that everything is ok" to "configure an alert if the transfer job fail".

FAQ

Does Storage Transfer support AWS S3?

Yes. As you can see here Amazon S3 is supported as Source. Plus, it does not require agents or agent pools.

How much does everything cost?

TL;DR we should only pay for the Object Storage cost.

The Storage Transfer pricing explicitly says "No charges" for Agentless transfers. So traffic for Amazon S3 should be free.

Transfering from aws cloudfront instead of S3 directly reduces the AWS egress cost. This cost should be negligible with respect to the usual crates.io and releases traffic.

The cost of Object Storage depends on the storage class. The cost calculator is here.

Here's an estimate. Please edit "Class A" with the following number:

// each `cargo publish` publishes a ".crate" file
let published = "number of crates users publish every month";
// each `cargo publish` updates the .xml rss file
let corresponding_rss = published;
// each `cargo publish` renders the readme to display on crates.io
let readme_precentage = "percentage of crates with a readmes";
let readmes = published * readme_percentage;
let class_a = published + corresponding_rss + readmes;
println!("{class_a}");

We can drop the cost of this bucket by not storing readmes and rss.

  • Does anybody have the number of published crates every month?

It is important to estimate the number of published crates because if it's very high, "coldline storage" is more convenient than "archive storage" (try yourself in the pricing calculator).

From my understanding "Class A" doesn't increases much for releases, because they only happen every rust release. Instead users publishing crates are way more frequent.

Can we backup only some paths of the bucket?

If we don't want to backup the entire bucket we can use filters which are supported in Agentless transfers.

However, I'm not sure if this solution works with CloudFront. Maybe we can just give the URL path we want to backup? E.g. static.crates.io/crates/.

Anyway, this shouldn't be necessary because we probably want to backup everything in the buckets (unless we realize we might save a lot of money but not backing up readmes and rss)

Do we need a multi-region backup for the object storage?

No. Multi-region only helps if we want to serve this data real-time and we want to have a fallback mechanism if a GCP region fails. We just need this object storage for backup purposes, so we don't need to pay double 👍

Questions

GCP region

Do you have a preference? Let's use cost calculator to pick one of the cheapest regions 👍

Manual test

Should we add a step 0 where we test step 2 in a dummy GCP project without terraform? Just to validate our assumptions.

CDN for releases

I didn't put cloufront- as a prefix for static.rust-lang.org because from my understanding we only serve releases via cloudfront, right?

Buckets

  • First, do you confirm the s3 buckets we need to backup are the following:

  • Do we also need the index?

Useful docs

@marcoieni marcoieni moved this from Ready to In Progress in infra-team Jul 22, 2024
@jdno
Copy link
Member Author

jdno commented Jul 23, 2024

The Storage Transfer pricing explicitly says "No charges" for Agentless transfers. So traffic for Amazon S3 should be free.

Transfering from aws cloudfront instead of S3 directly reduces the AWS egress cost. This cost should be negligible with respect to the usual crates.io and releases traffic.

These two statements seem to contradict each other. But I agree that, if we go through CloudFront, the egress costs for the backups should be absolutely negligible compared to our usual traffic. Even a full one-time backup of both releases and crates should only be in the region of ~90TB, which is marginal compared to our overall monthly volume.

Does anybody have the number of published crates every month?

This number should be easy to get from either the crates.io team or the Foundation's security engineer.

From my understanding "Class A" doesn't increases much for releases, because they only happen every rust release.

Nightly releases are published every day. 😉 But the amount of files that are created each day is still way less than on crates.io.

unless we realize we might save a lot of money but not backing up readmes and rss

I would approach this from the perspective of "what do we need to backup", and then figure out how we pay for it afterwards. Given that this is intended as a backup for disaster recovery, we have a strong argument to find the necessary budget.

@jdno
Copy link
Member Author

jdno commented Jul 23, 2024

A question that hasn't been addressed yet are the different access controls for the original files in AWS and the backups in GCP. Who will have access? How do we deal with any issues that monitoring might surface? Who will be able to investigate and resolve those?

@marcoieni
Copy link
Member

These two statements seem to contradict each other

Let me clarify: traffic for Amazon S3 should be free on gcp bill. We still pay egress cost in the aws bill 👍

Given that this is intended as a backup for disaster recovery, we have a strong argument to find the necessary budget.

Agree 👍 I need to ask to the Foundation's security engineer about this.

@marcoieni
Copy link
Member

marcoieni commented Jul 24, 2024

Does anybody have the number of published crates every month?

I got an answer here
It's 1k per day, so the "Class A" cost is negligible 👍

EDIT:

  • Status: waiting for feedback from the infra team.

  • merge the hackmd in the service catalog and ask for a review to the team docs: document gcp backup #127

Bonus: try terraform cloud or pulumi if I feel like.

@marcoieni
Copy link
Member

marcoieni commented Aug 20, 2024

Tasks:

  • setup a staging environment in gcp to test. Backup staging buckets. I can do one gcp project for staging and one for prod.
  • setup users in terraform.

Execution plan from hackmd:

  1. In the simpleinfra repo, with terraform, create the project where we want to store the objects and create an empty google object storage called crates-io. Set the storage class to "archive".
  2. Create a transfer for the crates-io bucket by entering the CloudFront domain cloudfront-static.crates.io. This step is documented in the s3-cloudfront docs (however, we want to do this in terraform). Select "Run every day" as scheduling option (hourly or weekly are also an option).
  3. For one week, monitor:
    • price: is it what we expected?
    • correctness: is the google bucket up-to-date with the S3 one?
  4. Do the same for the other buckets.
  5. The infra team works on a "monitoring system".
    • Initially, it's "login every week and manually check that everything is ok", i.e.:
      • Ensure the number of files and the size of the GCP buckets is the same as the respective AWS buckets by looking at the metrics
      • Ensure that only the authorized people have access to the account
    • Later we can prioritize "configuring an alert if the transfer job fails". E.g. we could create an alert in Datadog.
  6. Run the following test:
    • Upload a file in an AWS S3 bucket and check that it appears in GCP.
    • Edit the file in AWS and check that you can recover the previous version from GCP.
    • Delete the in AWS and check that you can recover all previous versions from GCP.

Non blocking questions to be answered before closing this task:

  • We are backing up public data of the crates-io database (db-dump in the crates-io bucket). Should we also backup private data of the crates-io database? The concern here is that it contains private information (See this for the information that is included and excluded from the database dump).

@marcoieni
Copy link
Member

I'm waiting for the gcp account to be provisioned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

No branches or pull requests

2 participants