Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Serialization for Blueprints #4793

Closed
smklein opened this issue Jan 10, 2024 · 5 comments · Fixed by #4899
Closed

DB Serialization for Blueprints #4793

smklein opened this issue Jan 10, 2024 · 5 comments · Fixed by #4899
Assignees
Labels
database Related to database access nexus Related to nexus Update System Replacing old bits with newer, cooler bits

Comments

@smklein
Copy link
Collaborator

smklein commented Jan 10, 2024

A "blueprint" is a concept arriving in main...dap/update-control-2 , which describes the set of software that should be running on hardware components, including versions and configurations.

The update planner will end up evaluating blueprints and generating new ones, but the "goal" blueprint describes the intended state of the system. This is a description of fleet-wide configuration intent, which has overlap with topics like sled addition and removal (See: #4787, #4719). During the update sync, we discussed that the "goal blueprint" could be the source of configuration information, such as "What DNS config should we deploy?"

This has been kinda a back-and-forth from a design perspective: Is the state of Sleds, disks, services, etc, "attached to the object" (e.g., in the form of a "State" row in the DB table) or is it "part of the blueprint" (e.g., if it's in the blueprint, it's in-use, otherwise, it is not in-use, and the state can be inferred from other factors).

A lot of this discussion depends on the form of the blueprint when it's serialized to the database. This issue tracks that serialization specifically.

I'm basically filing this issue because I want to work on some other downstream tasks:

  • How do we decide which sleds are active, to add sled agent entries to DNS?
  • How do we decide which services are active, to propagate their entries to DNS?
  • How do we decide which services should be running on a sled, to send the corresponding PUT requests to their sled agents?
  • How do we decide which physical disks are active, to determine viable allocation targets?
  • How do we decide which physical disks are inactive, to determine which regions / services need migration?
@smklein smklein added database Related to database access nexus Related to nexus Update System Replacing old bits with newer, cooler bits labels Jan 10, 2024
@smklein
Copy link
Collaborator Author

smklein commented Jan 10, 2024

Some dimensions to consider:

How many tables do we need to represent the blueprint?

At a base level, I'm assuming we have a blueprint structure like the following:

CREATE TABLE omicron.public.blueprint (
  id UUID NOT NULL
  parent_id UUID,

  ...
)

But there's a question of "how do we represent the blueprint targets, like the set of active sleds?"

1. "The entirely in-DB approach", where we create a new table like this:

CREATE TABLE omicron.public.blueprint_sled_record (
  // References `omicron.public.sled`
  sled_id UUID NOT NULL,
  // References `omicron.public.blueprint`
  blueprint_id UUID NOT NULL
)

And we create these records anew for each blueprint.

Creating a new blueprint would mean:

  • BULK INSERT blueprint_sled_record
  • BULK INSERT for other "relationship" records
  • INSERT blueprint

Pros: Easy + fast to index using the DB alone when determining "which sleds do/don't belong to a particular blueprint
Cons: Lots of rows to insert for new blueprints, harder to clean up old blueprints

2. "The mostly dense blueprint, more in-memory approach"

Where we add an rcgen value to omicron.public.blueprint, store it in-memory with the representation of the blueprint, and when we do operations assuming that knowledge, we JOIN on the blueprint table's latest target, and validate that rcgen is equal to what we read previously. In this case, the set of sleds in the blueprint could be stored as a UUID array, because we wouldn't query for the "active set" in a DB operation, we'd be willing to check in-memory and retry if our blueprint got out-of-date.

This would look something like the following:

CREATE TABLE omicron.public.blueprint (
  id UUID NOT NULL
  parent_id UUID,

  sleds UUID[],
  ...
)

Pros: Much cheaper to create / maintain blueprints. We can use Rust to do set management rather than SQL which is kinda nice.
Cons: It's harder to do "fully within-the-db" operations acting on the blueprint, since a lot of operations would need to compare that rcgen value to validate assumptions from variables read in-memory.

@smklein
Copy link
Collaborator Author

smklein commented Jan 11, 2024

How do we manage concurrency control with the blueprint?

If we use the latest blueprint as a "goalpost", there are a lot of downstream tasks that would like to consume this information (e.g., creating DNS records for the allocated services / sleds, deciding which disks are active/inactive, deciding which sleds are valid targets for instance allocation).

1. Store a marker on the blueprint to signify it's the latest, check it in downstream tasks

For example, this proposes the following structure:

CREATE TABLE omicron.public.blueprint (
  id UUID NOT NULL
  parent_id UUID,
  ...
);

-- We do a similar thing for storing the "DB version"; the
-- singleton lets everyone access the "implied latest thing".
--
-- If we had a "fleet UUID", we could use that as primary
-- key here too. 
CREATE TABLE omicron.public.blueprint_metadata (
  singleton BOOL NOT NULL PRIMARY KEY,
  blueprint_id UUID UNIQUE REFERENCES blueprint (id),
  CHECK (singleton = true)
);

If an operation wants to read information about the latest blueprint, it can do so by accessing a cached, in-memory copy, and also validate that the blueprint has not changed by comparing against the metadata.

For example:

-- Do the new work based on the blueprint...
UPDATE <some auxiliary table> SET <new values>
  -- Validating that the old blueprint has not changed.
  -- This could be modified to return an explicit error, to make it more clear, if that would help.
  WHERE $old_blueprint_id = (SELECT blueprint_id FROM blueprint_metadata WHERE singleton = TRUE)

The idea here is optimistic concurrency control, similar to our usage of "rcgen" elsewhere (and we could insert generation numbers explicitly into the blueprint table too, if we wanted that extra little bit of information, but a boolean also demonstrates the concept).

We use a similar concept for concurrency control from Nexus -> external services (DNS servers take a generation number as input, as do sled agents when PUT-ing the set of Zones to be launched), which acts as a lease of of validity. The generation number, to a downstream consumer, means "if this is higher than the highest number you've seen, it's a more up-to-date view of the world".

Within Nexus, there are many RPWs that may want to consume the contents of the blueprint. By using optimistic concurrency control, they can safely mutate the state of the database conditionally on the assumption that the blueprint has not changed while their calculations took place.

2. Operate entirely in the DB

This requires "The entirely in-DB approach" from this comment above: #4793 (comment).

Basically: When acting on some information from the blueprint, perform all the lookups and validation entirely within the context of a SQL operation.

For example, to select the set of active sleds:

SELECT sled_id
FROM blueprint_sled_record INNER JOIN blueprint on blueprint_sled_record.blueprint_id = blueprint.id
WHERE blueprint_id = (SELECT blueprint_id FROM blueprint_metadata WHERE singleton = TRUE)

Different from option (1), this requires no "in-memory caching" of blueprint information, since the records can be acted upon entirely within SQL. This option is arguably the "least likely to generate retries", but also the least flexible, as it requires all execution to be done purely in SQL, which may impose usage of CTEs more heavily.

3. Cache a "pretty recent" blueprint in-memory, hope for the best

This option omits concurrency control altogether. We read a "pretty recent" copy of the blueprint into memory, act on it, and keep it up-to-date periodically. Downstream tasks may operate on old versions of the blueprint, and we just trust that they'll be "eventually consistent".

This is definitely the simplest option, but it has drawbacks. Many different nexuses can act with different blueprints as "targets" simultaneously. This can cause problems for downstream tasks.

(Admission from Sean: I don't really understand how this option would be viable, but @davepacheco mentioned it to me in chat: "Also, the whole idea is phrasing planning+execution such that it's generally okay to attempt to execute an older blueprint" )

@smklein
Copy link
Collaborator Author

smklein commented Jan 16, 2024

Follow-up from a discussion with @jgallagher :

One more idea for "concurrency control" with the blueprint would be to store one or more generation numbers in the blueprint structure itself, and for downstream consumers to use those.

4. Create generation numbers within the blueprint

For example:

The DNS system currently has a version_added field, where it operates on a generation number. This allows the DNS service to receive records from "slow Nexus instances" and safely ignore them, because they're out-of-date.

One difficulty with blueprints is the issue of "deciding when and how to update this column".

  • Option 1: If we have multiple background tasks trying to consume a blueprint and update the "version_added" column, they need some way to ensure "the latest one is acting on the latest blueprint". This requires background tasks to check the blueprint table at the time they perform any update.

  • Option 2: (John's idea, I'm interested) If the creation of the blueprint also creates the generation number, background tasks can know to "only do work" if they're acting upon a generation number smaller than the one provided. This still forces background tasks to "check the DB" before doing work, but they can basically just their local tables by comparing with the generation number from the blueprint table, rather than querying "what is the latest blueprint".

Example data flow:

  • Initial state: Blueprint A, with generation number 1
  • New state: Blueprint B, with generation number 2
  • Generation Number as CRDB concurrency control: For a downstream task -- say, generating DNS records -- we can use the "new generation number" when updating CRDB records. This means new records are written with "generation number 2".
    • This prevents a slow Nexus from updating DNS records with info from Blueprint A after we've saved state from Blueprint B.
  • Generation Number as External Service concurrency control: When communicating with the DNS service, we also send the same generation number ("2") there.
    • This prevents a slow Nexus from updating the DNS server with records from Blueprint A after we've sent records from Blueprint B.

NOTE: The "Generation Number as External Service concurrency control" mechanism already exists today, and is also part of our API for requesting new services. However, the usage of "Generation Number as CRDB concurrency control" would be somewhat new -- in particular, the choice to have it be generated by the blueprint would be a novel behavior.

@davepacheco
Copy link
Collaborator

The update planner will end up evaluating blueprints and generating new ones, but the "goal" blueprint describes the intended state of the system.

#4804 calls this the "target" blueprint. A previous version did call it "goal" (and let me know if I missed a spot!). I don't really care which we use but I think we should use consistent terminology.

If we use the latest blueprint as a "goalpost", there are a lot of downstream tasks that would like to consume this information (e.g., creating DNS records for the allocated services / sleds, deciding which disks are active/inactive, deciding which sleds are valid targets for instance allocation).

I think it may be important to distinguish uses of the blueprint that are part of execution vs. those that aren't. So far, we've designed execution mechanisms that can correctly ignore attempts to use an older version (the DNS version and the OmicronZonesConfig version). That may not be possible for something like "deciding which sleds are valid targets for instance allocation". But that choice is always racy -- a sled might catch fire immediately after we decide to put an instance there. So how much does it matter if we make a slightly stale choice, if the system detects that quickly and corrects it? I think we want to treat each of these uses of blueprint data individually until we have enough to generalize.

How many tables do we need to represent the blueprint?
How do we manage concurrency control with the blueprint?

I think I've been assuming the "entirely in-DB" approach. I think that's pretty straightforward for the Omicron zones -- we've already done the same thing for the inventory side. I'm not sure if sleds need any representation in the blueprint yet. I've been thinking a bit about the sled lifecycle, going through states like "in-service", "draining", and "decommissioned". I think we could keep storing that in the sled table, at least to start? I'd like to think about this a bit more and draft an RFD on it.

How do we manage concurrency control with the blueprint?

I have been assuming that blueprints are immutable and that if we wanted to change the intended state of the system, we'd create a new blueprint. If blueprints could change and had a generation number or something, when would you bump that vs. generate a new blueprint?

The DNS system currently has a version_added field, where it operates on a generation number. This allows the DNS service to receive records from "slow Nexus instances" and safely ignore them, because they're out-of-date.
One difficulty with blueprints is the issue of "deciding when and how to update this column".

Option 2 (John's idea) sounds similar to a proposal that @smklein and I discussed in chat a few weeks ago. That was basically:

  • the blueprint contains:
    • the set of zones [and sleds] "in service" (i.e., that should have DNS records)
    • the current version of the internal DNS zone configuration
  • during execution, Nexus:
    1. fetches the current DNS config (which includes the current generation number). If it's not the one in the blueprint, abandon execution and refetch the current blueprint. (By definition, something has changed internal DNS since this blueprint was generated, which presumably means a new blueprint has been generated and made the target.)
    2. computes what the new DNS config should be, which it can do solely from the information in the blueprint
    3. attempts to write the new DNS config conditional on the DNS config not having changed in the meantime. (This is already the behavior that the datastore DNS update functions have -- when you change DNS, the write is always conditional on the current generation not having changed since you read it.)
    4. if this fails because the DNS version has changed, abort execution of this blueprint and start again fetching the latest target

I think this is not quite the same but I don't think I follow John's idea yet.

@smklein
Copy link
Collaborator Author

smklein commented Jan 23, 2024

Closing the loop a little bit -- I wrote up a lot of ideas in this issue as an attempt to understand "how should downstream tasks consume the blueprint, to decide what work needs to be done".

I originally did this writing with the idea that "downstream tasks will want to read the blueprint, in some form, to understand what work they need to do". However, as I'm better understanding the executor, it sounds like the blueprint executor is responsible for reading the blueprint and writing state (in the form of records, updated state, etc) that triggers these backgrounds tasks to do work. This means that the executor (optionally) forms a layer of indirection between "the blueprint" and "tasks which consume the blueprint".

Whether or not the "version" or "generation numbers" exist in the blueprint, some "queued operations table" or directly in the state field of a table being acted upon, is kinda an implementation detail, that I suppose we'll figure out as we create more of these tasks acting downstream from the blueprint.

jgallagher added a commit that referenced this issue Jan 26, 2024
This replaces the in-memory blueprint storage added as a placeholder in
#4804 with cockroachdb-backed tables. Both the tables and related
queries are _heavily_ derived from the similar tables in the inventory
system (particularly serializing omicron zones and their related
properties). The tables are effectively identical as of this PR, but we
opted to keep the separate because we expect them to diverge some over
time (e.g., inventory might start collecting additional per-zone
properties that don't exist for blueprints, such as uptime).

The big exception to "basically the same as inventory" is the
`bp_target` table which tracks the current (and past) target blueprint.
Inserting into this table has some subtleties, and we use a CTE to check
and enforce the invariants. This is the first diesel/CTE I've written;
it's based on other similar CTEs in Nexus, but I'd still appreciate a
particularly careful look there.
 
Fixes #4793.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to database access nexus Related to nexus Update System Replacing old bits with newer, cooler bits
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants