Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Extract storage manager related code from sled-agent (#4332)
This is a large PR. The changes are in many ways *just* moving code around, and at a cursory glance may appear to be a giant waste of time. To justify these changes and my rationale I will first present the goals I believe we should strive to meet in general sled-agent development. I'll then go into how I believe these goals can be realized via code patterns. I'll then discuss the concrete changes in this PR and how they implement the discussed patterns. Lastly, I'll list a few open questions about the implementation in the PR. I want to emphasize that a lot of what I talk about below is already done in sled-agent, at least part way. We can pick the best patterns already in use and standardize on them to build a more coherent and yet decoupled architecture for easier testing and understanding. # Goals This PR is an attempt to start making sled-agent easier to maintain. In order to make things easier to maintain, I believe we should attempt to realize a few key goals: 1. Initial startup and control code should read linearly. You shouldn't have to jump around to follow the gist of what's going on. 2. Tasks, like functions, should live at a certain abstraction level. It must be clear what state the task is managing and how it gets mutated. 3. A developer must be able to come into the codebase and know where a given functionality should live and be able to clearly follow existing patterns such that their new code doesn't break or cross abstractions that make following the code harder in the future. 4. Tests for individual abstractions and task driven code should be easy to write. You shouldn't need to spin up the whole sled-agent in order to test an algorithm or behavior. 5. Hardware should be abstracted out, and much of the code shouldn't deal with hardware directly. # Patterns ## Task/Handle In my opinion, the most important pattern to follow for async rust software is what I call the "task/handle" pattern. This pattern helps code maintenance by explicitly managing state and making task driven code easier to test. All tasks that provide services to other tasks should follow this pattern. In this pattern, all state is contained in a type owned by the task and inaccessible directly to other tasks. The task has a corresponding handle which must be `Send` and `Clone`, and can be used to make requests of the task. The handle interacts with the task solely via message passing. Mutexes are never used. A few things fall directly out of this pattern: * Code that manipulates state inside and outside the task can be synchronous * We don't have to worry about mutex behavior with regards to cancellation, await points, etc... * Testing becomes easier: * We can test a bunch of the code without spawning a task in many cases. We just [directly construct the state object and call functions](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR634-R656), whether sync or async. * When testing sequential operations that are fire and forget, we [know when they have been processed by the task loop](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR702-R709), because our next request will only run after those operations. This is a direct result of the fifo channel between handle and task. This is not possible with state shared outside the task via a mutex. We must poll that mutex protected state until it either changes to the value we expect or we get a timeout. If we expect no change, we must use a side-channel. * We can write complex state handling algorithms, such as those in maghemite or bootstore, in a deterministic, synchronous style that allows property based testing and model checking in as straightforward a manner as possible. * We can completely [fake the task](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR163-R223) without changing any of the calling code except the constructor. The real handle can instead send messages to a fake task. Importantly, this strategy can also be used to abstract out hardware for unit tests and simulation. Beyond all that though, the understanding of how state is mutated and accessed is clear. State is only mutated inside the task, and can only be retrieved from code that has a copy of the handle. There is no need for separate `_locked` methods, and no need to worry about order of locking for different bits of task state. This can make the task code itself much shorter and easier to read as demonstrated by the new `StorageManager` code in `sled_storage`. We can also maintain internal stats for the task, and all of this can be retrieved from the handle for reporting in `omdb`. There is one common question/problem with this pattern. How do you choose a bounded channel size for handle to task messages? This can be tricky, but in general, you want the channel to *act* unbounded, such that sends never fail. This makes the channels reliable, such that we never drop messages inside the process, and the caller doesn't have to choose what to do when overloaded. This simplifies things drastically for developers. However, you also don't want to make the channel actually unbounded, because that can lead to run-away memory growth and pathological behaviors, such that requests get slower over time until the system crashes. After discussions with @jgallagher and reflections from @sunshowers and @davepacheco on RFD 400, the solution is to choose a large enough bound such that we should never hit it in practice unless we are truly overloaded. If we hit the bound it means that beyond that requests will start to build up and we will eventually topple over. So when we hit this bound, we just [go ahead and panic](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR75-R77). Picking a channel bound is hard to do empirically, but practically, if requests are mostly mutating task local state, a bound of 1024 or even 8192 should be plenty. Tasks that must perform longer running ops can spawn helper tasks as necessary or include their own handles for replies rather than synchronously waiting. Memory for the queue can be kept small with boxing of large messages. It should also be noted that mutexes also have queues that can build up and cause pathological slowness. The difference is that these queues are implicit and hard to track. ## Isolate subsystems via the message driven decoupling We have a bunch of managers (`HardwareManager`, `StorageManager`, `ServiceManager`, `InstanceManager`) that interact with each other to provide sled-agent services. However, much of that interaction is ad- hoc with state shared between tasks via an `Inner` struct protected by a mutex. It's hard to isolate each of these managers and test them independently. By ensuring their API is well defined via a `Handle`, we can fake out different managers as needed for independent testing. However, we can go even farther, without the need for dependency injection by having different managers announce their updates via [broadcast or watch channels](https://github.com/oxidecomputer/omicron/pull/4332/files#diff-08315639e20a9323f2740455a7813d83ca43b05dca197ebbda66c13f750d446bR129-R133). With this strategy, we can drive tests for a task by using the handle to both trigger operations, and then wait for the outflow of messages that should occur rather than mocking out the handle API of another task directly and checking the interaction via the fake. This is especially useful for lower level services that others depend upon such as `StorageManager`, and we should do this when we can, rather than having tasks talk directly to each other. This strategy leads directly to the last pattern I really want to mention, the `monitor` pattern. ## High level interaction via monitors Remember that a primary goal of these patterns is to make the sled-agent easier to read and discover what is happening at any given point in time. This has to happen with the inherent concurrency of all the separate behaviors occurring in the sled-agent such as monitoring for hardware and faults, or handling user requests from Nexus. If our low level managers announce updates via channels we can have high level `Monitors` that wait for those updates and then dispatch them to other subsystems or across clients to other sleds or services. This helps further decouple services for easier testing, and also allows us to understand all the asynchrony in the system in fewer spots. We can put RAS code in these monitors and use them as central points to maintain stats and track the overall load on the system. # The punchline How do the patterns above satisfy the goals? By decoupling tasks and interacting solely via message passing we make testing easier(goal 4). By separating out managers/ services into separate tasks and maintaining state locally we satisfy goals 2 and 5. By making the tasks clear in their responsibilities, and their APIs evident via their handles, we help satisfy goal 3. Goal 3 is also satisfied to a degree by the elimination of sharing state and the decoupling via monitors. Lastly, by combining all the patterns, we can spawn all our top level tasks and monitors in one place after initializing any global state. This allows the code to flow linearly. Importantly, it's also easy to violate goal 3 by dropping some mutexes into the middle of a task and oversharing handles between subsystems. I believe we can prevent this, and also make testing and APIs clearer, by separating subsystems into their own rust packages and then adding monitors for them in the sled-agent. I took a first cut at this with the `StorageManager`, as that was the subsystem I was most familiar with. # Concrete Code changes Keeping in line with my stated goals and patterns, I abstracted the storage manager code into its own package called `sled-storage`. The `StorageManager` uses the `task/handle` pattern, and there is a monitor for it in sled-agent that takes notifications and updates nexus and the zone bundler. There are also a bunch of unit tests where none existed before. The bulk of the rest of the code changes fell out of this. In particular, now that we have a separate `sled-storage` package, all high level disk management code lives there. Construction and formatting of partitions still happens in `sled-hardware`, but anything above the zpool level occurs in `sled-storage`. This allows testing of real storage code on any illumos based system using file backed zpools. Importantly, the key-management code now lives at this abstraction level, rather than in `sled-hardware`, which makes more sense since encryption only operates on datasets in ZFS. I'd like to do similar things to the other managers, and think that's a good way to continue cleaning up sled-agent. The other large change in this code base is simplifying the startup of the bootstrap agent such that all tasks that run for the lifetime of the sled-agent process are spawned in `long_running_tasks`. There is a struct called `LongRunningTaskHandles` that allows interaction with all the "managers" spawned here. This change also has the side effect of removing the `wait_while_handling_hardware_updates` concurrent future code, since we have a hardware monitor now running at the beginning of time and can never miss updates. This also has the effect of negating the need to start a separate hardware manager when the sled-agent starts up. Because we now know which tasks are long running, we don't have to save their tokio `JoinHandle`s to display ownership and thus gain clonability. # Open Questions * I'm not really a fan of the name `long_running_task_handles`. Is there a better name for these? * Instead of calling `spawn_all_longrunning_tasks` should we just call the individual spawn functions directly in `bootstrap/pre_server`? Doing it this way would allow the `ServiceManager` to also live in the long running handles and remove some ordering issues. For instance, we wait for the bootdisk inside `spawn_all_longrunning_tasks` and we also upsert synthetic disks. Neither of these is really part of spawning tasks.
- Loading branch information