API endpoints:
-
RFD 24: regions, AZs, etc
-
(lots more)
Work queue (see also: existing GitHub issues):
-
use CARGO_BIN_EXE for paths to binaries https://doc.rust-lang.org/cargo/reference/environment-variables.html#environment-variables-cargo-sets-for-crates
-
dropshot: allow consumers to provide error codes for dropshot errors
-
general maintenance and cleanup
-
replace &Arc<T> with &T, and some instances of Arc<T> as well
-
all identifiers could be newtypes, with a prefix for the type (like AWS "i-123" for instances)
-
rethinking ApiError a bit — should it use thiserror, or at least impl std::error::Error?
-
scope out switching to sync (see RFD 79)
-
proper config + logging for sled agent
-
-
settle on an approach for modification of resources and implement it once
-
implement behavior of server restarting (e.g., sled agent starting up)
-
This would help validate some of the architectural choices. Current thinking is that this will notify OXCP of the restart, and OXCP will find instances that are supposed to be on that server and run instance_ensure(). It will also want to do that for the disks associated with those instances. IMPORTANT: this process should also remove any resources that are currently on that system, so the notification to OXCP about a restart may need to include the list of resources that the SA knows about and their current states.
-
-
implement audit log
-
implement alerts
-
implement external user authentication
-
implement external user authorization mechanism
-
implement throttling and load shedding described in RFD 6
-
implement hardening in RFD 10
-
implement ETag / If-Match / If-None-Match
-
implement limits for all types of resources
-
implement scheme for API versioning
-
how to identify the requested version — header or URI?
-
translators for older versions?
-
integration of supported API versions into build artifact?
-
Should all the uses of serde_json disallow unrecognized fields? Should any?
-
-
debugging/monitoring: Prometheus?
-
debugging/monitoring: OpenTracing? OpenTelemetry?
-
debugging/monitoring: Dynamic tracing?
-
debugging/monitoring: Core files?
-
Automated testing
-
General API testing: there’s a lot of boilerplate in hand-generated tests for each kind of resource. Would it be reasonable / possible to have a sort of omnibus test that’s given the OpenAPI spec (or something like it), creates a hierarchy with at least one of every possible resource, and does things like: For each possible resource
-
attempt to (create, get, put, delete) one with an invalid name
-
attempt to (GET, DELETE, PUT) one that does not exist
-
attempt to create one with invalid JSON
-
attempt to create one with a duplicate name of the one we know about
-
exercise list operation with marker and limit (may need to create many of them)
-
for each required input property:
-
attempt to create a resource without that property
-
-
for each input property: attempt to create a resource with invalid values for that property
-
list instances of that resource and expect to find the one we know about
-
GET the one instance we know about
-
DELETE the one instance we know about
-
GET the one instance we know about again and expect it to fail
-
list instances again and expect to find nothing
-
-
-
We will need archivers for deleted records — especially saga logs
External dependencies / open questions:
-
Should we create a more first-class notion of objects in the API?
-
This would be a good way to enforce built-in limits.
-
This would be a good way to enforce uniformity of pagination.
-
If each resource provides a way to construct ETags, we could provide automatic implementation of If-Match, etc.
-
With the right interface, we could provide automatic implementations of PUT or PATCH with JSON Merge Patch and JSON Patch given any one of these.
-
-
would like to require that servers have unique, immutable uuids
-
TLS:
-
How will we do TLS termination?
-
How will we manage server certificates?
-
How will we manage client certificates?
-
-
what does bootstrapping / key management look like?
-
what does internal authorization look like?
Other activities:
-
Performance testing
-
Stress testing
-
Fault testing / under load
-
Fuzz testing
-
Security review
Nice-to-haves:
-
API consistency checks: e.g., camel case every where
Things we’re going to want to build once:
-
metric export
-
structured event reporting (e.g., audit log, alert log, fault log)
-
opentracing-type reporting
-
client-side circuit breakers
-
service discovery
-
client connection pooling
-
server-side throttling
-
command-line utilities
Check out linkerd (for inspiration — it looks K8s-specific)