-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD - Oracle join method #49480
RFD - Oracle join method #49480
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,288 @@ | ||
--- | ||
authors: Andrew Burke ([email protected]) | ||
state: draft | ||
--- | ||
|
||
# RFD 192 - Oracle cloud join method | ||
|
||
## Required Approvers | ||
|
||
* Engineering: @nklaassen && @strideynet | ||
|
||
## What | ||
|
||
Add the ability for Teleport agents running on Oracle Cloud instances to join | ||
a cluster without a static token. | ||
|
||
## Why | ||
|
||
This feature removes the need to use a shared secret to establish trust between | ||
a Teleport cluster and an Oracle Cloud compute instance. | ||
|
||
## Details | ||
|
||
### Glossary | ||
|
||
- **OCI** - Oracle Cloud Infrastructure. Interchangeable with Oracle Cloud in this document. | ||
- **OCID** - Oracle Cloud Identifier. Unique ID associated with all Oracle Cloud resources. | ||
- **Tenancy** / **Tenant** - Oracle equivalent of an AWS account/Azure subscription/etc. | ||
- **Compartment** - Logical grouping of resources. Equivalent to an Azure resource group. | ||
|
||
### UX | ||
|
||
Suppose Alice is a system administrator with a Teleport cluster, and she wants | ||
to add some Oracle Cloud compute instances to it. She | ||
would first create the file `token.yaml` with the following contents: | ||
|
||
```yaml | ||
# token.yaml | ||
kind: token | ||
version: v2 | ||
metadata: | ||
name: oci-token | ||
spec: | ||
roles: [Node] | ||
oracle: | ||
allow: | ||
- tenancy: "ocid1.tenancy.oc1..<unique ID>" # the OCID for Alice's tenancy | ||
parent_compartments: ["ocid1.compartment.oc1..<unique ID>"] # the OCID for Alice's compartment | ||
# If needed, Alice can further restrict the compartments and regions | ||
# instances can join from. | ||
``` | ||
|
||
She would then create the provision token: | ||
|
||
```sh | ||
$ tctl create token.yaml | ||
``` | ||
|
||
Next, Alice would install, configure, and start Teleport on all of her instances. | ||
If Alice has not yet created her instances, she can set `user_data` in each | ||
instance's metadata to add an init script for | ||
(cloud-init)[https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/launchinginstance.htm#]. | ||
Otherwise, she can run the following script locally to install Teleport on her | ||
existing instances: | ||
|
||
```sh | ||
$ INSTANCE_IDS=$(oci compute instance list | jq -r '.data | map(.id) | join(" ")') # filter instances as needed | ||
$ for INSTANCE_ID in $(echo $INSTANCE_IDS) | ||
do | ||
oci instance-agent command create \ | ||
--compartment-id <compartment-id> \ | ||
--content '{"source": {"sourceType": "TEXT", "text": "curl https://cdn.teleport.dev/install.sh | bash -s <Teleport version> && \ | ||
teleport node configure --token oci-token --join-method oracle --proxy example.com && \ | ||
sudo systemctl start teleport"}}' \ | ||
--target '{"instanceId": "$INSTANCE_ID"}' | ||
done | ||
``` | ||
|
||
She can confirm that the nodes have joined either in the web UI or by running `tsh ls`. | ||
|
||
### Implementation | ||
|
||
#### Token spec | ||
|
||
The provision token will be extended to include a new `oracle` section: | ||
|
||
```yaml | ||
kind: token | ||
version: v2 | ||
metadata: | ||
name: example-oci-token | ||
spec: | ||
roles: [Node, Kube, Db] | ||
oracle: | ||
allow: | ||
# OCID of the tenancy to allow instances to join from. Required. | ||
- tenancy: "ocid1.tenancy.oc1..<unique ID>" | ||
# OCIDs of compartments to allow instances to join from. Only the direct parent | ||
# compartment applies; i.e. nested compartments are not taken into account. | ||
# If compartments is empty, | ||
# instances can join from any compartment in the tenancy. | ||
parent_compartments: ["ocid1.compartment.oc1...<unique_ID>"] | ||
# Regions to allow instances to join from. Both full names ("us-phoenix-1") | ||
# and abbreviations ("phx") are allowed. If regions is empty, instances can join from any region. | ||
regions: ["phx", "us-ashburn-1"] | ||
# Add more entries as necessary. | ||
- tenancy: "..." | ||
parent_compartments: ["foo", "bar"] | ||
regions: ["baz", "quux"] | ||
# ... | ||
``` | ||
|
||
#### Permissions | ||
|
||
Before the join process can begin, the nodes joining need permission to authenticate clients. | ||
|
||
- Create a [dynamic group](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/managingdynamicgroups.htm) | ||
that matches all the instances that will join Teleport. | ||
- Create the following policy: `Allow dynamic-group <dynamic-group-name> to inspect authentication in tenancy` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you taken dynamic groups misconfigurations into consideration, such as lax rules that would allow unauthorized resources become group members and be allowed to join the cluster? These misconfigurations usually arises from membership conditions that depend on attributes that can be specified by low-privileged users (eg. tags). If I understand the flow correctly, to join the cluster, an instance needs to satisfy the following pre-conditions:
In such a scenario, gaining access to the dynamic group would effectively allow the instance to join the cluster. I don't consider this an issue on Teleport's side. However, I think an update to the RDF or future documentation to highlight the dangers of such misconfiguration will definitely be useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my mind, the membership of the dynamic group is not intended to be a criteria for which instances are allowed to join; it's just the only available vehicle for giving those instances the needed permissions. The provision token (expected tenancy/compartment/region) is the sole source of who's allowed to join. Ideally the matching rules would exactly match what's in the token; I've added that recommendation to the RFD and I'll include it in the future docs. |
||
|
||
As long as the criteria for which instances can join doesn't change, the group | ||
and policy do not need to be updated for each new instance. We will recommend that | ||
users configure the dynamic group matching rules to match their provision token | ||
to limit unnecessary permissions. | ||
|
||
#### Join process | ||
|
||
When a node initiates the Oracle join method: | ||
|
||
- The node starts a `RegisterUsingOracleMethod` grpc request to the auth server. | ||
- The auth server generates a 32 byte challenge string and sends it to the node. | ||
- The node fetches credentials for its | ||
[instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm#concepts) | ||
via the Oracle instance metadata service. Instances are guaranteed to have a principal | ||
and always have access to the instance metadata service to fetch their credentials. | ||
- The node will create a [signed HTTP request](https://docs.oracle.com/en-us/iaas/Content/API/Concepts/signingrequests.htm) | ||
to `http://127.0.0.1`. The address doesn't matter as the node will only use the | ||
signed headers and never make the request. | ||
- The node will create a second signed request, this time to | ||
`https://auth.{region}.oraclecloud.com/v1/authentication/authenticateClient`, | ||
and include the signed headers from the previous request as the payload (the | ||
authenticateClient route is not documented in the Oracle docs, but the | ||
[request](https://docs.oracle.com/en-us/iaas/api/#/en/identity-dp/v1/datatypes/AuthenticateClientDetails) | ||
and [response](https://docs.oracle.com/en-us/iaas/api/#/en/identity-dp/v1/datatypes/AuthenticateClientResult) | ||
types are). | ||
The node will [include and sign](https://github.com/oracle/oci-go-sdk/blob/c696c320af82270e0a2fc5324600c4902b907ecc/example/example_identity_test.go#L51-L59) | ||
the challenge from the auth server under the header `x-teleport-challenge`. | ||
- The node sends the signed headers and the common token request parameters | ||
to the auth server. | ||
- The auth server extracts the instance's region from the signed headers to | ||
reconstruct the `authenticateClient` URL (found in the `keyID` field of the | ||
`Authorization` header, formatted as a JWT) and forwards the request to the | ||
Oracle API and verifies that the request succeeds. | ||
- The auth server maps the claims `opc-tenant`, `opc-compartment`, and `opc-instance` | ||
from the authenticateClient response to the instance's tenancy ID, compartment | ||
ID, and instance ID respectively. | ||
- The auth server validates/verifies several properties: | ||
- The tenancy ID, compartment ID, and instance ID are all valid Oracle OCIDs. | ||
- The tenancy ID, compartment ID, and region match the Teleport provision token. | ||
- The signed challenge matches. | ||
- If everything above succeeds, the node is allowed to join the cluster. | ||
|
||
#### Throttling | ||
|
||
If the auth server is ever | ||
[throttled by Oracle](https://docs.oracle.com/en-us/iaas/Content/API/Concepts/usingapi.htm#throttle), | ||
the TooManyRequests error will be propagated back to the node, which will try | ||
`RegisterUsingOracleMethod` again with the exponential backoff recommended by | ||
Oracle (maximum of 60 seconds). | ||
|
||
#### Limitations | ||
|
||
The Oracle provision tokens will not support nested compartments, i.e. if | ||
compartment `foo` has a child compartment `bar` and the provision token has | ||
`parent_compartments: ["foo"]`, this will not allow instances in container `bar` to | ||
join. This is for simplicity's sake; Teleport would need to make several | ||
requests to the Oracle Cloud API to walk up the compartment tree from the | ||
compartment the instance is in, each of which would need to be signed. This | ||
would require a complicated back-and-forth between the auth server and the | ||
joining node to get signed requests for each compartment. | ||
|
||
### Security | ||
atburke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To mitigate SSRF, Teleport will verify that the region provided by the joining | ||
node is valid. | ||
|
||
On top of the signed challenge, both Teleport and the Oracle API will | ||
verify that the `X-Date` header in the signed request is | ||
[within 5 minutes](https://docs.oracle.com/en-us/iaas/Content/API/Concepts/usingapi.htm#clock) | ||
of their own clocks. | ||
|
||
### Proto Specification | ||
|
||
Add `RegisterUsingOracleMethod` rpc to the join service: | ||
|
||
```proto | ||
message RegisterUsingOracleMethodRequest { | ||
types.RegisterUsingTokenRequest register_using_token_request = 1; | ||
map<string,string> headers = 2; | ||
} | ||
|
||
message RegisterUsingOracleMethodResponse { | ||
oneof Response { | ||
string challenge = 1; | ||
Certs certs = 2; | ||
} | ||
} | ||
|
||
service JoinService { | ||
// ... | ||
rpc RegisterUsingOracleMethod(stream RegisterUsingOracleMethodRequest) returns (stream RegisterUsingOracleMethodResponse); | ||
} | ||
``` | ||
|
||
Extend provision tokens to include roles for joining Oracle instances: | ||
|
||
```proto | ||
message ProvisionTokenSpecV2 { | ||
// Existing fields... | ||
|
||
ProvisionTokenSpecV2Oracle Oracle = 17; | ||
} | ||
|
||
message ProvisionTokenSpecV2Oracle { | ||
message Rule { | ||
string Tenancy = 1; | ||
repeated string ParentCompartments = 2; | ||
repeated string Regions = 3; | ||
} | ||
|
||
repeated Rule Allow = 1; | ||
} | ||
``` | ||
|
||
### Audit Events | ||
|
||
Tokens created with the `oracle` join method and instances joining via Oracle | ||
tokens will be captured by the existing `ProvisionTokenCreate` and `InstanceJoin` | ||
events, respectively. | ||
|
||
### Backwards Compatibility | ||
|
||
Suppose Oracle join is released in Teleport version *X*. The expected behavior | ||
of agents with mixed versions is as follows: | ||
|
||
| | Auth <X | Auth >=X | | ||
|---|---|---| | ||
| Node <X | Irrelevant | Node will not launch with unrecognized join method | | ||
| Node >=X | Join will be rejected for unrecognized join method | Join works | | ||
|
||
### Test Plan | ||
|
||
Add an entry to the test plan to verify that the Oracle join method works as | ||
described in the docs, just like the other join methods. | ||
|
||
### Future work | ||
|
||
Cluster admins with many Oracle Cloud compartments may wish to specify the | ||
allowed compartments to join from by their tags, rather than having to | ||
specify each by OCID. The `oracle` section of the provision token | ||
spec could be expended with the `compartment_tags` field to allow filtering | ||
by defined and/or freeform tags. Since Teleport would already fetch the compartment | ||
from the Oracle API, no extra permissions would be required. | ||
|
||
## Appendix A: Sample keyID JWT claims | ||
|
||
```json | ||
/* spell-checker: disable */ | ||
{ | ||
"sub": "ocid1.instance.oc1.phx.<random string>", | ||
"opc-certtype": "instance", | ||
"iss": "authService.oracle.com", | ||
"fprint": "<fingerprint>", | ||
"ptype": "instance", | ||
"aud": "oci", | ||
"opc-tag": "V3,ocid1.tenancy.oc1..<random string>,AAAAAQAAAAAAAACB,AAAAAQAAAAAAhy9d", | ||
"ttype": "x509", | ||
"opc-instance": "ocid1.instance.oc1.phx.<random string>", | ||
"exp": 1732738022, | ||
"opc-compartment": "ocid1.compartment.oc1..<random string>", | ||
"iat": 1732736822, | ||
"jti": "<jwt id>", | ||
"tenant": "ocid1.tenancy.oc1..<random string>", | ||
"jwk": "{\"kid\":\"<fingerprint>\",\"n\":\"0BOIi1uIrzoyQmNmfsew8aRv1DVNx979QqD6WoZ37QTDkFuNoGUPssk_mftatqQUGbkppKAtXutb9lXO1SsEnyOv2_tN1KxBhiahtMdRoha0wchla2GJQd7zxVxjSU70ousmuHfIAr29P6jdx3zQ15WYG-MMRcKfB8FtETzEcTBJH9ujjw00LkBmQ_CJsJIq2YFWjp4HW8DlX2YER_FYy7Apq98Rqno0Ze4lBBib-HeJP2x7q0mxJoHEJlsRBdMweMRKhsFL5oKJjWaul06TBp4wuEx7Czcr427d5RZJ-cSCYCDkf8bzMhZ4K5o2cpKV3gcqXEDuH81_B4odZ4-oLQ\",\"e\":\"AQAB\",\"kty\":\"RSA\",\"alg\":\"RS256\",\"use\":\"sig\"}", | ||
"opc-tenant": "ocid1.tenancy.oc1..<random string>" | ||
} | ||
/* spell-checker: enable */ | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you been able to find much about what exactly this policy allows and if it seems acceptable to allow this for all nodes that need to join teleport? If it only allows you call the authenticateClient endpoint it seems okay to me, all it allows is for the node to get information about itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found anything that explicitly states what it allows, but these docs on the
authentication_oci
plugin for Oracle Heatwave (1, 2) suggest to me that it's only authenticateClient.