Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key management and initialization design #15

Open
njriasan opened this issue Feb 17, 2020 · 9 comments
Open

Key management and initialization design #15

njriasan opened this issue Feb 17, 2020 · 9 comments

Comments

@njriasan
Copy link
Contributor

The code as written doesn't transfer the private keys securely over TLS. This change is likely extremely necessary to accurate gauge performance. Additionally I should probably add an authentication mechanism (basically assign on first use) to prevent users from adding data to any already allocated UPC instances. This authentication is likely separate from simply the user profile containing legal algorithm instances (although they can be joined by including a certificate in the user profile) from a need perspective.

Assigning to @njriasan (because it doesn't seem I can actually assign).

@njriasan
Copy link
Contributor Author

I know cfc_webapp.py already has support for TLS, but I haven't tested it. Here are the TLS changes I propose to make

  1. Enable the TLS already supported by cfc_webapp.py
  2. When a usercloud receives an initial first connection, log the certificate that was presented. Then for any future data upload or download requests require using the exact same certificate. This way we won't accidentally mix up the data across users and users can request their data back.
  3. Upgrade the connection between the mongo container and the main server container to use a TLS connection for transferring the key.

Of course these 3 steps also require certificate generation. I think these are the extent of what is necessary to protect data and transfer the key properly. @shankari do you know if its also necessary to modify how the mongodb transfers data to the cloud server? I'm not quite sure how that works but I'm focusing on the steps I took to transfer the key (and also verifying user connections/uploads are secure).

@shankari
Copy link
Contributor

shankari commented Feb 25, 2020

We should definitely discuss key management in greater detail.

cfc_webapp.py already has standard HTTPS support, but that is essentially uni-directional. The server presents a CA-signed certificate to the client for authentication, but the client does not reciprocate. This ensures that the connection is encrypted and the server is valid, but does not say anything about the client. I am not quite sure how a CA-signed certificate can be easily and automatically generated on the client. There are techniques such as Let's Encrypt for automatically generating and renewing certs, but they expect that the user can prove control of a server by running a script. That won't really work for the client-side cert.

We could use a self-signed cert to get around that.

Alternatively, we could continue using a private key. I don't see why we need to switch from a regular private key to a bi-directional TLS connection. It seems like it is more complicated wrt generating signed certs, and I am not sure what it actually buys us.

Note also that we should have some way to recover the key in case people are switching phones or have lost their phone. Prior E2E encryption products have supported storing alternate representations (QR code, text representation) of the private key so that the user can store them offline for backup and recovery.

@njriasan
Copy link
Contributor Author

These are all fantastic points and I look forward to discussing them soon. My reasoning behind bi-directional TLS was so we could ensure the user was the same user, but I'm not sure that's necessary (and doesn't seem to be worth the hassle). Uni-directional TLS (which I went through the process of updating my examples to utilize and will PR soon) seems sufficient for the most part.

I do agree we need to talk about key recovery and the data life cycle in more detail.

@shankari shankari changed the title Replace http connection transferring private key with a TLS connection Key management and initialization design Feb 28, 2020
@shankari
Copy link
Contributor

@njriasan at our discussion yesterday, you said that one of the problems with the multi-step approach (client id and then private key over TLS) was that connections could be refused if the host went down. I'd like to understand that better.

In particular, I don't think we expect complex multi-stage protocols between the client and the container. The container exposes a REST API, so the client sends a request and receives a response.

  1. If the container goes down between one REST API call and the other, the interruption will not be visible to the client since kubernetes will spawn a new container for the new incoming call. This assumes that every client call provides both the client ID and the private key, which seems reasonable.
  2. If the container goes down while executing the call - e.g. during the multi-step handshake, or while the data for the API call is being sent/received, then the client will see the interruption. But it would have seen the interruption anyway, even with classic kubernetes, right? Or does kubernetes intercept the data at the TCP level to spawn a new container and reroute even if a connection is dropped?

@shankari
Copy link
Contributor

shankari commented Feb 28, 2020

At a high level, I think we should focus on the case in which the container is stateless and is discarded after every API call. As David pointed out, that is definitely possible in a cloud environment, and that is the worst case for this scenario.

Keeping the containers around is a essentially a caching optimization. Great when it works, but the system still has to function as expected when it doesn't.

@shankari
Copy link
Contributor

And that brings me to one more question/food for thought. I think we have a pretty good sense of how the key management will work for calls generated from the client. The client (e.g. smartphone app) will be the only storage location for the key, and it will send the private key to the server from every call.

However, the differentially private query layer, when it makes the calls to the container, will not have the private key. I believe this was the reason for caching the key in the container to begin with.

But if we can't assume that the key is cached in the container, then we will need to have a server-initiated request to the client for the key, which is likely to slow down the query response. Maybe that is acceptable, but we should discuss this.

@njriasan
Copy link
Contributor Author

These are definitely all great points. The issue I was suggesting for the REST API was rooted in how we contact each container. If we do a name lookup (the default way of doing things in Kubernetes) then yes if the container goes down it will immediately be replaced and we contacted that directly. This should be how we want to do things. However we had some questions about secure connections that cross through this load balancer (or router as we called it yesterday). Because of this we talked about trying to extract the specific IP-address that we could use to contact the individual container/pod. This could produce issues because these nodes can be moved and if we contact the IP-address directly we could connect to something else or nothing if the container goes down or is moved.

Assuming this document can be trusted "https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-multi-ssl" I think the first two paragraphs explain why we don't want to just use the load balancer to forward to the services and instead ideally contact them directly.

@shankari
Copy link
Contributor

shankari commented Feb 29, 2020

@njriasan I can think of a solution to this. May not be the best solution, but I am pretty sure it will work. And it is actually pretty elegant because the scheduler only needs to support one operation - running a script for a client.

The scheduler does this by launching a container for the script and returning the container IP to the client. The client and container then handshake for attestation and authentication.

  • the client verifies that the container is running the script that it had requested (unsure exactly how to do this, may skip if there is no easy solution)
  • the client then passes its private key to the container
  • the container decrypts the data and continues operation

For user-initiated scripts - e.g. storing data, running inference algorithms, retrieving data - the client triggers the script in the normal way since it has the private key.

For aggregate-querier initiated scripts, the querier sends a message to the client with the query it wants to run. The client checks the user policy on participating in aggregate queries. If the query matches the policy, then the client initiates the query script using the same mechanism as usual, and passes in the IP address of the querier. The query script runs and sends the result to the querier.

If we get attestation in general to work, the script should be able to attest the querier before it sends the data to it.

wrt the design alternatives,

  1. If the scheduler wants to run a script on the user private data, then it needs to tell the client this because the client is the only source of the key.
  2. The content of the message from the scheduler to the client has to include the script to run. An alternative would be for the server to request that the client launch a container for the scheduler to run the script in. But then the scheduler can substitute the script after the container is launched, so that seems like a bad idea.
  3. The client then launches the script in a newly created container.

@shankari
Copy link
Contributor

shankari commented Feb 29, 2020

This could produce issues because these nodes can be moved and if we contact the IP-address directly we could connect to something else or nothing if the container goes down or is moved.

It seems to me that there is a fairly simple solution to this. We only contact the IP address directly for the duration of a single call.

Concretely, the flow could be something like:

  1. client sends message to scheduler/router (client ID, scriptid on dockerhub)
  2. scheduler launches script container from image on dockerhub with the volume for the client mounted [1]
  3. scheduler returns IP of script container to client
    ------ start of vulnerable region ---------
  4. client verifies attestation of script container [2]
  5. client sends private key to container
  6. container uses private key to decrypt volume and run script
  7. container returns result (if any) to client
    ------- end of vulnerable region ----------

The next client call will start again with step 1.

This means that the vulnerable region is fairly short, and is bounded by the duration of a single call.

Note also that the vulnerable region is sensitive to failures other than the container being killed. For example, the client <-> container connection could be dropped at any time due to network issues. The client has to (and does) handle connection failures in the vulnerable region anyway, primarily through retrying. So handling yet another failure mode is not really that big a deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants