Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Deployment #71

Closed
rabernat opened this issue Jan 12, 2018 · 112 comments
Closed

AWS Deployment #71

rabernat opened this issue Jan 12, 2018 · 112 comments

Comments

@rabernat
Copy link
Member

It would be great to deploy our jupyterhub setup on AWS. There is a lot of community investment already in AWS. At ESIP, the HDF guys mentioned they would be interested in collaborating on this.

@jreadey, @rsignell-usgs: is either of you available to work on this?

@yuvipanda
Copy link
Member

Current AWS docs (https://zero-to-jupyterhub.readthedocs.io/en/v0.5-doc/create-k8s-cluster.html#setting-up-kubernetes-on-amazon-web-services-aws), and jupyterhub/zero-to-jupyterhub-k8s#299 is issue for us to document doing this on amazon's new managed kubernetes service.

@jhamman
Copy link
Member

jhamman commented Jan 12, 2018

Also cc @robfatland and @atlehr who are currently running a jupyterhub instance at AWS and may be interested in migrating to the pangeo setup. Do either of you want to try this out?

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Jan 15, 2018

@jreadey and I are going to write a proposal to https://aws.amazon.com/earth/research-credits/ to implement this framework on AWS. Currently the AWS call for research credits is focusing on proposals that use "Earth on AWS" datasets (https://aws.amazon.com/earth/).

One of those datasets is UK met office forecast data (https://aws.amazon.com/public-datasets/mogreps/) which are in NetCDF4 files, well suited for analysis with this framework.

Our plan is to get this jupyterhub setup going on AWS and also to put that NetCDF4 data into HSDS and perhaps compare/contract to access with zarr. @jflasher from AWS is willing to find us help if we run into problems with the deployment.

Can someone (perhaps offline at [email protected]) give me an idea of how many credits you guys have used in the last month so that I have some idea how much to ask AWS for?

@rabernat
Copy link
Member Author

@rsignell-usgs: that's great news! I support your plans 100%.

We published our original NSF proposal under a CC-BY license here:
https://figshare.com/articles/Pangeo_NSF_Earthcube_Proposal/5361094
I encourage you to reuse any parts of this you wish for the AWS proposal. And it would be great to have your proposal shared with the community under a similar license.

I see no reason why the cost details have to be communicated in private. For Jan 1-15, we did about $400 in compute and $20 in storage on GCP. We are storing about 700 GB right now, but we are about to start uploading some much bigger datasets. So I expect storage costs to increase somewhat. The compute charges reflect the usage of pangeo.pydata.org and the associated dask clusters. We have not really been doing any heavy, long-running calculations, so I also expect that to increase.

@rabernat
Copy link
Member Author

This was the biggest single line item on our 15-day billing statement:

Compute Engine Standard Intel N1 2 VCPU running in Americas: 3332.485 Hours

@mrocklin
Copy link
Member

mrocklin commented Jan 15, 2018 via email

@jreadey
Copy link

jreadey commented Jan 16, 2018

Amazon Kubernetes (EKS) distributes containers across a set of instances provided by the account owner. So if the number of containers is highly variable (as is likely the case as users come and go) it's easy to have either a under-utilized or over-committed cluster.

This project may be worth looking into: https://github.com/kubernetes/autoscaler. Supports GCS too!

@mrocklin
Copy link
Member

Yeah, our use case is somewhat more complex than Kubernetes autoscalers due to the need to manage stateful pods. We're actually managing pods dynamically ourselves.

This isn't actually the kind of autoscaling we need though. We're more interested in autoscaling the nodes themselves. Unfortunately provisioning nodes takes significantly longer than deploying new pods.

@rsignell-usgs
Copy link
Member

Does this mean that the pangeo framework would not benefit from EKS when deployed on AWS?

@mrocklin
Copy link
Member

It would be fine. What I'm saying is that there is typically a minute or two to provision new nodes in an elastic cluster. These couple minutes can be annoying to users. That's the only issue I'm bringing up.

@yuvipanda
Copy link
Member

Indeed, that is a problem both for dask and jupyterhub. I've filed kubernetes/autoscaler#148 which should vastly improve the situation for us if it gets implemented, and am playing with workarounds in berkeley-dsep-infra/data8xhub#7 until that gets implemented upstream. We also added the ability to pack nodes (rather than spread them) in JupyterHub chart (jupyterhub/zero-to-jupyterhub-k8s#384) to make the situation easier...

@robfatland
Copy link
Member

robfatland commented Jan 16, 2018 via email

@mrocklin
Copy link
Member

mrocklin commented Jan 16, 2018 via email

@jreadey
Copy link

jreadey commented Jan 16, 2018

Has anyone looked at this project: https://github.com/kubernetes/autoscaler?

It seems like the optimal thing for JupyterLab scaling would be to always have some reserve capacity so that new containers could quickly get launched within an existing instance. When the reserve runs low, fire up a new instance. If there is excess capacity, consolidate containers and shutdown an instance or two.

@yuvipanda
Copy link
Member

@jreadey indeed, that is the default upstream Node Autoscaler. It unfortunately only spins up a new node when your current cluster is 100% full, and new node creation can take minutes. If you see my previous reply, the issue I filed is in the same repo I linked to! The feature request is to add the concept of 'reserve capacity', which does not exist in that autoscaler yet. I also linked to one of our ongoing hack attempts to provide the concept of 'reserve capacity' until it gets added to upstream. That's really the only missing feature for it to be very useful for us, I think.

Hope that makes it a little clearer! Sorry for not providing more context in the previous comment!

@jreadey
Copy link

jreadey commented Jan 16, 2018

@yuvipanda - Sorry, I should have read through this issue first!
Anyway, looks like stars are aligning; I'll keep an eye on the issue you opened.

@jreadey
Copy link

jreadey commented Jan 16, 2018

Is the Pangeo team interested in utilizing the newly launched AWS EKS service: https://aws.amazon.com/eks/? Compared with a roll your own Kubernetes cluster, I imagine the EKS approach would involve less setup effort and provide a more stable environment.

Currently EKS is in preview though. I've applied to participate in the preview, but haven't been selected yet.

@mrocklin
Copy link
Member

mrocklin commented Jan 16, 2018 via email

@amanda-tan
Copy link
Contributor

We managed to bring up the Jupyterhub + Deployment on AWS by following the steps outlined in Z2JH for spinning up a K8 cluster on AWS using the Heptio template and using the config here: https://github.com/pangeo-data/pangeo/tree/master/gce . Only mod. was to config.yaml taking out the GCS stuff. Pretty straightforward.

@rabernat
Copy link
Member Author

Can someone give me an update on the status of Pangeo AWS deployment(s)?

@jhamman
Copy link
Member

jhamman commented Feb 20, 2018

@rabernat - The UW eScience team (@atlehr and @robfatland) have deployed something very similar to the GCP deployment (see #95 and https://pangeo-aws.cloudmaven.org). I think @atlehr was planning to revisit a few pieces of the initial deployment but I'm not sure of her time table. IIRC, their next steps were to do some FUSE stuff (e.g. mount s3://nasanex/) and experiment with Kubernets Operations (kops) for autoscaling. They may also be using their deployment for an upcoming OceanHackweek.

@rabernat
Copy link
Member Author

rabernat commented Feb 20, 2018

FYI, KubeCluster is not working for me on pangeo-aws...the workers never start. I think @tjcrone is having a similar problem as he tries to prepare some OceanHackweek OOI tutorials.

@mrocklin
Copy link
Member

mrocklin commented Feb 20, 2018 via email

@robfatland
Copy link
Member

This is a recorded announcement as I’m afraid we’re all out at the moment preparing for Cabled Array Hack Workshop. The commercial council of Magrathea thanks you for your esteemed visit, but regrets that the entire planet is closed for business. Thank you. If you would like to leave your name, and a planet where you can be contacted, kindly speak when you hear the tone [Beep]

@robfatland
Copy link
Member

By which I mean -- as the reference is perhaps too obscure (my colleague points out) -- the pangeo-aws experiment is Amanda trying to get ahead of the Jupyter Hub curve. However this is a 'stretch' effort on her part, currently un-funded. As noted we're heads-down trying to get another JHub up for the ocean hack workshop. So the update is that we'd still like to get kops going as Joe pointed out but this is on hold for the moment.

@amanda-tan
Copy link
Contributor

The error was insufficient cpu/memory. The current setup we have using Heptio does not allow for autoscaling -- I think that's the problem; we have to set the number of nodes by hand with a maximum of 20. Clarification question: Are kubernetes nodes the same as dask worker nodes?

@mrocklin
Copy link
Member

Your cloud deployment has VMs of a certain size (maybe 4 cores and 16GB of RAM each). Kubernetes is running on each of these VMs. Both JupyterHub and Daskernetes will deploy Pods to run either a Jupyter server or a dask worker. These pods will hopefully have resource constraints, ideally somewhere below your VM resources. These pods will typically run a Docker container that runs the actual Jupyter or Dask process.

So no, Kubernetes nodes are not the same as dask worker nodes. The term node here is a bit ambiguous. You will want to ensure that your cloud-provisioned VMs have more resources than is required by your Kuberentes Pods, for either Jupyter or Dask.

@rabernat
Copy link
Member Author

rabernat commented Feb 20, 2018

@robfatland: I hear you on your focus on the Ocean Hack Workshop. That was in fact what motivated me to look into this. You might want to reach out to Tim to clarify that in fact the hack week will be using a different jupyterhub deployment.

@amanda-tan
Copy link
Contributor

@mrocklin is dask-config.yaml still being used?

@jwagemann
Copy link
Member

@kmpaul Yes, @rsignell-usgs did a great job. Thanks for this. So far, I only found the documentation specific to GCP on the main site. Happy to contribute once we went through the setup.
To be specific regarding the data: I am planning to set up a prototype with "open" data from ECMWF, including data from Copernicus, such as ERA5 reanalysis.
Real comparison to the CDS will be tricky I guess, but it is part of a wider evaluation of different types of "geospatial data systems" available. Will keep you updated.

@jacobtomlinson
Copy link
Member

@jwagemann has asked me some further questions via email. Responding here for visibility.

Any preferences regarding AMI’s?

We have built two different clusters on AWS. The first was using kops and we were using the default AMI provided, however we did change from the Debian Jessie version to the Debian Stretch one due to some compatibility issues with m5 series instances.

Our new cluster is built using eksctl and we have used the default image which uses Amazon Linux 2.

What storage do you use? S3 or Amazon EBS, which provides persistent storage to Amazon EC2 instances?

All of our bulk storage for data is on S3. We either access it directly using S3 compatible tools or using FUSE mounts.

In our old cluster we used EBS for user home directories. However we have migrated to EFS as it is possible to mount on multiple systems. This is particularly useful when running large distributed dask jobs which need to access code in your home directory.

Any advantages to have data stored on EBS than on S3?

I would image that data stored on EBS would be more performant, however it would be more expensive and much harder to scale and manage.

Do your Kubernetes clusters run on EC2?

Our kops cluster was 100% EC2 (master and workers). Our new eksctl cluster uses EKS for the master service but then EC2 for the workers.

How many instances and dedicated hosts do you recommend?

We use autoscheduling on our cluster and therefore avoid this question. Generally the services which must always be running for Pangeo keep the cluster at a minimum of two m4.2xlarges (basically just Jupyter Hub, the proxy and maybe a couple of other web services).

We have our scaling policy set to a maximum of 50 nodes, but this is only due to networking limitations in AWS. Retrospectively I would choose a different CNI for the cluster instead of the default kubenet one as it will allow you to scale higher.

I saw from Rich’s post on Github that you made better experiences with m4.2x large nodes?

We use a combination of m4.2xlarge and m5.2xlarge. This is mainly due to spot availability in London. This mix gives us the best cost to reliability ratio in that region. In other regions YMMV.

How many spot instanced do you have?

We have spot nodepools and on-demand nodepools. We have rules in our Pangeo which ensure that Notebook pods end up on on-demand hosts to avoid user disruption. Things like dask workers run on the spot instances. As I said above we use autoscaling so this ends up being between 2 and 50 depending on how busy the cluster is.

Any other AWS services that are beneficial? For example Amazon Elastic Load Balancing?

When building the cluster with kops or eksctl it will automatically create Elastic Load Balancers for all Loadbalancer exposed services. This can become costly if you are running multiple services on the cluster (perhaps a few Pangeos for different purposes). Therefore we tend to use an Ingress service to consolidate them into one ELB.

We also use Route53 for DNS records along with the external DNS plugin.

Any experience with S3 Intelligent-Tiering?

We use the lifecycle rules to expire old data, but haven't experimented with the Intelligent-Tiering yet.

How many requests (PUT/COPY/POST/LIST and GET/SELECT) your deployments have?

I can't easily answer this one as it is very variable depending on what work is being done on the cluster.

Last month we did 35,405,000 PUT/COPY/POST/LIST Requests and 59,098,549 GET/SELECT Requests. However that probably isn't very useful without me telling you what work was done as a result of that, which I can't easily share with you as I don't necessarily know what all my users are working on.

How long do you estimate the set up time?

For a minimal cluster without any nice to have features like DNS, Ingress, EFS, etc I think you could get something up and running with eksctl and the pangeo helm chart in around an hour.

We've just spent around two weeks rebuilding everything for our new cluster. But that has been a combination of bringing new staff up to speed with the internals of kubernetes and pangeo as well as making major enhancements to our existing cluster configuration.

Are there any bottlenecks currently? Or anything that would be helpful in general to try out / test / investigate further?

Some of the main challenges we are facing at the moment is around making user home directories accessible from all dask workers in a cluster. Hence our move to EFS. This should be a big step forwards for us.

We are also facing challenges around how tools access data from S3. Many libraries cannot speak to S3 directly and require the FUSE drivers.

The biggest pain point I would raise at the moment though would be the amoutn of time the cluster takes to scale up and down. It is around 10 minutes for us at the moment, which can be painful for new users logging in or creating dask clusters. The new features around slots in Jupyter Hub should ease this.

@stale stale bot removed the stale label Mar 21, 2019
@jwagemann
Copy link
Member

@jacobtomlinson Thank you so much for your detailed answer. Much appreciated!

@stale
Copy link

stale bot commented May 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 20, 2019
@stale
Copy link

stale bot commented May 27, 2019

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed May 27, 2019
@rsignell-usgs
Copy link
Member

Has anyone tried out the new AWS "Container Insights"?
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Jul 18, 2019

Not exactly.

I've used fluentd-cloudwatch for a while to send pod logs to CloudWatch Logs. This can be super useful for getting dask worker logs after the pod has been deleted.

I tend to use prometheus/grafana for metrics as it's not AWS specific. CloudWatch metrics could be useful for folks who want to stick to the AWS ecosystem.

@jhamman jhamman reopened this Oct 30, 2019
@stale stale bot removed the stale label Oct 30, 2019
@rsignell-usgs
Copy link
Member

Folks from USGS, NOAA and NASA met yesterday to discuss sharing Pangeo deployment knowledge and issues since we are all using AWS. We decided to that the best place to discuss would be on this existing issue.

cc @apawloski (NASA), @kscocasey (NOAA), @grolston (USGS)

@rsignell-usgs
Copy link
Member

Just a heads up here that discussion about how to secure our AWS instances is going on over at pangeo-data/pangeo-cloud-federation#467 (comment)

@rsignell-usgs
Copy link
Member

Was looking at https://cloudprovider.dask.org/en/latest/ and seems exciting.

Does anyone have experience setting up a Pangeo instance on AWS using Fargate?

Would be nice to not have to manage Kubernetes, have faster spinup for performance and faster spindown to save on compute resources.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Nov 19, 2019

Using Fargate is definitely a compromise, but a useful one.

not have to manage Kubernetes

Fargate has much lower limits (~50 workers) and costs more per second. This is a tradeoff here.

have faster spinup for performance

This may not actually be true. The pangeo base image needs to be pulled for every worker. On kubernetes you only pull once per node, and if you kill and worker and create a new one (e.g restart your kernel and run your cells again) Fargate will pull the image again whereas k8s will use the cache.

faster spindown

This is definitely helpful.

Was looking at https://cloudprovider.dask.org/en/latest/ and seems exciting.

I'm really keen to get more eyes on this and do more testing and development. I know @RPrudden has done some experimenting with it.

@rsignell-usgs
Copy link
Member

Thanks for clarifying @jacobtomlinson. So clearly not the easy win I thought it was. Especially on spin up time. Is there any reason Fargate couldn't use a similar caching approach for the Pangeo container?

@jacobtomlinson
Copy link
Member

Everything behind Fargate is abstracted away. We have no visibility of the nodes that the containers run on. So AWS isolates everything. This means images will get pulled every time. Even if a node can hold 10 workers, each worker pulls the image. Or if a node has previously had a worker on it, it still pulls it again. I guess this is a security/management thing on AWS's part.

Putting images in ECR helps a little.

Taking back control of the cluster using ECS gives more flexibility and allows us to cache things better. But this is as much effort as managing a k8s cluster.

@rsignell-usgs
Copy link
Member

I'm here at the ESIP Winter Meeting in Bethesda, and @jflasher mentioned that @niallrobinson was mentioning that AWS SageMaker (which uses a regular JupyterLab interface) has some kubernetes capabilities and/or might now have the capability to run Dask.

@niallrobinson , do you have more info here, or did I get this wrong?

@scottyhq
Copy link
Member

scottyhq commented Jan 8, 2020

@rsignell-usgs - this is something we've been looking into, but haven't found enough time yet. You could always run a Dask LocalCluster on your sagemaker instance.

There were a bunch of new sagemaker features announced last month but as far as I know no simple way to run dask-kubernetes. You'd have to set up your own EKS cluster with dask-gateway. The basic idea being that if you set up an EKS cluster with just dask-gateway installed you could run your computations from a sagemaker notebook (even in a different region or from your local laptop). It is very promising but in early stages of development!

@grolston
Copy link

grolston commented Jan 8, 2020

@scottyhq we will try implementing dask gateway so sagemaker jupyter notebook and work with dask in scale. I do see sagemaker works with kubernetes but it is not necessarily exactly as we expected.

The dask gateway is something we will try out now and keep you posted.

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Feb 24, 2020

I have posted some notes on how to run Pangeo on AWS with SageMaker/Fargate.

DIY! No JupyterHub or Kubernetes dev ops needed!

Basically you:

  • create an AWS SageMaker Notebook instance
  • generate a persisted custom conda environment (that includes dask-cloudprovider)
  • Call FargateCluster with a custom Docker image that matches your custom environment

@tjcrone
Copy link
Contributor

tjcrone commented Feb 24, 2020

This is awesome Rich! Thanks for figuring this out and providing such great documentation. Does GCP or Azure have services to make this work as well?

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Feb 26, 2020

@tjcrone, here at the Dask developers workshop, @cody-dkdc is talking about a new AzureMLCluster capability for dask-cloudprovider.

See the Full Presentation

Here is slide 10:
2020-02-26_11-05-44

@stale
Copy link

stale bot commented Apr 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 26, 2020
@stale
Copy link

stale bot commented May 3, 2020

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests