-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting up ACCESS Pangeo on AWS #26
Comments
Just a note that I first tried building the cluster with |
@rsignell-usgs Did you create the ASG in each AZ because of the volume attachment issue? I remember we chatted about it, but don't remember the resolution. |
@jacobtomlinson suggested I do this. His explanation was:
|
Great - that will be useful for our auto-scaling setup. We're running into this error using EKS:
The solution is to probably use only single AZ deployment; I think that's doable with kops, not sure how to translate that to cloudformation templates. |
@amanda-tan I've probably mentioned this before, but when I attended the AWS "Building with Containers" class last August, the instructor (from Amazon) suggested we use |
We are currently exploring rebuilding our cluster with EKS. Here are a few notes we've found so far:
|
@amanda-tan, the kops cluster we have set up on |
I had removed the Met Office flex volume stuff on the pangeo-access cluster when I was doing the initial debugging, and had been meaning to add it back in. So I just updated the jupyter-config.yaml and the worker-template.yaml, I can write to /scratch and it shows up in the s3 bucket, and I can treat any public s3 data as a file (e.g.
|
following https://zero-to-jupyterhub.readthedocs.io/en/latest/amazon/step-zero-aws.html
Instructions say:
I created this using the aws cli following the instructions on
https://github.com/kubernetes/kops/blob/master/docs/aws.md
I skipped the DNS step because I'll have a "gossip-based cluster".
Enable versioning and encrypthion on the
$KOPS_STATE_STORE
:create cluster:
which produced this output:
Don't try to validate the cluster yes.
First enable networking:
Validate cluster. This will fail for several minutes before it works:
Then enable storage:
kubernetes secret
zero-to-jupyterhub step 0 complete!
install helm
test:
Install kubernetes cluster autoscaler
following
https://akomljen.com/kubernetes-cluster-autoscaling-on-aws/
create node instance groups for each subregion:
I first created a IG template:
and then I ran this script to create the IG in all 6 subregions:
Then update cluster:
Now add IAM policy rules for the nodes:
and add the
additionalPolicies
to thespec
: group:and apply configuration:
Check what version of kubernetes we are using:
and note the
ServerVersion=>GitVersion
(e.g. 1.11.6).The go to https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#releases and find the right CA version corresponding to your kubernetes version (e.g. 1.11.X => 1.3.X)
Then go to:
https://github.com/kubernetes/autoscaler/releases
and find the most recent version of your CA version (e.g. 1.3.5)
Specify this in your autoscaling helm chart:
verify it's running:
install pangeo helm chart:
find the IP:
which in my case, produced:
set the default namespace:
After logging into JH and verifying that the cluster scaled up using the CA-enabled IGs to meet the dask workers requested, I deleted the IG for the original 2 nodes from the initial cluster creation:
The text was updated successfully, but these errors were encountered: