Setting up ACCESS Pangeo on AWS #26

rsignell-usgs · 2019-02-05T14:04:38Z

following https://zero-to-jupyterhub.readthedocs.io/en/latest/amazon/step-zero-aws.html

Instructions say:

Create a IAM Role

This role will be used to give your CI host permission to create and destroy resources on AWS

AmazonEC2FullAccess
IAMFullAccess
AmazonS3FullAccess
AmazonVPCFullAccess
Route53FullAccess (Optional)

I created this using the aws cli following the instructions on
https://github.com/kubernetes/kops/blob/master/docs/aws.md

I skipped the DNS step because I'll have a "gossip-based cluster".

Enable versioning and encrypthion on the $KOPS_STATE_STORE:

aws s3api put-bucket-versioning --bucket esip-pangeo-kops-state-store --versioning-configuration Status=Enabled

aws s3api put-bucket-encryption --bucket  esip-pangeo-kops-state-store  --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

create cluster:

$ kops create cluster kopscluster.k8s.local \
   --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f  \
   --authorization RBAC \
   --master-size t2.small \
   --master-volume-size 10 \
   --node-size m4.2xlarge \
   --master-count 3 \
   --networking cni \
   --node-count 2 \
   --node-volume-size 120 \
   --image kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-02-08 \
   --yes

which produced this output:

I0206 15:13:48.491663   23761 create_cluster.go:496] Inferred --cloud=aws from zone "us-east-1a"
I0206 15:13:48.558748   23761 subnets.go:184] Assigned CIDR 172.20.32.0/19 to subnet us-east-1a
I0206 15:13:48.558862   23761 subnets.go:184] Assigned CIDR 172.20.64.0/19 to subnet us-east-1b
I0206 15:13:48.558913   23761 subnets.go:184] Assigned CIDR 172.20.96.0/19 to subnet us-east-1c
I0206 15:13:48.558962   23761 subnets.go:184] Assigned CIDR 172.20.128.0/19 to subnet us-east-1d
I0206 15:13:48.559028   23761 subnets.go:184] Assigned CIDR 172.20.160.0/19 to subnet us-east-1e
I0206 15:13:48.559071   23761 subnets.go:184] Assigned CIDR 172.20.192.0/19 to subnet us-east-1f
I0206 15:13:48.854296   23761 create_cluster.go:1407] Using SSH public key: /home/ec2-user/.ssh/id_rsa.pub
I0206 15:13:49.238073   23761 apply_cluster.go:542] Gossip DNS: skipping DNS validation
I0206 15:13:49.769488   23761 executor.go:103] Tasks: 0 done / 97 total; 34 can run
I0206 15:13:50.195611   23761 vfs_castore.go:736] Issuing new certificate: "ca"
I0206 15:13:50.509655   23761 vfs_castore.go:736] Issuing new certificate: "apiserver-aggregator-ca"
I0206 15:13:50.892007   23761 executor.go:103] Tasks: 34 done / 97 total; 29 can run
I0206 15:13:51.917987   23761 vfs_castore.go:736] Issuing new certificate: "kubecfg"
I0206 15:13:52.134623   23761 vfs_castore.go:736] Issuing new certificate: "kubelet-api"
I0206 15:13:52.333038   23761 vfs_castore.go:736] Issuing new certificate: "kube-proxy"
I0206 15:13:52.766449   23761 vfs_castore.go:736] Issuing new certificate: "kube-scheduler"
I0206 15:13:52.885908   23761 vfs_castore.go:736] Issuing new certificate: "kube-controller-manager"
I0206 15:13:53.122200   23761 vfs_castore.go:736] Issuing new certificate: "apiserver-proxy-client"
I0206 15:13:53.230130   23761 vfs_castore.go:736] Issuing new certificate: "apiserver-aggregator"
I0206 15:13:53.490835   23761 vfs_castore.go:736] Issuing new certificate: "kubelet"
I0206 15:13:53.513488   23761 vfs_castore.go:736] Issuing new certificate: "kops"
I0206 15:13:53.793454   23761 executor.go:103] Tasks: 63 done / 97 total; 26 can run
I0206 15:13:54.027747   23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:13:54.030642   23761 launchconfiguration.go:380] waiting for IAM instance profile "nodes.kopscluster.k8s.local" to be ready
I0206 15:13:54.066847   23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:13:54.179909   23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:14:04.639432   23761 executor.go:103] Tasks: 89 done / 97 total; 5 can run
I0206 15:14:04.997866   23761 vfs_castore.go:736] Issuing new certificate: "master"
I0206 15:14:05.726978   23761 executor.go:103] Tasks: 94 done / 97 total; 3 can run
I0206 15:14:06.331247   23761 executor.go:103] Tasks: 97 done / 97 total; 0 can run
I0206 15:14:06.406582   23761 update_cluster.go:290] Exporting kubecfg for cluster
kops has set your kubectl context to kopscluster.k8s.local

Cluster is starting.  It should be ready in a few minutes.

Suggestions:
 * validate cluster: kops validate cluster
 * list nodes: kubectl get nodes --show-labels
 * ssh to the master: ssh -i ~/.ssh/id_rsa [email protected]
 * the admin user is specific to Debian. If not using Debian please use the appropriate user based on your OS.
 * read about installing addons at: https://github.com/kubernetes/kops/blob/master/docs/addons.md.

Don't try to validate the cluster yes.
First enable networking:

 kubectl create -f https://git.io/weave-kube-1.6

Validate cluster. This will fail for several minutes before it works:

[ec2-user@ip-172-31-34-163 ~]$ kops validate cluster
Using cluster from kubectl context: kopscluster.k8s.local

Validating cluster kopscluster.k8s.local

INSTANCE GROUPS
NAME                    ROLE    MACHINETYPE     MIN     MAX     SUBNETS
master-us-east-1a       Master  t2.small        1       1       us-east-1a
master-us-east-1b       Master  t2.small        1       1       us-east-1b
master-us-east-1c       Master  t2.small        1       1       us-east-1c
nodes                   Node    m4.2xlarge      2       2       us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f

NODE STATUS
NAME                            ROLE    READY
ip-172-20-148-127.ec2.internal  node    True
ip-172-20-41-39.ec2.internal    master  True
ip-172-20-77-127.ec2.internal   master  True
ip-172-20-85-76.ec2.internal    node    True
ip-172-20-97-64.ec2.internal    master  True

Your cluster kopscluster.k8s.local is ready

Then enable storage:

(aws) [ec2-user@ip-172-31-34-163 ~]$ more storageclass.yml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
  name: gp2
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2

kubectl apply -f storageclass.yml

kubernetes secret


openssl rand -hex 128 >weave-passwd
kubectl create secret -n kube-system generic weave-passwd --from-file=./weave-passwd

 kubectl patch --namespace=kube-system daemonset/weave-net --type json -p '[ { "op": "add", "path": "/spec/template/spec/containers/0/env/0", "value": { "name": "WEAVE_PASSWORD", "valueFrom": { "secretKeyRef": { "key": "weave-passwd", "name": "weave-passwd" } } } } ]'

zero-to-jupyterhub step 0 complete!

install helm

curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash
kubectl --namespace kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --wait
kubectl patch deployment tiller-deploy --namespace=kube-system --type=json --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'

test:

helm version

Install kubernetes cluster autoscaler

following
https://akomljen.com/kubernetes-cluster-autoscaling-on-aws/

create node instance groups for each subregion:

I first created a IG template:

(aws) [ec2-user@ip-172-31-34-163 ~]$ more node_ig_template.yaml
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-02-01T14:32:59Z
  labels:
    kops.k8s.io/cluster: kopscluster.k8s.local
  name: nodes-us-east-#SUBZONE#-m4-2xlarge.kopscluster.k8s.local
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/node-template/label: ""
    kubernetes.io/cluster/kopscluster.k8s.local: owned
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: m4.2xlarge
  maxPrice: "0.38"
  maxSize: 50
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-east-#SUBZONE#-m4-2xlarge.kopscluster.k8s.local
  role: Node
  rootVolumeSize: 120
  subnets:
  - us-east-#SUBZONE#

and then I ran this script to create the IG in all 6 subregions:

#!/bin/bash
for SUBZONE in 1a 1b 1c 1d 1e 1f
do
  sed 's/#SUBZONE#/'"$SUBZONE"'/' node_ig_template.yaml > ig.yaml
  kops create -f ig.yaml
done

Then update cluster:

 kops update cluster kopscluster.k8s.local --yes

Now add IAM policy rules for the nodes:

kops edit cluster

and add the additionalPolicies to the spec: group:

spec:
  additionalPolicies:
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:DescribeTags",
            "autoscaling:TerminateInstanceInAutoScalingGroup"
          ],
          "Resource": ["*"]
        }
      ]

and apply configuration:

kops update cluster --yes

Check what version of kubernetes we are using:

kubectl version

and note the ServerVersion=>GitVersion (e.g. 1.11.6).

The go to https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#releases and find the right CA version corresponding to your kubernetes version (e.g. 1.11.X => 1.3.X)

Then go to:
https://github.com/kubernetes/autoscaler/releases
and find the most recent version of your CA version (e.g. 1.3.5)

Specify this in your autoscaling helm chart:

helm install --name autoscaler \
    --namespace kube-system \
    --set image.tag=v1.3.5 \
    --set autoDiscovery.clusterName=kopscluster.k8s.local \
    --set extraArgs.balance-similar-node-groups=false \
    --set extraArgs.expander=random \
    --set rbac.create=true \
    --set rbac.pspEnabled=true \
    --set awsRegion=us-east-1 \
    --set nodeSelector."node-role\.kubernetes\.io/master"="" \
    --set tolerations[0].effect=NoSchedule \
    --set tolerations[0].key=node-role.kubernetes.io/master \
    --set cloudProvider=aws \
    stable/cluster-autoscaler

verify it's running:

kubectl --namespace=kube-system get pods -l "app=aws-cluster-autoscaler,release=autoscaler"

install pangeo helm chart:

helm upgrade --install esip-pangeo pangeo/pangeo --namespace esip-pangeo --version=0.1.1-ce2f7f5  -f jupyter-config-noscratch.yaml -f secret-config.yaml

find the IP:

 kubectl --namespace=esip-pangeo get svc proxy-public

which in my case, produced:

NAME           TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)        AGE
proxy-public   LoadBalancer   100.64.166.60   ada21177b295e11e9a0ee0eef77e790b-963275451.us-east-1.elb.amazonaws.com   80:32541/TCP   40s

set the default namespace:

kubectl config set-context $(kubectl config current-context) --namespace=esip-pangeo

After logging into JH and verifying that the cluster scaled up using the CA-enabled IGs to meet the dask workers requested, I deleted the IG for the original 2 nodes from the initial cluster creation:

kops delete ig nodes --yes

The text was updated successfully, but these errors were encountered:

rsignell-usgs · 2019-02-05T22:09:23Z

Just a note that I first tried building the cluster with m5.2xlarge nodes, and specifying those in the 6 IGs. But it turned out they were in large demand and I wasn't getting the spot instances, so I switched to m4.2xlarge. I modified the recipe above to specify m4.2xlarge.

amanda-tan · 2019-03-06T22:59:17Z

@rsignell-usgs Did you create the ASG in each AZ because of the volume attachment issue? I remember we chatted about it, but don't remember the resolution.

rsignell-usgs · 2019-03-06T23:37:08Z

@jacobtomlinson suggested I do this. His explanation was:

The autoscaler can’t be sure that scaling that group will result in a instance that your pod can run on, so it just doesn’t do anything. I ended up duplicating our IGs and assigning them one zone each. So we have one for A and one for B. Then the autoscaler can be sure that if it needs to place a pod in B it can increase the B group by one and it will definitely get placed.

amanda-tan · 2019-03-07T03:33:05Z

Great - that will be useful for our auto-scaling setup. We're running into this error using EKS:

Event log
Server requested
2019-03-05 20:52:16+00:00 [Warning] 0/5 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 3 node(s) had no available volume zone.
2019-03-05 20:52:52+00:00 [Warning] Error: cannot find volume "volume-amanda-2dtan" to mount into container "notebook"
2019-03-05 20:52:52+00:00 [Normal] Successfully pulled image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo:a7ff12a"
Server requested
2019-03-05 20:52:52+00:00 [Normal] Successfully pulled image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo:a7ff12a"
2019-03-05 20:52:52+00:00 [Warning] Error: cannot find volume "volume-amanda-2dtan" to mount into container "notebook"
2019-03-05 20:53:02+00:00 [Warning] 0/5 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 3 node(s) had no available volume zone.

The solution is to probably use only single AZ deployment; I think that's doable with kops, not sure how to translate that to cloudformation templates.

rsignell-usgs · 2019-03-07T14:42:55Z

@amanda-tan I've probably mentioned this before, but when I attended the AWS "Building with Containers" class last August, the instructor (from Amazon) suggested we use kops instead of EKS for production. I can't remember exactly what the reasons were, unfortunately. So I'm guessing this has changed and now it's a better idea to use EKS, right?

jacobtomlinson · 2019-03-07T15:33:34Z

We are currently exploring rebuilding our cluster with EKS. Here are a few notes we've found so far:

EKS is just one piece of the puzzle (managed Kubernetes masters), you need to create a lot of extra stuff yourself.
- Nodes
- Security Groups
- VPCs
- IAM roles
- etc
There is a tool called eksctl which attempts to help you and create these things for you. It's still immature compared to kops.
- Currently we are not able to use spot instances for example.

rsignell-usgs · 2019-03-08T12:44:17Z

@amanda-tan, the kops cluster we have set up on pangeo-access is using spot instances and the autoscaling is working. Our burn rate dropped by a factor of two when we went to spot! I can give you the kops kubecfg if you want to deploy your hubs on this existing kubernetes cluster.

rsignell-usgs · 2019-03-21T16:11:54Z

I had removed the Met Office flex volume stuff on the pangeo-access cluster when I was doing the initial debugging, and had been meaning to add it back in.

So I just updated the jupyter-config.yaml and the worker-template.yaml,
installed the helm chart and it's working.

I can write to /scratch and it shows up in the s3 bucket, and I can treat any public s3 data as a file (e.g.

ncdump -h /s3/noaa-nwm-pds/nwm.20190314/forcing_short_range/nwm.t00z.short_range.forcing.f001.conus.nc

rsignell-usgs mentioned this issue Feb 6, 2019

AWS Deployment pangeo-data/pangeo#71

Closed

kmpaul mentioned this issue Mar 12, 2019

Team Programming Session - March 18 NCAR/xdev#3

Closed

jmunroe mentioned this issue Apr 15, 2019

Spin up AWS Kubernetes cluster for workshop jmunroe/pangeo-tutorial-c3dis-2019#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up ACCESS Pangeo on AWS #26

Setting up ACCESS Pangeo on AWS #26

rsignell-usgs commented Feb 5, 2019 •

edited

Loading

rsignell-usgs commented Feb 5, 2019 •

edited

Loading

amanda-tan commented Mar 6, 2019

rsignell-usgs commented Mar 6, 2019

amanda-tan commented Mar 7, 2019

rsignell-usgs commented Mar 7, 2019

jacobtomlinson commented Mar 7, 2019

rsignell-usgs commented Mar 8, 2019

rsignell-usgs commented Mar 21, 2019

Setting up ACCESS Pangeo on AWS #26

Setting up ACCESS Pangeo on AWS #26

Comments

rsignell-usgs commented Feb 5, 2019 • edited Loading

install helm

Install kubernetes cluster autoscaler

rsignell-usgs commented Feb 5, 2019 • edited Loading

amanda-tan commented Mar 6, 2019

rsignell-usgs commented Mar 6, 2019

amanda-tan commented Mar 7, 2019

rsignell-usgs commented Mar 7, 2019

jacobtomlinson commented Mar 7, 2019

rsignell-usgs commented Mar 8, 2019

rsignell-usgs commented Mar 21, 2019

rsignell-usgs commented Feb 5, 2019 •

edited

Loading

rsignell-usgs commented Feb 5, 2019 •

edited

Loading