Skip to content

Skyhook at AF

Oksana Shadura edited this page Jan 25, 2022 · 8 revisions

Separate Rook cluster @ UNL

We need to use a Skyhook Ceph image in the Rook CRD here: uccross/skyhookdm-arrow:vX.Y.Z.

Currently, we are using as an image: uccross/skyhookdm-arrow:v0.4.0

Kubernetes cluster configuration

After the cluster is updated, we need to deploy a Pod with the PyArrow (with SkyhookFileFormat API) library installed to start interacting with the cluster. This can be achieved by following these steps:

Update the ConfigMap with configuration options to be able to load the arrow CLS plugins,

kubectl apply -f cls.yaml

where cls.yaml is:

apiVersion: v1
kind: ConfigMap
data:
  config: |
    [global]
    debug ms = 1
    [osd]
    osd max write size = 250
    osd max object size = 256000000
    osd class load list = *
    osd class default list = *
    osd pool default size = 1
    osd pool default min size = 1
    osd crush chooseleaf type = 1
    osd pool default pg num = 128
    osd pool default pgp num = 128
    bluestore block create = true
    debug osd = 25
    debug bluestore = 30
    debug journal = 20
metadata:
  name: rook-config-override
  namespace: rook-ceph-skyhookdm

Create a CephFS on the Rook cluster,

kubectl create -f filesystem.yaml

where filesystem.yaml is:

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: cephfs
  namespace: rook-ceph-skyhookdm
spec:
  metadataPool:
    replicated:
      size: 3
  dataPools:
    - replicated:
        size: 3
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true

Add specific cephfs storage class to be used within rook-ceph-skyhookdm namespace with a pool cephfs-data0 in our case:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph-skyhookdm.cephfs.csi.ceph.com
parameters:
  # clusterID is the namespace where operator is deployed.
  clusterID: rook-ceph-skyhookdm

  # CephFS filesystem name into which the volume shall be created
  fsName: cephfs

  # Ceph pool into which the volume shall be created
  # Required for provisionVolume: "true"
  pool: cephfs-data0

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph-skyhookdm
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph-skyhookdm
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph-skyhookdm

reclaimPolicy: Delete 

Add PVC claim configuration that will be mounted in Coffea-casa Helm charts (in this example rook-ceph-skyhookdm is a namespace with a Skyhook specific Rook cluster):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: skyhook-pv-claim
  namespace: rook-ceph-skyhookdm
  labels:
    app: rook-ceph-skyhookdm
spec:
  storageClassName: rook-cephfs
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 500Gi```

# Add Ceph specific secrets in k8s

Check `fsid` and `keyring` values in the Ceph configuration and Keyring from some OSD/MON Pod.

```bash
  kubectl -n [namespace] cat [any-osd/mon-pod]:/var/lib/rook/[namespace]/[namespace].config
  kubectl -n [namespace] cat [any-osd/mon-pod]:/var/lib/rook/[namespace]/client.admin.keyring
  

Please add found on the previous step fsid and keyring values as a SealedSecret:

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  creationTimestamp: null
  name: skyhook-secret
  namespace: opendataaf-prod
spec:
  encryptedData:
    fsid: xxxxxx
    keyring: xxxxxx
  template:
    metadata:
      creationTimestamp: null
      name: skyhook-secret
      namespace: opendataaf-prod

Coffea-casa images supporting builtin Skyhook functionality

We had been using for testing the next two centos7 images (I will update Ubuntu images as well):

Ceph configuration file in /home/cms-jovyan/.ceph/ceph.conf and both fsid and keyring values are populated based on values stores as Kubernetes SealedSecrets.

Images could have some UNL Ceph specific setting, please report this to us and we will fix it!

Integration in coffea-casa AF

In Coffea-casa Helm charts please edit values.yaml before adding as secrets adding fsid and keyring from ceph.conf:

    singleuser:
      image:
        pullPolicy: Always
      extraEnv:
        SERVICEX_HOST: http://opendataaf-servicex-servicex-app:8000
        LABEXTENTION_FACTORY_CLASS: LocalCluster
        LABEXTENTION_FACTORY_MODULE: dask.distributed
        SKYHOOK_CEPH_UUIDGEN:
          valueFrom:
            secretKeyRef:
              name: skyhook-secret
              key: fsid
        SKYHOOK_CEPH_KEYRING:
          valueFrom:
            secretKeyRef:
              name: skyhook-secret
              key: keyring

Also, don't forget to mount Skyhook:

    singleuser:
      storage:
        extraVolumes:
        - name: skyhook-shared
          persistentVolumeClaim:
            claimName: skyhook-pv-claim
        extraVolumeMounts:
          - name: skyhook-shared
            mountPath: /mnt/cephfs

Test and small benchmark

Check the connection status from notebook terminal:

  $ ceph -s

Download some example dataset into /mnt/cephfs/. For example,

cd /mnt/cephfs
wget https://raw.githubusercontent.com/JayjeetAtGithub/zips/main/nyc.zip
unzip nyc.zip

Execute a mini test:

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq

format_ = ds.SkyhookFileFormat("parquet", "/home/cms-jovyan/.ceph/ceph.conf", "cephfs-data0")
partitioning_ = ds.partitioning(
    pa.schema([("payment_type", pa.int32()), ("VendorID", pa.int32())]),
    flavor="hive"
)
dataset_ = ds.dataset("file:///mnt/cephfs/nyc", partitioning=partitioning_, format=format_)
print(dataset_.to_table(
        columns=['total_amount', 'DOLocationID', 'payment_type'], 
        filter=(ds.field('payment_type') > 2)
).to_pandas())