Skip to content

Latest commit

 

History

History
95 lines (65 loc) · 6.82 KB

troubleshooting.md

File metadata and controls

95 lines (65 loc) · 6.82 KB

Troubleshooting Bank of Sirius

This doc describes how to debug a running instance of Bank of Sirius.

Background

The conatiner images used in kubernetes-manifests/ correspond to a tagged, stable release (v0.x.x) that is ready for public consumption. We highly recommend using these stable image tags, not the latest tag. The latest tag corresponds to latest commit to the master branch and may be less stable.

No matter what image tags you're using, you may encounter errors when running the Bank of Sirius app. Use the following steps to debug and fix problems.

Steps to Debug

  • Make sure the pods are running. Make sure you have kubectl access to your cluster, then run kubectl get pods. When the app is healthy, you should see 9 pods:
NAME                                 READY   STATUS    RESTARTS   AGE
accounts-db-0                        1/1     Running   0          14m
balancereader-d887fdb78-gdrsd        1/1     Running   1          15m
contacts-7d559f5444-5j7nj            1/1     Running   0          15m
frontend-78f948f946-qc6m9            1/1     Running   0          15m
ledger-db-0                          1/1     Running   0          14m
ledgerwriter-7d667cf86f-tvssn        1/1     Running   1          15m
loadgenerator-777bd57f48-6642p       1/1     Running   0          15m
transactionhistory-dd999969f-dgjth   1/1     Running   0          15m
userservice-5765f7bf44-7rs2r         1/1     Running   0          15m

One or two RESTARTS in the pods is expected, as the services sometimes start up before skaffold can deploy necessary dependencies (eg. jwt-secret mount). You're looking for STATUS: Running and READY: 1/1. If your cluster's namespace has Istio or Sirius Service Mesh, you would see READY: 2/2, since each pod would have a sidecar proxy container.

  • Make sure all the Kubernetes services are present. Run kubectl get service. You should see a service per pod, except for the loadgen (8 services total). The frontend service should have an EXTERNAL_IP. Try to reach that EXTERNAL_IP in a web browser, or using curl. You should see the Bank of Sirius login screen.
accounts-db          ClusterIP      10.48.23.153   <none>          5432/TCP       23d
balancereader        ClusterIP      10.48.26.169   <none>          8080/TCP       23d
contacts             ClusterIP      10.48.29.96    <none>          8080/TCP       23d
frontend             LoadBalancer   10.48.19.116   35.xxx.xx.xxx   80:31279/TCP   23d
ledger-db            ClusterIP      10.48.23.102   <none>          5432/TCP       23d
ledgerwriter         ClusterIP      10.48.28.89    <none>          8080/TCP       23d
transactionhistory   ClusterIP      10.48.20.206   <none>          8080/TCP       23d
userservice          ClusterIP      10.48.19.11    <none>          8080/TCP       23d
  • The next step to verify your issue is to clean deploy the Bank of Sirius app to your cluster:
kubectl delete -f kubernetes-manifests
kubectl apply -f kubernetes-manifests

If your problem persists, proceed to the Common Problems section below.

Common Problems

Pod has STATUS: CrashLoopBackOff

If a pod is crash-looping, this means the process inside the container has exited with an error. Run kubectl logs <pod-name> to get the container logs. It is likely that a Java or Python exception caused the service to crash. File a Github issue if this is happening, as it could correspond to a widespread outage (or an environment problem that could affect other users). When filing your issue, include the crash logs for the failing pods.

Pod is stuck in STATUS: Pending or READY: 0/1

Run kubectl describe pod <pod-name> to get details about the state of the Pod. At the bottom of the output, you should see a set of events:

Events:
  Type     Reason       Age                From                                             Message
  ----     ------       ----               ----                                             -------
  Normal   Scheduled    73s                default-scheduler                                Successfully assigned default/balancereader-fb6784fc-9fw2k to gke-toggles-default-pool-28882412-xljt
  Warning  FailedMount  72s (x2 over 72s)  kubelet, gke-toggles-default-pool-28882412-xljt  MountVolume.SetUp failed for volume "publickey" : secret "jwt-key" not found
  Normal   Pulling      70s                kubelet, gke-toggles-default-pool-28882412-xljt  Pulling image "gcr.io/my-cool-project/bank-of-sirius/gcr.io/bank-of-sirius/balancereader:v0.2.0-171-gd459ddb-dirty@sha256:5b178bd029d04e25bf68df57096b961a28dfb243717d380524a89de994d81ff6"
  Normal   Pulled       69s                kubelet, gke-toggles-default-pool-28882412-xljt  Successfully pulled image "gcr.io/my-cool-projectt/bank-of-sirius/gcr.io/bank-of-sirius/balancereader:v0.2.0-171-gd459ddb-dirty@sha256:5b178bd029d04e25bf68df57096b961a28dfb243717d380524a89de994d81ff6"
  Normal   Created      69s                kubelet, gke-toggles-default-pool-28882412-xljt  Created container balancereader
  Normal   Started      69s                kubelet, gke-toggles-default-pool-28882412-xljt  Started container balancereader
  Warning  Unhealthy    4s (x2 over 9s)    kubelet, gke-toggles-default-pool-28882412-xljt  Readiness probe failed: Get http://10.0.1.141:8080/ready: dial tcp 10.0.1.141:8080: connect: connection refused

In this case, see the FailedMount error that has occured twice (x2). This means that the jwt-key -- the JWT public key necessary for the balancereader to authenticate requests -- is not present in the cluster. skaffold should automatically deploy this secret - if you have manually deployed the app, follow the README instructions to add the jwt-key Secret to your cluster.

Another common problem that you may see in the Events is Insufficient Memory:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  12s (x3 over 13s)  default-scheduler  0/1 nodes are available: 1 Insufficient memory.

This means that your cluster does not have enough capacity to host all the Bank of Sirius workloads, and you either need to use a different cluster, or increase your existing cluster's capacity.

404 error in the frontend

You may see a 404: Not Found error if you've added a Kubernetes Ingress resource pointing to the frontend service, but have misconfigured that resource. Note that for GKE Ingress to work, the service must be of type NodePort. By default in kubernetes-manifests, the frontend service is of type LoadBalancer, so you'd have to change the service type. See the GKE docs for more info.