Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tls overlay #80

Merged
merged 25 commits into from
Oct 7, 2023
Merged

Add tls overlay #80

merged 25 commits into from
Oct 7, 2023

Conversation

sed-i
Copy link
Contributor

@sed-i sed-i commented Aug 14, 2023

This PR adds a TLS overlay.

Fixes #78.

Depends on:

Testing

First, deploy the testing overlay:

$ tox -e render-edge
$ juju deploy ./bundle.yaml --trust \
  --overlay overlays/tls-overlay.yaml \
  --overlay overlays/testing-overlay.yaml

We do not have the receive-ca-cert relation yet in cos charms, so the external ca would need to be copied in manually:

juju run external-ca/0 get-ca-certificate --format json | jq -r '."external-ca/0".results."ca-certificate"' > external-ca.crt
openssl x509 -noout -text -in external-ca.crt

echo | openssl s_client -showcerts -servername 10.43.8.206 -connect 10.43.8.206:443 | openssl x509 -text -noout

juju scp --container prometheus external-ca.crt prometheus/0:/usr/local/share/ca-certificates
juju ssh --container prometheus prometheus/0 update-ca-certificates --fresh

juju scp --container grafana external-ca.crt grafana/0:/usr/local/share/ca-certificates
juju ssh --container grafana grafana/0 update-ca-certificates --fresh

Then:

  1. Attempt to manually curl each cos component: juju show-unit catalogue/0 | grep url.
  2. Make sure all scrape targets are healthy: curl -k https://10.43.8.206/bndl-prometheus-0/api/v1/targets | jq | grep -E "scrapeUrl|health".
  3. Make sure the watchdog alert from avalanche is firing:
    • In prometheus: curl <prometheus>/api/v1/alerts.
    • In alertmanager: curl <alertmanager>/api/v2/alerts.
  4. Make sure grafana has all the dashboards.

@lucabello
Copy link
Contributor

I just tried deploying the bundle with the TLS overlay after the Catalogue issue has been closed, and I can conferm that specific problem is solved; Catalogue is now idle/active 🎉

@lucabello
Copy link
Contributor

lucabello commented Aug 23, 2023

Everything eventually settles to active idle, which is good.

As per the other test conditions:

  1. ✅ I can curl the listed components (prometheus, grafana, alertmanager); ❌ I cannot however curl catalogue, as I get Bad Gateway on both HTTP and HTTPS; (turns out I can curl catalogue directly, however not through traefik; it seems like we don't really use the traefik-provided URL in catalogue at all)
  2. ✅ All the scrape targets in Prometheus are healthy
  3. ❌ There's no alerts in that alertmanager endpoint; however, prometheus has some (two) which are related to avalanche (AlwaysFiringDueToAbsentMetric/DueToNumericValue); not sure if there's supposed to be an actual Watchdog alert like there is for alertmanager;
  4. ❓ Cannot check this due to the unreachable web UI bug

overlays/tls-overlay.yaml Outdated Show resolved Hide resolved
overlays/testing-overlay.yaml Outdated Show resolved Hide resolved
@lucabello
Copy link
Contributor

Following up on my previous comment:

@lucabello
Copy link
Contributor

@sed-i The race in alertmanager configuration issue could be fixed now (as my guess is that this PR solved it).

You should try deploying it and check! :D

@PietroPasotti
Copy link

interestingly, loki metrics report down because of a None in the url:

image

no alert seems to be firing after adding avalanche

@PietroPasotti
Copy link

something wrong with loki's metrics-endpoint
image

@PietroPasotti
Copy link

update for posterity: apparently fixed in edge

@PietroPasotti
Copy link

for my own future benefit:

curl -k https://$TRAEFIK_IP/$MODEL_NAME-prometheus-0/api/v1/targets | jq .status # 'success'
curl -k https://$TRAEFIK_IP/$MODEL_NAME-alertmanager/api/v2/alerts | jq .status # []
curl -k https://192.168.0.105/tls-bundle3-catalogue # ??? Bad Gateway
curl -k https://192.168.0.105/tls-bundle3-grafana # <a href="/tls-bundle3-grafana/login">Found</a>. 

@PietroPasotti
Copy link

PietroPasotti commented Sep 12, 2023

result of my day of sweat:

  • after deploy, all datasources in grafana are dead

  • I add to /etc/hosts: <traefik ip> traefik.local and `juju config traefik external_hostname="traefik.local"

  • log into grafana; when testing the datasources I get:

    • some error telling me it can't route to traefik
    • I juju ssh to grafana and add to /etc/hosts: <traefik ip> traefik.local
    • error changes to: Error reading Prometheus: Post "https://traefik.local/api/v1/query": x509: certificate is valid for 5c7c9fdf4db7d6d9269c72bc1453af14.33c28df6f153744b27cff01be4a94ca2.traefik.default, not traefik.local
  • no alerts are visible in the alertmanager ui

@sed-i
Copy link
Contributor Author

sed-i commented Sep 13, 2023

Deployed just now,

$ tox -e render-edge
$ juju deploy ./bundle.yaml --trust \
  --overlay overlays/tls-overlay.yaml \
  --overlay overlays/testing-overlay.yaml

And noticed a few issues with the scrape targets:

curl -sk https://10.43.8.206/bndl-prometheus-0/api/v1/targets | jq | grep -E "scrapeUrl|health"
        "scrapeUrl": "http://10.43.8.206:80/bndl-alertmanager/metrics",
        "health": "down",
        "scrapeUrl": "http://10.1.166.74:9001/metrics",
        "health": "up",
        "scrapeUrl": "http://10.1.166.82:9001/metrics",
        "health": "up",
        "scrapeUrl": "https://grafana-0.grafana-endpoints.bndl.svc.cluster.local:3000/metrics",
        "health": "up",
        "scrapeUrl": "http://10.43.8.206:80/bndl-loki-0/metrics",
        "health": "down",
        "scrapeUrl": "http://traefik-0.traefik-endpoints.bndl.svc.cluster.local:8082/metrics",
        "health": "up",
        "scrapeUrl": "https://prometheus-0.prometheus-endpoints.bndl.svc.cluster.local:9090/metrics",
        "health": "up",
  1. Alertmanager scrape URL has http instead of https, and the port is :80 instead of blank (or :443). (Manually curling https://10.43.8.206/bndl-alertmanager/metrics works.)
  2. Avalanche is not related to traefik, but it is reporting unit IP instead of fqdn. (Low priority)
  3. Loki scrape URL has http instead of https, and the port is :80 instead of blank (or :443). In addition, loki tells traefik over relation data that it's http instead of https, so the metrics endpoint isn't reachable via ingress URL at all.

@sed-i
Copy link
Contributor Author

sed-i commented Sep 14, 2023

Heads up, @PietroPasotti:

I added another relation to the tls overlay between traefik and a new charm, "external-ca". That's how it should be.
The problem is that it made things worse.
If before, when traefik was related to ca:certificates, curl prometheus worked, now it returns Internal Server Error%.
Need to figure out why.

@PietroPasotti
Copy link

seems to go away after adding the https redirect block to traefik config and restarting a couple of times

@PietroPasotti
Copy link

results from testing with traefik's [fix-tls branch](https://github.com/canonical/traefik-k8s-o perator/pull/245#event-10405698061):

  • whole bundle goes active/idle 💯
  • all metrics up except loki: 🤔
        "globalUrl": "https://192.168.0.105:443/clite-tls-test-loki-0/metrics",                                                               
        "lastError": "Get \"https://192.168.0.105:443/clite-tls-test-loki-0/metrics\": x509: cannot validate certificate for 192.168.0.105 bec
ause it doesn't contain any IP SANs",                                                                                                         
        "lastScrape": "2023-09-19T09:40:23.227981499Z",                                                                                       
        "lastScrapeDuration": 0.002484311,                                                                                                    
        "health": "down",                                                                                                                     
  • prometheus shows avalance alert 👌
  • alertmanager does not: 💥
❯ curl -k  https://192.168.0.105/clite-tls-test-alertmanager/api/v2/alerts   
[]                                                                                                                                                       

after copying over the cert as described in traefik's pr description:

image
image

@sed-i
Copy link
Contributor Author

sed-i commented Sep 19, 2023

Yeah, IP for SAN doesn't work yet, I suspect because of canonical/tls-certificates-interface#71.
With external hostname such as trfk.local curl worked for me.

*"run external-ca/0 get-ca-certificate --format=json --no-color".split()
)
cert = json.loads(stdout)["external-ca/0"]["results"]["ca-certificate"]
cert_path.write_text(cert)

Check failure

Code scanning / CodeQL

Clear-text storage of sensitive information

This expression stores [sensitive data (certificate)](1) as clear text.
@sed-i sed-i marked this pull request as ready for review October 3, 2023 15:53
Copy link
Contributor

@Abuelodelanada Abuelodelanada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After deploying the bundle as indicated in the testing instructions, I have the following:

$ jst                                                            
Model     Controller  Cloud/Region        Version  SLA          Timestamp
cos-lite  microk8s    microk8s/localhost  3.1.5    unsupported  15:36:51-03:00

App           Version  Status  Scale  Charm                     Channel  Rev  Address         Exposed  Message
alertmanager  0.25.0   active      2  alertmanager-k8s          edge      92  10.152.183.146  no       
avalanche              active      2  avalanche-k8s             edge      35  10.152.183.89   no       
ca                     active      1  self-signed-certificates  edge      37  10.152.183.75   no       
catalogue              active      1  catalogue-k8s             edge      27  10.152.183.204  no       
external-ca            active      1  self-signed-certificates  edge      37  10.152.183.217  no       
grafana       9.2.1    active      1  grafana-k8s               edge      92  10.152.183.218  no       
loki          2.7.4    active      1  loki-k8s                  edge      99  10.152.183.38   no       
prometheus    2.46.0   active      1  prometheus-k8s            edge     150  10.152.183.123  no       
traefik       2.10.4   active      1  traefik-k8s               edge     157  192.168.1.250   no       

Unit             Workload  Agent  Address      Ports  Message
alertmanager/0   active    idle   10.1.38.87          
alertmanager/1*  active    idle   10.1.38.98          
avalanche/0      active    idle   10.1.38.76          
avalanche/1*     active    idle   10.1.38.122         
ca/0*            active    idle   10.1.38.78          
catalogue/0*     active    idle   10.1.38.80          
external-ca/0*   active    idle   10.1.38.102         
grafana/0*       active    idle   10.1.38.109         
loki/0*          active    idle   10.1.38.71          
prometheus/0*    active    idle   10.1.38.70          
traefik/0*       active    idle   10.1.38.73          

Relation provider                   Requirer                     Interface              Type     Message
alertmanager:alerting               loki:alertmanager            alertmanager_dispatch  regular  
alertmanager:alerting               prometheus:alertmanager      alertmanager_dispatch  regular  
alertmanager:grafana-dashboard      grafana:grafana-dashboard    grafana_dashboard      regular  
alertmanager:grafana-source         grafana:grafana-source       grafana_datasource     regular  
alertmanager:replicas               alertmanager:replicas        alertmanager_replica   peer     
alertmanager:self-metrics-endpoint  prometheus:metrics-endpoint  prometheus_scrape      regular  
avalanche:metrics-endpoint          prometheus:metrics-endpoint  prometheus_scrape      regular  
avalanche:replicas                  avalanche:replicas           avalanche_replica      peer     
ca:certificates                     alertmanager:certificates    tls-certificates       regular  
ca:certificates                     catalogue:certificates       tls-certificates       regular  
ca:certificates                     grafana:certificates         tls-certificates       regular  
ca:certificates                     loki:certificates            tls-certificates       regular  
ca:certificates                     prometheus:certificates      tls-certificates       regular  
ca:send-ca-cert                     traefik:receive-ca-cert      certificate_transfer   regular  
catalogue:catalogue                 alertmanager:catalogue       catalogue              regular  
catalogue:catalogue                 grafana:catalogue            catalogue              regular  
catalogue:catalogue                 prometheus:catalogue         catalogue              regular  
catalogue:replicas                  catalogue:replicas           catalogue_replica      peer     
external-ca:certificates            traefik:certificates         tls-certificates       regular  
grafana:grafana                     grafana:grafana              grafana_peers          peer     
grafana:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
grafana:replicas                    grafana:replicas             grafana_replicas       peer     
loki:grafana-dashboard              grafana:grafana-dashboard    grafana_dashboard      regular  
loki:grafana-source                 grafana:grafana-source       grafana_datasource     regular  
loki:metrics-endpoint               prometheus:metrics-endpoint  prometheus_scrape      regular  
loki:replicas                       loki:replicas                loki_replica           peer     
prometheus:grafana-dashboard        grafana:grafana-dashboard    grafana_dashboard      regular  
prometheus:grafana-source           grafana:grafana-source       grafana_datasource     regular  
prometheus:prometheus-peers         prometheus:prometheus-peers  prometheus_peers       peer     
traefik:ingress                     alertmanager:ingress         ingress                regular  
traefik:ingress                     catalogue:ingress            ingress                regular  
traefik:ingress-per-unit            loki:ingress                 ingress_per_unit       regular  
traefik:ingress-per-unit            prometheus:ingress           ingress_per_unit       regular  
traefik:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
traefik:peers                       traefik:peers                traefik_peers          peer     
traefik:traefik-route               grafana:ingress              traefik_route          regular  

Then:

Attempt to manually curl each cos component: juju show-unit catalogue/0 | grep url.

╭─ubuntu@charm-dev-juju-31 ~/repos/cos-lite-bundle ‹feature/tls-overlay●› [microk8s:cos-lite]
╰─$ for i in $(juju show-unit catalogue/0 | grep url: | awk '{print $2}'); do echo $i; done
https://192.168.1.250/cos-lite-catalogue
https://192.168.1.250/cos-lite-grafana
https://192.168.1.250/cos-lite-prometheus-0
http://192.168.1.250/cos-lite-alertmanager
╭─ubuntu@charm-dev-juju-31 ~/repos/cos-lite-bundle ‹feature/tls-overlay●› [microk8s:cos-lite]
╰─$ for i in $(juju show-unit catalogue/0 | grep url: | awk '{print $2}'); do echo "-->" $i: `curl -o /dev/null -sk -w '%{http_code}' $i`; done
--> https://192.168.1.250/cos-lite-catalogue: 502
--> https://192.168.1.250/cos-lite-grafana: 302
--> https://192.168.1.250/cos-lite-prometheus-0: 302
--> https://192.168.1.250/cos-lite-alertmanager: 200
  • Catalogue URL is not working
  • Alertmanager is using HTTP instead of HTTPS
  • Grafana and Prometheus works OK

Make sure all scrape targets are healthy: curl -k https://10.43.8.206/bndl-prometheus-0/api/v1/targets | jq | grep -E "scrapeUrl|health".

Everything is OK:

$ curl -sk https://192.168.1.250/cos-lite-prometheus-0/api/v1/targets | jq | grep -E "scrapeUrl|health"
        "scrapeUrl": "https://alertmanager-1.alertmanager-endpoints.cos-lite.svc.cluster.local:9093/metrics",
        "health": "up",
        "scrapeUrl": "http://10.1.38.76:9001/metrics",
        "health": "up",
        "scrapeUrl": "http://10.1.38.122:9001/metrics",
        "health": "up",
        "scrapeUrl": "https://grafana-0.grafana-endpoints.cos-lite.svc.cluster.local:3000/metrics",
        "health": "up",
        "scrapeUrl": "https://loki-0.loki-endpoints.cos-lite.svc.cluster.local:3100/metrics",
        "health": "up",
        "scrapeUrl": "http://traefik-0.traefik-endpoints.cos-lite.svc.cluster.local:8082/metrics",
        "health": "up",
        "scrapeUrl": "https://prometheus-0.prometheus-endpoints.cos-lite.svc.cluster.local:9090/metrics",
        "health": "up",

Make sure the watchdog alert from avalanche is firing:

  • In prometheus: curl /api/v1/alerts.
  • In alertmanager: curl /api/v2/alerts.

Alerts are firing:

Prometheus:
imagen

Alertmanager:
imagen

Make sure grafana has all the dashboards.

Dashboards in Grafana:

imagen

@sed-i
Copy link
Contributor Author

sed-i commented Oct 3, 2023

Confirmed findings by @Abuelodelanada:

  • After https is enabled in am & trfk, catalogue still has http for am.
  • Curling the catalogue ingress url (with https) gives "bad gateway" (catalogue has a certs relation in place).

@Abuelodelanada
Copy link
Contributor

@sed-i Manually tested again. Everything works OK!!

@sed-i sed-i merged commit fd3297b into main Oct 7, 2023
@sed-i sed-i deleted the feature/tls-overlay branch October 7, 2023 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration tests for TLS
5 participants