When a service health check is configured, the service status will be healthy as soon as a health check passes.
If enough health checks fail (maxConsecutiveFailures
), the status will change to unhealthy.
To add a health check, give the port mapping a name and create a healthChecks
element that refers to it.
# updated service definition
$ cat > minitwit.json << EOF
{
"id": "/minitwit",
"instances": 1,
"cpus": 1,
"mem": 512,
"container": {
"docker": {
"image": "karlkfi/minitwit",
"forcePullImage": false,
"privileged": false,
"portMappings": [
{
"hostPort": 80,
"containerPort": 80,
"protocol": "tcp",
"name": "http-api"
}
],
"network": "BRIDGE"
}
},
"acceptedResourceRoles": [
"slave_public"
],
"requirePorts": true,
"healthChecks": [
{
"protocol": "TCP",
"gracePeriodSeconds": 30,
"intervalSeconds": 15,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 2,
"portName": "http-api"
}
]
}
EOF
# create service
$ dcos marathon app add < minitwit.json
# check service health
$ dcos marathon app list
ID MEM CPUS TASKS HEALTH DEPLOYMENT CONTAINER CMD
/minitwit 256 1 1/1 1/1 --- DOCKER None
When a service readiness check is configured, the service status will remain in the deploying stage until the readiness check is successful. This is most useful when services take a long time to initialize. This way they wont be marked as unhealthy until after they are ready.
To add a readiness check, create a readinessChecks
element that refers to a port by name.
# updated service definition
$ cat > minitwit.json << EOF
{
"id": "/minitwit",
"instances": 1,
"cpus": 1,
"mem": 512,
"container": {
"docker": {
"image": "karlkfi/minitwit",
"forcePullImage": false,
"privileged": false,
"portMappings": [
{
"hostPort": 80,
"containerPort": 80,
"protocol": "tcp",
"name": "http-api"
}
],
"network": "BRIDGE"
}
},
"acceptedResourceRoles": [
"slave_public"
],
"requirePorts": true,
"healthChecks": [
{
"protocol": "TCP",
"gracePeriodSeconds": 30,
"intervalSeconds": 15,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 2,
"portName": "http-api"
}
],
"readinessChecks": [
{
"protocol": "HTTP",
"path": "/",
"portName": "http-api",
"intervalSeconds": 15,
"timeoutSeconds": 5,
"httpStatusCodesForReady": [ 200, 302 ],
"preserveLastResponse": true
}
]
}
EOF
If at any point you need to debug a DC/OS component, job, or service, you may need to SSH into the cluster.
In production, not all machines in a cluster should be internet accessible. So you may need to VPN into the cluster network and SSH into a bootstrap or master node as a jump box.
In the provided AWS clusters, the master nodes are internet accessible. So you can SSH into them and/or use them to proxy into agent nodes.
First, add the SSH key to your SSH agent:
-
Download or write SSH private key to
~/.ssh/dcoskey
-
Set SSH private key permissions:
chmod 600 ~/.ssh/dcoskey
-
Make sure the local SSH agent is running:
eval $(ssh-agent -s)
-
Add SSH private key to SSH agent:
ssh-add ~/.ssh/dcoskey
If you get the error
Could not open a connection to your authentication agent.
, then start the SSH agent witheval $(ssh-agent -s)
and then retry the previous command.
To SSH into a master node:
ssh -A core@${MASTER_IP}
The -A
passes in the SSH agent config so the private key can be used while SSHed into the master node.
To print a list of agent nodes (including IPs and Mesos Agent IDs):
dcos node
To SSH into an agent node from a master node:
ssh core@${AGENT_IP}
Alternatively, it's possible to use the DC/OS CLI to handle master node proxying for you:
dcos node ssh --master-proxy --mesos-id=${AGENT_ID}
To find the internal IP(s) of the public agent nodes:
dcos node --json | jq -r '.[] | select(.attributes.public_ip == "true") | .hostname'
For more SSH details, see https://dcos.io/docs/1.8/administration/sshcluster/
Component List: https://dcos.io/docs/1.8/overview/components/
One of the most common logs to view are Marathon's because it handles deploying services.
To view Marathon's logs first SSH into a master node, then use journalctl:
$ journalctl -u dcos-marathon
To tail the logs, use -f
:
$ journalctl -u dcos-marathon -f
Other components can be found using systemctl
:
$ systemctl | grep dcos-
dcos-3dt.service loaded active running Diagnostics: DC/OS Distributed Diagnostics Tool Master API and Aggregation Service
dcos-adminrouter.service loaded active running Admin Router Master: A high performance web server and a reverse proxy server
dcos-cosmos.service loaded active running Package Service: DC/OS Packaging API
dcos-epmd.service loaded active running Erlang Port Mapping Daemon: DC/OS Erlang Port Mapping Daemon
dcos-exhibitor.service loaded active running Exhibitor: Zookeeper Supervisor Service
dcos-history.service loaded active running Mesos History: DC/OS Resource Metrics History Service/API
dcos-marathon.service loaded active running Marathon: DC/OS Init System
dcos-mesos-dns.service loaded active running Mesos DNS: DNS based Service Discovery
dcos-mesos-master.service loaded active running Mesos Master: DC/OS Mesos Master Service
dcos-metronome.service loaded active running Jobs Service: DC/OS Metronome
dcos-minuteman.service loaded active running Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service
dcos-navstar.service loaded active running Navstar: A distributed systems & network overlay orchestration engine
dcos-oauth.service loaded active running OAuth: OAuth Authentication Service
dcos-pkgpanda-api.service loaded active running Pkgpanda API: Package Management Service
dcos-spartan.service loaded active running DNS Dispatcher: An RFC5625 Compliant DNS Forwarder
dcos-pkgpanda-api.socket loaded active running Pkgpanda API socket: Package Management Service Socket
dcos-adminrouter-reload.timer loaded active waiting Admin Router Reloader Timer: Periodically reload admin router nginx config to pickup new dns
dcos-gen-resolvconf.timer loaded active waiting Generate resolv.conf Timer: Periodically update systemd-resolved for mesos-dns
dcos-logrotate-master.timer loaded active waiting Logrotate Timer: Timer to trigger every 2 minutes
dcos-signal.timer loaded active waiting Signal Timer: Timer for DC/OS Signal Service
dcos-spartan-watchdog.timer loaded active waiting DNS Dispatcher Watchdog Timer: Periodically check is Spartan is working
Note that master and agent nodes have different components.
Mesosphere Enterprise DC/OS includes user groups, fine-grained authorization, and service accounts, along with other security features. Service accounts in particular allow for the creation of long-lived authentication tokens, which are useful for programmatic interaction.
For more on service authentication, see https://docs.mesosphere.com/1.8/administration/id-and-access-mgt/service-auth/
In the future, service accounts and auth tokens will likely be open sourced, but for now, to programmatically avoid the OAuth flow, it is possible to SSH into the master nodes and curl the component APIs directly.
For example, to access the Marathon API:
$ ssh -i ${SSH_KEY_PATH} ${SSH_USER}@${DCOS_ADDRESS} curl -s -f http://marathon.mesos:8080/
Or access the Mesos API:
$ ssh -i ${SSH_KEY_PATH} ${SSH_USER}@${DCOS_ADDRESS} curl -s -f http://leader.mesos:5050/