OpenShift provides a wealth of monitoring and alerts however sometimes it can be handy to surface a simple up/down signal for an OpenShift cluster that can be easily interpreted by tools like UptimeRobot. This enables you to provide an operational or business level dashboard of the status of your cluster to users and application owners that may not necessarily familiar with all of the nuances of OpenShift’s or Kubernete’s internals.
The health status of an OpenShift cluster depends on many things such as etcd, operators, nodes, api, etc so how do we aggregate all of this information? While you could certaintly run your own code to do it, fortunately a cool utility called Cerberus already provides this capability. Cerebus was born out of Red Hat’s Performance and Scaling group and was designed to be used with Kraken, a chaos engineering tool. A chaos engineering tool isn’t very useful if you can’t determine the status of the cluster and thus Cereberus was born.
A number of blog posts have already been written about Kraken and Cerebus from a chaos engineering point of view which you can view here and here. Here we are going to focus on the basics of using it for simple health checking.
One thing to note about Cerberus is that it is aggresive about returning an unheathly state even if the cluster is operational. For example, if you set it to monitor a namespace any pod failures in a multi-pod deployment in that namespace will trigger an unhealthly flag even if the other pods in the deployment are running and still servicing requests. As a result, some tuning of Cerberus or the utilization of custom checks is required if you want to use it for a more SLA focused view.
To get started with Cerberus simply clone the git repo into an appropriate directory on the system where you want to run it. While you can run it inside of OpenShift, it is highly recommended to run it outside the cluster since your cluster monitoring tool should not be dependent on the cluster itself. To clone the repo, just run the following:
git clone https://github.com/cloud-bulldozer/cerberus
In order for Cerberus to run, it requires access to a kubeconfig file where a user has already been authenticated. For security purposes I would highly recommend using a serviceaccount with the cluster-reader role rather then using a user with cluster-admin. The commands below will create a serviceaccount in the openshift-monitoring namespace, bind it to the cluster-reader role and generate a kubeconfig that cerberus can use to authenticate to the cluster.
oc create sa cerberus -n openshift-monitoring
oc adm policy add-cluster-role-to-user cluster-reader -z cerberus -n openshift-monitoring
oc serviceaccounts create-kubeconfig cerberus -n openshift-monitoring > config/kubeconfig
Cerberus can automatically create a token for the prometheus-k8s service account to so it can access prometheus to pull metrics. To enable this we need to define a role to give the cerberus the necessary permissions and bind it to the cerberus service account. Create a file with the following content:
- kind: ServiceAccount
And then apply it with “oc apply -f” to the cluster.
To configure cerberus you can edit the existing config.yaml file in the repo or create a new one, creating a new one is highly recommended so if you do a git pull it doesn’t clobber your changes:
cp config/config.yaml config/my-config.yaml
Once you have the config file, you can go through the options and set what you need. Here is an example of my config file which is really just the example config with the kubeconfig parameter tweaked.
distribution: openshift # Distribution can be kubernetes or openshift
kubeconfig_path: /opt/cerberus/config/kubeconfig # Path to kubeconfig
port: 8080 # http server port where cerberus status is published
watch_nodes: True # Set to True for the cerberus to monitor the cluster nodes
watch_cluster_operators: True # Set to True for cerberus to monitor cluster operators
watch_url_routes: # Route url's you want to monitor, this is a double array with the url and optional authorization parameter
watch_namespaces: # List of namespaces to be monitored
- openshift-sdn # When enabled, it will check for the cluster sdn and monitor that namespace
cerberus_publish_status: True # When enabled, cerberus starts a light weight http server and publishes the status
inspect_components: False # Enable it only when OpenShift client is supported to run
# When enabled, cerberus collects logs, events and metrics of failed components
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
# This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies.
slack_integration: False # When enabled, cerberus reports the failed iterations in the slack channel
# The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
# When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
watcher_slack_ID: # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
slack_team_alias: # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned
custom_checks: # Relative paths of files conataining additional user defined checks
iterations: 5 # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
sleep_time: 60 # Sleep duration between each iteration
kube_api_request_chunk_size: 250 # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
daemon_mode: True # Iterations are set to infinity which means that the cerberus will monitor the resources forever
cores_usage_percentage: 0.5 # Set the fraction of cores to be used for multiprocessing
database_path: /tmp/cerberus.db # Path where cerberus database needs to be stored
reuse_database: False # When enabled, the database is reused to store the failures
At this time you can Cerberus manually and test it out as follows:
$ sudo python3 /opt/cerberus/start_cerberus.py --config /opt/cerberus/config/config-home.yaml
___ ___ _ __| |__ ___ _ __ _ _ ___
/ __/ _ \ '__| '_ \ / _ \ '__| | | / __|
| (_| __/ | | |_) | __/ | | |_| \__ \
\___\___|_| |_.__/ \___|_| \__,_|___/
2021-01-29 12:01:01,030 [INFO] Starting ceberus
2021-01-29 12:01:01,037 [INFO] Initializing client to talk to the Kubernetes cluster
2021-01-29 12:01:01,144 [INFO] Fetching cluster info
2021-01-29 12:01:01,260 [INFO]
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.12 True False 3d20h Cluster version is 4.6.12
2021-01-29 12:01:01,365 [INFO] Kubernetes master is running at https://api.home.ocplab.com:6443
2021-01-29 12:01:01,365 [INFO] Publishing cerberus status at http://0.0.0.0:8080
2021-01-29 12:01:01,381 [INFO] Starting http server at http://0.0.0.0:8080
2021-01-29 12:01:01,623 [INFO] Daemon mode enabled, cerberus will monitor forever
2021-01-29 12:01:01,623 [INFO] Ignoring the iterations set
2021-01-29 12:01:01,955 [INFO] Iteration 1: Node status: True
2021-01-29 12:01:02,244 [INFO] Iteration 1: Cluster Operator status: True
2021-01-29 12:01:02,380 [INFO] Iteration 1: openshift-ingress: True
2021-01-29 12:01:02,392 [INFO] Iteration 1: openshift-apiserver: True
2021-01-29 12:01:02,396 [INFO] Iteration 1: openshift-sdn: True
2021-01-29 12:01:02,399 [INFO] Iteration 1: openshift-kube-scheduler: True
2021-01-29 12:01:02,400 [INFO] Iteration 1: openshift-machine-api: True
2021-01-29 12:01:02,406 [INFO] Iteration 1: openshift-kube-controller-manager: True
2021-01-29 12:01:02,425 [INFO] Iteration 1: openshift-etcd: True
2021-01-29 12:01:02,443 [INFO] Iteration 1: openshift-monitoring: True
2021-01-29 12:01:02,445 [INFO] Iteration 1: openshift-kube-apiserver: True
2021-01-29 12:01:02,446 [INFO] HTTP requests served: 0
2021-01-29 12:01:02,446 [WARNING] Iteration 1: Masters without NoSchedule taint: ['home-jcn2d-master-0', 'home-jcn2d-master-1', 'home-jcn2d-master-2']
2021-01-29 12:01:02,592 [INFO] 
2021-01-29 12:01:02,592 [INFO] Sleeping for the specified duration: 60
Great, Cerberus is up and running now but wouldn’t be great if it would run automatically as a service? Let’s go ahead and set that up by creating a systemd service. First let’s setup a bash script called start.sh in the root of our cerberus directory as follows:
echo "Starting Cerberus..."
python3 /opt/cerberus/start_cerberus.py --config /opt/cerberus/config/my-config.yaml
Next, create a systemd service at /etc/systemd/system/cerberus.service and add the following to it:
Description=Cerberus OpenShift Health Check
To have the service run cerberus use the following commands:
systemctl enable cerberus.service
systemctl start cerberus.service
Check the status of the service after starting it, if the service failed you may need to delete the cerberus files in /tmp that were created when run manually previously. You can also check the endpoint at http://localhost:8080 to see the result it returns which is a simple text string with either “True” or “False”.
At this point we can then add our monitor to UptimeRobot assuming the Cerberus port is exposed to the internet. Below is an image of my monitor configuration:
And there you have it, you should start seeing the results in your status page as per the screenshot at the top of the page.