{"id":465,"date":"2021-02-01T17:28:00","date_gmt":"2021-02-01T17:28:00","guid":{"rendered":"http:\/\/gexperts.com\/wp\/?p=465"},"modified":"2021-02-01T17:28:00","modified_gmt":"2021-02-01T17:28:00","slug":"building-a-simple-up-down-status-dashboard-for-openshift","status":"publish","type":"post","link":"https:\/\/gexperts.com\/wp\/building-a-simple-up-down-status-dashboard-for-openshift\/","title":{"rendered":"Building a Simple Up\/Down Status Dashboard for OpenShift"},"content":{"rendered":"<p>OpenShift provides a wealth of monitoring and alerts however sometimes it can be handy to surface a simple up\/down signal for an OpenShift cluster that can be easily interpreted by tools like UptimeRobot. This enables you to provide an operational or business level <a href=\"https:\/\/stats.uptimerobot.com\/XqRRounE0\" rel=\"noopener\" target=\"_blank\">dashboard<\/a> of the status of your cluster to users and application owners that may not necessarily familiar with all of the nuances of OpenShift&#8217;s or Kubernete&#8217;s internals.<\/p>\n<p><a href=\"http:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptimerobot.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptimerobot-1024x695.png\" alt=\"UptimeRobot Dashboard\" width=\"584\" height=\"396\" class=\"aligncenter size-large wp-image-466\" srcset=\"https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptimerobot-1024x695.png 1024w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptimerobot-300x203.png 300w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptimerobot-768x521.png 768w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptimerobot-442x300.png 442w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptimerobot.png 1063w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><\/a><\/p>\n<p>The health status of an OpenShift cluster depends on many things such as etcd, operators, nodes, api, etc so how do we aggregate all of this information? While you could certaintly run your own code to do it, fortunately a cool utility called <a href=\"https:\/\/github.com\/cloud-bulldozer\/cerberus\" rel=\"noopener\" target=\"_blank\">Cerberus<\/a> already provides this capability. Cerebus was born out of Red Hat&#8217;s Performance and Scaling group and was designed to be used with Kraken, a chaos engineering tool. A chaos engineering tool isn&#8217;t very useful if you can&#8217;t determine the status of the cluster and thus Cereberus was born.<\/p>\n<p>A number of blog posts have already been written about Kraken and Cerebus from a chaos engineering point of view which you can view <a href=\"https:\/\/www.openshift.com\/blog\/introduction-to-kraken-a-chaos-tool-for-openshift\/kubernetes\" rel=\"noopener\" target=\"_blank\">here<\/a> and <a href=\"https:\/\/www.openshift.com\/blog\/openshift-scale-ci-part-4-introduction-to-cerberus-guardian-of-kubernetes\/openshift-clouds\" rel=\"noopener\" target=\"_blank\">here<\/a>. Here we are going to focus on the basics of using it for simple health checking.<\/p>\n<p>One thing to note about Cerberus is that it is aggresive about returning an unheathly state even if the cluster is operational. For example, if you set it to monitor a namespace any pod failures in a multi-pod deployment in that namespace will trigger an unhealthly flag even if the other pods in the deployment are running and still servicing requests. As a result, some tuning of Cerberus or the utilization of custom checks is required if you want to use it for a more SLA focused view.<\/p>\n<p>To get started with Cerberus simply clone the git repo into an appropriate directory on the system where you want to run it. While you can run it inside of OpenShift, it is highly recommended to run it outside the cluster since your cluster monitoring tool should not be dependent on the cluster itself. To clone the repo, just run the following:<\/p>\n<pre lang=\"bash\">\r\ngit clone https:\/\/github.com\/cloud-bulldozer\/cerberus\r\n<\/pre>\n<p>In order for Cerberus to run, it requires access to a kubeconfig file where a user has already been authenticated. For security purposes I would highly recommend using a serviceaccount with the cluster-reader role rather then using a user with cluster-admin. The commands below will create a serviceaccount in the openshift-monitoring namespace, bind it to the cluster-reader role and generate a kubeconfig that cerberus can use to authenticate to the cluster.<\/p>\n<pre lang=\"bash\">\r\noc create sa cerberus -n openshift-monitoring\r\noc adm policy add-cluster-role-to-user cluster-reader -z cerberus -n openshift-monitoring\r\noc serviceaccounts create-kubeconfig cerberus -n openshift-monitoring > config\/kubeconfig\r\n<\/pre>\n<p>Cerberus can automatically create a token for the prometheus-k8s service account to so it can access prometheus to pull metrics. To enable this we need to define a role to give the cerberus the necessary permissions and bind it to the cerberus service account. Create a file with the following content:<\/p>\n<pre lang=\"yaml\">\r\n---\r\napiVersion: rbac.authorization.k8s.io\/v1\r\nkind: Role\r\nmetadata:\r\n  name: cerberus\r\n  namespace: openshift-monitoring\r\nrules:\r\n  - apiGroups:\r\n      - \"\"\r\n    resources:\r\n      - serviceaccounts\r\n      - secrets\r\n    verbs:\r\n      - get\r\n      - list\r\n      - patch\r\n---\r\napiVersion: rbac.authorization.k8s.io\/v1\r\nkind: RoleBinding\r\nmetadata:\r\n  name: cerberus-service-account-token\r\n  namespace: openshift-monitoring  \r\nroleRef:\r\n  apiGroup: rbac.authorization.k8s.io\r\n  kind: Role\r\n  name: cerberus\r\nsubjects:\r\n  - kind: ServiceAccount\r\n    name: cerberus\r\n    namespace: openshift-monitoring\r\n<\/pre>\n<p>And then apply it with &#8220;oc apply -f&#8221; to the cluster.<\/p>\n<p>To configure cerberus you can edit the existing <a href=\"https:\/\/github.com\/cloud-bulldozer\/cerberus\/blob\/master\/config\/config.yaml\" rel=\"noopener\" target=\"_blank\">config.yaml<\/a> file in the repo or create a new one, creating a new one is highly recommended so if you do a git pull it doesn&#8217;t clobber your changes:<\/p>\n<pre lang=\"bash\">\r\ncp config\/config.yaml config\/my-config.yaml\r\n<\/pre>\n<p>Once you have the config file, you can go through the options and set what you need. Here is an example of my config file which is really just the example config with the kubeconfig parameter tweaked.<\/p>\n<pre lang=\"yaml\">\r\ncerberus:\r\n    distribution: openshift                              # Distribution can be kubernetes or openshift\r\n    kubeconfig_path: \/opt\/cerberus\/config\/kubeconfig     # Path to kubeconfig\r\n    port: 8080                                           # http server port where cerberus status is published\r\n    watch_nodes: True                                    # Set to True for the cerberus to monitor the cluster nodes\r\n    watch_cluster_operators: True                        # Set to True for cerberus to monitor cluster operators\r\n    watch_url_routes:                                    # Route url's you want to monitor, this is a double array with the url and optional authorization parameter\r\n    watch_namespaces:                                    # List of namespaces to be monitored\r\n        -    openshift-etcd\r\n        -    openshift-apiserver\r\n        -    openshift-kube-apiserver\r\n        -    openshift-monitoring\r\n        -    openshift-kube-controller-manager\r\n        -    openshift-machine-api\r\n        -    openshift-kube-scheduler\r\n        -    openshift-ingress\r\n        -    openshift-sdn                               # When enabled, it will check for the cluster sdn and monitor that namespace\r\n    cerberus_publish_status: True                        # When enabled, cerberus starts a light weight http server and publishes the status\r\n    inspect_components: False                            # Enable it only when OpenShift client is supported to run\r\n                                                         # When enabled, cerberus collects logs, events and metrics of failed components\r\n\r\n    prometheus_url:                                      # The prometheus url\/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.\r\n    prometheus_bearer_token:                             # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.\r\n                                                         # This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies. \r\n\r\n    slack_integration: False                             # When enabled, cerberus reports the failed iterations in the slack channel\r\n                                                         # The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )\r\n                                                         # When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.\r\n    watcher_slack_ID:                                        # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)\r\n        Monday:\r\n        Tuesday:\r\n        Wednesday:\r\n        Thursday:\r\n        Friday:\r\n        Saturday:\r\n        Sunday:\r\n    slack_team_alias:                                    # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned\r\n\r\n    custom_checks:                                       # Relative paths of files conataining additional user defined checks\r\n\r\ntunings:\r\n    iterations: 5                                        # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled\r\n    sleep_time: 60                                       # Sleep duration between each iteration\r\n    kube_api_request_chunk_size: 250                     # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.\r\n    daemon_mode: True                                    # Iterations are set to infinity which means that the cerberus will monitor the resources forever\r\n    cores_usage_percentage: 0.5                          # Set the fraction of cores to be used for multiprocessing\r\n\r\ndatabase:\r\n    database_path: \/tmp\/cerberus.db                      # Path where cerberus database needs to be stored\r\n    reuse_database: False                                # When enabled, the database is reused to store the failures\r\n<\/pre>\n<p>At this time you can Cerberus manually and test it out as follows:<\/p>\n<pre>\r\n$ sudo python3 \/opt\/cerberus\/start_cerberus.py --config \/opt\/cerberus\/config\/config-home.yaml\r\n               _                         \r\n  ___ ___ _ __| |__   ___ _ __ _   _ ___ \r\n \/ __\/ _ \\ '__| '_ \\ \/ _ \\ '__| | | \/ __|\r\n| (_|  __\/ |  | |_) |  __\/ |  | |_| \\__ \\\r\n \\___\\___|_|  |_.__\/ \\___|_|   \\__,_|___\/\r\n                                         \r\n\r\n2021-01-29 12:01:01,030 [INFO] Starting ceberus\r\n2021-01-29 12:01:01,037 [INFO] Initializing client to talk to the Kubernetes cluster\r\n2021-01-29 12:01:01,144 [INFO] Fetching cluster info\r\n2021-01-29 12:01:01,260 [INFO] \r\nNAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS\r\nversion   4.6.12    True        False         3d20h   Cluster version is 4.6.12\r\n\r\n2021-01-29 12:01:01,365 [INFO] Kubernetes master is running at https:\/\/api.home.ocplab.com:6443\r\n\r\n2021-01-29 12:01:01,365 [INFO] Publishing cerberus status at http:\/\/0.0.0.0:8080\r\n2021-01-29 12:01:01,381 [INFO] Starting http server at http:\/\/0.0.0.0:8080\r\n\r\n2021-01-29 12:01:01,623 [INFO] Daemon mode enabled, cerberus will monitor forever\r\n2021-01-29 12:01:01,623 [INFO] Ignoring the iterations set\r\n\r\n2021-01-29 12:01:01,955 [INFO] Iteration 1: Node status: True\r\n2021-01-29 12:01:02,244 [INFO] Iteration 1: Cluster Operator status: True\r\n2021-01-29 12:01:02,380 [INFO] Iteration 1: openshift-ingress: True\r\n2021-01-29 12:01:02,392 [INFO] Iteration 1: openshift-apiserver: True\r\n2021-01-29 12:01:02,396 [INFO] Iteration 1: openshift-sdn: True\r\n2021-01-29 12:01:02,399 [INFO] Iteration 1: openshift-kube-scheduler: True\r\n2021-01-29 12:01:02,400 [INFO] Iteration 1: openshift-machine-api: True\r\n2021-01-29 12:01:02,406 [INFO] Iteration 1: openshift-kube-controller-manager: True\r\n2021-01-29 12:01:02,425 [INFO] Iteration 1: openshift-etcd: True\r\n2021-01-29 12:01:02,443 [INFO] Iteration 1: openshift-monitoring: True\r\n2021-01-29 12:01:02,445 [INFO] Iteration 1: openshift-kube-apiserver: True\r\n2021-01-29 12:01:02,446 [INFO] HTTP requests served: 0 \r\n\r\n2021-01-29 12:01:02,446 [WARNING] Iteration 1: Masters without NoSchedule taint: ['home-jcn2d-master-0', 'home-jcn2d-master-1', 'home-jcn2d-master-2']\r\n\r\n2021-01-29 12:01:02,592 [INFO] []\r\n\r\n2021-01-29 12:01:02,592 [INFO] Sleeping for the specified duration: 60\r\n<\/pre>\n<p>Great, Cerberus is up and running now but wouldn&#8217;t be great if it would run automatically as a service? Let&#8217;s go ahead and set that up by creating a systemd service. First let&#8217;s setup a bash script called start.sh in the root of our cerberus directory as follows:<\/p>\n<pre lang=\"bash\">\r\n#!\/bin\/bash\r\n\r\necho \"Starting Cerberus...\"\r\n\r\npython3 \/opt\/cerberus\/start_cerberus.py --config \/opt\/cerberus\/config\/my-config.yaml\r\n<\/pre>\n<p>Next, create a systemd service at \/etc\/systemd\/system\/cerberus.service and add the following to it:<\/p>\n<pre>\r\n[Unit]\r\nDescription=Cerberus OpenShift Health Check\r\n\r\nWants=network.target\r\nAfter=syslog.target network-online.target\r\n\r\n[Service]\r\nType=simple\r\nExecStart=\/bin\/bash \/opt\/cerberus\/start.sh\r\nRestart=on-failure\r\nRestartSec=10\r\nKillMode=control-group\r\n\r\n[Install]\r\nWantedBy=multi-user.target\r\n<\/pre>\n<p>To have the service run cerberus use the following commands:<\/p>\n<pre lang=\"bash\">\r\nsystemctl daemon-reload\r\nsystemctl enable cerberus.service\r\nsystemctl start cerberus.service\r\n<\/pre>\n<p>Check the status of the service after starting it, if the service failed you may need to delete the cerberus files in \/tmp that were created when run manually previously. You can also check the endpoint at http:\/\/localhost:8080 to see the result it returns which is a simple text string with either &#8220;True&#8221; or &#8220;False&#8221;.<\/p>\n<p>At this point we can then add our monitor to UptimeRobot assuming the Cerberus port is exposed to the internet. Below is an image of my monitor configuration:<\/p>\n<p><a href=\"http:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptime-monitor.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptime-monitor-1024x467.png\" alt=\"\" width=\"584\" height=\"266\" class=\"aligncenter size-large wp-image-477\" srcset=\"https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptime-monitor-1024x467.png 1024w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptime-monitor-300x137.png 300w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptime-monitor-768x350.png 768w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptime-monitor-500x228.png 500w, https:\/\/gexperts.com\/wp\/wp-content\/uploads\/2021\/01\/uptime-monitor.png 1182w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><\/a><\/p>\n<p>And there you have it, you should start seeing the results in your status page as per the screenshot at the top of the page.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>OpenShift provides a wealth of monitoring and alerts however sometimes it can be handy to surface a simple up\/down signal for an OpenShift cluster that can be easily interpreted by tools like UptimeRobot. This enables you to provide an operational &hellip; <a href=\"https:\/\/gexperts.com\/wp\/building-a-simple-up-down-status-dashboard-for-openshift\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[12],"tags":[],"class_list":["post-465","post","type-post","status-publish","format-standard","hentry","category-openshift"],"_links":{"self":[{"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/posts\/465","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/comments?post=465"}],"version-history":[{"count":22,"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/posts\/465\/revisions"}],"predecessor-version":[{"id":489,"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/posts\/465\/revisions\/489"}],"wp:attachment":[{"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/media?parent=465"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/categories?post=465"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gexperts.com\/wp\/wp-json\/wp\/v2\/tags?post=465"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}