Enabling metrics in RH-SSO

Red Hat’s productized version of Keycloak is Red Hat Single Sign-On (RH-SSO), if you are not familair with Keycloak it is a popular open source identity and access management project. RH-SSO is a core base of application infrastructure at many organizations and monitoring it effectively is critical to ensuring service goals are being met.

Out of the box, the RH-SSO 7.4 image exposes Prometheus metrics however these metrics are for the underlying JBoss EAP platform that RH-SSO is running on rather than Keycloak specific metrics. While these low level JBoss EAP metrics are very useful and we definitely want to capture them, wouldn’t it be great if we could get highler level metrics on the number of logins, failed logins, client logins, etc from Keycloak as well?

This is where the community Aerogear Keycloak Metrics SPI project comes in play, it is a Keycloak extension that provides these metrics by leveraging the Keycloak eventing capabilities. Using this extension with RH-SSO, while not directly supported by Red Hat, is easy and straightforward. Note that this article was written using RH-SSO 7.4, your mileage may vary on other versions but conceptually it should follow the same process.

The first order of business is to create a container image that deploys the aerogear extension, here is the Containerfile that I am using:

FROM registry.redhat.io/rh-sso-7/sso74-openshift-rhel8:latest
 
ARG aerogear_version=2.5.0
 
RUN cd /opt/eap/standalone/deployments && \
    curl -LO https://github.com/aerogear/keycloak-metrics-spi/releases/download/${aerogear_version}/keycloak-metrics-spi-${aerogear_version}.jar && \
    touch keycloak-metrics-spi-${aerogear_version}.jar.dodeploy && \
    cd -

This container file is referencing the default rh-sso image from Red Hat and then downloading and installing the Aerogear SPI extension. I expect that many organizations using RH-SSO likely have already created their own image already to support themes and other extensions. You can either put your own image in the FROM block or simply incorporate the above into your own Containerfile. Once you have created the custom image you can deploy it into your OpenShift cluster.

NOTE. Currently this metrics SPI exposes the keycloak metrics on the default https port with no authentication which is a significant security concern as documented here. There is a pull request (PR) in progress to mitigate this in OpenShift here, I will update this blog once the PR is merged.

One other thing that needs to be done as part of the deployment is expose the EAP metrics because we want to capture them as well. By default the RH-SSO exposes the metrics on the management port which only binds to localhost thereby preventing Prometheus from scraping them. In order to enable Prometheus to scrape these metrics you will need to bind the management port to all IP addresses (0.0.0.0) so it can be read from the Pod IP. To do this, add -Djboss.bind.address.management=0.0.0.0 to the existing JAVA_OPTS_APPEND environment variable for the Deployment or StatefulSet you are using to deploy RH-SSO. If it doesn’t exist, just add it.

Once the SPI is deployed you then need to configure the realms you want to monitor to route events to the metrics-listener. To do this go to Manage > Events > Config and make the change in Event Listeners as per the screenshot below, be careful not to delete existing listeners.

This needs to be done on every realm for which you want to track metrics.

Once you have the SPI deployed and added the event listener to the realms to be monitored you are now ready to validate that it is working. The SPI works by adding a /metrics at the end of each realm URL. For example to view the metrics from the master realm, you would use the path /auth/realms/master/metrics. To test the metrics RSH to one of the SSO pods and run the following two curl commands:

# Test keycloak metrics for master realm on pod IP
$ curl -k https://$(hostname -i):8443/auth/realms/master/metrics
 
# HELP keycloak_user_event_CLIENT_REGISTER_ERROR Generic KeyCloak User event
# TYPE keycloak_user_event_CLIENT_REGISTER_ERROR counter
# HELP keycloak_user_event_INTROSPECT_TOKEN Generic KeyCloak User event
...
 
# Test EAP metrics on pod IP
curl http://$(hostname -i):9990/metrics
 
# HELP base_cpu_processCpuLoad Displays the "recent cpu usage" for the Java Virtual Machine process.
# TYPE base_cpu_processCpuLoad gauge
base_cpu_processCpuLoad 0.009113504556752278
...

If everything worked you should see a lot of output after the each curl commands with the first few lines being similar to the outputs shown. Now comes the next step, having prometheus scrape this data. In this blog I am using OpenShift’s User Workload monitoring feature that I blogged about here so I will not go into the intricacies of setting up the prometheus operator again.

To configure scraping of the EAP metrics we define a PodMonitor since this port isn’t typically defined in the SSO service, for my deployment the pod monitor appears as follows:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: eap
spec:
  selector:
    matchLabels:
      app: sso
  podMetricsEndpoints:
  - targetPort: 9990

Note that my deployment of sso has the pods labelled app: sso, make sure to update the selector above to match on a label in your sso pods. After that we define a servicemonitor to scrape the Aerogear Keycloak SPI metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: keycloak
spec:
  jobLabel: keycloak
  selector:
    matchLabels:
      app: sso
  endpoints:
  - port: keycloak
    path: /auth/realms/master/metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
    relabelings:
    - targetLabel: job
      replacement: keycloak
    - targetLabel: provider
      replacement: keycloak
    - targetLabel: instance
      replacement: sso
  - port: keycloak
    path: /auth/realms/openshift/metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
    relabelings:
    - targetLabel: job
      replacement: keycloak
    - targetLabel: provider
      replacement: keycloak
    - targetLabel: instance
      replacement: sso
  - port: keycloak
    path: /auth/realms/3scale/metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
    relabelings:
    - targetLabel: job
      replacement: keycloak
    - targetLabel: provider
      replacement: keycloak
    - targetLabel: instance
      replacement: sso

A couple of items to note here, first be aware that each realm’s metrics are on a separate path so multiple endpoints must be defined, one per realm. Second my SSO deployment is set to re-encrypt and is using a self-signed certificate at the service level. As a result we need to set insecureSkipVerify to true otherwise Prometheus will not scrape it due to an invalid certificate. Similar to the PodMonitor, update the selector to match labels in your service.

I’m using relabelings to set various labels that will appear with the metrics. This is needed because the Grafana dashboard I am using from the grafana library expects certain labels like job and provider to be set to keycloak otherwise it’s queries will not find the metrics. Setting these labels here is easier then modifying the dashboard. Finally I set the instance label to sso, if you don’t set this the instance label will default to the IP and port so this is a friendlier way of presenting it.

At this point we can deploy some grafana dashboards. Again I covered deploying and connecting Grafana to the cluster monitoring in a previous article so will not be covering it again. To deploy the keycloak dashboard we can reference the existing one in the grafana library in a dashboard object as follows:

apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
  name: sso-dashboard
  labels:
    app: grafana
spec:
  url: https://grafana.com/api/dashboards/10441/revisions/1/download
  datasources:
  - inputName: "DS_PROMETHEUS"
    datasourceName: "Prometheus"
  plugins:
    - name: grafana-piechart-panel
      version: 1.3.9

When rendered the dashboard appears as follows, note that my environment is not loaded so it’s not quite as interesting as it would be in a real production environment.

Keycloak Dashboard

Keycloak Dashboard

You can see that various metrics around heap, logins, login failures as well as other statistics are presented making it easier to understand what is happening with your SSO installation at any given time.

Next we do the same thing to create an EAP dashboard so we can visualize the EAP metrics:

apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
  name: eap-dashboard
  labels:
    app: grafana
spec:
  url: https://grafana.com/api/dashboards/10313/revisions/1/download
  datasources:
  - inputName: "DS_PROMETHEUS"
    datasourceName: "Prometheus"

And here is the EAP dashboard in all it’s glory:

JBoss EAP Dashboard

JBoss EAP Dashboard

The dashboard displays detailed metrics on the JVM heap status but you can also monitor other EAP platform components like databases and caches by customizing the dashboard. One of the benefits of Grafana is that it enables you to design dashboards that makes the most sense for your specific use case and organization. You can start with an off-the-shelf dashboard and then modify it as needed to get the visualization that is required.

RH-SSO is a key infrastructure component for many organizations and monitoring it effectively is important to ensure that SLAs and performance expectations are being met. Hopefully this article will provide a starting point for your organization to define and create a monitoring strategy around RH-SSO.

GitOps and OpenShift Operators Best Practices

In OpenShift, Operators are typically installed through the Operator Lifecycle Manager (OLM) which provides a great user interface and experience. Unfortunately OLM was really designed around a UI experience and as a result when moving to a GitOps approach there are a few things to be aware of in order to get the best outcomes. The purpose of this blog is to outline a handful of best practices that we’ve found after doing this for awhile, so without further ado here is the list:

1. Omit startingCSV in Subscriptions

When bringing an operator into GitOps, it’s pretty common to install an operator manually and then extract the yaml for the subscription and push it into a git repo. This will often appear as per this example:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/amq7-cert-manager-operator.openshift-operators: ""
  name: amq7-cert-manager-operator
  namespace: openshift-operators
spec:
  channel: 1.x
  installPlanApproval: Automatic
  name: amq7-cert-manager-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: amq7-cert-manager.v1.0.1

OLM will automatically populate the startingCSV for you which represents the specific version of the operator that you want to install. The problem with this is that operator versions will change regularly with updates meaning that everytime it changes you will need to update the version in the git repo. The majority of the time we simply want to consume the latest and greatest operator, omitting the startingCSV accomplishes that goal and greatly reduces the maintenance required for the subscription yaml.

Of course if you have a requirement to install a very specific version of the operator by all means include it, however in my experience this requirement tends to be rare.

2. Create OperatorGroup with namespaces

An OperatorGroup to quote the documentation “provides multitenant configuration to OLM-installed Operators”. Everytime you install an operator there must be one and only one OperatorGroup in the namespace. Some default namespaces, like openshift-operators, will have an OperatorGroup out of the box and you do not need to create a new one from GitOps. However if you want to install operators into your own namespaces you will need to have an OperatorGroup.

When using kustomize there is a temptation to bundle the OperatorGroup with a Subscription. This should be avoided because if you want to install multiple operators, say Prometheus and Grafana, in the same namespace they will create multiple OperatorGroups and prevent the operators from installing.

As a result if I need to install operators in GitOps I much prefer creating the OperatorGroup as part of the same kustomize folder where I’m creating the namespace. This allows me to aggregate multiple operators across different bases without getting into OperatorGroup confusion.

3. Omit the olm.providedAPIs annotation in OperatorGroup

Similar to startingCSV, when manually installing operators you will notice that OLM populates an annotation called olm.providedAPIs. Since OLM will populate it automatically there is no need to include this in the yaml in git as it becomes one more element that you will need to maintain.

4. Prefer manual installation mode

When installing an operator via OLM you can choose to install it in manual or automatic mode. For production clusters you should prefer the manual installation mode in order to control when operator upgrades happen. Unfortunately when using manual mode OLM requires you to approve the initial installation of the operator. While this is easy to do in the console UI it’s a little more challenging with a GitOps tool.

Fortunately my colleague Andrew Pitt has you covered and wrote an excellent tool to handle this, installplan-approver. This is a kubernetes job that you can deploy alongside the operator Subscription that watches for the installplan that OLM creates and automatically approves it. This gives you the desired workflow of automatic installation but manual approvals of upgrades.

Since this is run as a kubernetes job it only runs once and will not accidentally approve upgrades. In other words, subsequent synchronizations from a GitOps tool like Argo CD will not cause the job to run again since from the GitOps tool perspective the job aleeady exists and is synchronized.

5. Checkout the operators already available in Red Hat COP gitops-catalog

Instead of re-inventing a wheel, check out the operators that have already been made available for GitOps in the Red Hat Community of Practice (COP) gitops-catalog. This catalog has a number of commonly used operators already available for use with OpenShift Gitops (OpenShift Pipelines, OpenShift GitOps, Service Mesh, Logging and more). While this catalog is not officially supported by Red Hat, it provides a starting point for you to create your own in-house catalog and benefiting from the work of others.

Well that’s it for now, if you have more best practices feel free to add them in the comments.

Deploying Red Hat Advanced Cluster Security (aka Stackrox) with GitOps

I’ve been running Red Hat Advanced Cluster Security (RHACS) in my personal cluster via the stackrox helm chart for quite awhile, however now that the RHACS operator is available I figured it was time to step up my game and integrate it into my gitops cluster configuration instead of deploying it manually.

Broadly speaking when installing RHACS manually on a cluster there are four steps that you typically need to do:

  1. Subscribe the operator into your cluster via Operator Hub into the stackrox namespace
  2. Deploy an instance of Central which provides the UI, dashboards, etc (i.e. the single pane of glass) to interact with the product using the Central CRD API
  3. Create and download a cluster-init bundle in Central for the sensors and deploy it into the stackrox namespace
  4. Deploy the sensors via the SecuredCluster

When looking at these steps there are a couple of challenges to overcome for the process to be done via GitOps:

  • The steps need to happen sequentially, in particular the cluster-init bundle needs to be deployed before the SecuredCluster
  • Retrieving the cluster-init bundle requires interacting with the Central API as it is not managed via a kubernetes CRD

Fortunately both of these challenges are easily overcome. For the first challenge we can leverage Sync Waves in Argo CD to deploy items in a defined order. To do this, we simply annotate the objects with the desired order, aka wave, that we want using argocd.argoproj.io/sync-wave. For example, here is the operator subscription which goes first as we defined it in wave ‘0’:


apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"
  labels:
    operators.coreos.com/rhacs-operator.openshift-operators: ''
  name: rhacs-operator
  namespace: openshift-operators
spec:
  channel: latest
  installPlanApproval: Automatic
  name: rhacs-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: rhacs-operator.v3.62.0

The second challenge, retrieving the cluster-init bundle, is straightforward using the RHACS Central API. To invoke the API we create a small Kubernetes job that Argo CD will deploy after Central is up and running but before the SecuredCluster. The job will use a ServiceAccount with just enough permissions to retrieve the password and then interact with the API, an abbreviated version of the job highlighting the meat of it appears below:

echo "Configuring cluster-init bundle"
export DATA={\"name\":\"local-cluster\"}
curl -k -o /tmp/bundle.json -X POST -u "admin:$PASSWORD" -H "Content-Type: application/json" --data $DATA https://central/v1/cluster-init/init-bundles
echo "Bundle received"
 
echo "Applying bundle"
# No jq in container, python to the rescue
cat /tmp/bundle.json | python3 -c "import sys, json; print(json.load(sys.stdin)['kubectlBundle'])" | base64 -d | oc apply -f -

The last thing that needs to happen to make this work is define a custom health check in Argo CD for Central. If we do not have this healthcheck Argo CD will not wait for Central to be fully deployed before moving on to the next item in the wave which will cause issues when the job tries to execute and no Central is available. In your argo CD resource customization you need to add the following:

    platform.stackrox.io/Central:
      health.lua: |
        hs = {}
        if obj.status ~= nil and obj.status.conditions ~= nil then
            for i, condition in ipairs(obj.status.conditions) do
              if condition.status == "True" and condition.reason == "InstallSuccessful" then
                  hs.status = "Healthy"
                  hs.message = condition.message
                  return hs
              end
            end
        end
        hs.status = "Progressing"
        hs.message = "Waiting for Central to deploy."
        return hs

A full example of the healthcheck is in the repo I use to install the OpenShift GitOps operator here.

At this point you should have a fully functional RHACS deployment in your cluster being managed by the OpenShift GitOps operator (Argo CD). Going further, you can extend the example by using the Central API to integrate with RH-SSO and other components in your infrastructure using the same job technique to fetch the cluster-init-bundle.

The complete example of this approach is available in the Red Hat Canada GitOps Catalog repo in the acs-operator folder.

Discovering OpenShift Resources in Quarkus

I have a product-catalog application that I have been using as a demo for awhile now, it’s essentially a three tier application as per the topology view below with the front-end (client) using React, the back-end (server) written in Quarkus and a Maria database.

The client application is a Single Page Application (SPA) using React that talks directly to the server application via REST API calls. As a result, the Quarkus server back-end needs to have CORS configured in order to accept requests from the front-end application. While a wildcard, i.e. ‘*’, certainly works, in cases where it’s not a public API I prefer a more restrictive setting for CORS, i.e. http://client-product-catalog-dev.apps.home.ocplab.com.

The downside of this restrictive approach is that I need to customize this CORS setting on every namespace and cluster I deploy the application into since the client route is unique in each of those cases. While tools like kustomize or helm can help with this, the client URL needed for the CORS configuration is already defined as a route in OpenShift so why not just have the application discover the URL at runtime via the kubernetes API?

This was my first stab at using the openshift-client in Quarkus and it was surprisingly easy to get going. The Quarkus guide on using the kubernetes/openshift client is excellent as is par for the course with Quarkus guides. Folowing the guide, the first step is adding the extension to your pom.xml:

./mvnw quarkus:add-extension -Dextensions="openshift-client"

After that it’s just a matter of writing some code to discover the route. I opted to label the route with endpoint:client and to search for the route by that label. The first step was to create a LabelSelector as follows:

LabelSelector selector = new LabelSelectorBuilder().withMatchLabels(Map.ofEntries(entry("endpoint", "client"))).build();

Now that we have the label selector we can then ask for a list of routes matching that selector:

List<Route> routes = openshiftClient.routes().withLabelSelector(selector).list().getItems();

Finally with the list of routes I opt to use the first match. Note for simplicity I’m omitting a bunch of checking and logging that I am doing if there are zero matches or multiple matches, the full class with all of those checks appears further below.

Route route = routes.get(0);
String host = route.getSpec().getHost();
boolean tls = false;
if (route.getSpec().getTls() != null && "".equals(route.getSpec().getTls().getTermination())) {
    tls = true;
}
String corsOrigin = (tls?"https":"http") + "://" + host;

Once we have our corsOrigin, we set it as a system property to override the default setting:

System.setProperty("quarkus.http.cors.origins", corsOrigin);

In OpenShift you will need to give the view role to the serviceaccount that is running the pod in order for it to be able to interact with the Kubernetes API. This can be done via the CLI as follows:

oc adm policy add-role-to-user view -z default

Alternatively if using a kustomize or GitOps the equivalent yaml would be as follows:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: default-view
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
- kind: ServiceAccount
  name: default

So that’s basically it, with a little bit of code I’ve reduced the amount of configuration that needs to be done to deploy the app on a per namespace/cluster basis. The complete code is below appears below.

package com.redhat.demo;
 
import java.util.Map;
import static java.util.Map.entry;
 
import java.util.List;
 
import javax.enterprise.context.ApplicationScoped;
import javax.enterprise.event.Observes;
import javax.inject.Inject;
 
import io.fabric8.kubernetes.api.model.LabelSelector;
import io.fabric8.kubernetes.api.model.LabelSelectorBuilder;
import io.fabric8.openshift.api.model.Route;
import io.fabric8.openshift.client.OpenShiftClient;
import io.quarkus.runtime.ShutdownEvent;
import io.quarkus.runtime.StartupEvent;
 
import org.eclipse.microprofile.config.ConfigProvider;
import org.jboss.logging.Logger;
 
@ApplicationScoped
public class OpenShiftSettings {
 
    private static final Logger LOGGER = Logger.getLogger("ListenerBean");
 
    @Inject
    OpenShiftClient openshiftClient;
 
    void onStart(@Observes StartupEvent ev) {
        // Test if we are running in a pod
        String k8sSvcHost = System.getenv("KUBERNETES_SERVICE_HOST");
        if (k8sSvcHost == null || "".equals(k8sSvcHost)) {
            LOGGER.infof("Not running in kubernetes, using CORS_ORIGIN environment '%s' variable",
                    ConfigProvider.getConfig().getValue("quarkus.http.cors.origins", String.class));
            return;
        }
 
        if (System.getenv("CORS_ORIGIN") != null) {
            LOGGER.infof("CORS_ORIGIN explicitly defined bypassing route lookup");
            return;
        }
 
        // Look for route with label endpoint:client
        if (openshiftClient.getMasterUrl() == null) {
            LOGGER.info("Kubernetes context is not available");
        } else {
            LOGGER.infof("Application is running in OpenShift %s, checking for labelled route",
                    openshiftClient.getMasterUrl());
 
            LabelSelector selector = new LabelSelectorBuilder()
                    .withMatchLabels(Map.ofEntries(entry("endpoint", "client"))).build();
            List<Route> routes = null;
            try {
                routes = openshiftClient.routes().withLabelSelector(selector).list().getItems();
            } catch (Exception e) {
                LOGGER.info("Unexpected error occurred retrieving routes, using environment variable CORS_ORIGIN", e);
                return;
            }
            if (routes == null || routes.size() == 0) {
                LOGGER.info("No routes found with label 'endpoint:client', using environment variable CORS_ORIGIN");
                return;
            } else if (routes.size() > 1) {
                LOGGER.warn("More then one route found with 'endpoint:client', using first one");
            }
 
            Route route = routes.get(0);
            String host = route.getSpec().getHost();
            boolean tls = false;
            if (route.getSpec().getTls() != null && "".equals(route.getSpec().getTls().getTermination())) {
                tls = true;
            }
            String corsOrigin = (tls ? "https" : "http") + "://" + host;
            System.setProperty("quarkus.http.cors.origins", corsOrigin);
        }
        LOGGER.infof("Using host %s for cors origin",
                ConfigProvider.getConfig().getValue("quarkus.http.cors.origins", String.class));
    }
 
    void onStop(@Observes ShutdownEvent ev) {
        LOGGER.info("The application is stopping...");
    }
}

This code is also in my public repository.

RH-SSO (Keycloak) and GitOps

One of the underappreciated benefits of OpenShift is the included and supported SSO product called, originally enough, Red Hat Single Sign-On (RH-SSO). This is the productized version of the very popular upstream Keycloak community project which has seen widespread adoption amongst many different organizations.

While deploying RH-SSO (or Keycloak) from a GitOps perspective is super easy, managing the configuration of the product using GitOps is decidedly not. In fact I’ve been wanting to deploy and use RH-SSO in my demo clusters for quite awhile but balked at manually managing the configuration or resorting to the import/export capabilities. Also, while the Keycloak Operator provides some capabilities in this area, it is limited in the number of objects it supports (Realms, Clients and Users)and is still maturing so it wasn’t an option either.

An alternative tool that I stumbled upon is Keycloakmigration which enables you to configure your keycloak instance using yaml. It was designed to support pipelines where updates need to constantly flow into keycloak, as a result it follows a changelog model rather then a purely declarative form which I would prefer for GitOps. Having said that, in basic testing it works well in the GitOps context but my testing to date, as mentioned, has been basic.

Let’s look at how the changelog works, here is an example changelog file:

includes:
  - path: 01-realms.yml
  - path: 02-clients-private.yml
  - path: 03-openshift-users-private.yml
  - path: 04-google-idp-private.yml

Notice that it is simply specifying a set of files with each file in the changelog represents a set of changes to make to Keycloak, for example the 01-realms.yml adds two realms called openshift and 3scale:

id: add-realms
author: gnunn
changes:
  - addRealm:
      name: openshift
  - addRealm:
      name: 3scale

The file to add new clients to the openshift realm, 02-clients-private.yml, appears as follows:

id: add-openshift-client
author: gnunn
realm: openshift
changes:
# OpenShift client
- addSimpleClient:
    clientId: openshift
    secret: xxxxxxxxxxxxxxxxxxxxxxxx
    redirectUris:
      - "https://oauth-openshift.apps.home.ocplab.com/oauth2callback/rhsso"
- updateClient:
    clientId: openshift
    standardFlowEnabled: true
    implicitFlowEnabled: false
    directAccessGrantEnabled: true
# Stackrox client
- addSimpleClient:
    clientId: stackrox
    secret: xxxxxxxxxxxxxxxxxxxxx
    redirectUris:
      - "https://central-stackrox.apps.home.ocplab.com/sso/providers/oidc/callback"
      - "https://central-stackrox.apps.home.ocplab.com/auth/response/oidc"
- updateClient:
    clientId: stackrox
    standardFlowEnabled: true
    implicitFlowEnabled: false
    directAccessGrantEnabled: true

To create this changelog in kustomize, we can simply use the secret generator:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: sso

generatorOptions:
  disableNameSuffixHash: true

secretGenerator:
- name: keycloak-migration
  files:
  - secrets/keycloak-changelog.yml
  - secrets/01-realms.yml
  - secrets/02-clients-private.yml
  - secrets/03-openshift-users-private.yml
  - secrets/04-google-idp-private.yml

Now it should be noted that many of the files potentially contain sensitive information including client secrets and user passwords, as a result I would strongly recommend encrypting the secret before storing it in git using something like Sealed Secrets. I personally keep the generated commented out and only enable it when I need to generate the secret before sealing it. All of the files with the -private suffix are not stored in git.

Once you have the secret generated with the changelog and associated files, a Post-Sync job in ArgoCD can be used to execute the Keycloakmigration tool to perform the updates in Keycloak. Here is the job I am using:

apiVersion: batch/v1
kind: Job
metadata:
  name: keycloak-migration
  namespace: sso
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
      - image: klg71/keycloakmigration
        env:
        - name: BASEURL
          value: "https://sso-sso.apps.home.ocplab.com/auth"
        - name: CORRECT_HASHES
          value: "true"
        - name: ADMIN_USERNAME
          valueFrom:
            secretKeyRef:
              name: sso-admin-credential
              key: ADMIN_USERNAME
        - name: ADMIN_PASSWORD
          valueFrom:
            secretKeyRef:
              name: sso-admin-credential
              key: ADMIN_PASSWORD
        imagePullPolicy: Always
        name: keycloak-migration
        volumeMounts:
        - name: keycloak-migration
          mountPath: "/migration"
          readOnly: true
        - name: logs
          mountPath: "/logs"
      dnsPolicy: ClusterFirst
      restartPolicy: OnFailure
      terminationGracePeriodSeconds: 30
      volumes:
      - name: keycloak-migration
        secret:
          secretName: keycloak-migration
      - name: logs
        emptyDir: {}

In this job the various parameters are passed as environment variables. We directly mount the SSO admin secret into the container so that the tool can interact with Keycloak. The other interesting parameter to note is the CORRECT_HASHES parameter. I had some issues where if I manually changed an object the migration would refuse to run since it no longer followed the changelog. Since my environment is ephemeral and subject to troubleshooting, I opted to add this parameter to force the process to continue. I do need to test this out further before deciding whether I should leave it or remove it.

In summary, this shows one possible approach to configuring Keycloak using a GitOps approach. While my testing to this point has been very basic, I’m optimistic about the possibilities and look forward to trying it out more.

Managing OpenShift Cluster Configuration with GitOps

Where are all the people going
Round and round till we reach the end
One day leading to another
Get up, go out, do it again

Do It Again, The Kinks

Introduction

If you manage multiple Kubernetes or OpenShift clusters long enough, particularly ephemeral clusters which come and go, you’ve probably experienced that “Do it Again” feeling of monotonously repeating the same tasks over and over again to provision and setup a cluster. This is where GitOps comes into play helping automate those tasks in a reliable and consistent fashion.

First off, what is GitOps and why should you care about it? Simply GitOps is the process of continuously reconciling the state of a system with the state declared in a Git repository, at the end of the day that’s all it does. But buried in that simple statement, coupled with the declarative nature of Kubernetes, is what enables you to build, deliver, update and manage clusters at scale reliability and effectively and that’s why you should care.

Essentially in a GitOps approach the state of our cluster configuration is stored in git and as changes in git occur a GitOps tool will automatically update the cluster state to match. Just as a importantly, if someone changes the state of a cluster directly by modifying or deleting a resource via kubectl/oc or a GUI console the GitOps tool can automatically bring the cluster back in line with the state declared in git.

This can be thought of as a reconciliation loop where the GitOps tool is constantly ensuring the state of the cluster matches the declared state in git. In organization’s where configuration drift is a serious issue this capability should not be under-estimated in terms of daramatically improving reliability and consistency to cluster configuration and deployments. It also provides a strong audit trail of changes since every cluster change is represented by a git commit.

The concept of managing the state of a system in Git is not new, developers have been using source control for many years. On the operations side the concept of “Infrastructure as Code” has also existed for many years with middling success and adoption.

What’s different now is Kubernetes which provides a declarative rather then imperative platform and the benefits of being able to encapsulate the state of a system and have the system itself be responsible for matching this desired state is enormous. This almost (but not quite completely) eliminates the need for complex and often brittle imperative type scripts or playbooks to manage the state of the system that we often saw when organizations attempted “Infrastructure as Code”.

In a nutshell Kubernetes provides it’s own reconciliation loop, it’s constantly ensuring the state of the cluster matches the desired declared state. For example, when you deploy an application and change the number of replicas from 2 to 3 you are changing the desired state and the kubernetes controller is responsible for making that happen. At the end of the day GitOps is doing the same thing but just taking it one level higher.

This is why GitOps with Kubernetes is such a good fit that it becomes the natural approach.

GitOps and Kubernetes

Tools of the Trade

Now that you have hopefully been sold on the benefits of adopting a GitOps approach to cluster configuration let’s look at some of the tools that we will be using in this article.

Kustomize. When starting with GitOps many folks begin with storing raw yaml in their git repository. While this works it quickly leads to a lot of duplication (i.e. copy and paste) of yaml as one needs to tweak and tailor the yaml for specific use cases, environments or clusters. Over time this quickly becomes burdensome to maintain leading folks to look at alternatives. Typically there are two choices that folks typically gravitate towards: Helm or Kustomize.

Helm is a templating framework that provides package management of applications in a kubernetes cluster. Kustomize on the other hand is not a templating framework but rather a patching framework. Kustomize works by enabling developers to inherit, commpose and aggregate yaml and make changes to this yaml using various patching strategies such as merging or JSON patching. Since it is a patching framework, it can feel quite different to those used to a more conventional templating frameworks such as Helm, OpenShift Templates or Ansible Jinja.

Kustomize works on the concept of bases and overlays. Bases are essentially, as the name implies, the base raw yaml for a specific functionality. For example I could have a base to deploy a database into my cluster. Overlays on the other hand inherit from one or more bases and is where bases are patched for specific environments or clusters. So taking the previous example, I could have a database base for deploying MariaDB and an overlay that patches that base for an environment to use a specific password.

My strong personal preference is to use kustomize for gitops in enterprise teams where the team owns the yaml. One recommendation I would have when using kustomize to come up with an organizational standard for folder structure of bases and overlays in order to provide consistentcy and readability across repos and teams. My personal standard that we will be using in this article is located in my standards repository. By no means am I saying this standard is the one true way, however regardless of what standard you put in place having a standard is critical.

ArgoCD. While kustomize helps you organize and manage your yaml in git repos, we need a tool that can manage the GitOps integration with the cluster and provide the reconciliation loop we are looking for. In this article we will focus on ArgoCD, however there are a plethora of tools in this space including Flux, Advanced Cluster Management (ACM) and more.

I’m using ArgoCD for a few reasons. First I like it. Second it will be supported as part of OpenShift as an operator called OpenShift GitOps. For OpenShift customers with larger scale needs I would recommend checking out ACM in conjunction with ArgoCD and the additional capabilities it brings to the table.

ArgoCD

ArgoCD

Some key concepts to be aware of with ArgoCD include:

  • Applications. ArgoCD uses the concept of an Application to represent an item (git repo + context path) in git that is deployed to the cluster, while the term Application is used this does not necessarily correspond 1:1 to an application. The deployment of set of Roles and RoleBindings to the cluster could be an application, an operator subscription could be an Application, a three tier app could be a single application, etc. Basically don’t get hung up on the term Application, it’s really just the level of encapsulation.
  • In short, at the end of the day an Application is really just a reference to a git repository as per the example below:

    apiVersion: argoproj.io/v1alpha1
    kind: Application
    metadata:
      name: config-groups-and-membership
    spec:
      destination:
        namespace: argocd
        server: https://kubernetes.default.svc
      project: cluster-config
      source:
        path: manifests/configs/groups-and-membership/overlays/default
        repoURL: https://github.com/gnunn-gitops/cluster-config.git
        targetRevision: master
  • Projects. As per the ArgoCD website, “Projects provide a logical grouping of applications” which can be useful when organizing applications deployed into ArgoCD. It is also where you can apply RBAC and restrictions around applications in terms of the namespaces where applications can be deployed, what k8s APIs they can use, etc. In general I primarily use projects as an organization tool and prefer the model of deploying separate namespace scoped instances of ArgoCD on a per team level (not per app!) to provide isolation.
  • App of Apps. The “App of App” pattern refers to using an ArgoCD application to declaratively deploy other ArgoCD applications. Essentially you have an ArgoCD application that points to a git repository with other ArgoCD applications in them. The benefit of this approach is it enables you to deploy a single application to deliver a wide swath of functionality without having to deploy each application individually. That’s correct, it’s turtles all the way down. Note though that at some point in the future that the App of Apps pattern will likely be replaced by ApplicationSets.
  • Sync Waves. In Kubernetes there is often a need to handle dependencies, i.e. to deploy one thing before another. In ArgoCD this capability is provided by sync waves which enables you to annotate an application with the wave number it is part of. This is particularly powerful with the “App of App” pattern where we can use it to deploy our applications in a particular order which we will see when we do the cluster configuration (I’m getting there, I promise!)

Sealed Secrets. When you first start with GitOps the first reaction is typically “Awesome, we are storing all our stuff in git” shortly followed by “Crap we are storing all of our stuff in git including secrets”. To use GitOps effectively you need a way to either manage your secrets externally from git or encrypt them in git. There are a huge number of tools available for this, in Red Hat Canada we’ve settled on Sealed Secrets as it provides a straightforward way to encrypt/decrypt secrets in git and is easy to manage for our demos. Having said that we don’t have a hard and fast recommendation here, if your organization has an existing secret management solution (i.e. like Hashicorps Vault for example) I would highly recommend looking at using that as a first step.

Sealed Secrets runs as a controller in the cluster that will automatically decrypt a SealedSecret CR into a corresponding Secret. Secrets are encrypted using a private key which is most commonly associated to a specific cluster, i.e. the production cluster would have a different key then the development cluster. Secrets are tied to a namespace and can only be decrypted in the namespace for which they are intended. Finally a CLI called kubeseal allows users to quickly create a new SealedSecret for a particular cluster and namespace.

Bringing it all Together

With the background out of the way, let’s talk about bringing it all together to manage cluster configuration for GitOps. Assuming you have a freshly installed cluster all shiny and gleaming, the first step is to deploy ArgoCD into the cluster. There’s always a bit of a chicken and egg here in that you need to get the GitOps tool deployed before you can actually start GitOps’ing. For simplicity we will deploy ArgoCD manually here using kustomize, however a more enterprise solution would be to use something like ACM which can push out Argo to clusters on it’s own.

The Red Hat Canada GitOps organization has a repo with a standardized deployment and configuration of ArgoCD that we share in our team thanks to the hard work of Andrew Pitt. Our ArgoCD configuration includes resource customizations and exclusions that we have found made sense in our OpenShift environments. These changes help ArgoCD work better with certain resources to detemine if an application is in or out of sync.

To deploy ArgoCD to a cluster, you can simply clone the repo and use the include setup.sh script which deploys the operator followed by the ArgoCD instance in the argocd namespace.

Once you have ArgoCD deployed and ready to go you can actually start creating a cluster configuration repository. My cluster configuration is located in github at https://github.com/gnunn-gitops/cluster-config, my recommendation would be to start from scratch with your own repo rather then forking mine and slowly build it up to meet your needs. Having said that, let’s walk through how my repo is setup as an example.

The first thing you will notice is the structure with three key folders at the root level: clusters, environments and manifests. I cover these extensively in my standards document but here is a quick recap:

  • manifests. A base set of kustomize manifests and yaml for applications, operators, configuration and ArgoCD app/project definitions. Everything is inherited from here
  • environments. Environment specific aggregation and patching is found here. Unlike app environments (prod/test/qa), this is meant as environments that will share the same configuration (production vs non-production, aws versus azure, etc). It aggregates the argocd applications you wish deployed with the next level in the heirarchy clusters, using an app of app pattern.
  • clusters. Cluster specific configuration, it does not directly aggregate the environments but instead employs an app-of-app pattern to define one or more applications that point to the environment set of applications. It also includes anything that needs to be directly bootstrapped, i.e. a specific sealed-secrets key as an example.

The relationship between these folders is shown in the diagram above. The clusters folder can consume kustomize bases/overlays from both environments and manifests while environments can only consume from manifests, never clusters. This organizational rule helps keep things sane and logical.

So let’s look in a bit more detail how things are organized. So if you look at my environments folder you will see three overlays are present: bootstrap, local and cloud. Local and cloud represent my on-prem and cloud based environments, but what’s bootstrap and why does it exist?

Regardless of the cluster you are configuring, there is a need to bootstrap some things directly in the cluster outside of a GitOps context. If you look at the kustomization file you will see there are two items in particular that get bootstrapped directly:

  • ArgoCD Project. We need to add an ArgoCD project to act as a logical grouping for our cluster configuration. In my case the project is called cluster-config
  • Sealed Secret Key. I like to provision a known key for decrypting my SealedSecret objects in the cluster so that I have a known state to work from rather then SealedSecret generating a new key on install. This also makes it possible to restore a cluster from scratch without having to re-encrypt all the secrets in git. Note that the kustomization in bootstrap references a file sealed-secrets-secret.yaml which is not in git, this is the private key and is essentially the keys to the kingdom. I include this file in my .gitignore so it never gets accidentally committed to git.

Next if you examine the local environment kustomize file, notice that it is importing all of the ArgoCD applications that will be included in this environment along with any specific environment patching required.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: argocd

bases:
- ../../../manifests/argocd/apps/sealed-secrets-operator/base
- ../../../manifests/argocd/apps/letsencrypt-certs/base
- ../../../manifests/argocd/apps/storage/base
- ../../../manifests/argocd/apps/alertmanager/base
- ../../../manifests/argocd/apps/prometheus-user-app/base
- ../../../manifests/argocd/apps/console-links/base
- ../../../manifests/argocd/apps/helm-repos/base
- ../../../manifests/argocd/apps/oauth/base
- ../../../manifests/argocd/apps/container-security-operator/base
- ../../../manifests/argocd/apps/compliance-operator/base
- ../../../manifests/argocd/apps/pipelines-operator/base
- ../../../manifests/argocd/apps/web-terminal-operator/base
- ../../../manifests/argocd/apps/groups-and-membership/base
- ../../../manifests/argocd/apps/namespace-configuration-operator/base

patches:
- target:
    group: argoproj.io
    version: v1alpha1
    kind: Application
  path: patch-application.yaml
- target:
    group: argoproj.io
    version: v1alpha1
    kind: Application
    name: config-authentication
  path: patch-authentication-application.yaml

Now if we move up to the clusters folder you will see two folders at the time of this writing, ocplab and home, which are the two clusters I typically manage. The ocplab cluster is an ephemeral cluster that is installed and removed periodically in AWS, the home cluster is the one sitting in my homelab. Drilling into the clusters/overlays/home folder you will see the following sub-folders:

  • apps
  • argocd
  • configs

The apps and configs folders mirror the same folders in manifests, these are apps and configs that are specific to a cluster or ones that need to be patched for a specific cluster. If you look at the argocd folder and drill into cluster-config/clusters/overlays/home/argocd/apps/kustomization.yaml file you will see the kustomization as follows:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../../../../environments/overlays/local

resources:
- ../../../../../manifests/argocd/apps/cost-management-operator/base

patches:
# Patch console links for cluster routes
- target:
    group: argoproj.io
    version: v1alpha1
    kind: Application
    name: config-console-links
  path: patch-console-link-app.yaml
# Patch so compliance scan only runs on masters and doesn't get double-run
- target:
    group: argoproj.io
    version: v1alpha1
    kind: Application
    name: config-compliance-security
  path: patch-compliance-operator-app.yaml
# Path cost management to use Home source
- target:
    group: argoproj.io
    version: v1alpha1
    kind: Application
    name: config-cost-management
  path: patch-cost-management-operator-app.yaml

Notice this is inheriting the local environment as it’s base so it’s pulling in all of the ArgoCD applications from there and applying cluster specific patching as needed. Remember way back when we talked about the App of App pattern? Let’s look at that next.

Brining up the /clusters/overlays/home/argocd/manager/cluster-config-manager-app.yaml file, this is the App of App which I typically suffix the name with “-manager” since it manages the other applications. This file appears as follows:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cluster-config-manager
  labels:
    gitops.ownedBy: cluster-config
spec:
  destination:
    namespace: argocd
    server: https://kubernetes.default.svc
  project: cluster-config
  source:
    path: clusters/overlays/home/argocd/apps
    repoURL: https://github.com/gnunn-gitops/cluster-config.git
    targetRevision: master
  syncPolicy:
    automated:
      prune: false
      selfHeal: true

Note that the path is telling ArgoCD to deploy what we looked at earlier, i.e. where all of the cluster applications are defined by referencing the local environment. Thus deplying this manager application pulls in all of the other applications and deploys them as well, so running this single command:

kustomize build clusters/overlays/home/argocd/manager | oc apply -f -

Results in this:

Now as mentioned, all of the cluster configuration is deployed in a specific order using ArgoCD sync waves. In this repository the following order is used:

Wave Item
1 Sealed Secrets
2 Lets Encrypt for wildcard routes
3 Storage (iscsi storageclass and PVs)
11 Cluster Configuration (Authentication, AlertManager, etc)
21 Operators (Pipelines, CSO, Compliance, Namespace Operator, etc)

You can see these waves defined as annotations in the various ArgoCD applications, for example the sealed-secrets application has the following:

  annotations:
    argocd.argoproj.io/sync-wave: "1"

Conclusion

Well that brings this entry to a close, GitOps is a game changing way to manage your clusters and deploy applications. While there is some work and learning involved in getting everything set up once you do it you’ll never want to go back to manual processes again.

If you are making changes in a GUI console you are doing it wrong
Me

Acknowledgements

I want to thank my cohort in GitOps, Andrew Pitt. A lot of the stuff I talked about here comes from Andrew, he did all the initial work with ArgoCD in our group and was responsible for evangelizing it. I started with Kustomize, Andrew started with ArgoCD and we ended up meeting in the middle, perfect team!

Updating Kustomize Version in ArgoCD

I love kustomize, particularly when paired with ArgoCD, and find that it’s been a great way to reduce yaml duplication. As much as I love it, there have been some annoying bugs with it over the months particularly in how it handles remote repositories.

For those not familiar with using remote repositories, you can have a kustomization that imports bases and resources from a git repository instead of having to be on your local file system. This makes it possible to develop a common set of kustomizations that can be re-used across an organization. This is essentially what we do in the Red Hat Canada Catalog repo where we share common components across our team. Here is an an example of using a repo repository where my cluster-config repo imports the cost management operator from the Red Hat Canada Catalog:

kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1

bases:
- github.com/redhat-canada-gitops/catalog/cost-management-operator/overlays/default

patchesJson6902:
  - path: patch-source-and-name.yaml
    target:
      group: koku-metrics-cfg.openshift.io
      kind: KokuMetricsConfig
      name: instance
      version: v1beta1

This works really well but as mentioned previously bugs prevail, the format to reference the git repository has worked/not worked in different ways over previous versions and most annoyingly importing a kustomization which in turn has bases that nest more then one level deep in the repo will fail with an evalsymlink error. A lot of these issues were tied to the usage of go-getter.

Fortunately this all seems to have been fixed in the 4.x versions of kustomize with the dropping of go-getter, unfortunately ArgoCD is using 3.7.3 last time I checked. The good news is that it is easy enough to create your own version of the ArgoCD image and include whatever version of kustomize you want. The ArgoCD documentation goes through the options for including custom tools however at the moment the operator only supports embedding new tools in an image at the moment.

As a result the first step to using a custom version of kustomize (lol alliteration!) is to create the image through a Dockerfile:

FROM docker.io/argoproj/argocd:v1.7.12
 
# Switch to root for the ability to perform install
USER root
 
ARG KUSTOMIZE_VERSION=v4.0.1
 
# Install tools needed for your repo-server to retrieve & decrypt secrets, render manifests 
# (e.g. curl, awscli, gpg, sops)
RUN apt-get update && \
    apt-get install -y \
        curl \
        awscli \
        gpg && \
    apt-get clean && \
    curl -o /tmp/kustomize.tar.gz -L https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz && \
    ls /tmp && \
    tar -xvf /tmp/kustomize.tar.gz -C /usr/local/bin && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
 
# Switch back to non-root user
USER argocd

Note in the Dockerfile above I have chosen to overwrite the existing kustomize version. As per the ArgoCD Custom Tooling documentation, you can add multiple versions of kustomize and reference specific versions in your applications. However I see my fix here as a temporary measure until the ArgoCD image catches up with kustomize so I would prefer to keep my application yaml unencumbered with kustomize version references.

To build the image, simply run the following substituting my image repo and name that maps to your own registry and repository:

docker build . -t quay.io/gnunn/argocd:v1.7.12
docker push quay.io/gnunn/argocd:v1.7.12

Once we have the image, we can just update the ArgoCD CR that the operator uses to reference our image as per the example below:

apiVersion: argoproj.io/v1alpha1
kind: ArgoCD
metadata:
  name: example-argocd
  namespace: argocd
spec:
  image: quay.io/gnunn/argocd
  version: v1.7.12

Building a Simple Up/Down Status Dashboard for OpenShift

OpenShift provides a wealth of monitoring and alerts however sometimes it can be handy to surface a simple up/down signal for an OpenShift cluster that can be easily interpreted by tools like UptimeRobot. This enables you to provide an operational or business level dashboard of the status of your cluster to users and application owners that may not necessarily familiar with all of the nuances of OpenShift’s or Kubernete’s internals.

UptimeRobot Dashboard

The health status of an OpenShift cluster depends on many things such as etcd, operators, nodes, api, etc so how do we aggregate all of this information? While you could certaintly run your own code to do it, fortunately a cool utility called Cerberus already provides this capability. Cerebus was born out of Red Hat’s Performance and Scaling group and was designed to be used with Kraken, a chaos engineering tool. A chaos engineering tool isn’t very useful if you can’t determine the status of the cluster and thus Cereberus was born.

A number of blog posts have already been written about Kraken and Cerebus from a chaos engineering point of view which you can view here and here. Here we are going to focus on the basics of using it for simple health checking.

One thing to note about Cerberus is that it is aggresive about returning an unheathly state even if the cluster is operational. For example, if you set it to monitor a namespace any pod failures in a multi-pod deployment in that namespace will trigger an unhealthly flag even if the other pods in the deployment are running and still servicing requests. As a result, some tuning of Cerberus or the utilization of custom checks is required if you want to use it for a more SLA focused view.

To get started with Cerberus simply clone the git repo into an appropriate directory on the system where you want to run it. While you can run it inside of OpenShift, it is highly recommended to run it outside the cluster since your cluster monitoring tool should not be dependent on the cluster itself. To clone the repo, just run the following:

git clone https://github.com/cloud-bulldozer/cerberus

In order for Cerberus to run, it requires access to a kubeconfig file where a user has already been authenticated. For security purposes I would highly recommend using a serviceaccount with the cluster-reader role rather then using a user with cluster-admin. The commands below will create a serviceaccount in the openshift-monitoring namespace, bind it to the cluster-reader role and generate a kubeconfig that cerberus can use to authenticate to the cluster.

oc create sa cerberus -n openshift-monitoring
oc adm policy add-cluster-role-to-user cluster-reader -z cerberus -n openshift-monitoring
oc serviceaccounts create-kubeconfig cerberus -n openshift-monitoring > config/kubeconfig

Cerberus can automatically create a token for the prometheus-k8s service account to so it can access prometheus to pull metrics. To enable this we need to define a role to give the cerberus the necessary permissions and bind it to the cerberus service account. Create a file with the following content:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cerberus
  namespace: openshift-monitoring
rules:
  - apiGroups:
      - ""
    resources:
      - serviceaccounts
      - secrets
    verbs:
      - get
      - list
      - patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cerberus-service-account-token
  namespace: openshift-monitoring  
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cerberus
subjects:
  - kind: ServiceAccount
    name: cerberus
    namespace: openshift-monitoring

And then apply it with “oc apply -f” to the cluster.

To configure cerberus you can edit the existing config.yaml file in the repo or create a new one, creating a new one is highly recommended so if you do a git pull it doesn’t clobber your changes:

cp config/config.yaml config/my-config.yaml

Once you have the config file, you can go through the options and set what you need. Here is an example of my config file which is really just the example config with the kubeconfig parameter tweaked.

cerberus:
    distribution: openshift                              # Distribution can be kubernetes or openshift
    kubeconfig_path: /opt/cerberus/config/kubeconfig     # Path to kubeconfig
    port: 8080                                           # http server port where cerberus status is published
    watch_nodes: True                                    # Set to True for the cerberus to monitor the cluster nodes
    watch_cluster_operators: True                        # Set to True for cerberus to monitor cluster operators
    watch_url_routes:                                    # Route url's you want to monitor, this is a double array with the url and optional authorization parameter
    watch_namespaces:                                    # List of namespaces to be monitored
        -    openshift-etcd
        -    openshift-apiserver
        -    openshift-kube-apiserver
        -    openshift-monitoring
        -    openshift-kube-controller-manager
        -    openshift-machine-api
        -    openshift-kube-scheduler
        -    openshift-ingress
        -    openshift-sdn                               # When enabled, it will check for the cluster sdn and monitor that namespace
    cerberus_publish_status: True                        # When enabled, cerberus starts a light weight http server and publishes the status
    inspect_components: False                            # Enable it only when OpenShift client is supported to run
                                                         # When enabled, cerberus collects logs, events and metrics of failed components

    prometheus_url:                                      # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                             # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
                                                         # This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies. 

    slack_integration: False                             # When enabled, cerberus reports the failed iterations in the slack channel
                                                         # The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
                                                         # When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
    watcher_slack_ID:                                        # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:                                    # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned

    custom_checks:                                       # Relative paths of files conataining additional user defined checks

tunings:
    iterations: 5                                        # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
    sleep_time: 60                                       # Sleep duration between each iteration
    kube_api_request_chunk_size: 250                     # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
    daemon_mode: True                                    # Iterations are set to infinity which means that the cerberus will monitor the resources forever
    cores_usage_percentage: 0.5                          # Set the fraction of cores to be used for multiprocessing

database:
    database_path: /tmp/cerberus.db                      # Path where cerberus database needs to be stored
    reuse_database: False                                # When enabled, the database is reused to store the failures

At this time you can Cerberus manually and test it out as follows:

$ sudo python3 /opt/cerberus/start_cerberus.py --config /opt/cerberus/config/config-home.yaml
               _                         
  ___ ___ _ __| |__   ___ _ __ _   _ ___ 
 / __/ _ \ '__| '_ \ / _ \ '__| | | / __|
| (_|  __/ |  | |_) |  __/ |  | |_| \__ \
 \___\___|_|  |_.__/ \___|_|   \__,_|___/
                                         

2021-01-29 12:01:01,030 [INFO] Starting ceberus
2021-01-29 12:01:01,037 [INFO] Initializing client to talk to the Kubernetes cluster
2021-01-29 12:01:01,144 [INFO] Fetching cluster info
2021-01-29 12:01:01,260 [INFO] 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.12    True        False         3d20h   Cluster version is 4.6.12

2021-01-29 12:01:01,365 [INFO] Kubernetes master is running at https://api.home.ocplab.com:6443

2021-01-29 12:01:01,365 [INFO] Publishing cerberus status at http://0.0.0.0:8080
2021-01-29 12:01:01,381 [INFO] Starting http server at http://0.0.0.0:8080

2021-01-29 12:01:01,623 [INFO] Daemon mode enabled, cerberus will monitor forever
2021-01-29 12:01:01,623 [INFO] Ignoring the iterations set

2021-01-29 12:01:01,955 [INFO] Iteration 1: Node status: True
2021-01-29 12:01:02,244 [INFO] Iteration 1: Cluster Operator status: True
2021-01-29 12:01:02,380 [INFO] Iteration 1: openshift-ingress: True
2021-01-29 12:01:02,392 [INFO] Iteration 1: openshift-apiserver: True
2021-01-29 12:01:02,396 [INFO] Iteration 1: openshift-sdn: True
2021-01-29 12:01:02,399 [INFO] Iteration 1: openshift-kube-scheduler: True
2021-01-29 12:01:02,400 [INFO] Iteration 1: openshift-machine-api: True
2021-01-29 12:01:02,406 [INFO] Iteration 1: openshift-kube-controller-manager: True
2021-01-29 12:01:02,425 [INFO] Iteration 1: openshift-etcd: True
2021-01-29 12:01:02,443 [INFO] Iteration 1: openshift-monitoring: True
2021-01-29 12:01:02,445 [INFO] Iteration 1: openshift-kube-apiserver: True
2021-01-29 12:01:02,446 [INFO] HTTP requests served: 0 

2021-01-29 12:01:02,446 [WARNING] Iteration 1: Masters without NoSchedule taint: ['home-jcn2d-master-0', 'home-jcn2d-master-1', 'home-jcn2d-master-2']

2021-01-29 12:01:02,592 [INFO] []

2021-01-29 12:01:02,592 [INFO] Sleeping for the specified duration: 60

Great, Cerberus is up and running now but wouldn’t be great if it would run automatically as a service? Let’s go ahead and set that up by creating a systemd service. First let’s setup a bash script called start.sh in the root of our cerberus directory as follows:

#!/bin/bash
 
echo "Starting Cerberus..."
 
python3 /opt/cerberus/start_cerberus.py --config /opt/cerberus/config/my-config.yaml

Next, create a systemd service at /etc/systemd/system/cerberus.service and add the following to it:

[Unit]
Description=Cerberus OpenShift Health Check

Wants=network.target
After=syslog.target network-online.target

[Service]
Type=simple
ExecStart=/bin/bash /opt/cerberus/start.sh
Restart=on-failure
RestartSec=10
KillMode=control-group

[Install]
WantedBy=multi-user.target

To have the service run cerberus use the following commands:

systemctl daemon-reload
systemctl enable cerberus.service
systemctl start cerberus.service

Check the status of the service after starting it, if the service failed you may need to delete the cerberus files in /tmp that were created when run manually previously. You can also check the endpoint at http://localhost:8080 to see the result it returns which is a simple text string with either “True” or “False”.

At this point we can then add our monitor to UptimeRobot assuming the Cerberus port is exposed to the internet. Below is an image of my monitor configuration:

And there you have it, you should start seeing the results in your status page as per the screenshot at the top of the page.

API Testing in OpenShift Pipelines with Newman

If you are writing REST based API applications you probably have some familiarity with the tool Postman which allows you to test your APIs via an interactive GUI. However did you know that there is a CLI equivalent of Postman called Newman that works with your existing Postman collections? Newman enables you to re-use your existing collections to integrate API testing into automated processes where a GUI would not be appropriate. While we will not go into the details of Postman or Newman here if you are new to the tools you can check out this blog which provides a good overview of both.

Integrating Newman into OpenShift Pipelines, aka Tekton, is very easy and straightforward. In this blog we are going to look at how I am using it in my product catalog demo to test the back-end API built in Quarkus as part of the CI/CD process powered by OpenShift Pipelines. This CI/CD process is shown in the diagram below (click for a bigger version) and note the two tasks where we do our API testing in the Development and Test environments, dev-test and test-test (unfortunate name) respectively. These tests are run after the new image is built and deployed in each environment and are thus considered integration tests rather then unit tests.

Product Catalog Server CICD

One of the things I love about Tekton, and thus OpenShift Pipelines, is the extensibility, it’s very easy to extend by creating custom images using either an existing image or an image that you have created yourself. If you are not familiar with OpenShift Pipelines or Tekton I would highly recommend checking out the concepts documentation which provides a good overview.

The first step to using Newman in OpenShift Pipelines is to create a custom task for it. Tasks in Tekton represent a sequence of steps to accomplish a specific goal or as the name implies, task. Each step uses the specified container image to perform it’s function. Fortunately in our case there is an existing container image for newman that we can leverage without having to create our own at docker.io/postman/newman. Our task definition for the newman task appears below:

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: newman
spec:
  params:
  - name: COLLECTION
    description: The collection to run, typically a remote URL pointing to the collection
    type: string
  - name: ENVIRONMENT
    description: The environment file to use from the newman-env configmap
    default: "newman-env.json"
  steps:
    - name: collections-test
      image: docker.io/postman/newman:latest
      command:
        - newman
      args:
        - run
        - $(inputs.params.COLLECTION)
        - -e
        - /config/$(inputs.params.ENVIRONMENT)
        - --bail
      volumeMounts:
        - name: newman-env
          mountPath: /config
  volumes:
    - name: newman-env
      configMap:
        name: newman-env

There are two parameters declared as part of this task, COLLECTION and ENVIRONMENT. The collection parameter references a URL to the test suite that you want to run, it’s typically created using the Postman GUI and exported as a JSON file. For the pipeline in the product catalog we use this product-catalog-server-tests.json. Each test in the collection represents a request/response to the API along with some simple tests to ensure conformance with the desired results.

For example, when requesting a list of products, we test that the response code was 200 and 12 products were returned as per the picture below:

Postman

Postman

The environment parameter is a configmap with the customization the test suite requires for the specific environment that is being tested. For example, the API for the development and test environments have different URLs so we need to parametize this so we can re-use the same test suite across all environments. You can see the environments for the dev and test in my github repo. The task is designed so that the configmap, newman-env, contains all of the environments as separate files within the configmap as per the example here:

apiVersion: v1
data:
  newman-dev-env.json: "{\n\t\"id\": \"30c331d4-e961-4606-aecb-5a60e8e15213\",\n\t\"name\": \"product-catalog-dev-service\",\n\t\"values\": [\n\t\t{\n\t\t\t\"key\": \"host\",\n\t\t\t\"value\": \"server.product-catalog-dev:8080\",\n\t\t\t\"enabled\": true\n\t\t},\n\t\t{\n\t\t\t\"key\": \"scheme\",\n\t\t\t\"value\": \"http\",\n\t\t\t\"enabled\": true\n\t\t}\n\t],\n\t\"_postman_variable_scope\": \"environment\"\n}"
  newman-test-env.json: "{\n\t\"id\": \"30c331d4-e961-4606-aecb-5a60e8e15213\",\n\t\"name\": \"product-catalog-dev-service\",\n\t\"values\": [\n\t\t{\n\t\t\t\"key\": \"host\",\n\t\t\t\"value\": \"server.product-catalog-test:8080\",\n\t\t\t\"enabled\": true\n\t\t},\n\t\t{\n\t\t\t\"key\": \"scheme\",\n\t\t\t\"value\": \"http\",\n\t\t\t\"enabled\": true\n\t\t}\n\t],\n\t\"_postman_variable_scope\": \"environment\"\n}"
kind: ConfigMap
metadata:
  name: newman-env
  namespace: product-catalog-cicd

In the raw configmap the environments are hard to read due to formatting, however below is what the newman-dev-env.json looks like when formatted properly. Notice the route is pointing to the service in the product-catalog-dev namespace.

{
	"id": "30c331d4-e961-4606-aecb-5a60e8e15213",
	"name": "product-catalog-dev-service",
	"values": [
		{
			"key": "host",
			"value": "server.product-catalog-dev:8080",
			"enabled": true
		},
		{
			"key": "scheme",
			"value": "http",
			"enabled": true
		}
	],
	"_postman_variable_scope": "environment"
}

So now that we have our task, our test suite and our environments we need to add the task to the pipeline to test an environment. You can see the complete pipeline here, an excerpt showing the pipeline testing the dev environment appears below:

    - name: dev-test
      taskRef:
        name: newman
      runAfter:
        - deploy-dev
      params:
        - name: COLLECTION
          value: https://raw.githubusercontent.com/gnunn-gitops/product-catalog-server/master/tests/product-catalog-server-tests.json
        - name: ENVIRONMENT
          value: newman-dev-env.json

When you run the task newman will log the results of the tests and if any of the tests fail will return an error code which propagated up to the pipeline and cause the pipeline itself to fail. Here is the result from testing the Dev environment:

newman
Quarkus Product Catalog
→ Get Products
GET http://server.product-catalog-dev:8080/api/product [200 OK, 3.63KB, 442ms]
✓ response is ok
✓ data valid
→ Get Existing Product
GET http://server.product-catalog-dev:8080/api/product/1 [200 OK, 388B, 14ms]
✓ response is ok
✓ Data is correct
→ Get Missing Product
GET http://server.product-catalog-dev:8080/api/product/99 [404 Not Found, 115B, 18ms]
✓ response is missing
→ Login
POST http://server.product-catalog-dev:8080/api/auth [200 OK, 165B, 145ms]
→ Get Missing User
GET http://server.product-catalog-dev:8080/api/user/8 [404 Not Found, 111B, 12ms]
✓ Is status code 404
→ Get Existing User
GET http://server.product-catalog-dev:8080/api/user/1 [200 OK, 238B, 20ms]
✓ response is ok
✓ Data is correct
→ Get Categories
GET http://server.product-catalog-dev:8080/api/category [200 OK, 458B, 16ms]
✓ response is ok
✓ data valid
→ Get Existing Category
GET http://server.product-catalog-dev:8080/api/category/1 [200 OK, 192B, 9ms]
✓ response is ok
✓ Data is correct
→ Get Missing Category
GET http://server.product-catalog-dev:8080/api/category/99 [404 Not Found, 116B, 9ms]
✓ response is missing
┌─────────────────────────┬───────────────────┬───────────────────┐
│ │ executed │ failed │
├─────────────────────────┼───────────────────┼───────────────────┤
│ iterations │ 10 │
├─────────────────────────┼───────────────────┼───────────────────┤
│ requests │ 90 │
├─────────────────────────┼───────────────────┼───────────────────┤
│ test-scripts │ 80 │
├─────────────────────────┼───────────────────┼───────────────────┤
│ prerequest-scripts │ 00 │
├─────────────────────────┼───────────────────┼───────────────────┤
│ assertions │ 130 │
├─────────────────────────┴───────────────────┴───────────────────┤
│ total run duration: 883ms │
├─────────────────────────────────────────────────────────────────┤
│ total data received: 4.72KB (approx) │
├─────────────────────────────────────────────────────────────────┤
│ average response time: 76ms [min: 9ms, max: 442ms, s.d.: 135ms] │
└─────────────────────────────────────────────────────────────────┘

So to summarize integrating API testing with OpenShift Pipelines is very quick and easy. While in this example we showed the process using Newman other API testing tools can be integrated following a similar process.

OpenShift User Application Monitoring and Grafana the GitOps way!

Update: All of the work outlined in this article is now available as a kustomize overlay in the Red Hat Canada GitOps repo here.

Traditionally in OpenShift, the cluster monitoring that was provided out-of-the-box (OOTB) was only available for cluster monitoring. Administrators could not configure it to support their own application workloads necessitating the deployment of a separate monitoring stack (typically community prometheus and grafana). However this has changed in OpenShift 4.6 as the cluster monitoring operator now supports deploying a separate prometheus instance for application workloads.

One great capability provided by the OpenShift cluster monitoring is that it deploys Thanos to aggregate metrics from both the cluster and application monitoring stacks thus providing a central point for queries. At this point in time you still need to deploy your own Grafana stack for visualizations but I expect a future version of OpenShift will support custom dashboards right in the console alongside the default ones. The monitoring stack architecture for OpenShift 4.6 is shown in the diagram (click for architecture documentation) below:

Monitoring Architecture

In this blog entry we cover deploying the user application monitoring feature (super easy) as well as a Grafana instance (not super easy) using GitOps, specifically in this case with ArgoCD. This blog post is going to assume some familiarity with Prometheus and Grafana and will concentrate on the more challenging aspects of using GitOps to deploy everything.

The first thing we need to do is deploy the user application monitoring in OpenShift, this would typically be done as part of your cluster configuration. To do this, as per the docs, we simply need to deploy the following configmap in the openshift-monitoring namespace:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

You can see this in my GitOps cluster-config here. Once deployed you should see the user monitoring components deployed in the openshift-user-workload-monitoring project as per below:

Now that the user monitoring is up and running we can configure the monitoring of our applications by adding the ServiceMonitor object to define the monitoring targets. This is typically done as part of the application deployment by application teams, it is a separate activity from the deployment of the user monitoring itself which is done in the cluster configurgation by cluster administrators. Here is an example that I have for my product-catalog demo that monitors my quarkus back-end:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: server
  namespace: product-catalog-dev
spec:
  endpoints:
  - path: /metrics
    port: http
    scheme: http
  selector:
    matchLabels:
      quarkus-prometheus: "true"

In the service monitor above, it defines that any kubernetes services, in the same namespace as the ServiceMonitor, which have the label quarkus-prometheus set to true will have their metrics collected on the port named ‘http’ using the path ‘/metrics’. Of course, your application needs to be enabled for prometheus metrics and most modern frameworks like quarkus make this easy. From a GitOps perspective deploying the ServiceMonitor is just another yaml to deploy along with the application as you can see in my product-catalog manifests here.

As an aside please note that the user monitoring in OpenShift does not support the namespace selector in ServiceMonitor for security reasons, as a result the ServiceMonitor must be deployed in the same namespace as the targets being defined. Thus if you have the same application in three different namespaces (say dev, test and prod) you will need to deploy the ServiceMonitor in each of those namespaces independently.

Now if I were to stop here it would hardly merit a blog post, however for most folks once they deploy the user monitoring the next step is deploying something to visualize them and in this example that will be Grafana. Deploying the Grafana operator via GitOps in OpenShift is somewhat involved since we will use the Operator Lifecycle Manager (OLM) to do it but OLM is asynchronous. Specifically, with OLM you push the Subscription and OperatorGroup and asynchronously OLM will install and deploy the operator. From a GitOps perspective managing the deployment of the operator and the Custom Resources (CR) becomes tricky since the CRs cannot be installed until the Operator Custom Resource Definitions (CRDs) are installed.

Fortunately in ArgoCD there are a number of features available to work around this, specifically adding the `argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true` annotation to our resources will instruct ArgoCD not to error out if some resources cannot be added initially. You can also combine this with retries in your ArgoCD application for more complex operators that take significant time to initialize, for Grafana though the annotation seems to be sufficient. In my product-catalog example, I am adding this annotation across all resources using kustomize:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: product-catalog-monitor

commonAnnotations:
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true

bases:
- https://github.com/redhat-canada-gitops/catalog/grafana-operator/overlays/aggregate?ref=grafana
- ../../../manifests/app/monitor/base

resources:
- namespace.yaml
- operator-group.yaml
- cluster-monitor-view-rb.yaml

patchesJson6902:
- target:
    version: v1
    group: rbac.authorization.k8s.io
    kind: ClusterRoleBinding
    name: grafana-proxy
  path: patch-proxy-namespace.yaml
- target:
    version: v1alpha1
    group: integreatly.org
    kind: Grafana
    name: grafana
  path: patch-grafana-sar.yaml

Now it’s beyond the scope of this blog to go into a detailed description of kustomize, but in a nutshell it’s a patching framework that enables you to aggregate resources from either local or remote bases as well as add new resources. In the kustomize file above, we are using the Red Hat Canada standard deployment of Grafana, which includes OpenShift OAuth integration, and combining it with my application specific monitoring Grafana resources such as Datasources and Dashboards which is what we will look at next.

Continuing along we need to setup the plumbing to connect Grafana to the cluster monitoring Thanos instance in the openshift-monitoring namespace. This blog article, Custom Grafana dashboards for Red Hat OpenShift Container Platform 4, does a great job of walking you through the process and I am not going to repeat it here, however please do read that article before carrying on.

The first step we need to do is define a GrafanaDatasource object:

apiVersion: integreatly.org/v1alpha1
kind: GrafanaDataSource
metadata:
  name: prometheus
spec:
  datasources:
    - access: proxy
      editable: true
      isDefault: true
      jsonData:
        httpHeaderName1: 'Authorization'
        timeInterval: 5s
        tlsSkipVerify: true
      name: Prometheus
      secureJsonData:
        httpHeaderValue1: 'Bearer ${BEARER_TOKEN}'
      type: prometheus
      url: 'https://thanos-querier.openshift-monitoring.svc.cluster.local:9091'
  name: prometheus.yaml

Notice in httpsHeaderValue1 we are expected to provide a bearer token, this token comes from the grafana-serviceaccount and can only be determined at runtime which makes it a bit of a challenge from a GitOps perspective. To manage this, we deploy a kubernetes job as an ArgoCD PostSync hook in order to patch the GrafanaDatasource with the appropriate token:


apiVersion: batch/v1
kind: Job
metadata:
  name: patch-grafana-ds
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
        - image: registry.redhat.io/openshift4/ose-cli:v4.6
          command:
            - /bin/bash
            - -c
            - |
              set -e
              echo "Patching grafana datasource with token for authentication to prometheus"
              TOKEN=`oc serviceaccounts get-token grafana-serviceaccount -n product-catalog-monitor`
              oc patch grafanadatasource prometheus --type='json' -p='[{"op":"add","path":"/spec/datasources/0/secureJsonData/httpHeaderValue1","value":"Bearer '${TOKEN}'"}]'
          imagePullPolicy: Always
          name: patch-grafana-ds
      dnsPolicy: ClusterFirst
      restartPolicy: OnFailure
      serviceAccount: patch-grafana-ds-job
      serviceAccountName: patch-grafana-ds-job
      terminationGracePeriodSeconds: 30

This job runs using a special ServiceAccount which gives the job just enough access to retrieve the token and patch the datasource, once that’s done the job is deleted by ArgoCD.

The other thing we want to do is control access to Grafana, basically we want to grant OpenShift users who have view access on the Grafana route in the namespace access to grafana. The grafana operator uses the OpenShift OAuth Proxy to integrate with OpenShift. This proxy enables the definition of a Subject Access Review (SAR) to determine who is authorized to use Grafana, the SAR is simply a check on a particular object that acts as a way to determine access. For example, to only allow cluster administrators to have access to the Grafana instance we can specify that the user must have access to get namespaces:

-openshift-sar={"resource": "namespaces", "verb": "get"}

In our case we want anyone who has view access to the grafana route in the namespace grafana is hosted, product-catalog-monitor, to have access. So our SAR would appear as follows:

-openshift-sar={"namespace":"product-catalog-monitor","resource":"routes","name":"grafana-route","verb":"get"}

To make this easy for kustomize to patch, the Red Hat Canada grafana implementation passes the SAR as an environment variable. To patch the value we can include a kustomize patch as follows:

- op: replace
  path: /spec/containers/0/env/0/value
  value: '-openshift-sar={"namespace":"product-catalog-monitor","resource":"routes","name":"grafana-route","verb":"get"}'

You can see this patch being applied at the environment level in my product-catalog example here. In my GitOps standards, environments is where the namespace is created and thus it makes sense that any namespace patching that is required is done at this level.

After this it is simply a matter of including the other resources such as the cluster-monitor-view rolebinding to the grafana-serviceaccount so that grafana is authorized to retrieve the metrics.

If everything has gone well to this point you should be able to create a dashboard to view your application metrics.