Update: All of the work outlined in this article is now available as a kustomize overlay in the Red Hat Canada GitOps repo here.
Traditionally in OpenShift, the cluster monitoring that was provided out-of-the-box (OOTB) was only available for cluster monitoring. Administrators could not configure it to support their own application workloads necessitating the deployment of a separate monitoring stack (typically community prometheus and grafana). However this has changed in OpenShift 4.6 as the cluster monitoring operator now supports deploying a separate prometheus instance for application workloads.
One great capability provided by the OpenShift cluster monitoring is that it deploys Thanos to aggregate metrics from both the cluster and application monitoring stacks thus providing a central point for queries. At this point in time you still need to deploy your own Grafana stack for visualizations but I expect a future version of OpenShift will support custom dashboards right in the console alongside the default ones. The monitoring stack architecture for OpenShift 4.6 is shown in the diagram (click for architecture documentation) below:
In this blog entry we cover deploying the user application monitoring feature (super easy) as well as a Grafana instance (not super easy) using GitOps, specifically in this case with ArgoCD. This blog post is going to assume some familiarity with Prometheus and Grafana and will concentrate on the more challenging aspects of using GitOps to deploy everything.
The first thing we need to do is deploy the user application monitoring in OpenShift, this would typically be done as part of your cluster configuration. To do this, as per the docs, we simply need to deploy the following configmap in the openshift-monitoring namespace:
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true
You can see this in my GitOps cluster-config here. Once deployed you should see the user monitoring components deployed in the openshift-user-workload-monitoring project as per below:
Now that the user monitoring is up and running we can configure the monitoring of our applications by adding the ServiceMonitor object to define the monitoring targets. This is typically done as part of the application deployment by application teams, it is a separate activity from the deployment of the user monitoring itself which is done in the cluster configurgation by cluster administrators. Here is an example that I have for my product-catalog demo that monitors my quarkus back-end:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: server namespace: product-catalog-dev spec: endpoints: - path: /metrics port: http scheme: http selector: matchLabels: quarkus-prometheus: "true"
In the service monitor above, it defines that any kubernetes services, in the same namespace as the ServiceMonitor, which have the label quarkus-prometheus set to true will have their metrics collected on the port named ‘http’ using the path ‘/metrics’. Of course, your application needs to be enabled for prometheus metrics and most modern frameworks like quarkus make this easy. From a GitOps perspective deploying the ServiceMonitor is just another yaml to deploy along with the application as you can see in my product-catalog manifests here.
As an aside please note that the user monitoring in OpenShift does not support the namespace selector in ServiceMonitor for security reasons, as a result the ServiceMonitor must be deployed in the same namespace as the targets being defined. Thus if you have the same application in three different namespaces (say dev, test and prod) you will need to deploy the ServiceMonitor in each of those namespaces independently.
Now if I were to stop here it would hardly merit a blog post, however for most folks once they deploy the user monitoring the next step is deploying something to visualize them and in this example that will be Grafana. Deploying the Grafana operator via GitOps in OpenShift is somewhat involved since we will use the Operator Lifecycle Manager (OLM) to do it but OLM is asynchronous. Specifically, with OLM you push the Subscription and OperatorGroup and asynchronously OLM will install and deploy the operator. From a GitOps perspective managing the deployment of the operator and the Custom Resources (CR) becomes tricky since the CRs cannot be installed until the Operator Custom Resource Definitions (CRDs) are installed.
Fortunately in ArgoCD there are a number of features available to work around this, specifically adding the `argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true` annotation to our resources will instruct ArgoCD not to error out if some resources cannot be added initially. You can also combine this with retries in your ArgoCD application for more complex operators that take significant time to initialize, for Grafana though the annotation seems to be sufficient. In my product-catalog example, I am adding this annotation across all resources using kustomize:
apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: product-catalog-monitor commonAnnotations: argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true bases: - https://github.com/redhat-canada-gitops/catalog/grafana-operator/overlays/aggregate?ref=grafana - ../../../manifests/app/monitor/base resources: - namespace.yaml - operator-group.yaml - cluster-monitor-view-rb.yaml patchesJson6902: - target: version: v1 group: rbac.authorization.k8s.io kind: ClusterRoleBinding name: grafana-proxy path: patch-proxy-namespace.yaml - target: version: v1alpha1 group: integreatly.org kind: Grafana name: grafana path: patch-grafana-sar.yaml
Now it’s beyond the scope of this blog to go into a detailed description of kustomize, but in a nutshell it’s a patching framework that enables you to aggregate resources from either local or remote bases as well as add new resources. In the kustomize file above, we are using the Red Hat Canada standard deployment of Grafana, which includes OpenShift OAuth integration, and combining it with my application specific monitoring Grafana resources such as Datasources and Dashboards which is what we will look at next.
Continuing along we need to setup the plumbing to connect Grafana to the cluster monitoring Thanos instance in the openshift-monitoring namespace. This blog article, Custom Grafana dashboards for Red Hat OpenShift Container Platform 4, does a great job of walking you through the process and I am not going to repeat it here, however please do read that article before carrying on.
The first step we need to do is define a GrafanaDatasource object:
apiVersion: integreatly.org/v1alpha1 kind: GrafanaDataSource metadata: name: prometheus spec: datasources: - access: proxy editable: true isDefault: true jsonData: httpHeaderName1: 'Authorization' timeInterval: 5s tlsSkipVerify: true name: Prometheus secureJsonData: httpHeaderValue1: 'Bearer ${BEARER_TOKEN}' type: prometheus url: 'https://thanos-querier.openshift-monitoring.svc.cluster.local:9091' name: prometheus.yaml
Notice in httpsHeaderValue1 we are expected to provide a bearer token, this token comes from the grafana-serviceaccount and can only be determined at runtime which makes it a bit of a challenge from a GitOps perspective. To manage this, we deploy a kubernetes job as an ArgoCD PostSync hook in order to patch the GrafanaDatasource with the appropriate token:
apiVersion: batch/v1 kind: Job metadata: name: patch-grafana-ds annotations: argocd.argoproj.io/hook: PostSync argocd.argoproj.io/hook-delete-policy: HookSucceeded spec: template: spec: containers: - image: registry.redhat.io/openshift4/ose-cli:v4.6 command: - /bin/bash - -c - | set -e echo "Patching grafana datasource with token for authentication to prometheus" TOKEN=`oc serviceaccounts get-token grafana-serviceaccount -n product-catalog-monitor` oc patch grafanadatasource prometheus --type='json' -p='[{"op":"add","path":"/spec/datasources/0/secureJsonData/httpHeaderValue1","value":"Bearer '${TOKEN}'"}]' imagePullPolicy: Always name: patch-grafana-ds dnsPolicy: ClusterFirst restartPolicy: OnFailure serviceAccount: patch-grafana-ds-job serviceAccountName: patch-grafana-ds-job terminationGracePeriodSeconds: 30
This job runs using a special ServiceAccount which gives the job just enough access to retrieve the token and patch the datasource, once that’s done the job is deleted by ArgoCD.
The other thing we want to do is control access to Grafana, basically we want to grant OpenShift users who have view access on the Grafana route in the namespace access to grafana. The grafana operator uses the OpenShift OAuth Proxy to integrate with OpenShift. This proxy enables the definition of a Subject Access Review (SAR) to determine who is authorized to use Grafana, the SAR is simply a check on a particular object that acts as a way to determine access. For example, to only allow cluster administrators to have access to the Grafana instance we can specify that the user must have access to get namespaces:
-openshift-sar={"resource": "namespaces", "verb": "get"}
In our case we want anyone who has view access to the grafana route in the namespace grafana is hosted, product-catalog-monitor, to have access. So our SAR would appear as follows:
-openshift-sar={"namespace":"product-catalog-monitor","resource":"routes","name":"grafana-route","verb":"get"}
To make this easy for kustomize to patch, the Red Hat Canada grafana implementation passes the SAR as an environment variable. To patch the value we can include a kustomize patch as follows:
- op: replace path: /spec/containers/0/env/0/value value: '-openshift-sar={"namespace":"product-catalog-monitor","resource":"routes","name":"grafana-route","verb":"get"}'
You can see this patch being applied at the environment level in my product-catalog example here. In my GitOps standards, environments is where the namespace is created and thus it makes sense that any namespace patching that is required is done at this level.
After this it is simply a matter of including the other resources such as the cluster-monitor-view rolebinding to the grafana-serviceaccount so that grafana is authorized to retrieve the metrics.
If everything has gone well to this point you should be able to create a dashboard to view your application metrics.
In newer versions of Grafana they introduced a File Provider option that great simplifies this setup:
https://grafana.com/docs/grafana/latest/administration/configuration/#file-provider
Instead of setting up the job to patch the token you can use the File provider to reference the SA secret mounted to the filesystem:
“`
httpHeaderValue1: Bearer $__file{/var/run/secrets/kubernetes.io/serviceaccount/token}
“`
Thanks, I had not noticed that change and will definitely try it out.