Updating Kustomize Version in ArgoCD

I love kustomize, particularly when paired with ArgoCD, and find that it’s been a great way to reduce yaml duplication. As much as I love it, there have been some annoying bugs with it over the months particularly in how it handles remote repositories.

For those not familiar with using remote repositories, you can have a kustomization that imports bases and resources from a git repository instead of having to be on your local file system. This makes it possible to develop a common set of kustomizations that can be re-used across an organization. This is essentially what we do in the Red Hat Canada Catalog repo where we share common components across our team. Here is an an example of using a repo repository where my cluster-config repo imports the cost management operator from the Red Hat Canada Catalog:

kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1

bases:
- github.com/redhat-canada-gitops/catalog/cost-management-operator/overlays/default

patchesJson6902:
  - path: patch-source-and-name.yaml
    target:
      group: koku-metrics-cfg.openshift.io
      kind: KokuMetricsConfig
      name: instance
      version: v1beta1

This works really well but as mentioned previously bugs prevail, the format to reference the git repository has worked/not worked in different ways over previous versions and most annoyingly importing a kustomization which in turn has bases that nest more then one level deep in the repo will fail with an evalsymlink error. A lot of these issues were tied to the usage of go-getter.

Fortunately this all seems to have been fixed in the 4.x versions of kustomize with the dropping of go-getter, unfortunately ArgoCD is using 3.7.3 last time I checked. The good news is that it is easy enough to create your own version of the ArgoCD image and include whatever version of kustomize you want. The ArgoCD documentation goes through the options for including custom tools however at the moment the operator only supports embedding new tools in an image at the moment.

As a result the first step to using a custom version of kustomize (lol alliteration!) is to create the image through a Dockerfile:

FROM docker.io/argoproj/argocd:v1.7.12
 
# Switch to root for the ability to perform install
USER root
 
ARG KUSTOMIZE_VERSION=v4.0.1
 
# Install tools needed for your repo-server to retrieve & decrypt secrets, render manifests 
# (e.g. curl, awscli, gpg, sops)
RUN apt-get update && \
    apt-get install -y \
        curl \
        awscli \
        gpg && \
    apt-get clean && \
    curl -o /tmp/kustomize.tar.gz -L https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz && \
    ls /tmp && \
    tar -xvf /tmp/kustomize.tar.gz -C /usr/local/bin && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
 
# Switch back to non-root user
USER argocd

Note in the Dockerfile above I have chosen to overwrite the existing kustomize version. As per the ArgoCD Custom Tooling documentation, you can add multiple versions of kustomize and reference specific versions in your applications. However I see my fix here as a temporary measure until the ArgoCD image catches up with kustomize so I would prefer to keep my application yaml unencumbered with kustomize version references.

To build the image, simply run the following substituting my image repo and name that maps to your own registry and repository:

docker build . -t quay.io/gnunn/argocd:v1.7.12
docker push quay.io/gnunn/argocd:v1.7.12

Once we have the image, we can just update the ArgoCD CR that the operator uses to reference our image as per the example below:

apiVersion: argoproj.io/v1alpha1
kind: ArgoCD
metadata:
  name: example-argocd
  namespace: argocd
spec:
  image: quay.io/gnunn/argocd
  version: v1.7.12

Razer Book 13 (2020) and Linux

Razer Book 13

So I bought a new laptop this week, a Razer Book 13, and I thought it be helpful to others to write a few words about how well Linux works on this laptop. TLDR; despite Razer’s terrible reputation when it comes to Linux, things are pretty awesome with only configuration of the chroma keyboard missing.

I was looking for a machine for more portability then my XPS 15 since I bought a desktop server last year for all of my heavy workloads and didn’t need a super powerful laptop with gobs of memory any longer. As well my XPS 15 is almost four years old and was having issues with the touchpad and fans that were becoming on the noisy side. Going with an XPS 13 would have been the obvious choice but after four years with the XPS 15 I really wanted to change things up so enter the Razer Book 13.

First let’s talk a bit about the overall design of the laptop. It’s a stunning looking laptop with it’s CNC machined case that’s just gorgeous to look at. The display is very good though I do get a certain amount of IPS glow that is particularly noticeable with the GDM grey login screen. Other then that the display is gorgeous and super bright so no issues with outdoor visibility. When using the laptop display I do use the experimental fractional scaling support in Wayland to set it at 125% which works well for me. As I get older my eyes are not so hot for small text.

If you check online there have been some complaints about the keyboard and it being inconsistent with regards to keyboard presses. I haven’t had any issues myself but honestly I’m a two finger (ok maybe three) typist that is very heavy on the keyboard. I haven’t had any issues with missed keystrokes and it’s a perfectly reasonabe typing experience. The touch pad is a glass precision touchpad and works really well, no issues with it at all.

My one complaint with the laptop is that the mid-level laptop (16 GB, touchscreen) which I purchased only comes with a 256 GB SSD which is way too small nowadays. Fortunately it’s very straightforward to open the case and I ended up replacing it with a Western Digital SN750 1 TB SSD. Unfortunately I didn’t realize that Tiger Lake supported PCI-E 4.0 now so an SN850 might have been a better choice albeit more expensive. I’m not doing anything disk intensive so the SN750 is fine for me.

In terms of a Linux installation I went with my usual choice of Arch Linux and set it up in a dual boot configuration with Windows. I wanted to have Windows available in order to perform firmware updates since Razer doesn’t participate in LVFS like Dell does. I first installed Windows in the first 128 GB partition and then went ahead and installed Arch Linux. Another mistake I made here was not overriding the default size Windows uses for the EFI partition, it’s a 100 MB which is too small once you get KMS and including things like Plymouth. For now I’ve disabled the fallback image but on my next installation I’ll make sure to correct that.

In terms of an Arch installation I typically use LVM and Ext4 with LUKS, however in the spirit of changing things up decided to go with BTRFS this time. It’s the default now on Fedora and if it’s good enough for Fedora it’s good enough for me! There was an excellent write-up on performing the installation that I used which can be found here, since all details are there I will not go into depth here. Having said that, BTRFS subvolumes are super cool and well worth checking out.

Once Arch was installed I then proceeded to install Gnome and test everything out, to my delight all of the hardware works fine. The wifi, bluetooth, trackpad, brightness controls (display and keyboard) all work great. The chroma keyboard works in it’s default lighting mode, Spectrum, but there’s no way to configure a different light mode or per-key lighting. I expected that and the default lighting doesn’t bother me at all. I had hoped that I could set the lighting in Windows and it would carry over when you boot into Linux but no luck. Someone has provided the necessary info to the openrazer project so hopefully it will come sooner then later. Update: Chroma keyboard can now be configured via openrazer with this PR, hopefully it gets merged soon.

I ended up replacing my Dell TB16 dock as it would not charge the Razer, apparently it only puts out 60 watts of power for non-dell laptops. I went with a Pluggable Thunderbolt 3 dock as a rplacement that does everything I need. The dock works well with both Windows and Linux and I haven’t had any issues with it so far.

Pluggable Dock

So all in all I am very happy with my purchase and hopefully I will have many years of enjoyment with this laptop. If you have any questions or something you want me to check feel free to ping me in the comments below or in the Razer Book 13 Linux thread I started on reddit.

Building a Simple Up/Down Status Dashboard for OpenShift

OpenShift provides a wealth of monitoring and alerts however sometimes it can be handy to surface a simple up/down signal for an OpenShift cluster that can be easily interpreted by tools like UptimeRobot. This enables you to provide an operational or business level dashboard of the status of your cluster to users and application owners that may not necessarily familiar with all of the nuances of OpenShift’s or Kubernete’s internals.

UptimeRobot Dashboard

The health status of an OpenShift cluster depends on many things such as etcd, operators, nodes, api, etc so how do we aggregate all of this information? While you could certaintly run your own code to do it, fortunately a cool utility called Cerberus already provides this capability. Cerebus was born out of Red Hat’s Performance and Scaling group and was designed to be used with Kraken, a chaos engineering tool. A chaos engineering tool isn’t very useful if you can’t determine the status of the cluster and thus Cereberus was born.

A number of blog posts have already been written about Kraken and Cerebus from a chaos engineering point of view which you can view here and here. Here we are going to focus on the basics of using it for simple health checking.

One thing to note about Cerberus is that it is aggresive about returning an unheathly state even if the cluster is operational. For example, if you set it to monitor a namespace any pod failures in a multi-pod deployment in that namespace will trigger an unhealthly flag even if the other pods in the deployment are running and still servicing requests. As a result, some tuning of Cerberus or the utilization of custom checks is required if you want to use it for a more SLA focused view.

To get started with Cerberus simply clone the git repo into an appropriate directory on the system where you want to run it. While you can run it inside of OpenShift, it is highly recommended to run it outside the cluster since your cluster monitoring tool should not be dependent on the cluster itself. To clone the repo, just run the following:

git clone https://github.com/cloud-bulldozer/cerberus

In order for Cerberus to run, it requires access to a kubeconfig file where a user has already been authenticated. For security purposes I would highly recommend using a serviceaccount with the cluster-reader role rather then using a user with cluster-admin. The commands below will create a serviceaccount in the openshift-monitoring namespace, bind it to the cluster-reader role and generate a kubeconfig that cerberus can use to authenticate to the cluster.

oc create sa cerberus -n openshift-monitoring
oc adm policy add-cluster-role-to-user cluster-reader -z cerberus -n openshift-monitoring
oc serviceaccounts create-kubeconfig cerberus -n openshift-monitoring > config/kubeconfig

Cerberus can automatically create a token for the prometheus-k8s service account to so it can access prometheus to pull metrics. To enable this we need to define a role to give the cerberus the necessary permissions and bind it to the cerberus service account. Create a file with the following content:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cerberus
  namespace: openshift-monitoring
rules:
  - apiGroups:
      - ""
    resources:
      - serviceaccounts
      - secrets
    verbs:
      - get
      - list
      - patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cerberus-service-account-token
  namespace: openshift-monitoring  
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cerberus
subjects:
  - kind: ServiceAccount
    name: cerberus
    namespace: openshift-monitoring

And then apply it with “oc apply -f” to the cluster.

To configure cerberus you can edit the existing config.yaml file in the repo or create a new one, creating a new one is highly recommended so if you do a git pull it doesn’t clobber your changes:

cp config/config.yaml config/my-config.yaml

Once you have the config file, you can go through the options and set what you need. Here is an example of my config file which is really just the example config with the kubeconfig parameter tweaked.

cerberus:
    distribution: openshift                              # Distribution can be kubernetes or openshift
    kubeconfig_path: /opt/cerberus/config/kubeconfig     # Path to kubeconfig
    port: 8080                                           # http server port where cerberus status is published
    watch_nodes: True                                    # Set to True for the cerberus to monitor the cluster nodes
    watch_cluster_operators: True                        # Set to True for cerberus to monitor cluster operators
    watch_url_routes:                                    # Route url's you want to monitor, this is a double array with the url and optional authorization parameter
    watch_namespaces:                                    # List of namespaces to be monitored
        -    openshift-etcd
        -    openshift-apiserver
        -    openshift-kube-apiserver
        -    openshift-monitoring
        -    openshift-kube-controller-manager
        -    openshift-machine-api
        -    openshift-kube-scheduler
        -    openshift-ingress
        -    openshift-sdn                               # When enabled, it will check for the cluster sdn and monitor that namespace
    cerberus_publish_status: True                        # When enabled, cerberus starts a light weight http server and publishes the status
    inspect_components: False                            # Enable it only when OpenShift client is supported to run
                                                         # When enabled, cerberus collects logs, events and metrics of failed components

    prometheus_url:                                      # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                             # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
                                                         # This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies. 

    slack_integration: False                             # When enabled, cerberus reports the failed iterations in the slack channel
                                                         # The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
                                                         # When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
    watcher_slack_ID:                                        # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:                                    # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned

    custom_checks:                                       # Relative paths of files conataining additional user defined checks

tunings:
    iterations: 5                                        # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
    sleep_time: 60                                       # Sleep duration between each iteration
    kube_api_request_chunk_size: 250                     # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
    daemon_mode: True                                    # Iterations are set to infinity which means that the cerberus will monitor the resources forever
    cores_usage_percentage: 0.5                          # Set the fraction of cores to be used for multiprocessing

database:
    database_path: /tmp/cerberus.db                      # Path where cerberus database needs to be stored
    reuse_database: False                                # When enabled, the database is reused to store the failures

At this time you can Cerberus manually and test it out as follows:

$ sudo python3 /opt/cerberus/start_cerberus.py --config /opt/cerberus/config/config-home.yaml
               _                         
  ___ ___ _ __| |__   ___ _ __ _   _ ___ 
 / __/ _ \ '__| '_ \ / _ \ '__| | | / __|
| (_|  __/ |  | |_) |  __/ |  | |_| \__ \
 \___\___|_|  |_.__/ \___|_|   \__,_|___/
                                         

2021-01-29 12:01:01,030 [INFO] Starting ceberus
2021-01-29 12:01:01,037 [INFO] Initializing client to talk to the Kubernetes cluster
2021-01-29 12:01:01,144 [INFO] Fetching cluster info
2021-01-29 12:01:01,260 [INFO] 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.12    True        False         3d20h   Cluster version is 4.6.12

2021-01-29 12:01:01,365 [INFO] Kubernetes master is running at https://api.home.ocplab.com:6443

2021-01-29 12:01:01,365 [INFO] Publishing cerberus status at http://0.0.0.0:8080
2021-01-29 12:01:01,381 [INFO] Starting http server at http://0.0.0.0:8080

2021-01-29 12:01:01,623 [INFO] Daemon mode enabled, cerberus will monitor forever
2021-01-29 12:01:01,623 [INFO] Ignoring the iterations set

2021-01-29 12:01:01,955 [INFO] Iteration 1: Node status: True
2021-01-29 12:01:02,244 [INFO] Iteration 1: Cluster Operator status: True
2021-01-29 12:01:02,380 [INFO] Iteration 1: openshift-ingress: True
2021-01-29 12:01:02,392 [INFO] Iteration 1: openshift-apiserver: True
2021-01-29 12:01:02,396 [INFO] Iteration 1: openshift-sdn: True
2021-01-29 12:01:02,399 [INFO] Iteration 1: openshift-kube-scheduler: True
2021-01-29 12:01:02,400 [INFO] Iteration 1: openshift-machine-api: True
2021-01-29 12:01:02,406 [INFO] Iteration 1: openshift-kube-controller-manager: True
2021-01-29 12:01:02,425 [INFO] Iteration 1: openshift-etcd: True
2021-01-29 12:01:02,443 [INFO] Iteration 1: openshift-monitoring: True
2021-01-29 12:01:02,445 [INFO] Iteration 1: openshift-kube-apiserver: True
2021-01-29 12:01:02,446 [INFO] HTTP requests served: 0 

2021-01-29 12:01:02,446 [WARNING] Iteration 1: Masters without NoSchedule taint: ['home-jcn2d-master-0', 'home-jcn2d-master-1', 'home-jcn2d-master-2']

2021-01-29 12:01:02,592 [INFO] []

2021-01-29 12:01:02,592 [INFO] Sleeping for the specified duration: 60

Great, Cerberus is up and running now but wouldn’t be great if it would run automatically as a service? Let’s go ahead and set that up by creating a systemd service. First let’s setup a bash script called start.sh in the root of our cerberus directory as follows:

#!/bin/bash
 
echo "Starting Cerberus..."
 
python3 /opt/cerberus/start_cerberus.py --config /opt/cerberus/config/my-config.yaml

Next, create a systemd service at /etc/systemd/system/cerberus.service and add the following to it:

[Unit]
Description=Cerberus OpenShift Health Check

Wants=network.target
After=syslog.target network-online.target

[Service]
Type=simple
ExecStart=/bin/bash /opt/cerberus/start.sh
Restart=on-failure
RestartSec=10
KillMode=control-group

[Install]
WantedBy=multi-user.target

To have the service run cerberus use the following commands:

systemctl daemon-reload
systemctl enable cerberus.service
systemctl start cerberus.service

Check the status of the service after starting it, if the service failed you may need to delete the cerberus files in /tmp that were created when run manually previously. You can also check the endpoint at http://localhost:8080 to see the result it returns which is a simple text string with either “True” or “False”.

At this point we can then add our monitor to UptimeRobot assuming the Cerberus port is exposed to the internet. Below is an image of my monitor configuration:

And there you have it, you should start seeing the results in your status page as per the screenshot at the top of the page.