Virtualisation, Storage and various other ramblings.

Category: Kubernetes (Page 1 of 4)

Changing the default apps wildcard certificate in OCP4

In a standard OCP4 installation, several route objects are created by default and secured with a internally signed wildcard certificate.

These routes are configured as <app-name>.apps.<domain>. In my example, I have a cluster with the assigned domain ocp-acm.virtualthoughts.co.uk, which results in the routes below:

oauth-openshift.apps.ocp-acm.virtualthoughts.co.uk
console-openshift-console.apps.ocp-acm.virtualthoughts.co.uk
grafana-openshift-monitoring.apps.ocp-acm.virtualthoughts.co.uk
thanos-querier-openshift-monitoring.apps.ocp-acm.virtualthoughts.co.uk
prometheus-k8s-openshift-monitoring.apps.ocp-acm.virtualthoughts.co.uk
alertmanager-main-openshift-monitoring.apps.ocp-acm.virtualthoughts.co.uk

Inspecting console-openshift-console.apps.ocp-acm.virtualthoughts.co.uk shows us the default wildcard TLS certificate used by the Ingress Operator:

Because it’s internally signed, it’s not trusted by default by external clients. However, this can be changed.

Installing Cert-Manager

OperatorHub includes the upstream cert-manager chart, as well as one maintained by Red Hat. This can be installed to manage the lifecycle of our new certificate. Navigate to Operators -> Operator Hub -> cert-manager and install.

Create Secret, Issuer and Certificate resources

With Cert-Manager installed, we need to provide configuration so it knows how to issue challenges and generate certificates. In this example:

  • Secret – A client secret created from my cloud provider for authentication used to satisfy the challenge type. In this example AzureDNS, as I’m using the DNS challenge request type to prove ownership of this domain.
  • ClusterIssuer – A cluster wide configuration that when referenced, determines how to get (issue) certs. You can have multiple Issuers in a cluster, namespace or cluster scoped pointing to different providers and configurations.
  • Certificate – TLS certs can be generated automatically from ingress annotations, however in this example, it is used to request and store the certificate in its own lifecycle, not tied to a specific ingress object.

Let’s Encrypt provides wildcard certificates, but only through the DNS-01 challenge. The HTTP-01 challenge cannot be used to issue wildcard certificates. This is reflected in the config:

apiVersion: v1
kind: Secret
metadata:
  name: azuredns-config
  namespace: cert-manager
type: Opaque
data:
  client-secret: <Base64 Encoded Secret from Azure>
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
  namespace: cert-manager
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: <email>
    privateKeySecretRef:
      name: letsencrypt
    solvers:
    - dns01:
        azureDNS:
          clientID: <clientID>
          clientSecretSecretRef:
            name: azuredns-config
            key: client-secret
          subscriptionID: <subscriptionID>
          tenantID: <tenantID>
          resourceGroupName: <resourceGroupName>
          hostedZoneName: virtualthoughts.co.uk
          # Azure Cloud Environment, default to AzurePublicCloud
          environment: AzurePublicCloud
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-apps-certificate
  namespace: openshift-ingress
spec:
  secretName: apps-wildcard-tls
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer
  commonName: "*.apps.ocp-acm.virtualthoughts.co.uk"
  dnsNames:
  - "*.apps.ocp-acm.virtualthoughts.co.uk"

Applying the above will create the respective objects required for us to request, receive and store a wildcard certificate from LetsEncrypt, using the DNS challenge request with AzureDNS.

The certificate may take ~2mins or so to become Ready due to the nature of the DNS style challenge.

oc get cert -A

NAMESPACE           NAME                        READY   SECRET              AGE
openshift-ingress   wildcard-apps-certificate   True    apps-wildcard-tls   33m

Patch the Ingress Operator

With the certificate object created, the Ingress Operator needs re configuring, referencing the secret name of the certificate object for our new certificate:

oc patch ingresscontroller.operator default \
--type=merge -p \
'{"spec":{"defaultCertificate":{"name":"apps-wildcard-tls"}}}' \
--namespace=openshift-ingress-operator

Validate

After applying, navigating back to the clusters console will present the new wildcard cert:

Improving the CI/build process for the community Rancher Exporter

One of my side projects is developing and maintaining an unofficial Prometheus Exporter for Rancher. It exposes metrics pertaining to Rancher-specific resources including, but not limited to managed clusters, Kubernetes versions, and more. Below shows an example dashboard based on these metrics.

overview-dashboard.png

Incidentally, if you are using Rancher, I’d love to hear your thoughts/feedback.

Previous CI workflow

The flowchart below outlines the existing process. Whilst automated, pushing directly to latest is bad practice.

To improve this. Several additional steps were added. First of which acquires the latest, versioned image of the exporter and saves it to the $GITHUB_OUTPUT environment

    - name: Retrieve latest Docker image version
        id: get_version
        run: |
          echo "image_version=$(curl -s "https://registry.hub.docker.com/v2/repositories/virtualthoughts/prometheus-rancher-exporter/tags/" | jq -r '.results[].name' | grep -v latest | sort -V | tail -n 1)" >> $GITHUB_OUTPUT


Referencing this, the next version can be generated based on MAJOR.MINOR.PATCH. Incrementing the PATCH version. In the future, this will be modified to add more flexibility to change MAJOR and MINOR versions.

      - name: Increment version
        id: increment_version
        run: |
          # Increment the retrieved version
          echo "updated_version=$(echo "${{ steps.get_version.outputs.image_version }}" | awk -F. -v OFS=. '{$NF++;print}')" >> $GITHUB_OUTPUT

With the version generated, the subsequent step can tag and push both the incremented version, and latest.

      - name: Build and push
        uses: docker/build-push-action@v3
        with:
          context: .
          push: true
          tags: |
            virtualthoughts/prometheus-rancher-exporter:${{ steps.increment_version.outputs.updated_version }}
            virtualthoughts/prometheus-rancher-exporter:latest

Lastly, the Github action will also modify the YAML manifest file to reference the most recent, versioned image:

      - name: Update Kubernetes YAML manifest
        run: |
          # Install yq
          curl -sL https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -o yq
          chmod +x yq
          sudo mv yq /usr/local/bin/
          
          # Find and update the image tag in the YAML file
          IMAGE_NAME="virtualthoughts/prometheus-rancher-exporter"
          NEW_TAG="${{ steps.increment_version.outputs.updated_version }}"
          OLD_TAG=$(yq eval '.spec.template.spec.containers[] | select(.name == "rancher-exporter").image' manifests/exporter.yaml | cut -d":" -f2)
          NEW_IMAGE="${IMAGE_NAME}:${NEW_TAG}"
          sed -i "s|${IMAGE_NAME}:${OLD_TAG}|${NEW_IMAGE}|" manifests/exporter.yaml

Which results in:

Debugging cloud-init not executing runcmd commands

Background

Rancher leverages cloud-init for the provisioning of Virtual Machines on a number of infrastructure providers, as below:

I recently encountered an issue whereby vSphere based clusters using an Ubuntu VM template would successfully provision, but SLES based VM templates would not.

What does Rancher use cloud-init for?

This is covered in the Masterclass session I co-hosted, but as a refresher, particularly with the vSphere driver, Rancher will mount an ISO image to the VM to deliver the user-data portion of a cloud-init configuration. The contents of which look like this:

#cloud-config
groups:
- staff
hostname: scale-aio-472516f5-s82pz
runcmd:
- sh /usr/local/custom_script/install.sh
set_hostname:
- scale-aio-472516f5-s82pz
users:
- create_groups: false
  groups: staff
  lock_passwd: true
  name: docker
  no_user_group: true
  ssh_authorized_keys:
  - |
    ssh-rsa AAAAB3NzaC1yc.......
  sudo: ALL=(ALL) NOPASSWD:ALL
write_files:
- content: H4sIAAAAAAAA/wAAA...........
  encoding: gzip+b64
  path: /usr/local/custom_script/install.sh
  permissions: "0644"

Note: This is automatically generated, any additional cloud-init config you include in the cluster configuration (below) gets merged with the above.

It saves a script with write_files and then runs this with runcmd – this will install the rancher-system-agent service and begin the process of installing RKE2/K3s.

The Issue

When I provisioned SLES based clusters using my existing Packer template, Rancher would indicate it was waiting for the agent to check in:

Investigating

Thinking cloud-init didn’t ingest the config, I ssh’d into the node to do some debugging. I noticed that the node name had changed:

sles-15-sp3-pool1-15a47a8f-xcspb:~ #

Which I verified with:

sles-15-sp3-pool1-15a47a8f-xcspb:/ # cat /var/lib/cloud/instance/user-data.txt | grep hostname
hostname: sles-15-sp3-pool1-15a47a8f-xcspb

Inspecting user-data.txt from that directory also matched what was in the mounted ISO. I could also see /usr/local/custom_script/install.sh was created, but nothing indicated that it was executed. It appeared everything else from the cloud-init file was processed – SSH keys, groups, writing the script, etc, but nothing from runcmd was executed.

I ruled out the script by creating a new cluster and adding my own command:

As expected, this was merged into the user-data.iso file mounted to the VM, but /tmp/test.txt didn’t exist, so it was never executed.

Checking cloud-init logs

Cloud-Init has an easy way to collect logs – the cloud-init collect-logs command, This will generate a tarball:

sles-15-sp3-pool1-15a47a8f-xcspb:/ # cloud-init collect-logs
Wrote /cloud-init.tar.gz

I noted in cloud-init.log I could see the script file being saved:

2023-01-18 09:56:22,917 - helpers.py[DEBUG]: Running config-write-files using lock (<FileLock using file '/var/lib/cloud/instances/nocloud/sem/config_write_files'>)
2023-01-18 09:56:22,927 - util.py[DEBUG]: Writing to /usr/local/custom_script/install.sh - wb: [644] 29800 bytes
2023-01-18 09:56:22,928 - util.py[DEBUG]: Changing the ownership of /usr/local/custom_script/install.sh to 0:0

But nothing indicating it was executed.

I decided to extrapolate a list of all the cloud-init modules that were initiated:

cat cloud-init.log | grep "Running module"

stages.py[DEBUG]: Running module migrator
stages.py[DEBUG]: Running module seed_random 
stages.py[DEBUG]: Running module bootcmd 
stages.py[DEBUG]: Running module write-files 
stages.py[DEBUG]: Running module growpart 
stages.py[DEBUG]: Running module resizefs 
stages.py[DEBUG]: Running module disk_setup
stages.py[DEBUG]: Running module mounts 
stages.py[DEBUG]: Running module set_hostname
stages.py[DEBUG]: Running module update_hostname 
stages.py[DEBUG]: Running module update_etc_hosts 
stages.py[DEBUG]: Running module rsyslog 
stages.py[DEBUG]: Running module users-groups 
stages.py[DEBUG]: Running module ssh

But still, no sign of runcmd.

Checking cloud-init configuration

Outside of the log bundle, /etc/cloud/cloud.cfg includes the configuration for cloud-init. having suspected the runcmd module may not be loaded, I checked, but it was present:

# The modules that run in the 'config' stage
cloud_config_modules:
 - ssh-import-id
 - locale
 - set-passwords
 - zypper-add-repo
 - ntp
 - timezone
 - disable-ec2-metadata
 - runcmd

However, I noticed that nothing from the cloud_config_modules block was mentioned in cloud-init.log. However, everything from cloud_init_modules was:

# The modules that run in the 'init' stage
cloud_init_modules:
 - migrator
 - seed_random
 - bootcmd
 - write-files
 - growpart
 - resizefs
 - disk_setup
 - mounts
 - set_hostname
 - update_hostname
 - update_etc_hosts
 - ca-certs
 - rsyslog
 - users-groups
 - ssh

So, it appeared the entire cloud_config_modules step wasn’t running. Weird.

Fixing

After speaking with someone from the cloud-init community, I found out that there are several cloud-init services that exist on a host machine. Each dedicated to a specific step.

Default config on SLES 15 SP4 machine:

sles-15-sp3-pool1-15a47a8f-xcspb:/ # sudo systemctl list-unit-files | grep cloud
cloud-config.service                    disabled        disabled     
cloud-final.service                     disabled        disabled     
cloud-init-local.service                disabled        disabled     
cloud-init.service                      enabled         disabled     
cloud-config.target                     static          -            
cloud-init.target                       enabled-runtime disabled

Default config on a Ubuntu 22.04 machine:

packerbuilt@SRV-RNC-1:~$ sudo systemctl list-unit-files | grep cloud
cloud-config.service                        enabled         enabled
cloud-final.service                         enabled         enabled
cloud-init-hotplugd.service                 static          -
cloud-init-local.service                    enabled         enabled
cloud-init.service                          enabled         enabled
cloud-init-hotplugd.socket                  enabled         enabled
cloud-config.target                         static          -
cloud-init.target                           enabled-runtime enabled

The cloud-config service was not enabled and therefore would not run any of the related modules. To rectify, I added the following to my Packer script when building the template:

# Ensure cloud-init services are enabled
systemctl enable cloud-init.service
systemctl enable cloud-init-local.server
systemctl enable cloud-config.service
systemctl enable cloud-final.service

After which, provisioning SLES based machines from Rancher worked.

« Older posts

© 2024 Virtual Thoughts

Theme by Anders NorenUp ↑

Social media & sharing icons powered by UltimatelySocial
RSS
Twitter
Visit Us
Follow Me