K3s Cluster Maintenance | Runbooks

Prerequisites

Before starting, confirm the following:

You have SSH access to the K3s node(s) via WireGuard
kubectl is configured and can reach the cluster (kubectl cluster-info)
You know the current K3s version (k3s --version)
You have reviewed the K3s release notes for the target version
No active deployments or rollouts are in progress (kubectl rollout status on critical workloads)
Monitoring is operational (Grafana/Prometheus dashboards accessible)
You have at least 30 minutes of uninterrupted time

Warning: This procedure assumes a single-node K3s cluster on Proxmox. Multi-node clusters require additional coordination for control plane and agent upgrades.

Step 1: Pre-Maintenance Health Check

Verify the cluster is healthy before making any changes.

# Cluster status
kubectl cluster-info
kubectl get nodes -o wide

# All pods running
kubectl get pods -A --field-selector=status.phase!=Running

# Check for pending or failed pods
kubectl get pods -A | grep -E 'Pending|Error|CrashLoopBackOff|ImagePullBackOff'

# Resource pressure
kubectl describe node | grep -A 5 "Conditions:"

# Disk usage on node
df -h /var/lib/rancher/k3s

Stop here if:

Any system pods (coredns, traefik, metrics-server) are not Running
The node shows MemoryPressure, DiskPressure, or PIDPressure
Disk usage on /var/lib/rancher/k3s exceeds 80%

Resolve these issues before proceeding with maintenance.

Step 2: Backup Cluster State

K3s uses SQLite by default (single-node) or etcd (HA). Back up accordingly.

SQLite (default single-node)

# Stop writes temporarily
sudo systemctl stop k3s

# Copy the SQLite database
sudo cp /var/lib/rancher/k3s/server/db/state.db \
  /var/lib/rancher/k3s/server/db/state.db.backup-$(date +%Y%m%d)

# Restart K3s
sudo systemctl start k3s

etcd (if configured for HA)

# Snapshot etcd
sudo k3s etcd-snapshot save --name pre-maintenance-$(date +%Y%m%d)

# Verify snapshot
sudo k3s etcd-snapshot ls

Backup manifests and Helm values

# Export all resources (excluding system-managed)
kubectl get all -A -o yaml > /tmp/k3s-backup-all-$(date +%Y%m%d).yaml

# Export Helm releases
helm ls -A -o yaml > /tmp/k3s-backup-helm-$(date +%Y%m%d).yaml

# Copy auto-deployed manifests
sudo cp -r /var/lib/rancher/k3s/server/manifests \
  /tmp/k3s-manifests-backup-$(date +%Y%m%d)

Verify: Confirm backup files exist and are non-empty before proceeding.

Step 3: Drain the Node

Cordon the node to prevent new scheduling, then drain existing workloads.

# Get the node name
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')

# Cordon (prevent new pods from scheduling)
kubectl cordon $NODE_NAME

# Drain (evict existing pods gracefully)
kubectl drain $NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=120s

Warning: On a single-node cluster, draining evicts all non-DaemonSet pods. Services will be unavailable during the maintenance window. Plan accordingly.

Verify:

# Node should show SchedulingDisabled
kubectl get nodes

Step 4: Upgrade K3s Binary

# Record current version
k3s --version

# Stop K3s
sudo systemctl stop k3s

# Install the target version
# Replace INSTALL_K3S_VERSION with the target (e.g., v1.29.2+k3s1)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.29.2+k3s1" sh -

# If you use custom install flags, pass them again:
# curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.29.2+k3s1" \
#   INSTALL_K3S_EXEC="--disable=servicelb --write-kubeconfig-mode=644" sh -

Verify:

k3s --version
# Should show the target version

sudo systemctl status k3s
# Should be active (running)

Step 5: Verify Cluster Health Post-Upgrade

# Wait for the node to become Ready (may take 30-60 seconds)
kubectl get nodes -w

# Uncordon the node to allow scheduling
kubectl uncordon $NODE_NAME

# Verify all system pods recover
kubectl get pods -n kube-system

# Wait for all pods to return to Running
watch kubectl get pods -A

Expected recovery time: System pods (coredns, traefik, metrics-server) should be Running within 2 minutes. Application pods will be rescheduled once the node is uncordoned.

Verify each critical workload:

# Check Traefik ingress
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik --tail=20

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS resolution
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never \
  -- nslookup kubernetes.default.svc.cluster.local

# Verify ingress routes are responding
curl -s -o /dev/null -w "%{http_code}" https://grafana.valkyrienexus.com

Step 6: Validate Application Workloads

# Check all deployments have desired replica count
kubectl get deployments -A

# Check for any pods stuck in non-Running state
kubectl get pods -A --field-selector=status.phase!=Running

# Verify Authentik forward-auth is working
curl -s -o /dev/null -w "%{http_code}" \
  -H "Host: grafana.valkyrienexus.com" \
  http://localhost:80
# Should return 302 (redirect to Authentik) or 200 (if session exists)

Step 7: Post-Maintenance Cleanup

# Remove old backup files after confirming stability (keep for 7 days minimum)
# Do NOT delete immediately -- wait until next maintenance window

# Check for any orphaned resources
kubectl get pods -A | grep Evicted
kubectl delete pods -A --field-selector=status.phase==Failed

Rollback

If the upgrade causes issues:

Scenario: K3s fails to start after upgrade

# Restore the previous K3s binary
# The installer keeps the previous binary at /usr/local/bin/k3s.old (if available)
sudo systemctl stop k3s
sudo cp /usr/local/bin/k3s.old /usr/local/bin/k3s
sudo systemctl start k3s

Scenario: Cluster state is corrupted

# Stop K3s
sudo systemctl stop k3s

# Restore SQLite backup
sudo cp /var/lib/rancher/k3s/server/db/state.db.backup-YYYYMMDD \
  /var/lib/rancher/k3s/server/db/state.db

# Restore previous K3s binary if needed
sudo cp /usr/local/bin/k3s.old /usr/local/bin/k3s

# Start K3s
sudo systemctl start k3s

# Verify
kubectl get nodes
kubectl get pods -A

Scenario: etcd snapshot restore

sudo systemctl stop k3s
sudo k3s server --cluster-reset \
  --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/pre-maintenance-YYYYMMDD
sudo systemctl start k3s

Scenario: Applications not recovering

If individual application pods are not recovering after uncordon:

# Force delete stuck pods
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# Restart the deployment
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# Check events for the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Maintenance Notes

Schedule maintenance during low-usage windows (weekday mornings for homelab)
Always upgrade one minor version at a time (v1.28 to v1.29, not v1.28 to v1.30)
After upgrading K3s, check that Helm chart versions in /var/lib/rancher/k3s/server/manifests are compatible
Document the upgrade in the infrastructure changelog with the date, source version, and target version

Prerequisites#

Step 1: Pre-Maintenance Health Check#

Step 2: Backup Cluster State#

SQLite (default single-node)#

etcd (if configured for HA)#

Backup manifests and Helm values#

Step 3: Drain the Node#

Step 4: Upgrade K3s Binary#

Step 5: Verify Cluster Health Post-Upgrade#

Step 6: Validate Application Workloads#

Step 7: Post-Maintenance Cleanup#

Rollback#

Scenario: K3s fails to start after upgrade#

Scenario: Cluster state is corrupted#

Scenario: etcd snapshot restore#

Scenario: Applications not recovering#

Maintenance Notes#