Prerequisites
Before starting, confirm the following:
- You have SSH access to the K3s node(s) via WireGuard
kubectlis configured and can reach the cluster (kubectl cluster-info)- You know the current K3s version (
k3s --version) - You have reviewed the K3s release notes for the target version
- No active deployments or rollouts are in progress (
kubectl rollout statuson critical workloads) - Monitoring is operational (Grafana/Prometheus dashboards accessible)
- You have at least 30 minutes of uninterrupted time
Warning: This procedure assumes a single-node K3s cluster on Proxmox. Multi-node clusters require additional coordination for control plane and agent upgrades.
Step 1: Pre-Maintenance Health Check
Verify the cluster is healthy before making any changes.
# Cluster status
kubectl cluster-info
kubectl get nodes -o wide
# All pods running
kubectl get pods -A --field-selector=status.phase!=Running
# Check for pending or failed pods
kubectl get pods -A | grep -E 'Pending|Error|CrashLoopBackOff|ImagePullBackOff'
# Resource pressure
kubectl describe node | grep -A 5 "Conditions:"
# Disk usage on node
df -h /var/lib/rancher/k3s
Stop here if:
- Any system pods (coredns, traefik, metrics-server) are not Running
- The node shows MemoryPressure, DiskPressure, or PIDPressure
- Disk usage on
/var/lib/rancher/k3sexceeds 80%
Resolve these issues before proceeding with maintenance.
Step 2: Backup Cluster State
K3s uses SQLite by default (single-node) or etcd (HA). Back up accordingly.
SQLite (default single-node)
# Stop writes temporarily
sudo systemctl stop k3s
# Copy the SQLite database
sudo cp /var/lib/rancher/k3s/server/db/state.db \
/var/lib/rancher/k3s/server/db/state.db.backup-$(date +%Y%m%d)
# Restart K3s
sudo systemctl start k3s
etcd (if configured for HA)
# Snapshot etcd
sudo k3s etcd-snapshot save --name pre-maintenance-$(date +%Y%m%d)
# Verify snapshot
sudo k3s etcd-snapshot ls
Backup manifests and Helm values
# Export all resources (excluding system-managed)
kubectl get all -A -o yaml > /tmp/k3s-backup-all-$(date +%Y%m%d).yaml
# Export Helm releases
helm ls -A -o yaml > /tmp/k3s-backup-helm-$(date +%Y%m%d).yaml
# Copy auto-deployed manifests
sudo cp -r /var/lib/rancher/k3s/server/manifests \
/tmp/k3s-manifests-backup-$(date +%Y%m%d)
Verify: Confirm backup files exist and are non-empty before proceeding.
Step 3: Drain the Node
Cordon the node to prevent new scheduling, then drain existing workloads.
# Get the node name
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
# Cordon (prevent new pods from scheduling)
kubectl cordon $NODE_NAME
# Drain (evict existing pods gracefully)
kubectl drain $NODE_NAME \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=120s
Warning: On a single-node cluster, draining evicts all non-DaemonSet pods. Services will be unavailable during the maintenance window. Plan accordingly.
Verify:
# Node should show SchedulingDisabled
kubectl get nodes
Step 4: Upgrade K3s Binary
# Record current version
k3s --version
# Stop K3s
sudo systemctl stop k3s
# Install the target version
# Replace INSTALL_K3S_VERSION with the target (e.g., v1.29.2+k3s1)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.29.2+k3s1" sh -
# If you use custom install flags, pass them again:
# curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.29.2+k3s1" \
# INSTALL_K3S_EXEC="--disable=servicelb --write-kubeconfig-mode=644" sh -
Verify:
k3s --version
# Should show the target version
sudo systemctl status k3s
# Should be active (running)
Step 5: Verify Cluster Health Post-Upgrade
# Wait for the node to become Ready (may take 30-60 seconds)
kubectl get nodes -w
# Uncordon the node to allow scheduling
kubectl uncordon $NODE_NAME
# Verify all system pods recover
kubectl get pods -n kube-system
# Wait for all pods to return to Running
watch kubectl get pods -A
Expected recovery time: System pods (coredns, traefik, metrics-server) should be Running within 2 minutes. Application pods will be rescheduled once the node is uncordoned.
Verify each critical workload:
# Check Traefik ingress
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik --tail=20
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Test DNS resolution
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never \
-- nslookup kubernetes.default.svc.cluster.local
# Verify ingress routes are responding
curl -s -o /dev/null -w "%{http_code}" https://grafana.valkyrienexus.com
Step 6: Validate Application Workloads
# Check all deployments have desired replica count
kubectl get deployments -A
# Check for any pods stuck in non-Running state
kubectl get pods -A --field-selector=status.phase!=Running
# Verify Authentik forward-auth is working
curl -s -o /dev/null -w "%{http_code}" \
-H "Host: grafana.valkyrienexus.com" \
http://localhost:80
# Should return 302 (redirect to Authentik) or 200 (if session exists)
Step 7: Post-Maintenance Cleanup
# Remove old backup files after confirming stability (keep for 7 days minimum)
# Do NOT delete immediately -- wait until next maintenance window
# Check for any orphaned resources
kubectl get pods -A | grep Evicted
kubectl delete pods -A --field-selector=status.phase==Failed
Rollback
If the upgrade causes issues:
Scenario: K3s fails to start after upgrade
# Restore the previous K3s binary
# The installer keeps the previous binary at /usr/local/bin/k3s.old (if available)
sudo systemctl stop k3s
sudo cp /usr/local/bin/k3s.old /usr/local/bin/k3s
sudo systemctl start k3s
Scenario: Cluster state is corrupted
# Stop K3s
sudo systemctl stop k3s
# Restore SQLite backup
sudo cp /var/lib/rancher/k3s/server/db/state.db.backup-YYYYMMDD \
/var/lib/rancher/k3s/server/db/state.db
# Restore previous K3s binary if needed
sudo cp /usr/local/bin/k3s.old /usr/local/bin/k3s
# Start K3s
sudo systemctl start k3s
# Verify
kubectl get nodes
kubectl get pods -A
Scenario: etcd snapshot restore
sudo systemctl stop k3s
sudo k3s server --cluster-reset \
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/pre-maintenance-YYYYMMDD
sudo systemctl start k3s
Scenario: Applications not recovering
If individual application pods are not recovering after uncordon:
# Force delete stuck pods
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
# Restart the deployment
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# Check events for the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Maintenance Notes
- Schedule maintenance during low-usage windows (weekday mornings for homelab)
- Always upgrade one minor version at a time (v1.28 to v1.29, not v1.28 to v1.30)
- After upgrading K3s, check that Helm chart versions in
/var/lib/rancher/k3s/server/manifestsare compatible - Document the upgrade in the infrastructure changelog with the date, source version, and target version