Leaning k8s the Cooper way via Claude and Ansible …

Building Kubernetes the Hard Way — Then Automating It All with Ansible

How I went from zero k8s knowledge to a fully IPv6-only, BGP-routed, HA Kubernetes cluster in 7 weeks — and what I learned along the way.

  • Please note 99.69% of this blog post was generated by claude code with minimal editing by Cooper.

The Starting Point: Why "The Hard Way"?

In early February 2026, I decided it was time to properly learn Kubernetes. Not the managed cloud kind where you click a button and get a cluster. Not the kubeadm init kind where a tool does the thinking for you. I wanted to understand every single core component — why each certificate exists, what flags the API server needs, how etcd consensus actually works, and what happens at the network level when a pod talks to a service. Friends have asked about my adventure so this blog post attemptes to share the experience. This has all been done why navigating a 2/3 month old newborn, thus why this took longer than it would have historically for me (but AI really helped speed things up).

So I started with Kubernetes the Hard Way by Kelsey Hightower — the canonical tutorial for engineers who want to understand Kubernetes from the ground up.

The philosophy is simple: no shortcuts. You provision machines, generate every TLS certificate with openssl, write every systemd unit file, configure every kubeconfig, and route every packet yourself. When you're done, you don't just have a cluster — you have a mental model of how it all fits together.

But I didn't stop there. My goal was to take that hard-won understanding and encode it into Ansible — building a cluster that could be reproduced, upgraded, and maintained entirely through automation. And I added a twist: IPv6-only networking throughout.

I can't thank claude code enough, working out the sharp edges and creating and then trawling the VMs to help convert the setup to ansible. This really accelerated my learning. That side tho I am stil a HUGE believer in makign your infra in a way so that if AI is down, one can still drive it and debug it. This is where using ansible as the "infrastructure as code" and "documentaton" really helps.


The Infrastructure: KVM VMs Across Two Sites

I already had a mature Ansible-managed home infrastructure — 87 roles managing everything from routers to Docker containers to ZFS pools. The k8s cluster would live on KVM virtual machines spread across two physical hypervisors (my home linux "routers" home1 and home2) for pretend redundancy. k8s and the OSS world love their fucking consesnus, so I had to double the VMs on home1 to allow these systems to have their odd numbers of members in their HA componentes. I dislike this a want. I'm a simple N+1 guy. I get it, but still annoying.

┌─────────────────────────────────────┐   ┌─────────────────────────────────────┐
│         home1.cooperlees.com        │   │         home2.cooperlees.com        │
│         (Primary Hypervisor)        │   │        (Secondary Hypervisor)       │
│                                     │   │                                     │
│  ┌──────────┐  ┌──────────────────┐ │   │  ┌──────────┐  ┌────────────────┐  │
│  │  server   │  │  server3         │ │   │  │  server2  │  │  node-2        │  │
│  │ CP + etcd │  │  CP + etcd       │ │   │  │ CP + etcd │  │  Worker        │  │
│  │ 3GB RAM   │  │  3GB RAM         │ │   │  │ 3GB RAM   │  │  2GB RAM       │  │
│  │ NVMe etcd │  │  NVMe etcd       │ │   │  │           │  │                │  │
│  └──────────┘  └──────────────────┘ │   │  └──────────┘  └────────────────┘  │
│  ┌──────────┐  ┌──────────────────┐ │   │                                     │
│  │  node-0   │  │  node-1          │ │   │                                     │
│  │  Worker   │  │  Worker          │ │   │                                     │
│  │  2GB RAM   │  │  2GB RAM         │ │   │                                     │
│  └──────────┘  └──────────────────┘ │   │                                     │
│                                     │   │                                     │
│         ┌──────────────┐            │   │         ┌──────────────┐            │
│         │  br-k8 bridge │            │   │         │  br-k8 bridge │            │
│         └──────┬───────┘            │   │         └──────┬───────┘            │
└────────────────┼────────────────────┘   └────────────────┼────────────────────┘
                 │                                          │
                 └──────────── L2/L3 Link ─────────────────┘

Six nodes total: 3 control plane nodes running etcd + API server + controller-manager + scheduler, and 3 worker nodes. The control plane nodes are tainted with NoSchedule so workloads only land on workers. Every VM is provisioned via Ansible with cloud-init, static IPs, and autostart enabled in libvirt.


Phase 1: The Genesis (February 2026)

Day 1: First Boot

The first commit on February 1st targeted Kubernetes 1.32 — the same version Kelsey's tutorial uses. I started with the classic KTHW layout: a jumpbox for distributing binaries, manually installed etcd, hand-crafted systemd units for every control plane component.

But unlike the tutorial's single control plane node, I immediately went for 3 control plane nodes for proper etcd quorum. I also set up Prometheus monitoring from day one — I've been running a Prometheus stack on my VPS for years, and I wasn't about to fly blind like so many people happily do in our industry (why??).

                    ┌──────────────────────────────────────────┐
                    │          Phase 1: Genesis                │
                    │                                          │
                    │  ┌────────┐ ┌────────┐ ┌────────┐       │
                    │  │ server │ │server2 │ │server3 │       │
                    │  │  etcd  │ │  etcd  │ │  etcd  │       │
                    │  │  api   │ │  api   │ │  api   │       │
                    │  │  ctrl  │ │  ctrl  │ │  ctrl  │       │
                    │  │  sched │ │  sched │ │  sched │       │
                    │  └───┬────┘ └───┬────┘ └───┬────┘       │
                    │      │          │          │             │
                    │      └──────────┼──────────┘             │
                    │                 │                         │
                    │  ┌────────┐ ┌────────┐ ┌────────┐       │
                    │  │ node-0 │ │ node-1 │ │ node-2 │       │
                    │  │kubelet │ │kubelet │ │kubelet │       │
                    │  │contd   │ │contd   │ │contd   │       │
                    │  └────────┘ └────────┘ └────────┘       │
                    │                                          │
                    │  Networking: Dual-stack (IPv4 + IPv6)    │
                    │  CNI: None yet                           │
                    │  HA: None yet                            │
                    └──────────────────────────────────────────┘

The PKI Challenge

One of the first things KTHW teaches you is that Kubernetes runs on a lot of TLS certificates. I generated them all and stored them in Ansible Vault:

  • CA certificate — the root of trust for the entire cluster
  • kube-apiserver cert — with SANs for every node IP, the VIP, DNS names, service IPs, and ::1
  • Per-component certs — controller-manager, scheduler, kube-proxy each get their own identity
  • Per-node kubelet certs — each worker authenticates with its own certificate
  • Service account signing key — for token generation
  • Encryption key — for etcd secrets-at-rest (AES-CBC)

Every certificate, every kubeconfig, every encryption key lives in group_vars/k8s/vault.yml or per-node host_vars/*/vault.yml, encrypted with ansible-vault. The Ansible role deploys them to the right machines, in the right paths, with the right permissions.


Phase 2: Going IPv6-Only (Late February)

This is where things got interesting. My home network already runs dual-stack with heavy IPv6 usage — I have ULA (fd00:) addresses for internal services and GUA for internet-facing traffic. So I asked myself: can I run a Kubernetes cluster entirely on IPv6?

The answer is yes - and everything really just works (TM) now (even for me). So people have no excuse that k8s blocks IPv6 for internet services now.

The IPv6 Address Plan

┌────────────────────────────────────────────────────────────────┐
│                    IPv6 Address Allocation                      │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Node Network:     fd00:4::/64                                 │
│    server:         fd00:4::20                                  │
│    server2:        fd00:4::120                                 │
│    server3:        fd00:4::21                                  │
│    node-0:         fd00:4::30                                  │
│    node-1:         fd00:4::31                                  │
│    node-2:         fd00:4::132                                 │
│                                                                │
│  API Server VIP:   fd00:4::4        (BGP-advertised)           │
│                                                                │
│  Pod CIDR:         fd00:4:69::/56   (per-node /64 slices)      │
│    node-0 pods:    fd00:4:69::/64                              │
│    node-1 pods:    fd00:4:69:1::/64                            │
│    node-2 pods:    fd00:4:69:2::/64                            │
│                                                                │
│  Service CIDR:     fd00:4:32::/112                             │
│  Cluster DNS:      fd00:4:32::53                               │
│                                                                │
└────────────────────────────────────────────────────────────────┘

etcd on IPv6

Moving etcd to IPv6 was the first step. etcd is the brain of the cluster — if it breaks, everything breaks. I went through a memorable TLS toggle cycle: disabled TLS to debug IPv6 connectivity issues, then re-enabled it once I confirmed the IPv6 transport worked. The health checks needed updating too — 127.0.0.1 became [::1].

The etcd cluster stabilized at v3.6.8 after starting on v3.6.0-rc.3. Running a release candidate in what was becoming a real cluster didn't feel right, so I upgraded as soon as it worked to the latest stable since it had support.

The etcd Tuning Rabbit Hole

etcd performance turned out to be a fascinating tuning exercise. My cluster experienced leader thrashing — the leader election would flip between nodes, causing brief API server hiccups. The fix was multi-layered:

  • Dedicated NVMe disks: etcd is extremely sensitive to disk latency. I passed through NVMe storage from the hypervisors to the etcd VMs, with cache=none for direct I/O
  • I/O priority: Set ionice to realtime class for the etcd process
  • Heartbeat tuning: Adjusted etcd's election timeout and heartbeat interval
  • Disabled btrfs COW: The hypervisor's btrfs filesystem was causing write amplification on the VM disk images — chattr +C fixed the fragmentation
  • Weekly defrag timer: etcd's B+ tree grows monotonically; a coordinated weekly defrag on Sunday at 3am keeps it healthy
┌────────────────────────────────────────────────────────────────────┐
│                    etcd Performance Stack                          │
│                                                                    │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  etcd v3.6.8 (systemd service, ionice realtime)             │  │
│  │  Listening: https://[fd00:4::X]:2379                        │  │
│  │  Metrics:   http://[::]:2381                                │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                             │                                      │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │  /var/lib/etcd (or /scratch/k8s_etcd/)                      │  │
│  │  Dedicated 4GB qcow2 disk, cache=none                       │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                             │                                      │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │  NVMe SSD (host passthrough)                                │  │
│  │  btrfs with COW disabled (chattr +C)                        │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                    │
│  Maintenance: Weekly defrag (Sunday 3am), snapshot before update   │
└────────────────────────────────────────────────────────────────────┘

I even had to replace my home servers old nvme drive as etcd took it's life. I now have a new 120GB drive with 5 years warantee!


Phase 3: BGP-Based High Availability

With the control plane on IPv6, I needed a way to make the API server highly available. Enter kube-vip and BGP ECMP (Equal-Cost Multi-Path).

How It Works

Each control plane node runs kube-vip as a static pod. kube-vip advertises a shared Virtual IP (fd00:4::4) via BGP to my home routers, which run FRRouting. The routers see three equal-cost paths to the VIP and load-balance across all three API servers using ECMP.

┌──────────────────────────────────────────────────────────────────────┐
│                     BGP HA Architecture                              │
│                                                                      │
│                        ┌────────────┐                                │
│                        │   Client   │                                │
│                        │  kubectl   │                                │
│                        └─────┬──────┘                                │
│                              │                                       │
│                    lab-k8s.cooperlees.com                             │
│                         fd00:4::4                                    │
│                              │                                       │
│              ┌───────────────┼───────────────┐                       │
│              │               │               │                       │
│     ┌────────▼─────┐ ┌──────▼───────┐ ┌─────▼────────┐              │
│     │ home1 router │ │              │ │ home2 router │              │
│     │  FRRouting   │ │              │ │  FRRouting   │              │
│     │  AS 65069    │ │   ECMP LB    │ │  AS 65070    │              │
│     └──────┬───────┘ └──────────────┘ └──────┬───────┘              │
│            │                                  │                      │
│     ┌──────┼─────────────────┐         ┌──────┼──────────┐           │
│     │      │                 │         │      │          │           │
│  ┌──▼───┐ ┌▼───────┐        │      ┌──▼───┐  │          │           │
│  │server│ │server3 │        │      │server2│  │          │           │
│  │ :6443│ │  :6443 │        │      │ :6443 │  │          │           │
│  │kube- │ │ kube-  │        │      │kube-  │  │          │           │
│  │ vip  │ │  vip   │        │      │ vip   │  │          │           │
│  │AS    │ │ AS     │        │      │AS     │  │          │           │
│  │65004 │ │ 65004  │        │      │65004  │  │          │           │
│  └──────┘ └────────┘        │      └───────┘  │          │           │
│     home1                   │         home2   │          │           │
│                             │                 │          │           │
│  All 3 nodes advertise fd00:4::4 via BGP      │          │           │
│  Routers perform ECMP load balancing           │          │           │
│  If a node fails, BGP withdraws the route      │          │           │
└──────────────────────────────────────────────────────────────────────┘

The beauty of this approach is hitless failover. When a control plane node goes down — whether for maintenance or failure — kube-vip withdraws the BGP route, and within seconds (BGP hold timer: 9s, keepalive: 3s) the routers converge on the remaining paths. As I used a VIP within the /64 all nodes got, I also had to enable gratitous Neighor Discovry / Solicitation.

kube-vip's Growing Pains

kube-vip v1.1.0 introduced a regression that de-configured BGP peers under certain conditions. I temporarily switched to a community fix image until v1.1.1 shipped with the fix — a good reminder that even infrastructure components have bugs, and that version pinning with automated freshness checks (more on that later) is essential.


Phase 4: Cilium — The Network Brain

For the CNI (Container Network Interface), I chose Cilium — and it turned out to be one of the best decisions of the project.

Why Cilium?

Cilium uses eBPF to implement networking directly in the Linux kernel, which means:

  • No overlay network — native IPv6 routing, no VXLAN or Geneve tunnels
  • BGP pod CIDR advertising — workers announce their pod CIDRs to the routers
  • Hubble — a built-in observability layer that gives you flow logs, DNS visibility, and HTTP metrics

The Full Network Picture

┌────────────────────────────────────────────────────────────────────────┐
│                  Complete Network Architecture                         │
│                                                                        │
│  ┌──────────────────────────┐       ┌──────────────────────────┐       │
│  │    home1 Router          │       │    home2 Router          │       │
│  │    FRRouting AS 65069    │◄─────►│    FRRouting AS 65070    │       │
│  │                          │       │                          │       │
│  │  Receives BGP routes:    │       │  Receives BGP routes:    │       │
│  │  • fd00:4::4/128 (VIP)  │       │  • fd00:4::4/128 (VIP)  │       │
│  │  • fd00:4:69::/64 (n0)  │       │  • fd00:4:69::/64 (n0)  │       │
│  │  • fd00:4:69:1::/64 (n1)│       │  • fd00:4:69:1::/64 (n1)│       │
│  │  • fd00:4:69:2::/64 (n2)│       │  • fd00:4:69:2::/64 (n2)│       │
│  └────────────┬─────────────┘       └────────────┬─────────────┘       │
│               │                                   │                    │
│    ┌──────────┴───────────────────────────────────┴──────────┐         │
│    │                    br-k8 Network                         │         │
│    │                  fd00:4::/64                             │         │
│    └──┬──────────┬──────────┬──────────┬──────────┬──────┬───┘         │
│       │          │          │          │          │      │              │
│  ┌────▼───┐ ┌───▼────┐ ┌───▼────┐ ┌───▼───┐ ┌───▼──┐ ┌─▼─────┐       │
│  │ server │ │server2 │ │server3 │ │node-0 │ │node-1│ │node-2 │       │
│  │  CP    │ │  CP    │ │  CP    │ │Worker │ │Worker│ │Worker │       │
│  │        │ │        │ │        │ │Cilium │ │Cilium│ │Cilium │       │
│  │kube-vip│ │kube-vip│ │kube-vip│ │  BGP  │ │ BGP  │ │ BGP   │       │
│  │BGP:VIP │ │BGP:VIP │ │BGP:VIP │ │pods   │ │pods  │ │pods   │       │
│  └────────┘ └────────┘ └────────┘ └───────┘ └──────┘ └───────┘       │
│                                                                        │
│  Control Plane: kube-vip advertises API VIP via BGP                    │
│  Workers: Cilium advertises pod CIDRs via BGP                          │
│  Result: Any device on the network can reach any pod natively          │
└────────────────────────────────────────────────────────────────────────┘

The combination of kube-vip (for the control plane VIP) and Cilium BGP (for pod CIDRs) means that any device on my home network can reach any pod directly by its IPv6 address — no port forwarding, no ingress controllers needed for internal access. The routers handle all the routing via BGP.

Here is the view from frr on home1 to the k8s cluster:

home1.cooperlees.com# sh ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.255.0.3, local AS number 65069 VRF default vrf-id 0
BGP table version 22
RIB entries 37, using 4736 bytes of memory
Peers 11, using 259 KiB of memory
Peer groups 3, using 192 bytes of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
fd00::1         4      65001       345       352       22    0    0 00:16:00            8       17 US Hub
fd00:1::3       4      65070       352       352       22    0    0 00:16:00           12       17 home2
fd00:4::30      4      65004       323       346        0    0    0 00:15:58        NoNeg    NoNeg node-0
fd00:4::31      4      65004       323       346        0    0    0 00:15:58        NoNeg    NoNeg node-1
fd00:4::132     4      65004       323       346        0    0    0 00:15:58        NoNeg    NoNeg node-2
fd00:4::220     4      65004       322       344        0    0    0 00:15:55        NoNeg    NoNeg server
fd00:4::221     4      65004       323       346        0    0    0 00:15:57        NoNeg    NoNeg server3
fd00:4::320     4      65004       323       346        0    0    0 00:15:57        NoNeg    NoNeg server2
fd00:4::20      4      65004       351       346        0    0    0 00:15:58        NoNeg    NoNeg kube-vip-server
fd00:4::21      4      65004       351       346        0    0    0 00:15:58        NoNeg    NoNeg kube-vip-server3
fd00:4::120     4      65004       351       346        0    0    0 00:15:58        NoNeg    NoNeg kube-vip-server2

Total number of neighbors 11
home1.cooperlees.com# show ipv6 route fd00:4::/32 longer-prefixes 
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv6 unicast VRF default:
C>* fd00:4::/64 is directly connected, br-k8, weight 1, 00:16:11
L>* fd00:4::2/128 is directly connected, br-k8, weight 1, 00:16:11
B>* fd00:4::4/128 [20/0] via fd00:4::20, br-k8, weight 1, 00:16:05
  *                      via fd00:4::21, br-k8, weight 1, 00:16:05
  *                      via fd00:4::120, br-k8, weight 1, 00:16:05
S>* fd00:4:69::/56 [1/0] via fd00:4::30, br-k8, weight 1, 00:16:07
  *                      via fd00:4::31, br-k8, weight 1, 00:16:07
  *                      via fd00:4::132, br-k8, weight 1, 00:16:07
  *                      via fd00:4::220, br-k8, weight 1, 00:16:07
  *                      via fd00:4::221, br-k8, weight 1, 00:16:07
  *                      via fd00:4::320, br-k8, weight 1, 00:16:07
B>* fd00:4:69::/64 [20/0] via fd00:4::220, br-k8, weight 1, 00:16:02
B>* fd00:4:69:1::/64 [20/0] via fd00:4::320, br-k8, weight 1, 00:16:04
B>* fd00:4:69:2::/64 [20/0] via fd00:4::221, br-k8, weight 1, 00:16:04
B>* fd00:4:69:3::/64 [20/0] via fd00:4::30, br-k8, weight 1, 00:16:05
B>* fd00:4:69:4::/64 [20/0] via fd00:4::132, br-k8, weight 1, 00:16:05
B>* fd00:4:69:5::/64 [20/0] via fd00:4::31, br-k8, weight 1, 00:16:05

Phase 5: The Observability Stack

I'm a firm believer that infrastructure you can't observe is infrastructure you can't trust. From the very first commit, Prometheus monitoring was part of the design. Over the seven weeks, the observability stack evolved significantly.

The Metrics Pipeline

┌────────────────────────────────────────────────────────────────────────┐
│                    Observability Architecture                          │
│                                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                     Prometheus (VPS)                             │   │
│  │               prometheus.cooperlees.com                         │   │
│  │                                                                 │   │
│  │  Scrape Jobs:                                                   │   │
│  │  ┌─────────────────┬───────────────────┬──────────────────┐     │   │
│  │  │ kube-apiserver  │ kubelet           │ kubelet-cadvisor │     │   │
│  │  │ :6443/metrics   │ :10250/metrics    │ :10250/cadvisor  │     │   │
│  │  │ via VIP         │ all 6 nodes       │ all 6 nodes      │     │   │
│  │  ├─────────────────┼───────────────────┼──────────────────┤     │   │
│  │  │ etcd            │ kube-state-metrics│ CoreDNS          │     │   │
│  │  │ :2381/metrics   │ NodePort 30080    │ NodePort 30053   │     │   │
│  │  │ control planes  │ workers           │ workers          │     │   │
│  │  ├─────────────────┼───────────────────┼──────────────────┤     │   │
│  │  │ Cilium          │ Hubble            │ Kuberhealthy     │     │   │
│  │  │ :9962/metrics   │ :9965/metrics     │ NodePort 30082   │     │   │
│  │  │ all nodes       │ all nodes         │ workers          │     │   │
│  │  ├─────────────────┼───────────────────┼──────────────────┤     │   │
│  │  │ node_exporter   │ Grafana Alloy     │ monitord         │     │   │
│  │  │ :9100/metrics   │ (replaces 3 tools)│ :9740/metrics    │     │   │
│  │  │ all nodes       │ all nodes         │ all nodes        │     │   │
│  │  └─────────────────┴───────────────────┴──────────────────┘     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                         │
│                              ▼                                         │
│                    ┌───────────────────┐                                │
│                    │   Grafana (VPS)   │                                │
│                    │   Dashboards +    │                                │
│                    │   Alerting        │                                │
│                    └───────────────────┘                                │
└────────────────────────────────────────────────────────────────────────┘

Key Observability Decisions

Grafana Alloy replaced three tools at once. I was previously running promtail (log shipping), node_exporter (OS metrics), and blackbox_exporter (synthetic probes) as separate components. Grafana Alloy is a single binary that does all three — a significant operational simplification.

kube-state-metrics exposes the state of Kubernetes objects (pods, deployments, services) as Prometheus metrics. Early on, I had a triple-counting bug because I was scraping it from each node individually instead of through the VIP. The fix was to scrape via a single NodePort endpoint.

etcd metrics evolved too. Initially I ran a sidecar proxy (the etcd_metrics_proxy role) to expose etcd's metrics endpoint without requiring TLS client certs. Later, etcd v3.6.8 added a dedicated metrics port (2381) that serves metrics without authentication, making the proxy obsolete.

Metrics Server — the in-cluster metrics API that powers kubectl top pods and horizontal pod autoscaling — required enabling the API aggregation layer on the kube-apiserver. Another KTHW learning moment: you appreciate these features more when you have to wire them up yourself.


Phase 6: Synthetic Health Checks with Kuberhealthy

Metrics tell you what is happening. Synthetic checks tell you whether things actually work. I deployed Kuberhealthy to continuously verify cluster health by running real operations.

This one turned into an adventure. Kuberhealthy had two problems in my environment:

  1. No IPv6 support — the deployment health check pods couldn't communicate on an IPv6-only cluster
  2. Leader election bugs — the ConfigMap-based leader election would deadlock under certain conditions

I ended up forking Kuberhealthy, fixing the IPv6 issues, and rewriting the leader election to use Kubernetes Leases instead of ConfigMaps. The fork lives at ghcr.io/cooperlees/kuberhealthy:fix-leader-election. Contributing IPv6 fixes to the ecosystem became an unexpected side benefit of this project.

  • Turns out I can change this and just set to use the leader election with the upstream version and it all works ... I will probably clean that up

Phase 7: CoreDNS — Tuned for IPv6

CoreDNS is the cluster DNS server, and it needed special attention for IPv6-only operation.

The Tuning

  • bufsize 1232 — IPv6 has a minimum MTU of 1280 bytes. DNS responses larger than this get fragmented, and IPv6 fragmentation is notoriously unreliable. Setting the EDNS0 buffer size to 1232 ensures responses stay within a single packet.
  • autopath — reduces the number of DNS lookups for pods using the default ndots:5 search path. Instead of trying svc.cluster.local, svc.default.svc.cluster.local, etc., autopath rewrites the query to the correct FQDN on the server side.
  • Cache tuning — 5-minute success TTL, 30-second denial TTL, 9984-entry capacity
  • PodDisruptionBudgetminAvailable: 1 ensures at least one CoreDNS replica is always running during node drains
  • topologySpreadConstraints — replaced the older soft anti-affinity to spread replicas across nodes, ensuring DNS survives a single node failure
  • Forwarding — external DNS queries go to my PiHole instances (fd00:68::69 and fd00:70::69) rather than public resolvers

The Upgrade Story: Proving Hitless Operations

One of the most satisfying aspects of this project was proving that Kubernetes upgrades can be done with zero downtime — even on a "hard way" cluster without kubeadm.

The Upgrade Procedure

┌────────────────────────────────────────────────────────────────────┐
│                  Hitless Upgrade Procedure                         │
│                                                                    │
│  1. Update k8s_patch_version in group_vars/all/vars.yml            │
│                                                                    │
│  2. For each control plane node (one at a time):                   │
│     ┌──────────────────────────────────────────────────────────┐   │
│     │  $ kubectl drain server --ignore-daemonsets               │   │
│     │      ↓ Node marked unschedulable                         │   │
│     │      ↓ Existing pods evicted (respecting PDBs)           │   │
│     │                                                          │   │
│     │  $ ansible-playbook site.yaml --tags k8s_node_setup      │   │
│     │      ↓ New k8s binaries installed                        │   │
│     │      ↓ Systemd services restarted                        │   │
│     │      ↓ kube-vip withdraws BGP route during restart       │   │
│     │      ↓ ECMP routes to remaining 2 API servers            │   │
│     │                                                          │   │
│     │  $ kubectl uncordon server                               │   │
│     │      ↓ Node marked schedulable again                     │   │
│     │      ↓ kube-vip re-advertises BGP route                  │   │
│     └──────────────────────────────────────────────────────────┘   │
│                                                                    │
│  3. Repeat for worker nodes                                        │
│                                                                    │
│  Result: At no point is the cluster unavailable.                   │
│  API server: Always reachable via remaining BGP paths              │
│  Workloads: PDBs ensure minimum availability                       │
│  DNS: CoreDNS PDB ensures at least 1 replica                      │
└────────────────────────────────────────────────────────────────────┘

I proved this works by upgrading from Kubernetes 1.32 to 1.35.2 — a major version jump — with zero downtime. The BGP-based HA means there's no single point of failure. When one API server restarts, the routers route around it. PodDisruptionBudgets ensure CoreDNS and other critical workloads maintain minimum replica counts during drains.

Version Freshness Automation

To make sure components don't silently go stale, I built a version_check role that runs at the end of every Ansible playbook execution. It checks every pinned version against the latest GitHub release:

Component Pinned Latest Status
Kubernetes 1.35.2 ? Checked
etcd v3.6.8 ? Checked
Cilium 1.19.1 ? Checked
CoreDNS 1.14.1 ? Checked
kube-vip 1.1.1 ? Checked
kube-state-metrics v2.18.0 ? Checked

If anything is behind, the playbook output tells me. Combined with ansible_shed (a daemon that auto-runs the playbook every 2 hours), I get continuous drift detection for free.


What I Learned

1. "The Hard Way" Pays Compound Interest

Every debugging session was faster because I understood the components. When etcd was thrashing leaders, I knew to check heartbeat intervals and disk latency — not because a runbook told me, but because I'd configured those parameters myself. When Cilium BGP wasn't advertising pod CIDRs, I knew to check the node labels because I'd written the CiliumBGPPeeringPolicy manifest by hand.

2. IPv6-Only is Production-Ready

The cluster runs entirely on IPv6 ULA addresses. The core Kubernetes components handle IPv6 well. The gaps are in the ecosystem — I had to fork Kuberhealthy for IPv6 support, and some tooling defaults to 127.0.0.1 where it should use [::1]. These are solvable problems, and contributing fixes upstream makes the ecosystem better for everyone.

3. BGP is the Right HA Primitive for Home Labs

Floating VIPs with NDP (IPv6 ARP) are fragile. BGP ECMP is elegant: each node independently advertises routes, routers independently make forwarding decisions, and convergence is fast and well-understood. If you already run FRRouting on your home routers (and you should), adding BGP peering for Kubernetes HA is natural. But the combination seems to work perfectly and outside of the region I get ECMP load balancing to my control nodes.

4. Observability is Not Optional

From Prometheus scraping etcd metrics on day one to Hubble flow logs and Kuberhealthy synthetic checks, every layer of observability caught real problems:

  • etcd metrics revealed leader thrashing before it became user-visible
  • Kubelet cadvisor metrics showed memory pressure that led to bumping control plane VMs to 3GB
    • Was predominately due to etcd i/o stuggles
  • CoreDNS metrics exposed cache hit rates that guided cache tuning
  • Cilium Hubble metrics caught dropped packets from misconfigured pod CIDR routes

5. Ansible is a Great Fit for "Hard Way" Kubernetes

The mapping from KTHW manual steps to Ansible tasks is almost 1:1. Each role corresponds to a chapter in the tutorial. The difference is that Ansible makes it repeatable and upgradeable. When Kubernetes 1.36 drops, I'll update one variable and roll it out node by node with zero downtime.


The Final Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│                    Kubernetes Cluster Architecture                       │
│                         (March 2026)                                    │
│                                                                         │
│  ┌─── Control Plane (3x, HA) ────────────────────────────────────────┐  │
│  │                                                                   │  │
│  │  • kube-apiserver (IPv6, :6443)                                   │  │
│  │  • kube-controller-manager (leader-elected)                       │  │
│  │  • kube-scheduler (leader-elected)                                │  │
│  │  • etcd v3.6.8 (NVMe-backed, TLS, weekly defrag)                 │  │
│  │  • kube-vip 1.1.1 (BGP VIP: fd00:4::4, AS 65004)                │  │
│  │  • Tainted: NoSchedule                                            │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌─── Workers (3x) ─────────────────────────────────────────────────┐  │
│  │                                                                   │  │
│  │  • kubelet + containerd                                           │  │
│  │  • Cilium 1.19.1 (native IPv6 routing, BGP pod CIDR ads)         │  │
│  │  • Hubble (flow logs, DNS visibility, HTTP metrics)               │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌─── Cluster Services ─────────────────────────────────────────────┐  │
│  │                                                                   │  │
│  │  • CoreDNS 1.14.1 (2 replicas, PDB, autopath, cache tuned)       │  │
│  │  • Metrics Server v0.8.1 (kubectl top, HPA-ready)                │  │
│  │  • kube-state-metrics v2.18.0 (object metrics → Prometheus)      │  │
│  │  • Kuberhealthy (synthetic health checks, IPv6-fixed fork)        │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌─── Observability ────────────────────────────────────────────────┐  │
│  │                                                                   │  │
│  │  • Grafana Alloy (node metrics, logs, probes — replaces 3 tools) │  │
│  │  • Prometheus (VPS) scraping 10+ metric endpoints                 │  │
│  │  • Grafana dashboards + alerting                                  │  │
│  │  • Cilium Hubble metrics (dns, tcp, flow, drop, http)             │  │
│  │  • etcd native metrics (:2381)                                    │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌─── Networking ───────────────────────────────────────────────────┐  │
│  │                                                                   │  │
│  │  • IPv6-only: fd00:4::/64 (nodes), fd00:4:69::/56 (pods)         │  │
│  │  • BGP: kube-vip (VIP) + Cilium (pod CIDRs) → FRRouting routers  │  │
│  │  • DNS: CoreDNS → PiHole (fd00:68::69, fd00:70::69)              │  │
│  │  • No overlay, no tunnels — native IPv6 routing                   │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌─── Automation ───────────────────────────────────────────────────┐  │
│  │                                                                   │  │
│  │  • Ansible: 10+ k8s roles, vault-encrypted secrets                │  │
│  │  • ansible_shed: auto-runs site.yaml every 2 hours                │  │
│  │  • version_check: flags stale component versions                  │  │
│  │  • CI: ansible-lint + syntax-check on every push                  │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Timeline

Date Milestone
Feb 1 First k8s commit — targeting Kubernetes 1.32
Feb 12 etcd Prometheus monitoring added
Feb 19 3-node etcd cluster, VMs across two sites
Feb 24-25 etcd and control plane moved to IPv6-only
Feb 25 kube-vip + BGP ECMP for API server HA
Feb 26 Cilium CNI deployed with Hubble
Feb 26 CoreDNS v1.14.1 for IPv6-only DNS
Feb 26 etcd upgraded from v3.6.0-rc.3 to v3.6.8
Feb 27 Kubernetes upgraded from 1.32 to 1.35.2
Feb 27 version_check role for freshness tracking
Feb 28 Cilium BGP pod CIDR advertising on workers
Mar 1-5 Grafana Alloy replaces promtail + node_exporter + blackbox
Mar 2 Kuberhealthy deployed (with IPv6 fork)
Mar 6 Metrics Server + API aggregation layer
Mar 10-16 NVMe etcd disks, I/O tuning, btrfs COW fix
Mar 17 CoreDNS tuning: PDB, topology spread, bufsize, autopath, cache
Mar 18 kube-state-metrics Ansible role
Mar 20 kube-vip 1.1.1, jumpbox removed — cluster is self-sufficient

What's Next

The cluster is stable, monitored, and upgradeable. Future plans include:

  • Running real workloads — the infrastructure is ready; time to deploy applications
  • Exploring Cilium network policies — eBPF-based microsegmentation
  • Testing chaos scenarios — killing nodes, partitioning networks, corrupting etcd
  • Contributing IPv6 fixes upstream — the Kuberhealthy fork is just the start

Seven weeks ago, I had never run a Kubernetes cluster. Today I have a production-grade, IPv6-only, BGP-routed, fully automated cluster that I understand down to every certificate and systemd unit file. That's the power of doing it the hard way first — and then automating what you've learned.


All infrastructure code is managed with Ansible. The cluster runs on 6 KVM virtual machines across 2 hypervisors, using approximately 15GB of RAM total. No cloud services were harmed in the making of this cluster.

    Write a Reply or Comment

    Your email address will not be published.


    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>