The Kubernetes Skills Nobody Tests You On (But Everyone Needs in Production)
You've passed your CKA with flying colors. Your CKAD badge shines on LinkedIn. You can explain the nuances of CRDs versus CustomResourceDefinitions while half-asleep. But then day one hits in production, and suddenly you're staring at a cluster that's been running for two years, spewing errors you've never seen, with no documentation and a Slack channel blowing up with "Is it fixed yet?" messages.
Certifications validate theory. Production demands survival skills. They're like learning chess rules versus playing in a tournament where your opponent cheats, the board tilts randomly, and half the pieces have undocumented special moves. This blog dives deep into the untested skills that separate certificate holders from battle-hardened Kubernetes operators—the ones that keep clusters alive when everything else fails.
Mastering Log Archaeology
Kubernetes exams drill kubectl logs, kubectl describe, and basic pod states into your brain. CrashLoopBackOff? ImagePullBackOff? You got this. But production logs are a different beast: terabytes of noise from unfamiliar apps, cryptic vendor-specific errors, and multi-layer failures spanning application, kubelet, container runtime, and kernel.
The real art lies in correlation. A seemingly healthy pod reports "connection refused," but digging reveals the kubelet evicted a sidecar 30 seconds earlier due to ephemeral storage pressure. You trace this through journalctl -u kubelet, cross-reference with CRI-O logs (crictl logs), and spot the kernel dmesg entry showing inode exhaustion. No exam teaches this chain-of-custody debugging.
Consider a real scenario: Your API gateway logs 5xx errors sporadically. Pod logs show clean HTTP 200s. Network team blames you. Solution? tcpdump on the node captures fragmented packets dropped by iptables due to MTU mismatch between Calico VXLAN (usually 1410 MTU) and your AWS VPC (9001 MTU). Fix: Adjust CNI config and node sysctls. This multi-tool detective work—fluent in strace, ebpftrace, falco, and log aggregators like Loki or ELK—takes years to hone.
Pro tip: Build a mental model of failure propagation. Apps lie; kernels tell truth. Always start from the bottom (dmesg, auditd) and work up.
Demystifying Network Chaos
Exams cover Services (ClusterIP, NodePort, LoadBalancer) and basic NetworkPolicies. You know kube-proxy modes (iptables, IPVS). Production? Packets vanish into ether, DNS flakes under load, and east-west traffic chokes nodes.
Intuition for traffic flows separates juniors from seniors. Trace a request: Ingress → Istio Envoy → Service → Pod IP → CNI overlay (Flannel VXLAN? Cilium eBPF?). Overlapping CIDRs? Pod 10.244.1.10 routes to host 10.244.1.1 instead of peer pod. Debug with ip route, conntrack -L, and cilium monitor.
Load spikes reveal DNS thundering herds: 1000 pods query CoreDNS simultaneously for "myservice.default.svc.cluster.local," overwhelming the 2-replica deployment. Scale CoreDNS? Better: Headless Services for intra-pod discovery, ExternalDNS for stable records.
MTU hell kills externally: Pods talk fine internally (1450 MTU), but AWS NLB → EKS → pod drops 1500-byte packets. ping -M do -s 1472 confirms. Fix: CNI plugins like Calico support MTU auto-discovery.
Service meshes amplify complexity. Istio's Envoy sidecars proxy everything, but Envoy access logs (/logs/envoy_access.log) reveal mTLS handshake failures from expired certs. No exam covers rotating Istio CAs or debugging Envoy statsd metrics.
Mastery means thinking like packets: entry point, CNI encapsulation, iptables traversal, kernel forwarding, egress NAT. Tools: k9s for live flows, tshark for deep inspection, nfd for topology awareness.
Capacity Planning Wizardry
Requests/limits? Exams say set CPU 100m, memory 128Mi. Production laughs: One Java app GC pauses for 30s, starving latency-sensitive pods. True skill: Profiling real usage across percentiles.
Vertical Pod Autoscaler (VPA) recommendations? Use as starting point, not gospel. Prometheus queries reveal p99 CPU at 750m during spikes, but requests=500m wastes nodes. Solution: Burst-to-zero limits (Kubernetes 1.27+), plus node affinity to isolate noisy workloads.
Cloud quirks matter: EKS Fargate bills per vCPU-hour but throttles at 100%—profile with node-exporter CPU throttle metrics. GKE's regional PDs share IOPS across zones; oversubscribe and watch latencies explode.
Node sizing wars: 100x m5.large (2vCPU) vs 10x m5.4xlarge (16vCPU)? Small nodes maximize bin-packing (more precise scheduling), but etcd gossip suffers >100 nodes. Large nodes reduce CNI overhead but single-node failures kill 10% capacity. Data-driven: Track kubectl top nodes trends, model with cluster-proportional-autoscaler for kube-system.
Predict failures: 80% memory pressure triggers OOMkiller (score based on oom_score_adj). Monitor node_memory_MemAvailable_bytes < 100MB/node. Proactive: Descheduler evicts long-running pods during pressure.
Tools stack: Kubecost for cost allocation, Goldilocks (VPA UI), Keptn for GitOps scaling. Real art: Balancing overprovisioning (5-10% headroom) with cost control.
Thriving in Documentation Deserts
Inherited clusters: Custom operators in Go, Helm charts with 50 values.yaml overrides, ArgoCD apps pointing to unmerged PRs. No README. Panic? No—systematic recon.
Step 1: kubectl get crd -o yaml | grep -i operator reveals extensions. Dive into operator logs, then GitHub (common naming). Step 2: helm list -A + helm get values decodes configs. Step 3: kubectl get networkpolicy --all-namespaces -o yaml maps isolation.
Undocumented deps? strace -p <pid> on crashing pods shows library mismatches. Legacy PSPs (deprecated 1.25)? kubectl get psp lists permissive policies blocking upgrades.
Temporary fixes first: kubectl debug ephemeral containers for introspection, kubectl patch for hotfixes. Permanent: Incremental refactoring—migrate one namespace at a time.
Mindset shift: Comfort in chaos. Ask "What changed last?" (Git logs, audit logs via kubectl logs -n kube-system kube-apiserver). Leave breadcrumbs: PRs with fixes + docs.
Security: From Checkboxes to Battlements
CKS tests RBAC audits, kyverno policies. Production threat model: Insider devs pushing malicious images, runtime exploits via dirty COW, etcd breaches.
Image pipeline: Cosign signatures + Trivy SCA in CI. Reject unsigned/notus:2 images. Runtime: Falco rules detect chmod +s /bin/sh execs.
Secrets anti-pattern: base64 etcd blobs. Migrate to External Secrets Operator pulling from Vault/AWS SSM, with CSI drivers for pod injection.
Sandboxing: gVisor for untrusted workloads (10-20% perf hit), Kata for VM isolation. PodSecurityAdmission enforces restricted baseline.
Audit everything: OPA Gatekeeper for admission control, audit2rbac for policy gen. Compliance: Falco + OpenSearch for SOC2 logs (rotate 90 days).
Human factor: GitOps RBAC limits PR merges to approvers. OPA policies reject privileged pods.
Navigating Organizational Minefields
Devs demand MongoDB in-cluster (resist: use Atlas). CTO wants 99.99% SLA on $10k/month budget. Upgrades: Dual-stack IPv6? Test canary nodes first.
Cost wars: Karpenter vs Cluster Autoscaler—Karpenter spotsizes faster but spot interruptions cascade. Balance: 70% spot, 30% OD, interruption queues.
Politics: Translate "pods pending" to "add $2k/month nodes." Document tradeoffs: "Service mesh adds 15% latency but mTLS everywhere."
Simplicity as Superpower
Premature Istio? Skip unless >50 services. Kustomize > Helm for <100 manifests. Measure: If debugging > deploying, simplify.
Eternal Vigilance: Staying Current
1.31 deprecates PodSecurityPolicy (done). Track via kubeadm upgrade plan. Cloud-specific: EKS 1.30 mandates control-plane upgrade first.
Community: KubeCon talks > hype blogs. Experiment: kind clusters for PoCs.
Performance Autopsy Skills
Slow despite resources? kubectl top pod --containers hides throttling: node_cpu_throttled_seconds_total. Memory: container_memory_working_set_bytes vs RSS.
I/O: iostat -x 1 on PV nodes. Network: ss -m for memcg pressure.
Kernel: sysctl net.core.somaxconn=4096, fs.file-max. Tune iteratively.
Aligning with Kubernetes Philosophy
Fight stateful? Use StatefulSets + operators. Long jobs? Karakora. Embrace declarative: GitOps everything.
Bridging the Chasm
Certifications open doors; these skills keep you employed. Production forges them—volunteer for on-call, build side projects, contribute upstream. The untested skills turn theory into mastery.