I sense Kubernetes is not designed for stateful service, although it can do that work. I used to watch a video, one respectful retired engineer from Google and operate kubernetes, he said, “people want to put everything inside kubernetes, trying to build a cloud inside cloud itself, that doesn’t make any sense”, this really hit me.
Architecting Stateful Services in K8s: From Local NVMe to RDMA-Powered Block Storage
In the world of cloud-native infrastructure, the “Data Gravity” problem remains the final frontier. While Kubernetes (K8s) has perfected the orchestration of stateless microservices, deploying high-performance stateful workloads—like a self-managed MySQL cluster—presents a unique set of challenges involving identity, persistence, and I/O latency.
Have a conversation with one of former colleague,currently work at Shopee, working on build self-managed RDS inside EKS (Elastic Kubernetes Service), cool and challenge work.
This post explores the evolution of storage strategies for K8s stateful services, moving from standard local disks to the cutting edge of RDMA-accelerated network storage.
1. The Foundation: StatefulSet and Pod Identity
When deploying a database in K8s, the first hurdle is ensuring that a Pod maintains its identity across restarts. Unlike a Deployment, a StatefulSet provides:
- Stable Network ID: Each Pod gets a sticky DNS name (e.g.,
mysql-0.db-service) via a Headless Service. - Ordered Deployment: Pods are created from to , ensuring predictable startup sequences.
- Volume Retention: When a Pod is rescheduled, it automatically re-attaches to its specific Persistent Volume (PV).
2. Local SSD: The Performance King vs. The Operational Burden
For high-throughput databases, Local Persistent Volumes (Local PVs) utilizing NVMe SSDs offer the lowest possible latency. However, they introduce a “Strict Coupling” problem.
Key Challenges:
- Pod Affinity: A Pod becomes “pinned” to a specific physical node. If that node goes down, the Pod cannot start elsewhere because its data is physically trapped.
- Data Integrity: K8s does not replicate data at the storage layer for Local PVs. You must implement high availability at the application layer (e.g., MySQL Group Replication or Galera).
- Node Maintenance: Draining a node for kernel updates requires a full database rebuild on a new node, which is time-consuming and risky for large datasets.
3. Remote Block Storage: The Quest for Elasticity
To solve the coupling issue, we move to Network Block Storage (EBS, Ceph, or SAN). This separates compute from storage, allowing a Pod to drift across the cluster while its data follows.
The Latency Wall
The traditional TCP/IP stack introduces significant overhead:
- Context Switching: Moving data between User Space and Kernel Space.
- Protocol Overhead: Encapsulating block commands into TCP packets.
- Network Jitter: Standard Ethernet is “lossy,” leading to retransmissions and unpredictable Tail Latency (P99).
4. The Game Changer: RDMA and NVMe-over-Fabrics
To bridge the performance gap between Local SSDs and Network Storage, we look to RDMA (Remote Direct Memory Access).
Beyond Zero-Copy
RDMA isn’t just about reducing memory copies; it’s a complete hardware-software bypass:
- Kernel Bypass: Applications communicate directly with the NIC (RNIC), skipping the OS kernel entirely.
- CPU Offload: The NIC hardware handles packet ordering, flow control, and retransmissions, freeing the CPU for database queries.
- Lossless Ethernet (RoCE v2): By using hardware-level protocols like PFC (Priority Flow Control), the network becomes “lossless,” preventing the drops that plague standard TCP.
NVMe-over-Fabrics (NVMe-oF)
While iSCSI was designed for the era of spinning disks, NVMe-oF is designed for the age of flash. It maps the massive parallelism of NVMe (up to 64K queues) directly over an RDMA fabric.
- Result: You get the flexibility of remote storage with a latency penalty of only compared to local NVMe—a far cry from the millisecond-level delays of traditional network storage.
5. Architectural Summary: The Modern Choice
| Feature | Local NVMe SSD | Standard Block (TCP) | RDMA + NVMe-oF |
|---|---|---|---|
| Latency | |||
| Mobility | Low (Pinned to Node) | High | High |
| Complexity | High (App-level HA) | Low | Moderate (HW/Network config) |
| Best For | Extreme IOPS / TempDB | General Purpose | High-perf Production DBs |
Conclusion
For a self-managed K8s database cluster, the path forward is clear. While Local PVs remain the choice for raw speed, the operational complexity of node pinning is a significant drawback.
The industry is rapidly shifting toward a Disaggregated Storage model. By leveraging RoCE v2 and NVMe-oF, we can finally achieve the “Holy Grail” of K8s stateful services: the performance of a local disk with the resilience and elasticity of the cloud.