Stateful Services in K8s: From Local NVMe to RDMA-Powered Block Storage

I sense Kubernetes is not designed for stateful service, although it can do that work. I used to watch a video, one respectful retired engineer from Google and operate kubernetes, he said, “people want to put everything inside kubernetes, trying to build a cloud inside cloud itself, that doesn’t make any sense”, this really hit me.

Architecting Stateful Services in K8s: From Local NVMe to RDMA-Powered Block Storage

In the world of cloud-native infrastructure, the “Data Gravity” problem remains the final frontier. While Kubernetes (K8s) has perfected the orchestration of stateless microservices, deploying high-performance stateful workloads—like a self-managed MySQL cluster—presents a unique set of challenges involving identity, persistence, and I/O latency.

Have a conversation with one of former colleague,currently work at Shopee, working on build self-managed RDS inside EKS (Elastic Kubernetes Service), cool and challenge work.

This post explores the evolution of storage strategies for K8s stateful services, moving from standard local disks to the cutting edge of RDMA-accelerated network storage.

1. The Foundation: StatefulSet and Pod Identity

When deploying a database in K8s, the first hurdle is ensuring that a Pod maintains its identity across restarts. Unlike a Deployment, a StatefulSet provides:

Stable Network ID: Each Pod gets a sticky DNS name (e.g., mysql-0.db-service) via a Headless Service.
Ordered Deployment: Pods are created from $0$ to $N-1$ , ensuring predictable startup sequences.
Volume Retention: When a Pod is rescheduled, it automatically re-attaches to its specific Persistent Volume (PV).

2. Local SSD: The Performance King vs. The Operational Burden

For high-throughput databases, Local Persistent Volumes (Local PVs) utilizing NVMe SSDs offer the lowest possible latency. However, they introduce a “Strict Coupling” problem.

Key Challenges:

Pod Affinity: A Pod becomes “pinned” to a specific physical node. If that node goes down, the Pod cannot start elsewhere because its data is physically trapped.
Data Integrity: K8s does not replicate data at the storage layer for Local PVs. You must implement high availability at the application layer (e.g., MySQL Group Replication or Galera).
Node Maintenance: Draining a node for kernel updates requires a full database rebuild on a new node, which is time-consuming and risky for large datasets.

3. Remote Block Storage: The Quest for Elasticity

To solve the coupling issue, we move to Network Block Storage (EBS, Ceph, or SAN). This separates compute from storage, allowing a Pod to drift across the cluster while its data follows.

The Latency Wall

The traditional TCP/IP stack introduces significant overhead:

Context Switching: Moving data between User Space and Kernel Space.
Protocol Overhead: Encapsulating block commands into TCP packets.
Network Jitter: Standard Ethernet is “lossy,” leading to retransmissions and unpredictable Tail Latency (P99).

4. The Game Changer: RDMA and NVMe-over-Fabrics

To bridge the performance gap between Local SSDs and Network Storage, we look to RDMA (Remote Direct Memory Access).

Beyond Zero-Copy

RDMA isn’t just about reducing memory copies; it’s a complete hardware-software bypass:

Kernel Bypass: Applications communicate directly with the NIC (RNIC), skipping the OS kernel entirely.
CPU Offload: The NIC hardware handles packet ordering, flow control, and retransmissions, freeing the CPU for database queries.
Lossless Ethernet (RoCE v2): By using hardware-level protocols like PFC (Priority Flow Control), the network becomes “lossless,” preventing the drops that plague standard TCP.

NVMe-over-Fabrics (NVMe-oF)

While iSCSI was designed for the era of spinning disks, NVMe-oF is designed for the age of flash. It maps the massive parallelism of NVMe (up to 64K queues) directly over an RDMA fabric.

Result: You get the flexibility of remote storage with a latency penalty of only $10\% \sim 20\%$ compared to local NVMe—a far cry from the millisecond-level delays of traditional network storage.

5. Architectural Summary: The Modern Choice

Feature	Local NVMe SSD	Standard Block (TCP)	RDMA + NVMe-oF
Latency	$10 \sim 100\ \mu s$	$2 \sim 5\ \text{ms}$	$150 \sim 400\ \mu s$
Mobility	Low (Pinned to Node)	High	High
Complexity	High (App-level HA)	Low	Moderate (HW/Network config)
Best For	Extreme IOPS / TempDB	General Purpose	High-perf Production DBs

Conclusion

For a self-managed K8s database cluster, the path forward is clear. While Local PVs remain the choice for raw speed, the operational complexity of node pinning is a significant drawback.

The industry is rapidly shifting toward a Disaggregated Storage model. By leveraging RoCE v2 and NVMe-oF, we can finally achieve the “Holy Grail” of K8s stateful services: the performance of a local disk with the resilience and elasticity of the cloud.