Zero-Downtime Migration: Sharding a 30M-Row MySQL Single Table
In high-growth distributed systems, you eventually hit the “performance wall” of a single MySQL table. When a table surpasses 30 million rows, B+ Tree depth increases, query latency spikes, and DDL operations become high-risk maneuvers.
As a Staff Engineer, the challenge isn’t just “splitting the table”—it’s doing so while the engine is running at 5,000 RPM. This guide outlines a production-grade, zero-downtime strategy for migrating from a single table to a sharded architecture.
1. Architectural Strategy: The “Pragmatic Architect” Approach
Before touching the database, we must establish a foundation that supports both current scale and 10x future growth.
Sharding Key & Topology
- Shard Key Selection: Choose a high-cardinality field (e.g.,
user_idortenant_id) to ensure even data distribution. - Logical vs. Physical: Map 256 logical shards to a smaller number of physical databases (e.g., 8 or 16). This allows for easy future rebalancing without re-sharding.
Distributed ID Generation
Standard auto_increment fails in a sharded environment.
- Action: Implement a Snowflake or Segment-based ID generator (like Baidu’s UidGenerator or Meituan’s Leaf) before starting the migration.
2. The Migration Pipeline: Full Load + CDC
To achieve zero downtime, we use a combination of Full Load (historical data) and Change Data Capture (CDC) (real-time increments).
Technical Comparison: Canal vs. AWS DMS
- AWS DMS: Best for managed environments. It automates the transition from full load to CDC but acts as a “black box.”
- Canal + MQ: Best for high customization. It allows you to intercept Binlogs, transform data in transit, and handle complex sharding logic via code.
3. Step-by-Step Execution Playbook
Phase 1: Preparation & Dual-Writing
- Enable Binlog: Ensure MySQL
binlog_formatis set toROW. - Start Incremental Sync: Deploy Canal to listen to the single table’s Binlog. Stream changes into a Message Queue (Kafka/RocketMQ).
- Perform Full Load: Use a migration tool (e.g., DataX) to copy historical rows.
- Tip: Use
INSERT ... ON DUPLICATE KEY UPDATEto ensure the CDC data (which might arrive earlier) isn’t overwritten by stale historical data.
- Tip: Use
Phase 2: The “Shadow Read” Validation
Before switching traffic, verify data integrity without affecting the user.
- Logic: The middleware intercepts a percentage of
SELECTqueries, executes them against both the old and new tables, and logs any discrepancies.
graph TD
A[App Request] --> B{Middleware}
B -->|Read| C[Old Single Table]
B -.->|Async Shadow Read| D[New Sharded Tables]
D --> E[Comparison Engine]
E -->|Mismatch| F[Alert/Log]
C --> G[Return Result to User]
4. The Critical Switching Window (The “Cutover”)
The most sensitive part of the process is the Write-Switch. We aim for a “silence period” of less than 60 seconds.
The 10-Minute Playbook
- Set Read-Only: Trigger a global configuration (e.g., via Apollo/Nacos) to put the application into “Maintenance Mode” (Rejecting writes, allowing reads).
- Catch-Up: Wait for the CDC lag to hit zero.
- Update Routing: Update the Sharding Middleware (e.g., ShardingSphere) to point all traffic to the new sharded tables.
- Enable Reverse Sync: Immediately start a sync process from the New Tables back to the Old Table.
- Resume Writes: Disable “Maintenance Mode.”
5. Rollback & Safety Nets
A Staff Engineer never executes a plan without a “Reverse Gear.”
The Reverse Sync (The Safety Net)
By syncing data from the new shards back to the original single table, you maintain a “Hot Standby.”
- Scenario: If a bug is discovered 30 minutes after cutover, you can flip the routing switch back to the single table instantly. Since the single table has been receiving updates via Reverse Sync, no data is lost.
Circular Mirroring Prevention
To prevent Old -> New -> Old infinite loops:
- Tagging: Ensure the migration tool connects using a specific DB user. Configure the CDC listeners to ignore transactions generated by that specific user.
6. Post-Migration Cleanup
- Observability: Monitor for “Cross-Shard Joins” that might degrade performance.
- Decommissioning: Keep the old table for one full business cycle (usually 7 days). Once stability is confirmed, drop the old table during off-peak hours to reclaim storage.
Conclusion
Moving 30 million rows from a single table to a sharded architecture is a test of operational discipline. By utilizing CDC-driven synchronization, Shadow Reads, and Reverse Sync, you transform a high-risk migration into a controlled, reversible evolution.
Key Takeaway: The middleware is your steering wheel, but the Binlog is your fuel. Master both, and the migration becomes invisible to your users.