Scaling a Payment Orchestration System from 100 to 10,000 TPS on AWS

In real life, hardly any chance to do this project; if it really comes, it’s more than gold. This post is just a skeleton, such a project will be more complex

Background

This post walks through a real design exercise: taking a payment + virtual account (VA) system from 100 TPS to 10,000 TPS on AWS, balancing cost and performance at each stage. Before diving in, I want to call out something worth keeping in mind throughout.

10,000 TPS for a payment system is an almost fictional number.

Stripe, one of the most sophisticated payment infrastructure companies in the world, processes roughly 250 million transactions per day — that’s around 2,900 TPS averaged across 24 hours, with peak periods perhaps 3–5× that. So when we say “10,000 TPS payment core,” we’re describing something in the same weight class as Stripe at full peak, or larger. In practice, most fintech products — even well-funded, mature ones — operate comfortably in the 50–500 TPS range. A Series B payments company hitting 1,000 TPS is doing serious volume.

Why design to 10,000 TPS then? Because the architecture you’d build to get there is instructive at every scale below it. The strategies that take you from 100 to 1,000 TPS are the same ones that take you to 10,000 — just applied more aggressively. Think of this as a scaling roadmap, not a literal deployment target.

Baseline: ~190 TPS

The starting point is deliberately modest — the kind of stack a small engineering team would deploy for a real production system:

Backend: 2 × (2 vCPU / 4 GB) application servers (ECS tasks or EC2)
Database: Single RDS PostgreSQL instance (4 vCPU / 8 GB)
No caching layer, no queue, no read replicas
Provider layer (bank APIs / payment rails): avg 100–400ms, p99=300ms–1s depending on rail
Payment API (end-to-end): avg 150–450ms, p99=400ms–1s — dominated by provider rail latency
VA account API (read): avg 5–15ms, p99=50ms — no provider dependency, DB/cache only

Benchmarked on 2 × 2-vCPU backend nodes + single Postgres, no Redis or queue, with a log-normal connector simulator (p50=120ms, p99=380ms):

Rate	Payment API p99	Sys errors	Dropped	Verdict
100/s	44ms	0%	0	PASS
200/s	470ms	0%	~0	edge
300/s	11.7s	9.6%	2,904	collapse

Clean ceiling: ~190/s. The bottleneck is not CPU — both backend nodes sat at 151% of 200% capacity and Postgres at 272% of 400% when the system collapsed. The actual wall is the DB connection pool: with 40 connections per node and a ~420ms mean hold time (dominated by the provider call), Little’s Law caps throughput at 80 connections ÷ 0.42s ≈ 190/s. Every in-flight payment holds a DB connection for its entire duration, including the wait on the external payment rail.

One counterintuitive finding: adding Redis + Kafka raised the clean ceiling from ~190/s to ~250/s — because Redis offloads provider-health and routing reads off Postgres, reducing per-payment DB connection pressure. The infra layer is a net throughput win, not overhead. The first bottleneck then shifts to backend CPU, with Postgres still having headroom.

The path past this ceiling is PgBouncer in transaction pooling mode — it releases the real Postgres backend connection during the external provider call, breaking the pool ÷ hold-time ceiling entirely. That’s exactly what Strategy 2 addresses.

The Six Scaling Strategies

1. Multi-layer Cache: Absorb Read Traffic Before It Hits the DB

At moderate TPS, your read traffic will dwarf your write traffic — especially for a VA account system where users frequently check balances. Putting a cache in front of the database is the highest-leverage move available.

Layer 1 — API Gateway response cache (TTL 1–5s)

For endpoints like GET /va-accounts/{id}/balance, a 2-second TTL at the gateway level eliminates the majority of read load before a single byte reaches your application tier. At 10,000 TPS with 60% reads, this alone can absorb 4,000–5,000 req/s. It’s free throughput. The trade-off is eventual consistency — only apply this to endpoints where a few seconds of staleness is acceptable.

Layer 2 — ElastiCache Redis Cluster

This is the primary workhorse. The key design decisions are what you cache and how you invalidate it:

Cache Key	TTL	Strategy
`va:balance:{account_id}`	5–30s	Write-through on every mutation
`acct:state:{account_id}`	60s	Cache-aside, short TTL
`idem:{idempotency_key}`	24h	Write-once after payment completes
`ratelimit:{user_id}:{window}`	sliding	Redis INCR + EXPIRE

Write-through for balances means every balance mutation writes to both DB and Redis in the same operation. This keeps the cache warm without a separate invalidation step, and avoids the “thundering herd” problem where a cache miss under high load sends thousands of requests simultaneously to the database.

Idempotency keys deserve special mention: every payment request should carry a client-generated key, and the service should check Redis before processing. This makes retries safe regardless of TPS — critical in a payment system where clients aggressively retry on network errors.

Cluster sizing: start with a 3-shard ElastiCache cluster (r7g.large nodes, ~6 GB each). At 10,000 TPS with a 60% cache hit rate, you’re handling around 6,000 Redis ops/sec per shard — well within the ~100,000 ops/sec ceiling of a single node.

2. Read/Write Splitting: Multiply DB Read Capacity Linearly

Once you’ve maximized cache hit rate, the remaining DB read traffic needs to be distributed. PostgreSQL read replicas let you scale read capacity roughly linearly by adding nodes.

The most important thing to add before replicas: PgBouncer in transaction pooling mode.

Without connection pooling, 40 ECS tasks × 10 DB connections each = 400 real PostgreSQL connections — and that number grows with every new task you spin up. PostgreSQL becomes unstable above a few hundred connections. PgBouncer sits in front of RDS and multiplexes application connections onto a much smaller pool of actual server connections:

pool_mode = transaction        ; critical — session mode doesn't help
max_client_conn = 2000         ; what your app sees
default_pool_size = 25         ; actual server connections per db/user
server_idle_timeout = 30

AWS RDS Proxy is worth considering as a managed alternative — it handles connection pooling transparently, integrates with IAM for credentials, and automatically routes failovers. For teams that don’t want to operate PgBouncer themselves, it’s a compelling choice.

Read replica routing rules:

Writes and reads-after-writes → primary only. A payment that has just completed must read its own write.
Balance queries, transaction history → replicas, routed via PgBouncer or RDS Proxy.
Reports and dashboards → replicas or (better) Redshift.

At 10,000 TPS, you’ll want a db.r7g.2xlarge primary (8 vCPU, 64 GB RAM) with 2–3 read replicas across availability zones.

TPS band	Primary instance	Read replicas	PgBouncer pool size
100	db.t3.xlarge	—	50
1,000	db.r7g.large	1×	100
5,000	db.r7g.xlarge	2×	200
10,000	db.r7g.2xlarge	2–3×	400

3. Microservice Split: Payment Core and VA Account on Separate Databases

At high TPS, co-locating payment ledger writes and VA account reads on the same database is a liability. They have fundamentally different workload profiles:

Payment Core — write-heavy, strict ACID requirements, sequential consistency on the ledger. Every row matters. No eventual consistency.
VA Account Service — read-heavy, high-frequency balance queries, tolerates a few seconds of staleness for reads (backed by the Redis cache).

Sharing a database means their connection pools compete, their buffer caches fight each other, and their IOPS budgets overlap. Splitting them gives each service independent vertical scaling headroom and means a slow query in one can’t block the other.

Service boundaries:

Payment Core API                  VA Account Service
────────────────────              ────────────────────
POST /payments                    GET  /va-accounts/{id}
POST /refunds                     GET  /va-accounts/{id}/balance  
GET  /payments/{id}               POST /va-accounts/{id}/topup
Internal: ledger write            Internal: balance cache write

Inter-service communication should be asynchronous. The Payment Core should never call the VA Account Service synchronously during a request path — that creates coupling and a latency chain. Instead, Payment Core emits a payment.completed event; VA Account Service consumes it and updates the balance. Both services scale independently, and neither blocks the other.

4. Async Decoupling via SQS/SNS: The Shock Absorber

Message queues are a default consideration for any payment system — they’re what lets you decouple your API throughput from your processing throughput.

Two core patterns:

Pattern A — Async settlement

Payment Core validates the request, writes an intent record to the DB, and immediately returns 202 Accepted. A background consumer picks up the SQS message and executes the actual settlement with the payment rail (bank API, card network, etc.). This means your public API TPS ceiling is no longer tied to the latency of an external HTTP call to a bank.

Pattern B — SNS fan-out

A single payment.completed SNS topic fans out to multiple SQS queues:

VA Account Service → update balance
Notification Service → send receipt email/push
Audit Service → append to compliance log
Fraud Service → run async risk scoring

Each consumer scales independently. Adding a new downstream consumer means creating a new SQS subscription — no changes to Payment Core.

Use SQS FIFO with MessageGroupId = account_id for payment events. This guarantees ordering per account, which matters: if two payments arrive for the same account, you don’t want the second processed before the first.

Dead-letter queues are non-negotiable — any failed message after N retries should land in a DLQ with an alarm on queue depth. In a payment system, a silently dropped message is a worse outcome than a loud failure.

5. Cold/Hot Data Split: Keep OLTP Lean, Move Analytics to OLAP

As transaction volume grows, your transactions table becomes the single most dangerous object in your database. Reporting queries against it compete with live payment writes. Historical data inflates table size, slows down index scans, and adds cost to every VACUUM cycle.

The split:

Hot data (OLTP, RDS): Last 90 days of transactions, all active account states. Use PostgreSQL table partitioning (PARTITION BY RANGE (created_at)) with monthly partitions. The query planner skips irrelevant partitions entirely, and old partitions detach cleanly for archival.
Cold data (S3/Glacier): Transactions older than 90 days, archived via a nightly Lambda job. Keep a thin transaction_archive pointer table in RDS ({id, s3_path, period}) for lookup without pulling data back.
OLAP (Redshift/Athena): Use AWS DMS with CDC (Change Data Capture) to stream new transactions from RDS to Redshift in near-real-time. Finance reports, reconciliation, and fraud analytics all query Redshift — they never touch the OLTP primary. For ad-hoc queries on the S3 archive, Athena is pay-per-query and requires no warehouse infrastructure.

This split is primarily a cost and reliability move. You protect OLTP performance, reduce RDS storage costs, and give your data team a warehouse they can query without asking you to add read replicas.

6. Connection Pooling and Compute Right-Sizing

Two things kill payment systems at high TPS before the architecture even gets stressed: connection exhaustion and over-provisioned compute that still can’t handle bursts.

ECS auto-scaling configuration that actually works:

Payment services are I/O-bound, not CPU-bound — the application is mostly waiting on DB or Redis responses. CPU-based auto-scaling will scale too late. Use ALB RequestCountPerTarget as the scaling metric instead:

Scale out when requests/tasks> 200 (sustained 2 min)
Scale in when requests/task < 80 (sustained 10 min — conservative, avoids thrash)
Min tasks: 2, Max tasks: 40

Instance sizing: 2 vCPU / 4 GB per Fargate task is typically sufficient for payment API tasks — profile before going larger. The throughput ceiling for a payment task is almost always the DB connection or downstream I/O, not the CPU.

Aurora PostgreSQL vs RDS PostgreSQL at scale: At 5,000+ TPS, Aurora becomes meaningfully better. Its shared storage architecture means read replicas are available almost immediately (no replication lag ramp-up), its buffer pool is shared across replicas, and it supports up to 15 read replicas vs 5 for RDS. The cost premium is real, but the operational simplicity at large scale pays for it.

The Full Architecture at 10,000 TPS

Putting it all together, the request path looks like this:

Client
  └─ CloudFront + WAF (TLS, DDoS, geo-routing, static edge cache)
      └─ API Gateway (auth, rate-limit, L1 response cache TTL 1–5s)
          └─ ElastiCache Redis (L2: balance, account state, idempotency keys)
              ├─ Payment Core API (ECS Fargate, 2–40 tasks)
              │    └─ PgBouncer / RDS Proxy
              │        ├─ RDS Primary (writes)
              │        └─ RDS Replicas ×2 (reads)
              │            └─ SQS FIFO → async settlement, SNS fan-out
              └─ VA Account Service (ECS Fargate, 2–40 tasks)
                   └─ PgBouncer / RDS Proxy
                       ├─ RDS Primary (writes)
                       └─ RDS Replicas ×2 (reads)

Cold path: RDS → S3/Glacier (nightly archive, >90 days)
Analytics: DMS CDC → Redshift | Athena on S3

Cost vs Performance Levers

At 10,000 TPS, the biggest cost traps are over-provisioned RDS and under-utilized ElastiCache. Tune in this order:

1. Maximize Redis cache hit rate first. Every 1% improvement in cache hit rate directly reduces RDS load. Measure cache_hits / (cache_hits + cache_misses) per endpoint family, not globally — a 90% hit rate on balance reads masking a 10% hit rate on account state reads will mislead you.

2. Commit to reserved instances on DB primaries. At 10,000 TPS, the primary DB instance is always-on infrastructure. A 1-year reserved instance saves 40–50% over on-demand. This is the single highest ROI infrastructure cost decision.

3. Right-size Fargate tasks before scaling out. Profile memory and CPU utilization on a single task under load before deciding to add tasks. Payment API tasks are frequently under-utilizing CPU at 20–30% while waiting on I/O. Adding vCPUs doesn’t help.

4. Use Athena for historical queries instead of keeping history in RDS. Storing 2 years of transactions in RDS to support “download statement” features is expensive. S3 + Athena handles this at a fraction of the cost with acceptable latency for a non-realtime operation.

The Hidden Topology Driver: 🤦Data Residency Law

Here’s something that doesn’t show up in most scaling guides, but completely reshapes this architecture in the real world.

If this payment system is global — and any system at 10,000 TPS almost certainly is — regulation decides your database topology before engineering does.

GDPR mandates that EU residents’ financial data must be stored and processed within the EU. It cannot legally transit to or rest in US infrastructure. US financial regulations have their own requirements. Some countries go further — India’s RBI requires payment data to be stored exclusively on Indian soil. You don’t architect around this. You comply with it first, then engineer within the constraints.

The practical consequence: you cannot have a single global payment-core database. You run independent stacks per region, each fully isolated at the data layer.

EU Region (Frankfurt / Ireland)       US Region (us-east-1 / us-west-2)
──────────────────────────────        ──────────────────────────────────
API Gateway (EU)                      API Gateway (US)
Payment Core API                      Payment Core API
VA Account Service                    VA Account Service
RDS PostgreSQL (EU-resident data)     RDS PostgreSQL (US-resident data)
ElastiCache Redis                     ElastiCache Redis
SQS / SNS                             SQS / SNS

No cross-region DB replication. No shared primary. Completely independent stacks, by law.

And this is actually good news for the TPS math.

If your business splits 40% EU / 60% US — a typical global fintech distribution — you’re no longer designing a single system for 10,000 TPS. You’re designing two systems: one that needs to handle ~4,000 TPS peak, and one that handles ~6,000 TPS peak. Neither needs to be Stripe-scale on its own. The architecture becomes more tractable, and the cost drops proportionally.

What can be shared across regions:

Component	Shared?	Reason
Payment Core DB	No	Data residency law
VA Account DB	No	Same
ElastiCache (Redis)	No	Contains PII and financial state
Fraud ML models	Yes (read-only sync)	Logic only, no PII
Merchant/product config	Yes (CDN-distributed)	No PII
Aggregated analytics	Yes (anonymized)	No individual-level data

The one engineering problem this creates: a German customer traveling to the US and making a payment. The request hits the US API gateway, but the account is EU-domiciled. You need a lightweight global routing layer — essentially a small DynamoDB Global Table or Route 53 Latency Routing rule — that maps account_id prefix to home region, then proxies or redirects accordingly. This routing table contains no financial data, just region mappings, so it can legally live anywhere.

The pattern is: one global router, N fully isolated regional stacks. The router knows where your account lives; the regional stack handles everything else.

This also means your disaster recovery story is cleaner than a single-region design. An EU outage doesn’t affect US customers at all — they’re on a separate stack. You get fault isolation as a compliance side effect.

A Reality Check on the Numbers

10,000 TPS payment processing is an engineering thought experiment more than a deployment target for most teams. For context:

Stripe processes ~250M transactions/day → ~2,900 TPS average, peak perhaps 3–5× that
A healthy Series B fintech might see 100–500 TPS peak
A major regional bank’s core payment processing might peak at 1,000–2,000 TPS

The architecture described here handles 10,000 TPS — but the same layered approach is what you’d use at 500 TPS or 2,000 TPS, just with smaller instances and fewer replicas. The progression is incremental. You don’t build the 10,000 TPS stack on day one; you build the 500 TPS stack well, instrument it thoroughly, and add layers as the data tells you to.

The most important rule in payment system scaling: don’t add infrastructure ahead of measured bottlenecks. A well-tuned 2-node setup with PgBouncer and a warm Redis cache can handle far more than its raw specs suggest. Profile first, scale second.

Summary

Strategy	Primary benefit	When to add
Multi-layer cache (Redis + Gateway)	Absorb 60–80% of read traffic	Before first scaling crisis
Read/write splitting + PgBouncer	Multiply read capacity, fix connection exhaustion	~500 TPS
Microservice + DB split	Independent scaling, workload isolation	~1,000–2,000 TPS
SQS/SNS async decoupling	Decouple throughput from processing latency	Early — default for payments
Cold/hot data split (OLAP)	Protect OLTP, enable analytics	~6–12 months post-launch
Aurora + ECS auto-scaling	High availability, burst handling	~5,000 TPS
Regional stack isolation	Data residency compliance, fault isolation	Day one if global

The architecture isn’t about reaching 10,000 TPS. It’s about building a system that degrades gracefully under load, scales predictably as volume grows, and doesn’t require a full rewrite at each order of magnitude. That’s the goal worth engineering toward.

Written based on practical AWS architecture design for financial systems. Infrastructure numbers are illustrative; always benchmark your specific workload before committing to instance types and cluster sizes.