Scaling Payment Systems: Lessons from Processing 20M+ Daily Transactions
Key architectural decisions and technical strategies for building high-performance payment systems — from microservices design to real-time monitoring and failure planning.
The Challenge: 20M+ Daily Transactions
At Safaricom, we processed over 20 million USSD requests daily for mobile data purchases. That's 230+ transactions per second, each requiring validation, processing, and confirmation — reliably, securely, with minimal latency, for users ranging from urban professionals to first-time internet users on basic phones in areas with intermittent connectivity.
Any system failure meant millions of users couldn't purchase data and Safaricom lost revenue immediately. The stakes were not abstract. This shaped every architectural decision we made.
Key Architectural Decisions
Moving to Microservices
The first critical decision was decomposing the monolith into microservices. This let us scale components independently based on their specific load characteristics. Payment processing, user management, and notifications each had different scaling needs — treating them as a single unit meant over-provisioning everything for the peak of the heaviest component.
Beyond scaling, the isolation of failures was essential. A bug in the notification service shouldn't prevent a transaction from completing. With microservices, a failure in one domain can be contained, allowing the system to continue operating with reduced functionality rather than crashing entirely.
Development velocity improved too. Teams could work on, test, and deploy individual services independently — no more coordinating across the entire system for every release.
Database Architecture for High Throughput
Payment systems require both high throughput and strict ACID compliance — a genuinely hard combination. We addressed it through several layered strategies.
Read replicas distributed read operations across multiple database instances. Balance queries, transaction history lookups, and user account reads happened against replicas, leaving the primary for writes and transactions that required real-time consistency.
Connection pooling was critical at our connection volumes. We tuned pool sizes carefully based on observed load patterns — too small and requests queued; too large and the database itself became the bottleneck.
Multi-layer caching with Redis handled sessions, frequently accessed user data, and configuration. Cache invalidation was the constant challenge — getting this wrong meant serving stale balances, which in a payment system is a significant problem.
Real-time Monitoring and Circuit Breakers
At 230 transactions per second, problems escalate faster than any human can react. Automated alerting, carefully tuned to minimise false positives while catching real anomalies quickly, was non-negotiable. The key insight: you need to deeply understand what normal looks like before you can detect meaningful deviation from it.
Circuit breakers were implemented to prevent cascading failures. When a downstream service started degrading, the circuit breaker tripped — failing fast rather than allowing slow failures to back up and overwhelm the system. This meant accepting degraded functionality in some scenarios rather than a complete outage.
Performance Optimisation
Asynchronous Processing for Non-Critical Paths
Not every operation needs to block the transaction response. We moved logging, fraud analysis, and notification sending to asynchronous queues. Transactions completed and confirmed quickly; background processing continued without affecting the user-facing response time.
Fraud detection is worth noting specifically — running complex ML models synchronously on every transaction at our scale was infeasible. Moving it async meant we accepted a short window where a fraudulent transaction could complete before being flagged, but the alternative was unacceptable latency for every legitimate user.
Caching Strategy
We implemented caching at multiple levels: application-level caching for session data and user preferences; query result caching for expensive, stable queries; Redis for anything requiring fast read/write with acceptable durability trade-offs. The discipline was knowing what could tolerate staleness and by how much — and enforcing that consistently across the team.
Load Balancing and Auto-scaling
Our infrastructure scaled horizontally based on real-time metrics. Application servers sat behind load balancers that distributed traffic across instances, adding capacity during peak periods automatically. The key was defining the right metrics to trigger scaling — CPU alone was insufficient; request queue depth and response time percentiles gave us a more accurate picture of actual load.
Security at Scale
Payment systems are high-value targets. Security wasn't a layer we added at the end — it was woven into the architecture from the start.
Encryption at rest and in transit was foundational. Rate limiting protected against abuse and denial-of-service attempts, with logic sophisticated enough to distinguish legitimate high-volume users from malicious traffic patterns. PCI DSS compliance required regular audits, documentation, and process discipline — not just technical controls.
Security debt is particularly dangerous in payment systems. We treated security findings with the same urgency as production outages, never letting them accumulate in a backlog.
Lessons Learned
Architecture Decisions Age Poorly When Made Under Pressure
The most expensive lesson: architectural decisions made early are incredibly difficult to change later. The cost of a monolith that wasn't designed for scale isn't just technical — it's the organisational friction of trying to change fundamental structures while the system is in production, under load, with users depending on it.
Invest in the architecture upfront, even when the pressure is to ship. Use proven technologies over cutting-edge when reliability is the primary constraint. Design for the traffic you'll have in 18 months, not just today.
Test at the Scale You'll Actually Run
We learned the hard way that small-scale tests don't reveal scale-specific problems. Performance characteristics at 10 transactions per second often bear no resemblance to behaviour at 230. Load test with realistic data volumes. Stress test beyond expected capacity — you want to know where the breaking points are before your users find them. Practice chaos engineering to understand how the system fails, not just how it succeeds.
Measure Everything — Especially Business Outcomes
Technical metrics are necessary but not sufficient. The metric that mattered most to Safaricom was transaction completion rate — and that number told a direct story about revenue. Tying technical performance to business outcomes makes prioritisation decisions clearer and makes it far easier to justify investment in reliability and performance work.
Plan for Failure, Not Just Success
Circuit breakers, graceful degradation, backup systems, and detailed runbooks for failure scenarios aren't pessimism — they're engineering discipline. The question is never whether something will fail, but when and how badly. Systems designed with failure in mind fail gracefully. Systems that assume success fail catastrophically.
The Results
These decisions and optimisations produced measurable outcomes: 99.9% uptime during peak periods, an 11% revenue increase through higher transaction completion rates, and a 50% reduction in processing latency. More importantly, the system became something the team could reason about, modify, and extend with confidence — which is the real measure of good architecture.
Want to work together?
If this sparked ideas about how AI could help your business, let's talk.
Book a Consultation