A wake-up call at 3 a.m. changed my API design philosophy forever. It was a Tuesday, and my phone buzzed with an alert that would reshape my approach. Our API, the gateway to our services, had crashed, taking down three downstream services with it. By the time I reached my laptop, customer support tickets were pouring in. The root cause? A single database replica failure, with no fallback plan in place. This led to a costly lesson: a $14,000 hit in service-level agreement credits and a loss of trust. But here's where it gets controversial: this incident taught me a valuable lesson about API design.
The 3 a.m. Test: A Simple Yet Powerful Idea
I devised a simple test, which I now apply to every API I build: The 3 a.m. Test. It's a straightforward question: When this system breaks at 3 a.m., will the on-call engineer be able to diagnose and fix it quickly? This test has helped me eliminate 'clever' design choices that often complicate matters. For instance, complex error codes that require documentation lookups or implicit state that depends on previous requests are a no-go.
After that fateful night, I rebuilt our API infrastructure from scratch. Over three years and handling 50 million daily requests, I developed five key principles that transformed our reliability from 99.2% to 99.95%.
Principle 1: Embrace Partial Failure
Six months later, we faced another outage. This time, a downstream payment processor went unresponsive, and our API waited indefinitely for responses, leading to a crash. I realized we needed systems that gracefully degraded rather than failing catastrophically.
Here's what we implemented:
- A resilient service client class with primary and fallback URLs.
- A circuit breaker pattern to handle timeouts and connection errors.
- A degraded response mechanism to handle failures, ensuring users get something rather than an error.
The key insight: A degraded response is almost always better than an error. Users can work with limited functionality or stale data, but a 500 error stops them in their tracks.
Principle 2: Idempotency is Non-Negotiable
This principle came at a cost: a $27,000 mistake. A mobile client bug caused aggressive retry of failed requests, including a payment request without an idempotency key. The result? A customer was charged 23 times for the same order. Refunds, customer service hours, and engineering time to fix the issue added up to a costly lesson.
Now, every mutating endpoint requires an idempotency key. Our idempotent endpoint class handles this, caching responses and preventing data corruption from retry bugs.
Principle 3: Version in the URL, Not the Header
I learned this lesson by watching a junior engineer spend six hours debugging an issue. Our API was versioned through a custom header, which seemed clean and tidy. But when something went wrong, the logs showed the URL and response code, not the headers. This led to confusion and a wasted six hours.
We moved versioning to the URL path, ensuring version information is visible in every log entry, trace, error report, and monitoring dashboard. We also established clear versioning rules to prevent future headaches.
Principle 4: Rate Limit Early, Not When You Need To
We almost learned this lesson the hard way when a partner company's integration bug caused an infinite retry loop, sending 50,000 requests per second. Fortunately, our load balancer's basic protection kicked in, but legitimate traffic was also affected.
We implemented tiered rate limiting, with different limits for different client tiers and per endpoint. This ensured that when the partner's bug happened again six months later, their requests were rate-limited, and our other clients were unaffected.
Principle 5: If You Can't See It, You Can't Fix It
The scariest outages are not when everything breaks but when something is subtly wrong, and you don't notice for days. We had an issue where 3% of requests were failing with a specific error code, which went unnoticed for two weeks.
After that, we built observability into every endpoint, ensuring we could catch issues quickly. Our minimum observability requirements now include request counts, latency percentiles, error rates, distributed tracing, and alerts at a 1% error rate.
The Results
After three years of applying these principles, our API infrastructure saw significant improvements:
- Monthly availability increased from 99.2% to 99.95%.
- Mean time to detection reduced from 45 minutes to 3 minutes.
- Mean time to recovery decreased from 2 hours to 18 minutes.
- Client-reported errors dropped from 340/week to 23/week.
- 3 a.m. pages (per month) reduced from 8 to 0.5.
What I'd Tell My Past Self
If I could go back, I'd tell myself to build for failure from day one. Every external call and database will eventually fail, so design for it proactively. Make the safe thing the easy thing. Invest in observability early, as you can't fix what you can't see. And remember, boring is good. The clever solution that's hard to debug at 3 a.m. isn't clever.
APIs don't survive by accident; they survive by design. By designing for the moment when everything goes wrong, you can ensure your API is resilient and reliable.
Now, when my phone buzzes at 3 a.m., it's usually just spam. And that's exactly how I like it.