Scaling to a Billion Users: Patterns That Actually Worked

I am not going to name the system. The lawyers would not approve and, honestly, the patterns below repeat across every large consumer product so it makes little difference. What I will tell you is that it served just over a billion daily active users at peak, ran across four continents, and — by the time I left — moved somewhere close to five petabytes of application-layer traffic a day.

These are the patterns that actually held. Not what we planned. What survived contact with reality.

1. Cache aggressively, invalidate carefully, and own the staleness budget

Caching is the first tool you reach for and the easiest one to get wrong. The cheap version — "put Redis in front of Postgres, cache for five minutes" — works at 100k users and quietly explodes at 100 million.

Three specific things mattered at scale:

Explicit staleness contracts. Every cached object had an agreed maximum staleness, declared in code, visible to the service owner. It turned out almost every argument about cache correctness was really an argument about whose definition of "fresh enough" was in force.
Request coalescing. The first request to hit a cold cache entry would fetch; the next ninety-nine requests in the same millisecond window would wait for the same in-flight fetch. Without this, a hot key going cold briefly brought a whole shard down.
Negative caching. "This thing doesn't exist" is a valid answer that deserves a cache entry of its own. Otherwise a small number of 404-producing keys will hammer your database forever.

2. Shard early, but shard on the right key

Everyone knows you have to shard. What nobody tells you is that the choice of sharding key is essentially irreversible, and that you will make the wrong choice the first time.

We sharded initially by user ID. This was fine until we discovered that about 0.1% of our users — celebrities, influencers, bots, a single extremely active test account that got forgotten — accounted for about 40% of the read traffic. The shards containing these users became hot and the rest sat idle.

The fix was not more shards. The fix was a second routing layer that identified the heavy hitters and replicated them across every shard, with a slightly more expensive write path in exchange for a dramatically cheaper read path. This is the pattern Cassandra calls "multi-copy" and Facebook calls "hot-key split". Every mature sharded system reinvents it.

3. Asynchronous by default, synchronous by exception

A rule that survived every redesign: if a user-visible action has to block on a downstream service, we owe the user an explanation of why. Everything else goes through a queue.

This sounds slightly mystical until you have watched a single slow downstream service take the entire user-facing API with it for fifteen minutes. Asynchronous work is not about efficiency. It is about fault isolation. A write to the newsfeed-ranking pipeline failing should not prevent the user from posting the thing they are trying to post.

4. One datastore per workload, not one datastore per team

The platonic ideal of "one database" dies at scale. What replaces it is not chaos. It is a small number of datastores, each picked deliberately, each owned by the platform team, each documented with the precise workload it is suitable for.

We ended up with: a sharded relational store for user identity and money; a wide-column store for timelines and feeds; a search index for full-text; an in-memory cache for session state; and a blob store for everything bigger than 1 MB. That was it. The rule was "no new datastore without a Staff engineer's signature on the request". The rule was more important than the particular set.

5. Capacity is about the 99th percentile, not the average

When you run a service that serves a billion people, the tail of your latency distribution is the service. The median user has a fantastic experience because the median request is fast. The bottom 5% of users — whose requests happen to land on a slow shard, or during a GC pause, or inside a bad deploy — are the ones whose experience shapes whether your product succeeds.

The things that actually moved the p99 in my experience were, in order: fewer network hops, shorter timeouts, smarter retries (with jitter, always with jitter), and — the single biggest win — admission control. A service that serves 100% of incoming requests at a p99 of 4 seconds is worse than a service that serves 97% of incoming requests at a p99 of 200 ms and rejects the other 3% in under 10 ms. The rejected requests get a retry budget; the system stays healthy; the overall experience improves.

6. The org chart is part of the architecture

Conway's Law is not a law. It is an observation that you cannot escape. If two teams must coordinate every week to ship a feature, the bit of software they both touch will be unhealthy within a year. If a team owns a service end to end, the service will be healthier than its peers even if it is technically worse.

At scale the single most consequential architectural decision is not "microservices or monolith". It is "what are the team boundaries and do the service boundaries follow them?" Get that right and a mediocre architecture survives. Get it wrong and a brilliant architecture rots.

What didn't work

For completeness, and because tech blogs only ever tell the victory stories:

A central write-through cache. Sounds elegant. Creates a single point of failure that absolutely nobody wants to own.
An eventually-consistent counter service. You can get it to work, but the number of follow-on bugs in consumers who didn't know it was eventually consistent was vastly higher than the cost of just running a sharded, strongly-consistent counter.
A home-grown service mesh. The open-source ones aren't perfect. They are better than a two-engineer team pretending to be Envoy.

None of the above is specific to a particular stack or cloud provider. The patterns appear, under different names, in every serious system I've worked on. If you only take one thing away from this post: scale is mostly about picking fewer, better-understood primitives and having the discipline not to add more.

— Nivaan