API Design Patterns That Scale to Millions of Requests
Every production system we build at LockedIn Labs communicates through APIs. Internal services talk to each other over gRPC. Mobile clients consume GraphQL. Third-party integrations hit REST endpoints. The APIs are the contracts that hold the system together, and poorly designed contracts create friction that compounds with every new consumer, every new version, and every spike in traffic. We have designed, built, and operated APIs serving billions of requests per month across fintech, healthcare, and enterprise SaaS. This article shares the patterns that have survived production at scale.
The common mistake is treating API design as a backend implementation detail. It is not. Your API is a product. It has consumers who depend on its behavior, its performance characteristics, and its stability. Changing an API after consumers have integrated is expensive and politically difficult. The time to get the design right is before the first consumer writes their first line of integration code.
REST vs GraphQL vs gRPC: Choosing the Right Protocol
The protocol decision is not about which technology is best in the abstract. It is about which technology fits your specific consumer profiles, performance requirements, and operational constraints. REST is the default for public-facing APIs because every HTTP client in every language can consume it, the caching model is well understood, and the tooling ecosystem is mature. GraphQL is superior for mobile and frontend clients that need flexible data fetching, because it eliminates over-fetching and under-fetching and reduces the number of round trips. gRPC is the choice for internal service-to-service communication where type safety, code generation, and streaming are more important than human readability.
Most production systems use more than one protocol. At LockedIn Labs, a typical architecture exposes REST for third-party integrations, GraphQL for the web and mobile frontends, and gRPC for internal microservice communication. An API gateway sits at the edge, routing external requests to the appropriate backend protocol. This is not added complexity for its own sake — each protocol serves the needs of its consumers better than a one-size-fits-all choice would.
Protocol Selection Matrix
REST
Public APIs, third-party integrations, webhooks, CRUD-heavy domains, browser-to-server
GraphQL
Mobile/web frontends, complex data relationships, client-specific data needs, rapid iteration
gRPC
Internal services, high-throughput streaming, polyglot environments, latency-sensitive paths
Pagination: The Detail That Breaks at Scale
Pagination seems trivial until it is not. Offset-based pagination — skip 100, take 20 — is simple to implement and works fine for small datasets. But when your table has 50 million rows, OFFSET 1000000 LIMIT 20 forces the database to scan and discard a million rows before returning the 20 you want. Query time grows linearly with the offset value, and at scale this makes deep pagination unusably slow.
Cursor-based pagination solves this by using a stable reference point — typically the primary key or a sort key of the last item on the current page — as the starting point for the next query. The database can use an index seek to jump directly to the cursor position, making page 10,000 as fast as page 1. The tradeoff is that cursor pagination does not support jumping to arbitrary page numbers, which makes it unsuitable for UI patterns that show numbered pages. For API consumers, this is rarely a problem — most consumers iterate sequentially through result sets.
Cursor-based pagination response format
// Request
GET /api/v1/orders?limit=20&cursor=eyJpZCI6MTAwMH0=
// Response
{
"data": [
{ "id": 1001, "status": "completed", "total": 149.99 },
{ "id": 1002, "status": "pending", "total": 89.50 },
// ... 18 more items
],
"pagination": {
"next_cursor": "eyJpZCI6MTAyMH0=",
"has_more": true,
"limit": 20
}
}
// The cursor is an opaque, base64-encoded token.
// The server decodes it to: { "id": 1020 }
// Next query: SELECT * FROM orders WHERE id > 1020 ORDER BY id LIMIT 20
// This uses an index seek — O(log n) regardless of position.For APIs that need to support complex sorting and filtering with pagination, keyset pagination extends the cursor concept to composite keys. If the consumer sorts by created_at descending, the cursor encodes both the created_at value and the id as a tiebreaker. The WHERE clause becomes created_at less than cursor_created_at OR (created_at equals cursor_created_at AND id less than cursor_id). This maintains stable ordering even when new items are inserted between page fetches.
Rate Limiting: Protecting Your System and Your Consumers
Rate limiting is not punishment — it is a reliability mechanism that protects both the API provider and the API consumers. Without rate limiting, a single misbehaving client can saturate your backend, degrading performance for every other consumer. With poorly designed rate limiting, legitimate traffic gets rejected while abusive patterns slip through. Good rate limiting requires understanding your traffic patterns and designing limits that accommodate normal usage while preventing abuse.
We implement rate limiting at multiple layers. The API gateway enforces global rate limits per API key — typically 1,000 to 10,000 requests per minute for production integrations, with higher limits available on request. Behind the gateway, individual services enforce resource-specific limits based on the cost of the operation. A lightweight read endpoint might allow 100 requests per second per consumer, while a compute-intensive analytics endpoint might allow 10 per minute. The rate limit headers are always present in the response: X-RateLimit-Limit (the maximum), X-RateLimit-Remaining (how many are left), and X-RateLimit-Reset (when the window resets). These headers let consumers implement client-side throttling and avoid hitting the limit in the first place.
The token bucket algorithm is our default implementation because it accommodates burst traffic while enforcing an average rate. A consumer with a 100 requests per minute limit can send 20 requests in a one-second burst (if they have accumulated tokens) without being rejected, as long as their sustained rate stays within the limit. For distributed systems, we use a centralized rate limiter backed by Redis with the sliding window log algorithm, which provides more accurate rate limiting than fixed windows at the cost of slightly more memory usage. The rate limiter itself must be fast and resilient — if the rate limiter is down, we fail open (allow traffic) rather than fail closed (reject everything), because blocking all traffic is worse than temporarily allowing excess traffic.
Versioning: Managing Change Without Breaking Consumers
API versioning is a problem of backward compatibility, not software versioning. The question is: when you change the API, how do you avoid breaking existing consumers? There are three dominant strategies, each with clear tradeoffs. URL versioning (/api/v1/orders, /api/v2/orders) is the most visible and the easiest for consumers to understand. Header versioning (Accept: application/vnd.api+json;version=2) keeps URLs clean but is harder to discover and test. Query parameter versioning (?version=2) is a compromise that works but feels like an afterthought.
We use URL versioning as the default because it makes the version explicit, cacheable, and debuggable. When a consumer reports an issue, there is no ambiguity about which version they are using. The major version in the URL changes only for breaking changes — removing a field, changing a field type, altering the semantics of an operation. Additive changes — new optional fields, new endpoints, new query parameters — are made within the existing version. This means most APIs stay on v1 for years, with v2 only introduced when a fundamental redesign is necessary.
Versioning with sunset headers for deprecation
// Response from deprecated v1 endpoint
HTTP/1.1 200 OK
Content-Type: application/json
Sunset: Sat, 01 Jun 2026 00:00:00 GMT
Deprecation: true
Link: </api/v2/orders>; rel="successor-version"
// Response body includes deprecation notice
{
"data": { ... },
"meta": {
"api_version": "v1",
"deprecation_notice": "This endpoint will be removed on 2026-06-01. Please migrate to /api/v2/orders. See https://docs.example.com/migration-guide"
}
}Caching: The Fastest Request Is the One You Never Make
Caching is the single most effective technique for scaling API throughput. A well-designed caching strategy can reduce backend load by 90% or more for read-heavy workloads. But caching also introduces the hardest problems in computer science: cache invalidation and cache consistency. The design tradeoffs depend on how much stale data your consumers can tolerate and how quickly data changes in your system.
We cache at three layers. HTTP caching uses Cache-Control headers to tell consumers and CDNs how long a response can be reused. For public, stable data (product catalog, configuration), we set long TTLs with CDN caching. For user-specific or frequently changing data, we use short TTLs or no-cache with ETag validation. Application-layer caching uses Redis to store computed results that are expensive to generate — aggregated analytics, search results, recommendation feeds. The cache key includes all parameters that affect the result, and cache entries are invalidated when underlying data changes through event-driven invalidation rather than TTL expiry. Database query caching is the innermost layer — frequently executed queries with stable results are cached at the ORM level with short TTLs.
The stale-while-revalidate pattern is particularly powerful for APIs with high read traffic. When a cache entry expires, the first request triggers an asynchronous background refresh while immediately returning the stale cached value. Subsequent requests within the refresh window also receive the stale value. Only after the refresh completes do new requests receive the updated value. This eliminates cache stampedes — the thundering herd problem where hundreds of concurrent requests all miss the cache simultaneously and hammer the backend.
Documentation: The API Is Only as Good as Its Docs
An undocumented API is an unusable API. The best API design in the world is worthless if consumers cannot discover endpoints, understand request formats, or debug error responses. We treat API documentation as code: it lives in the same repository as the API implementation, it is generated from the same source of truth (OpenAPI specification for REST, SDL for GraphQL, protobuf definitions for gRPC), and it is deployed through the same CI/CD pipeline. If the documentation disagrees with the implementation, the CI build fails.
Good API documentation includes more than endpoint references. It includes getting started guides that take a new consumer from zero to their first successful API call in under five minutes. It includes authentication examples in multiple programming languages. It includes error response catalogs with explanations and suggested remediation for every error code. It includes rate limiting documentation that explains limits, headers, and retry strategies. It includes changelog documentation that describes every change to the API, tagged by version, with migration guides for breaking changes.
API Design Principles We Follow on Every Project
p50 < 50ms
Median API response time target for all synchronous endpoints
99.99%
API availability SLO across all production services
0
Breaking changes without a deprecation period and migration guide
API design is engineering discipline, not creative expression. The best APIs are boring — they follow established conventions, they behave predictably, they handle errors gracefully, and they scale without surprising their consumers. The patterns in this article are not novel. They are battle-tested approaches that we have refined across dozens of production systems. The value is not in knowing them but in applying them consistently, from the first endpoint to the thousandth.
Related Articles
Building APIs that need to scale?
Our backend engineering team designs and builds APIs that handle millions of requests without breaking a sweat.