Skip to content
DevOps & SRE

DevOps at Scale: How We Maintain 99.99% Uptime Across 200+ Services

LockedIn Labs Engineering TeamJanuary 25, 20269 min read

At LockedIn Labs, our platform engineering team operates over 200 services across multiple cloud providers and Kubernetes clusters, serving aggregate traffic measured in billions of requests per month. Our trailing twelve-month uptime across production services is 99.993%. That number did not happen by accident. It is the product of deliberate engineering decisions, hard-won lessons from production incidents, and an operational culture built around observability, automation, and blameless accountability.

This article shares the frameworks, tools, and practices we use to maintain that level of reliability. It is not a theoretical overview of SRE principles — it is a practitioner’s guide drawn from real war stories, real outages, and real recovery. Every pattern described here has been battle-tested in production under genuine load.

The Observability Stack: See Everything, Alert on What Matters

Observability is not monitoring with a better name. Monitoring tells you when something is broken. Observability lets you ask arbitrary questions about your system’s behavior without deploying new code. The distinction matters because in a distributed system with 200+ services, you cannot predict which questions you will need to ask during an incident. You need the ability to slice and dice telemetry in real time, correlating metrics, logs, and traces across service boundaries.

Our observability stack rests on four pillars, each serving a distinct purpose. Metrics provide the quantitative heartbeat: request rates, error rates, latency percentiles, saturation levels. We use Prometheus for collection and Grafana for visualization, with custom recording rules that pre-compute the aggregations we query most frequently during incidents. Every service exposes a standard set of RED metrics (Rate, Errors, Duration) plus domain-specific business metrics.

Structured logs are the narrative layer. Every log line is JSON, every log line carries a trace ID and span ID, and every log line includes the service name, version, and deployment environment. We route logs through a centralized pipeline that enriches them with Kubernetes metadata — pod name, node, namespace — before indexing. The ability to jump from a log line to the corresponding trace, and from a trace span to the corresponding metrics dashboard, is the glue that makes observability work at scale.

Distributed traces are the map. When a request touches twelve services, the trace shows the full journey: which service added the most latency, where retries happened, which downstream call failed. We use OpenTelemetry for instrumentation and maintain a sampling strategy that captures 100% of error traces and 10% of successful traces, with adaptive sampling that increases during anomalies. Continuous profiling is the fourth pillar. We run always-on CPU and memory profiling across production services at a 1% sampling rate. When a service shows gradual memory growth or unexpected CPU usage, the profiling data is already there — no need to deploy a profiling build and wait for the problem to recur.

The Four Pillars of Observability

Metrics

RED metrics per service, Prometheus + Grafana, custom recording rules for incident queries

Logs

Structured JSON, trace-correlated, enriched with K8s metadata, centralized indexing

Traces

OpenTelemetry instrumentation, 100% error capture, adaptive sampling during anomalies

Profiling

Always-on CPU + memory profiling at 1% sampling, no deploy needed for analysis

Deployment Strategies: Zero-Downtime as a Non-Negotiable

Deployments are the single largest source of production incidents. In our historical incident data, 72% of outages were directly caused by a deployment. That statistic drove us to treat deployment safety as the highest-leverage investment in reliability. Every deployment at LockedIn Labs follows a progressive rollout strategy, and no deployment reaches 100% traffic without automated validation at each stage.

For stateless services, our default strategy is canary deployment. A new version is deployed to a small subset of pods (typically 5%) and monitored for a configurable bake time — usually 10 minutes for non-critical services, 30 minutes for Tier 1 services. During the bake period, our deployment controller compares error rates, latency percentiles, and business metrics between the canary and the stable population. If any metric degrades beyond a configured threshold, the deployment automatically rolls back. No human intervention required.

For stateful services and database migrations, we use blue-green deployments with traffic shifting at the load balancer level. The new environment is fully provisioned and warmed up before any traffic is shifted. We run synthetic transactions against the green environment to validate functionality, and only then begin a gradual traffic shift: 1%, 5%, 25%, 50%, 100%, with automated health checks at each step. The blue environment remains available for instant rollback for 48 hours after cutover.

Progressive rollouts extend the canary concept to feature-level granularity. Using feature flags integrated with our deployment pipeline, we can release new code to production but expose the new behavior to a controlled subset of users — starting with internal teams, expanding to beta users, and finally rolling out to general availability. This decouples deployment from release, which is one of the most impactful operational improvements a team can make.

Key Insight

72% of our historical production incidents were deployment-related. Since implementing automated canary analysis with mandatory bake times, deployment-caused incidents dropped by 91%. The investment in deployment safety infrastructure paid for itself within three months.

Incident Response: SLOs, Error Budgets, and Blameless Postmortems

Reliability is not about preventing all incidents. It is about detecting incidents fast, responding effectively, and learning from every failure. Our incident response framework is built on three complementary systems: Service Level Objectives that define what “reliable enough” means, error budgets that quantify the cost of unreliability, and blameless postmortems that convert incidents into organizational learning.

Every production service has a defined SLO. For customer-facing APIs, the standard SLO is 99.95% of requests completing successfully within 300ms over a rolling 30-day window. For async processing services, the SLO measures processing completeness — 99.99% of messages processed within the contracted time. Each SLO has an associated error budget: the amount of unreliability the business is willing to tolerate. A 99.95% SLO gives you 21.6 minutes of total downtime per month. That budget is consumed by incidents, and when it runs low, the team shifts from feature development to reliability work. This is not a suggestion — it is an automated policy enforced by our platform tooling.

When an incident occurs, the on-call engineer follows a structured response protocol. The first action is always to assess blast radius — which users are affected, which services are impacted, and what is the business impact. Classification into severity levels (SEV1 through SEV4) triggers the appropriate communication channels and escalation paths. SEV1 incidents (customer-facing service down) automatically page the incident commander, notify executive stakeholders, and create a dedicated war room. SEV4 incidents (minor degradation, no user impact) are handled asynchronously by the owning team.

Every incident, regardless of severity, results in a blameless postmortem. The word “blameless” is loaded, and we take it seriously. The postmortem focuses on systemic causes, not individual actions. Questions like “why did the system allow this to happen?” replace “who made this mistake?” Every postmortem produces concrete action items with owners and deadlines, and those action items are tracked in our reliability backlog with the same priority as feature work. We review postmortem action item completion rates monthly, and our target is 90% of items resolved within 30 days.

Infrastructure as Code: Reproducibility as a Feature

Every piece of infrastructure we operate is defined in code, reviewed in pull requests, and deployed through automated pipelines. There are zero exceptions. No ClickOps, no manual console changes, no “temporary” SSH sessions. This is not ideological purity — it is a pragmatic response to the reality that manual changes are the second largest source of production incidents (after deployments) and the hardest to diagnose because they leave no audit trail.

We use Terraform for cloud infrastructure provisioning with a strict module architecture. Every infrastructure component — VPCs, Kubernetes clusters, databases, message queues, CDN configurations — is encapsulated in a versioned Terraform module with well-defined inputs and outputs. Teams consume these modules through a service catalog, and the modules enforce organizational standards: encryption at rest, network policies, backup schedules, monitoring integration.

Kubernetes resources are managed through Helm charts with ArgoCD handling GitOps-style continuous delivery. When an engineer merges a change to the infrastructure repository, ArgoCD detects the drift between the desired state in Git and the actual state in the cluster, and reconciles automatically. This eliminates the entire class of “I forgot to apply that change” incidents and gives us a complete audit trail of every infrastructure change, who approved it, and when it was applied.

War Stories from Production

Theory is useful. Production experience is indispensable. Here are three incidents that shaped our current practices.

The Cascading Timeout

A single downstream service experienced increased latency, causing callers to hold connections open longer than expected. Thread pools saturated across six upstream services in under three minutes. The root cause was a missing circuit breaker on a non-critical dependency. The fix was straightforward — add circuit breakers with sensible timeouts — but the lesson was deeper: every external dependency, even optional ones, needs a failure isolation boundary. We now enforce circuit breaker configuration as a deployment gate. No service can deploy to production without defined timeout, retry, and circuit breaker policies for every outbound call.

The Silent Data Corruption

A schema migration in a shared database silently changed column precision, causing financial calculations to round differently. The error was within tolerance for individual transactions but accumulated to material discrepancies over time. We detected it 11 days later through a reconciliation batch job. This incident drove two changes: first, we implemented continuous data validation checks that compare aggregated business metrics against expected ranges in real time. Second, we established a policy that every schema migration affecting numeric columns requires an explicit precision validation test before deployment.

The DNS Propagation Incident

During a planned migration between cloud regions, a DNS TTL misconfiguration caused 30% of traffic to route to the decommissioned region for 47 minutes. The region was healthy but running on reduced capacity, resulting in degraded performance rather than an outage. The lesson: every migration plan now includes explicit DNS TTL pre-warming, where we reduce TTLs to 60 seconds at least 48 hours before the migration window, and we validate TTL propagation across major resolver networks before proceeding with traffic shifts.

By the Numbers

99.993%

Trailing twelve-month uptime across all production services

4.2 min

Mean time to detection for customer-impacting incidents

91%

Reduction in deployment-caused incidents since implementing automated canary analysis

Reliability at scale is not a destination — it is a practice. Every incident teaches you something, every quarter brings new challenges, and the system you operate today will not be the system you operate next year. The organizations that maintain high availability over long periods are the ones that invest continuously in observability, deployment safety, incident response, and a culture where learning from failure is valued more than preventing it. Incidents will happen. The question is whether your organization gets better because of them.

Need production-grade DevOps?

Our SRE team can audit your infrastructure and build a reliability roadmap in two weeks.