Skip to content
Observability & SRE

Beyond Monitoring: Building an Observability-First Engineering Culture

LockedIn Labs Engineering TeamJanuary 8, 202610 min read

Most engineering organizations confuse monitoring with observability. They set up dashboards, configure alerts on CPU and memory, and call it done. Then an incident happens — latency spikes for 10% of users, but only in one region, only for requests that touch a specific downstream service, only when the payload exceeds a certain size. The dashboards are green. The alerts did not fire. The team spends four hours in a war room manually querying logs, guessing at hypotheses, and deploying debug builds to production. This is the cost of monitoring without observability.

Observability is the ability to understand the internal state of your system by examining its external outputs. Monitoring tells you what is broken. Observability lets you ask why it is broken, even when you could not have predicted the failure mode in advance. The distinction matters because modern distributed systems fail in novel, combinatorial ways that no predefined dashboard can anticipate. At LockedIn Labs, we build observability into every system from day one. This article shares the architecture, tooling, and cultural practices that make observability work at scale.

OpenTelemetry: The Instrumentation Standard

OpenTelemetry has won the instrumentation war. It is the second most active CNCF project after Kubernetes, with SDKs for every major language and integrations with every major observability backend. If you are starting a new observability initiative in 2026, OpenTelemetry is not optional — it is the foundation. The value proposition is vendor-neutral instrumentation: you instrument your code once, and you can send telemetry to any backend — Datadog, Grafana Cloud, Honeycomb, New Relic, or your own open-source stack. This eliminates vendor lock-in and lets you switch backends without touching application code.

We instrument at three levels. Automatic instrumentation captures the basics without code changes: HTTP requests, database queries, gRPC calls, message queue operations. This gets you 70% of the value with zero effort. Manual instrumentation adds business context: which user made this request, which feature flag was active, which experiment variant was selected, what was the business outcome. This is where observability becomes powerful — you can slice telemetry by business dimensions, not just technical ones. Custom metrics capture domain-specific measurements: order processing time, ML model inference latency, payment gateway response distribution. These feed into SLOs and business dashboards.

OpenTelemetry manual instrumentation example (Node.js)

import { trace, metrics, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');
const meter = metrics.getMeter('order-service');

const orderDuration = meter.createHistogram('order.processing.duration', {
  description: 'Time to process an order end-to-end',
  unit: 'ms',
});

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    const start = Date.now();

    // Add business context as span attributes
    span.setAttribute('order.id', order.id);
    span.setAttribute('order.total', order.total);
    span.setAttribute('order.item_count', order.items.length);
    span.setAttribute('customer.tier', order.customer.tier);

    try {
      await validateInventory(order);
      span.addEvent('inventory_validated');

      await processPayment(order);
      span.addEvent('payment_processed');

      await fulfillOrder(order);
      span.addEvent('order_fulfilled');

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      orderDuration.record(Date.now() - start, {
        'customer.tier': order.customer.tier,
        'order.type': order.type,
      });
      span.end();
    }
  });
}

Distributed Tracing: Following Requests Across Service Boundaries

In a monolithic application, a stack trace tells you everything you need to know about a request’s journey. In a distributed system with dozens of services, the stack trace is useless — it only shows what happened within one service. Distributed tracing solves this by propagating a trace context across every service boundary. A single user request generates a trace that spans the API gateway, the authentication service, the business logic service, the database, the cache, the message queue, and the notification service. Every operation is a span within that trace, forming a tree that shows exactly what happened, in what order, and how long each step took.

The critical engineering decision with tracing is the sampling strategy. Capturing 100% of traces is prohibitively expensive at scale — a service handling 10,000 requests per second generates terabytes of trace data per day. But sampling randomly means you might miss the exact trace that would have explained today’s incident. We use a multi-tier sampling strategy: 100% capture for errors and slow requests (above the p99 latency threshold), 100% capture for requests matching debug criteria (specific user IDs, feature flags, experiment variants), 10% random sampling for baseline performance data, and adaptive sampling that increases the rate when anomaly detection triggers.

Tail-based sampling, where the sampling decision is made after the trace is complete rather than at the head, is the gold standard but requires more infrastructure. The OpenTelemetry Collector supports tail-based sampling through its tail sampling processor. We deploy the collector as a gateway service that receives all spans, holds them in memory for 30 seconds, and then makes an informed sampling decision based on the complete trace. If any span in the trace has an error status, high latency, or matches a debug rule, the entire trace is kept. This ensures you never lose a trace that contains useful debugging information.

SLOs vs SLAs: Measuring What Matters to Users

Service Level Agreements are contractual obligations to customers — they define the penalty when you fail. Service Level Objectives are internal targets that your engineering team operates against — they define the standard you aim for. The SLO should always be stricter than the SLA, creating a buffer that lets you detect and fix degradation before it becomes a contractual breach. If your SLA promises 99.9% availability, your SLO should target 99.95% or higher. The gap between the SLO and the SLA is your error budget — the amount of unreliability you can tolerate before customers are impacted.

The most common mistake we see is defining SLOs around infrastructure metrics rather than user experience. CPU utilization, memory usage, and disk IOPS are interesting to operations teams but meaningless to users. Users care about whether the page loaded, whether the API returned the correct result, and whether it happened fast enough. Good SLOs are defined in terms of user-facing behavior: the percentage of requests that return a successful response within 200 milliseconds, the percentage of checkout flows that complete without error, the percentage of search queries that return results within one second.

SLO Framework: Our Standard Template

Availability SLO

99.95% of requests return a non-5xx response, measured over a rolling 30-day window

Latency SLO

p50 under 100ms, p95 under 500ms, p99 under 1,000ms, measured per endpoint

Correctness SLO

99.99% of responses return the correct result, validated through continuous synthetic testing

Freshness SLO

Data pipeline end-to-end latency under 5 minutes for 99% of records, measured at the consumer

Alerting Strategy: Reducing Noise, Increasing Signal

Alert fatigue is the silent killer of incident response. When engineers receive hundreds of alerts per week, they stop paying attention. Critical alerts get lost in the noise. Pages go unacknowledged. Incidents that should have been caught in minutes are discovered hours later when customers complain. The solution is not better alert routing — it is fewer, better alerts.

Our alerting philosophy is built on one principle: every alert should be actionable. If an alert fires and the on-call engineer cannot take a meaningful action in response — if the correct response is to watch and wait, or to check a dashboard and confirm everything is fine — then the alert should not exist. We categorize alerts into three tiers. Tier 1 pages are customer-impacting incidents that require immediate human intervention: the service is down, data is being corrupted, a security breach is in progress. These page the on-call engineer with an escalation path. Tier 2 notifications are degradations that need attention within business hours: SLO burn rate is elevated, a canary deployment is showing higher error rates, disk utilization is approaching threshold. These create tickets and send Slack notifications. Tier 3 informational alerts are logged but do not notify anyone — they exist for post-incident analysis and trend tracking.

SLO-based alerting with multi-window burn rate

# Alert when SLO error budget is being consumed too quickly
# Multi-window burn rate catches both fast and slow burns

groups:
  - name: slo-alerts
    rules:
      # Fast burn: 2% of 30-day budget in 1 hour (page immediately)
      - alert: SLOBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{code=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.0005)
          and
          (
            sum(rate(http_requests_total{code=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.0005)
        labels:
          severity: page
          tier: "1"
        annotations:
          summary: "High error rate burning SLO budget rapidly"

      # Slow burn: 5% of 30-day budget in 6 hours (ticket)
      - alert: SLOBudgetSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{code=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (6 * 0.0005)
        labels:
          severity: ticket
          tier: "2"

Dashboards: Designing for Incident Response

Dashboards are not decorations. They are tools for making decisions under pressure. During an incident, the on-call engineer needs to answer three questions within sixty seconds: is this real, what is the impact, and where should I look next? Every dashboard should be designed to answer these questions for its domain.

We use a three-layer dashboard hierarchy. The top layer is the service overview — a single dashboard per service that shows the RED metrics (Rate, Errors, Duration), the SLO status, and the current error budget. This is what the on-call engineer opens first. The middle layer is the operational dashboard — deeper metrics for each service component: database query performance, cache hit rates, downstream dependency health, queue depths. This is where the engineer drills down after identifying which service is affected. The bottom layer is the debug dashboard — detailed metrics, recent traces, and log queries for specific subsystems. This is where the engineer goes to understand the root cause.

Dashboard design principles matter more than tooling. Keep the information density high — remove every panel that is not used during incidents. Use consistent color coding — red means bad, green means good, across every dashboard. Use consistent time ranges — if one panel shows the last hour and the adjacent panel shows the last 24 hours, the engineer will waste time reconciling timelines during a stressful incident. Link dashboards to each other and to traces — clicking on a spike in the error rate panel should navigate to the relevant traces. These details seem minor, but they compound during an incident when every minute of diagnosis time matters.

Building the Culture: Observability as an Engineering Practice

The hardest part of observability is not the technology — it is the culture change. Observability requires every engineer to think about operability from the start of feature development, not as an afterthought before release. Every new endpoint needs instrumentation. Every new feature needs SLO coverage. Every new integration needs health checks and circuit breakers. This only happens when observability is treated as a first-class engineering practice, not a task delegated to the ops team.

We embed observability into the development workflow through three mechanisms. First, observability is part of the definition of done for every feature. A feature without instrumentation, dashboards, and SLO coverage is not complete, regardless of whether the code works. Second, we run regular game days where we inject failures into production-like environments and have teams practice incident response. This builds muscle memory for using observability tools under pressure and reveals gaps in instrumentation before real incidents expose them. Third, we conduct blameless post-incident reviews for every significant event, and the review always includes an assessment of observability — did we have the data we needed to diagnose the issue? If not, what instrumentation would have helped? The improvement actions from these reviews feed directly into the next sprint.

Observability is not a destination — it is a practice that improves continuously. Every incident teaches you something about the gaps in your instrumentation. Every new feature introduces new failure modes that need coverage. The organizations that maintain reliable, performant systems at scale are the ones that invest in observability as a core engineering discipline, not a compliance checkbox. The cost of building it is significant. The cost of not having it when you need it is catastrophic.

Ready to build observability into your platform?

Our SRE team can design and implement an observability stack tailored to your architecture in weeks.