Skip to content
Client identity protected by NDA · Reference available upon request
LOGISTICS · 12 US METRO MARKETS

Fleet Orchestration & Route Intelligence
for a National Last-Mile Carrier

How we replaced a city-scale PHP monolith with a real-time Go orchestration platform handling 50K+ daily deliveries across 12 markets — with zero delivery disruptions during a 14-week live migration.

50K+
Daily Deliveries
12
Metro Markets
35%
Cost Reduction
99.97%
Platform Uptime
The Client

A fast-growing carrier that outgrew its infrastructure overnight

Our client is a national last-mile delivery carrier that expanded from 2 markets to 12 in under 18 months through a combination of organic growth and two regional acquisitions. What had been a manageable PHP-based dispatch system for a single city was now being force-fitted to coordinate 50K+ deliveries per day across a dozen metropolitan markets, each with different driver pools, SLA commitments, and regulatory requirements.

The engineering team had done heroic work keeping the legacy system alive. But by the time we were engaged, 70% of their sprint capacity was going to incident response and production firefighting. There was no realistic path to building the real-time optimization capabilities the business needed without first replacing the foundation.

The Challenge

Scaling logistics beyond manual coordination

The legacy system had no real-time optimization capability. Route planning ran as a nightly batch job, meaning the system was always solving for yesterday's conditions. When a driver called in sick, traffic patterns shifted, or a pickup was added mid-route, dispatchers were manually redistributing stops via phone and spreadsheet. The system had no awareness that anything had changed.

Scaling had exposed every architectural assumption in the original codebase. The PHP monolith used session-level database locking that caused cascading timeouts when concurrent route calculations exceeded 30 simultaneous operations. In a single-city environment, this was acceptable. Across 12 markets with different peak windows, it was a production incident waiting to happen — and often did.

Customer SLAs were slipping. Delivery costs were climbing because routes weren't optimized. And the engineering team had no capacity to build anything new. The constraint was absolute: the new platform had to be built and migrated while 50K+ deliveries continued every day on the existing system.

What Broke First

Three assumptions that didn't survive contact with production

The VRP solver was too slow for real-time re-optimization

Our initial route optimization approach used a full VRP solver for both initial route planning and real-time re-optimization events. In testing, re-optimization completed in 4–6 seconds. In production during morning peak, with 400+ concurrent re-optimization events, it was taking 45–90 seconds — long enough for conditions to change before the solution was delivered. We had to architect a tiered approach: full VRP for initial planning, a faster greedy insertion algorithm for real-time re-optimization, with the full solver running asynchronously as a background improvement pass. This took four weeks to re-architect.

Market-specific regulatory data was incomplete at contract signing

We scoped regulatory requirements by gathering data from the client's ops team for each of the 12 markets. In practice, three markets had regulatory constraints — specific delivery window restrictions and building access rules — that the ops team wasn't aware of until drivers flagged them during the pilot. These required last-minute additions to the market configuration layer. The lesson: validate operational assumptions with drivers and dispatchers, not just management.

The strangler fig migration created a 3-week data consistency problem

During the strangler fig market migration, we ran both systems in parallel for each market before cutover. In Market 4, a race condition in the handoff logic caused a 72-hour period where delivery status updates from the new system were not propagating back to the legacy system's customer-facing tracking. We caught it before it affected SLAs, but it required an emergency architecture review and a new event synchronization layer that we hadn't planned for.

Our Approach

Event-driven orchestration at city scale

The platform was built in Go for raw throughput and predictable latency under concurrent load. The orchestration engine consumes fleet events — driver check-ins, delivery confirmations, traffic updates, weather alerts, customer rescheduling — through Apache Kafka and processes them against real-time constraint solvers to dynamically rebalance routes across the active fleet.

The routing engine uses a tiered optimization approach: an initial full VRP solver for route planning at dispatch time, combined with a faster greedy insertion algorithm for real-time re-optimization events during the shift. A background optimization pass runs the full solver against active routes continuously, surfacing improvements that dispatchers can accept or override.

Each market is isolated as an independent tenant within a shared Kubernetes cluster, with market-specific configuration for local routing constraints, delivery window regulations, and SLA thresholds. Migrating one market at a time using a strangler fig pattern — lowest volume first, highest volume last — meant that market-level failures during migration were contained and the risk was manageable at each step.

The driver-facing React Native app provides turn-by-turn navigation, real-time package scanning, proof-of-delivery capture, and automatic schedule updates when routes are re-optimized mid-shift. Dispatchers get a command-center dashboard with live fleet visualization and one-click manual override for edge cases the algorithm can't handle.

Architecture

Distributed systems, deterministic outcomes

Tiered Route Optimization

Full VRP solver for initial planning, greedy insertion for real-time re-optimization, and a background improvement pass — balancing solution quality with responsiveness at scale.

Multi-Market Tenancy

Isolated market configurations within shared Kubernetes clusters. Region-specific routing constraints, compliance rules, and independent SLA thresholds per market.

Event-Driven Backbone

Apache Kafka for all fleet events with exactly-once semantics and full delivery lifecycle audit trail. gRPC for low-latency service mesh communication across 20+ microservices.

Full Observability Stack

Prometheus metrics, Grafana dashboards, and distributed tracing across every delivery lifecycle event. Automated SLA breach alerting with sub-minute detection.

Results

Faster deliveries, lower costs, freed engineers

50K+ Deliveries/Day

Peak throughput across all 12 active markets with real-time status tracking, dynamic re-routing, and sub-minute SLA breach detection.

12 Metro Markets

Each market runs as an isolated tenant with region-specific routing constraints, local compliance rules, and independent SLA thresholds.

35% Cost Reduction

Operational logistics costs reduced through AI-powered route optimization, predictive load balancing, and dynamic fleet allocation — validated against 12-month pre-migration baseline.

99.97% Uptime

Production availability maintained through multi-region Kubernetes clusters with automated failover, circuit breakers, and graceful degradation under market-specific event spikes.

The 14-week strangler fig migration completed with zero delivery disruptions. Markets were migrated sequentially — starting with the lowest-volume region and finishing with the highest — with automated reconciliation running throughout to confirm delivery status parity between old and new systems before each market was cut over.

Within 90 days of full deployment, the 35% cost reduction was validated against the 12-month pre-migration baseline. The engineering team reclaimed 60% of their sprint capacity from incident response back to feature development — the metric the client's CTO cited first when asked about the project's success.

Key Technical Decisions

  • Tiered optimization over a single VRP solver — the key to achieving sub-second re-optimization in production at scale
  • Strangler fig migration over a big-bang cutover — the only approach that didn't require a delivery disruption window
  • Go over Node.js for the orchestration core — the throughput and goroutine concurrency model handled fleet event bursts better under load testing
  • Market-level tenant isolation — contained migration failures to individual markets and prevented cross-market incidents
Technology

The stack

GoKubernetesApache KafkaPostgreSQLReact NativeRedisgRPCPrometheusGrafanaTerraformAWS

Reference Available Upon Request

This client is referenceable. We can arrange a direct conversation with their VP of Engineering or Head of Operations for qualified enterprise prospects under mutual NDA.

Request a Reference

Scaling your logistics?
Let's talk.

We build distributed systems that handle real-world operational complexity at scale. Tell us what you're working with.