Principal Engineer | Huntsville, AL

Make failures visible. Make recovery safe.

I build event-driven systems, streaming pipelines, and control-plane services that stay diagnosable under partial failure and can be replayed without corrupting data.

Event-driven & streaming Retries / backpressure / idempotency OpenTelemetry + low-noise alerts AWS / EKS

See highlights Open source

Currently working on

Reliability under partial failure and dependency instability
Safe recovery: DLQ strategy, quarantine, and replay runbooks
Low-noise telemetry with correlation IDs and traces
Game Days and readiness checks that keep systems honest

Focus areas

Core focus areas

Resilience engineering for distributed systems: failure-mode driven design, bounded retries, and safe recovery paths.

Diagnosability by design

Signals that match real failure modes: end-to-end correlation, traces and structured events, plus low-noise alerts and runbooks that shorten time-to-diagnosis.

Resilient workflows and pipelines

Retries, backoff, and backpressure designed around dependency behavior so partial failures don’t turn into cascades.

Safe recovery design

Replay and reprocessing with guardrails: idempotency, quarantine, and verification steps that prevent corruption.

Selected highlights

Outcomes from production systems

≈40% fewer latency spikes under volatility Tuned gRPC retry behavior for a latency-sensitive workflow; reduced spike frequency during volatile periods without turning retries into downstream pressure.

≈90% faster recovery in streaming incidents Built idempotent consumer patterns and a guarded replay path; recovery became repeatable when pipelines needed intervention.

Safe DLQ replay with guardrails Quarantine-first reprocessing, rate-limited replay, and verification checklists to prevent “recovery” from corrupting data.

Cascading-failure resistance Retry classification + jittered backoff + concurrency caps tuned to dependency behavior, preventing retry storms and thundering herds.

Low-noise telemetry tied to runbooks Correlation IDs, structured logs, and RED/USE metrics so incidents are diagnosable without spelunking raw logs for hours.

Game days that prove recovery paths Failure drills that validate replay procedures and alert/runbook coverage before production forces the lesson.

Writing

Selected writing

Short notes on retries, tail latency, safe replay, and operability design.

Technical notes

Operability First: Policy, Not Hope

Throughput is technical. Operability is sociotechnical. Treat retries and replay as a control plane with explicit policy, bounded failure, and safe recovery.

11 min read Operability Resilience Distributed systems +4

Safe DLQ replay checklist

A practical runbook for replaying dead-letter messages without corrupting data or melting dependencies, with SQS/SNS and Kafka appendices.

7 min read Operability Safe recovery Runbooks +3

Why redress

My retry philosophy: classification-first, bounded unknowns, capped exponential backoff, and observability hooks.

3 min read Retries Operability

Why recourse

Policy-driven resilience for Go services: consistent retries, explicit backpressure budgets, hedging, and circuit breaking - with explainable observability.

6 min read Resilience Operability Go +1

Musings

The Monolith and the Swarm

How Stories Bind Us and What Happens When They Break

5 min read Narrative Legitimacy Social cohesion +1

Influences

The Tail at Scale (Dean & Barroso)

Made me treat tail latency as a first-class failure mode: retries must be load-aware and latency-aware.

Timeouts, retries, and backoff with jitter (Amazon Builders’ Library)

Reinforced that backoff isn’t just math - it’s a control system: caps, jitter, and budgets matter.

Exponential Backoff And Jitter (AWS Architecture Blog)

A concrete demonstration of why naïve exponential backoff synchronizes clients and creates retry storms without jitter.

Site Reliability Engineering (Google)

Error budgets and toil aren’t management slogans; they’re constraints that shape system design and operations.

Open source

Tools I maintain and contribute to

redress

Predictable retries with classification, per-class backoff strategies, and observability hooks.

Docs

recourse

Resilience library for Go services: retries, hedging, backpressure signaling, and circuit breakers.

Docs

Experience

Principal engineering in high-throughput, high-assurance environments

I’ve worked on systems where resilience is a product feature: latency-sensitive workflows, streaming pipelines, and distributed control planes in fintech and defense contexts.

What I’ve shipped

Latency-sensitive gRPC workflows in fintech environments
Payment and streaming pipelines with replay safety
Control-plane reconciliation loops and policy enforcement
Partition-tolerant workflows for intermittent networks

How I work

Define failure modes first, then instrument to observe them
Prefer small reversible changes over risky refactors
Write the runbook before the incident happens
Validate recovery with drills, not hope

Strengths

Depth across systems, platforms, and resilience patterns

Systems and architecture

Event-driven workflows and streaming ingestion
Control-plane reconciliation loops
Idempotency design and replay safety for side effects
Alert noise control and signal hygiene

Tools and languages

AWS (Lambda, SQS, Step Functions, EventBridge)
Kubernetes/EKS, Gateway API, Envoy
OpenTelemetry, Prometheus, Grafana
Go, Python, SQL

Contact

Let’s connect

If you’re operating event-driven systems and thinking about retries, DLQs, safe replay, or keeping telemetry quiet and useful, I’m always happy to compare notes.

Joshua A. Sorrell

Huntsville, AL

me@joshuasorrell.com

joshuasorrell.com