Make failures visible. Make recovery safe.
I build event-driven systems, streaming pipelines, and control-plane services that stay diagnosable under partial failure and can be replayed without corrupting data.
Currently working on
- Reliability under partial failure and dependency instability
- Safe recovery: DLQ strategy, quarantine, and replay runbooks
- Low-noise telemetry with correlation IDs and traces
- Game Days and readiness checks that keep systems honest
Core focus areas
Resilience engineering for distributed systems: failure-mode driven design, bounded retries, and safe recovery paths.
Diagnosability by design
Signals that match real failure modes: end-to-end correlation, traces and structured events, plus low-noise alerts and runbooks that shorten time-to-diagnosis.
Resilient workflows and pipelines
Retries, backoff, and backpressure designed around dependency behavior so partial failures don’t turn into cascades.
Safe recovery design
Replay and reprocessing with guardrails: idempotency, quarantine, and verification steps that prevent corruption.
Outcomes from production systems
Selected writing
Short notes on retries, tail latency, safe replay, and operability design.
Technical notes
Operability First: Policy, Not Hope
Throughput is technical. Operability is sociotechnical. Treat retries and replay as a control plane with explicit policy, bounded failure, and safe recovery.
Safe DLQ replay checklist
A practical runbook for replaying dead-letter messages without corrupting data or melting dependencies, with SQS/SNS and Kafka appendices.
Why redress
My retry philosophy: classification-first, bounded unknowns, capped exponential backoff, and observability hooks.
Why recourse
Policy-driven resilience for Go services: consistent retries, explicit backpressure budgets, hedging, and circuit breaking - with explainable observability.
Musings
The Monolith and the Swarm
How Stories Bind Us and What Happens When They Break
Made me treat tail latency as a first-class failure mode: retries must be load-aware and latency-aware.
Reinforced that backoff isn’t just math - it’s a control system: caps, jitter, and budgets matter.
A concrete demonstration of why naïve exponential backoff synchronizes clients and creates retry storms without jitter.
Error budgets and toil aren’t management slogans; they’re constraints that shape system design and operations.
Tools I maintain and contribute to
Principal engineering in high-throughput, high-assurance environments
I’ve worked on systems where resilience is a product feature: latency-sensitive workflows, streaming pipelines, and distributed control planes in fintech and defense contexts.
What I’ve shipped
- Latency-sensitive gRPC workflows in fintech environments
- Payment and streaming pipelines with replay safety
- Control-plane reconciliation loops and policy enforcement
- Partition-tolerant workflows for intermittent networks
How I work
- Define failure modes first, then instrument to observe them
- Prefer small reversible changes over risky refactors
- Write the runbook before the incident happens
- Validate recovery with drills, not hope
Depth across systems, platforms, and resilience patterns
Systems and architecture
- Event-driven workflows and streaming ingestion
- Control-plane reconciliation loops
- Idempotency design and replay safety for side effects
- Alert noise control and signal hygiene
Tools and languages
- AWS (Lambda, SQS, Step Functions, EventBridge)
- Kubernetes/EKS, Gateway API, Envoy
- OpenTelemetry, Prometheus, Grafana
- Go, Python, SQL
Let’s connect
If you’re operating event-driven systems and thinking about retries, DLQs, safe replay, or keeping telemetry quiet and useful, I’m always happy to compare notes.