About
High-assurance operability in distributed systems: diagnosable failures, bounded retries, and safe recovery you can actually run.
I’m a principal-level engineer focused on the part of distributed systems that most teams wish would just work: operability under partial failure.
I build and harden event-driven workflows, streaming pipelines, and control-plane style services—the kinds of systems where the happy path is easy, and the real work is making failure visible, diagnosis fast, and recovery safe.
My north star is simple:
- Failures should be explainable (not “something timed out somewhere”).
- Recovery should be repeatable (not “click buttons and pray”).
- Retries should be disciplined (not accidental load testing against your own dependencies).
- Telemetry should be quiet but complete (signal, not noise).
If you’ve ever had an incident where “reprocess the DLQ” felt like rolling dice, that’s exactly the category of problems I obsess over.
What I do
I make failure diagnosable
I’m not interested in “more logs.” I’m interested in answers.
That usually looks like:
- Correlation and trace context that actually propagates end-to-end (HTTP/gRPC headers, message attributes, workflow/task context)
- Structured logs with a consistent schema (so you can slice by operation, outcome, error class, dependency)
- RED/USE metrics aligned to what operators need during an incident
- Dashboards that match the system shape (queue/worker, workflow engine, streaming consumer group, controller loop, gateway)
I put reliability guardrails where systems actually break
The big reliability failures I’ve seen rarely come from exotic bugs. They come from:
- retries without classification
- unbounded concurrency
- missing timeouts/budgets
- noisy alerts that train people to ignore the pager
- “replay” as an afterthought
So I design guardrails like:
- retry classification + budgets (what is retriable, how many times, and why)
- exponential backoff + jitter (because synchronized retries are self-inflicted pain)
- concurrency caps / backpressure (because throughput without control is fragility)
- circuit breakers / rate limiting on unstable dependencies
- idempotency + dedupe at side-effect boundaries (because replay without side-effect control is corruption-by-another-name)
I treat recovery as a first-class feature
DLQ/quarantine isn’t a trash can; it’s a recovery mechanism.
I build operating models around:
- quarantine rules (what belongs there, what doesn’t)
- replay preconditions (what must be true before you hit “go”)
- rate plans (how you prevent a replay from melting dependencies)
- verification steps (how you prove the replay is doing what you think)
- stop conditions (what makes you halt and reassess)
How I work
My default workflow is deliberately boring—and that’s the point.
1. Name the failure modes first. If we can’t describe how it fails, we can’t reliably detect it.
2. Define invariants at the side-effect boundaries. What must never happen twice? What can be retried? What requires idempotency keys?
3. Instrument to observe those failure modes and invariants. Not “add telemetry,” but “add the minimum signals that let a human explain what happened.”
4. Add guardrails before refactors. I prefer small reversible changes (timeouts, caps, classification, schema, correlation) over risky “rewrite it” fixes.
5. Write the runbook before the next incident. A runbook isn’t documentation; it’s an operational interface. If it isn’t specific and verifiable, it’s fiction.
6. Validate recovery with drills, not hope. If a recovery path isn’t exercised, it will fail when you need it most.
Principles I keep coming back to
- Retries are load. Treat them as a controlled resource, not a default behavior.
- “Safe replay” is an engineering problem, not an operational task. If replay can corrupt state, the system is incomplete.
- Observability is only valuable if it shortens the path from symptom → cause → action.
- Low-noise beats high-volume. Paging should be meaningful; dashboards should be navigable.
- Artifacts over opinions. I trust dashboards, alerts, runbooks, drills, and tests more than architecture diagrams.
Where this comes from
My background spans fintech and global banking production systems, plus earlier work in defense research and HPC / optimization-heavy environments. That mix shaped my bias toward discipline:
- Fintech taught me about tail latency, cascading failure, and volatility.
- Banking taught me about correctness, idempotency, and safe processing under disruption.
- Research/HPC taught me to think in budgets, invariants, throughput vs. stability tradeoffs, and to respect the physics of systems under load.
- Defense contexts taught me that “degraded mode” isn’t theoretical—it’s real, and systems need to remain operable when assumptions break.
(If you care about the details, they’re in the resume/capability statement linked elsewhere on the site.)
Stack and tools I reach for
I’m platform-pragmatic, but my current emphasis is:
- AWS: Lambda, SQS, Step Functions, EventBridge, ECS/Fargate, IAM/KMS (plus the usual VPC/networking realities)
- Kubernetes/EKS: production workloads, deployment patterns, and guardrails
- Edge / traffic policy: Envoy, Gateway API-style policy, timeouts/retries/rate limits done deliberately
- Observability: OpenTelemetry, Prometheus, Grafana (and cloud-native telemetry when appropriate)
- IaC / delivery: Terraform/OpenTofu, Helm/Kustomize, GitOps-friendly patterns
- Languages: Go, Python, SQL (with enough C/C++ and systems work in the past to be dangerous)