arrow_back Back to Projects
Case Study Incident Response · Healthcare Ops

Results-Routing Incident Retrospective

A blameless retrospective on a misrouting incident in a lab ↔ EMR results pipeline. A cached results report could be delivered to the wrong recipient when the report API failed and a manual re-push went out against the wrong matching values. I led the response: detection, containment, transparent disclosure through the Privacy Officer, and a systemic fix that removed the error class rather than asking people to be more careful. The focus throughout was the system that allowed the mistake, not the person who made it.

RoleTechnical Program Manager
ScopeIncident response · cross-functional
StackHL7 v2 · results pipeline
OutcomeContained, disclosed transparently, root cause engineered out.
shield

Anonymized and recreated for illustration. This retrospective is fully anonymized. It contains no patient data and no real practice, provider, company, or vendor names, and no real dates. The specifics, including the timeline and any figures, are representative and presented to show how I lead an incident, not to report a particular event.

The Problem

How caching plus an API failure created a misrouting risk class

The setup: results reports were cached so the pipeline could keep serving them quickly and survive transient upstream hiccups. That cache was a reasonable performance and resilience choice on its own.

The failure mode: when the report API failed or returned incomplete data, the pipeline could fall back to a cached report that was stale or incorrect and deliver it to an EMR. Recovering from that meant a manual re-push of the correct report.

Why the recovery was fragile: a manual re-push depended on a person matching the correct HL7 MSH values and the right practice and provider by hand. That hand-matching is exactly the step where a report can be sent to the wrong recipient. In one incident, a report was delivered to the wrong recipient practice. The conditions, a cache that could serve bad data plus a manual recovery with no guardrails, formed a whole class of misrouting risk, not a one-off slip.

Incident Response · Timeline

The incident timeline

Illustrative

I ran the response as a clear sequence so nothing was improvised under pressure: detect, contain the spread, escalate to the right owners, then resolve and confirm. The stages below are representative, with no real dates or clock times.

Detection

Mismatch surfaced

A recipient mismatch between the delivered report and the expected practice was flagged.

arrow_forward

Containment

Stopped the spread

Paused further automated delivery on the affected path so no additional reports could misroute.

arrow_forward

Escalation

Right owners in

Looped in the Privacy Officer and Engineering to handle disclosure and the technical investigation in parallel.

arrow_forward

Resolution

Corrected and confirmed

Correct report delivered to the right recipient; written confirmation obtained that the misdelivered report was deleted.

Incident Response · Disclosure

Handling it transparently, by the book

Once the misroute was confirmed, the priority was correct, transparent handling rather than a quiet fix. I routed the disclosure through the people whose job it is to get it right and made transparency to both affected parties the default.

verified_user

Engaged the Privacy Officer

Brought in the Privacy Officer for proper handling before acting, so disclosure followed the correct process rather than my best guess in the moment.

campaign

Notified both parties

Notified both affected parties for transparency: the intended recipient and the practice that received the report in error, so no one was left unaware.

task_alt

Confirmed deletion

Obtained written confirmation that the incorrectly received report was deleted, closing the loop with a record rather than a verbal assurance.

Incident Response · Coordination

Who I coordinated, and how comms flowed

An incident is a coordination problem as much as a technical one. I kept a small, clear set of owners moving in parallel and sequenced communication so the right party heard the right thing at the right time, with no mixed messages.

verified_user

Privacy & Compliance

Disclosure owner

Owned the disclosure decision and process. Handoff: I gave them the confirmed facts and the affected parties; they set how and what we communicated.

engineering

Engineering

Containment & fix

Paused the affected delivery path, ran the technical investigation, and built the systemic fix. Handoff: a shared, reproducible picture of the failure mode.

support_agent

Support

Party comms

Carried the approved messaging to the affected parties and captured the written deletion confirmation. Handoff: a single, agreed script from Privacy.

hub

Program management

Coordinator

I sequenced the response, kept one source of truth on status, and made sure containment, disclosure, and the fix advanced together rather than tripping over each other.

Analysis · RCA

Root-cause analysis

Illustrative

The point of the analysis was to name the system conditions that let this happen, not to find someone to blame. Two contributing factors combined into the misroute, and both were fixable in the system itself.

cached

Cache served bad data on failure

When the report API failed or returned incomplete data, the pipeline could fall back to a cached report that was stale or incorrect and deliver it anyway. There was no validation gate to stop a known-bad report from going out on the failure path.

low_priority

Manual re-push had no guardrails

Recovery leaned on a person matching HL7 MSH values and the right practice and provider by hand. With no automated verification of that match before delivery, a single mismatch could route a report to the wrong recipient.

The Fix

Engineering the error class out, not retraining humans

The durable fix was to make the misroute impossible by design rather than to ask people to be more careful on a fragile manual step. We closed both contributing factors in the pipeline itself and pulled the manual re-push off the critical path.

verified

Cache validation on failure

On an API failure or incomplete response, the pipeline now validates the cached report before it can be delivered, so a stale or incomplete report is held rather than served. A known-bad report no longer leaves the system on the failure path.

fact_check

Automated MSH / practice verification

Recipient matching is verified automatically before delivery: the HL7 MSH values and the practice and provider are checked against the expected destination, so a mismatch is caught by the system instead of riding on a hand-match.

ruleWhat changed in the pipeline
check_circleGuardrails now enforced
check_circleCached reports validated before delivery on any API failure
check_circleStale or incomplete reports held, not served
check_circleHL7 MSH and practice / provider match verified automatically
check_circleManual re-push pulled off the critical path for the common case

The principle: reducing reliance on a manual re-push removed the step where a human had to get the matching right under pressure. The error class was designed out of the system, so the same mistake cannot recur the same way.

Reflection

Blameless postmortem, and what changed afterward

arrow_backAll projects mailDiscuss this work