Case Study Incident Response · Healthcare Ops

Results-Routing Incident Retrospective

I led an incident response that contained a results-misrouting risk, disclosed it transparently through the Privacy Officer, and then engineered the entire error class out of existence. Detection, containment, disclosure, and a systemic fix, all aimed at the system that allowed the mistake, never the person who made it. How a team handles its worst day is the real test of program leadership.

RoleTechnical Program Manager

ScopeIncident response · cross-functional

StackHL7 v2 · results pipeline

OutcomeContained, disclosed transparently, root cause engineered out.

shield

Anonymized and recreated for illustration. This retrospective is fully anonymized. It contains no patient data and no real practice, provider, company, or vendor names, and no real dates. The specifics, including the timeline and any figures, are representative and presented to show how I lead an incident, not to report a particular event.

The Problem

How caching plus an API failure created a misrouting risk class

The setup: results reports were cached so the pipeline could keep serving them quickly and survive transient upstream hiccups. That cache was a reasonable performance and resilience choice on its own.

The failure mode: when the report API failed or returned incomplete data, the pipeline could fall back to a cached report that was stale or incorrect and deliver it to an EMR. Recovering from that meant a manual re-push of the correct report.

Why the recovery was fragile: a manual re-push depended on a person matching the correct HL7 MSH values and the right practice and provider by hand. That hand-matching is exactly the step where a report can be sent to the wrong recipient. In one incident, a report was delivered to the wrong recipient practice. The conditions, a cache that could serve bad data plus a manual recovery with no guardrails, formed a whole class of misrouting risk, not a one-off slip.

Incident Response · Timeline

The incident timeline

Illustrative

I ran the response as a clear sequence so nothing was improvised under pressure: detect, contain the spread, escalate to the right owners, then resolve and confirm. The stages below are representative, with no real dates or clock times.

Detection

Mismatch surfaced

A recipient mismatch between the delivered report and the expected practice was flagged.

arrow_forward

Containment

Stopped the spread

Paused further automated delivery on the affected path so no additional reports could misroute.

arrow_forward

Escalation

Right owners in

Looped in the Privacy Officer and Engineering to handle disclosure and the technical investigation in parallel.

arrow_forward

Resolution

Corrected and confirmed

Correct report delivered to the right recipient; written confirmation obtained that the misdelivered report was deleted.

Incident Response · Disclosure

Handling it transparently, by the book

Once the misroute was confirmed, the priority was correct, transparent handling rather than a quiet fix. I routed the disclosure through the people whose job it is to get it right and made transparency to both affected parties the default.

verified_user

Engaged the Privacy Officer

Brought in the Privacy Officer for proper handling before acting, so disclosure followed the correct process rather than my best guess in the moment.

campaign

Notified both parties

Notified both affected parties for transparency: the intended recipient and the practice that received the report in error, so no one was left unaware.

task_alt

Confirmed deletion

Obtained written confirmation that the incorrectly received report was deleted, closing the loop with a record rather than a verbal assurance.

Incident Response · Coordination

Who I coordinated, and how comms flowed

An incident is a coordination problem as much as a technical one. I kept a small, clear set of owners moving in parallel and sequenced communication so the right party heard the right thing at the right time, with no mixed messages.

verified_user

Privacy & Compliance

Disclosure owner

Owned the disclosure decision and process. Handoff: I gave them the confirmed facts and the affected parties; they set how and what we communicated.

engineering

Engineering

Containment & fix

Paused the affected delivery path, ran the technical investigation, and built the systemic fix. Handoff: a shared, reproducible picture of the failure mode.

support_agent

Support

Party comms

Carried the approved messaging to the affected parties and captured the written deletion confirmation. Handoff: a single, agreed script from Privacy.

hub

Program management

Coordinator

I sequenced the response, kept one source of truth on status, and made sure containment, disclosure, and the fix advanced together rather than tripping over each other.

Analysis · RCA

Root-cause analysis

Illustrative

The point of the analysis was to name the system conditions that let this happen, not to find someone to blame. Two contributing factors combined into the misroute, and both were fixable in the system itself.

cached

Cache served bad data on failure

When the report API failed or returned incomplete data, the pipeline could fall back to a cached report that was stale or incorrect and deliver it anyway. There was no validation gate to stop a known-bad report from going out on the failure path.

low_priority

Manual re-push had no guardrails

Recovery leaned on a person matching HL7 MSH values and the right practice and provider by hand. With no automated verification of that match before delivery, a single mismatch could route a report to the wrong recipient.

The Fix

Engineering the error class out, not retraining humans

The durable fix was to make the misroute impossible by design rather than to ask people to be more careful on a fragile manual step. We closed both contributing factors in the pipeline itself and pulled the manual re-push off the critical path.

verified

Cache validation on failure

On an API failure or incomplete response, the pipeline now validates the cached report before it can be delivered, so a stale or incomplete report is held rather than served. A known-bad report no longer leaves the system on the failure path.

fact_check

Automated MSH / practice verification

Recipient matching is verified automatically before delivery: the HL7 MSH values and the practice and provider are checked against the expected destination, so a mismatch is caught by the system instead of riding on a hand-match.

ruleWhat changed in the pipeline

check_circleGuardrails now enforced

check_circleCached reports validated before delivery on any API failure

check_circleStale or incomplete reports held, not served

check_circleHL7 MSH and practice / provider match verified automatically

check_circleManual re-push pulled off the critical path for the common case

The principle: reducing reliance on a manual re-push removed the step where a human had to get the matching right under pressure. The error class was designed out of the system, so the same mistake cannot recur the same way.

Reflection

Blameless postmortem, and what changed afterward

check_circleFix the system, not the person. The retrospective stayed blameless throughout. A manual step with no guardrails will eventually produce a wrong outcome regardless of who runs it, so the honest conclusion was to remove the fragile step, not to coach the person who hit it.
check_circleGuardrails now prevent the misroute class. Cache validation on failure and automated recipient verification close both contributing factors, so a stale report cannot be served and a mismatched recipient cannot be delivered to without the system catching it.
check_circleThe disclosure path is codified. Engaging the Privacy Officer first, notifying both parties, and capturing written confirmation of deletion became the known, repeatable response, so the next team to face an incident is not improvising the handling.

arrow_backAll projects mailDiscuss this work