BCP/DR Tabletop Exercise, Annual

Owner: David McHale (CEO + CTO + Security lead). Authored: 2026-05-11. First run: scheduled 2026-Q3. Cadence: annual, plus ad-hoc after any near-miss incident.

Audience: HailBytes participants in the exercise. Also useful to procurement reviewers as evidence that BCP/DR planning at HailBytes is operationalized rather than purely documented.

Purpose: Provide a runnable tabletop exercise script for HailBytes’ annual BCP/DR drill. Two scenarios are paired into a single half-day session so the exercise covers both the highest-stakes scenario for a security vendor (supply-chain compromise) and the highest-stakes scenario for a small bootstrapped company (compound key-person loss).

1. Logistics

Duration: 4 hours total, run as two 90-minute scenario blocks with a 60-minute combined debrief.
Format: in-person if all three named individuals are co-located that quarter; otherwise video conference with a shared incident-channel mock-up.
Participants (required):
- David McHale (Incident Commander; serves on the day in whichever role the scenario does not remove)
- John Shedd (Communication lead; commercial-relationships continuity)
- Boden McHale (Technical lead; release-pipeline continuity)
Participants (required for external facilitation): Lost Rabbit Digital facilitates the first run and subsequent annual exercises.
Materials:
- Printed copies of bcp-dr-plan.md, key-person-succession.md, byoc-architecture.md, security-evidence-package.md, and this file.
- The participants’ real laptops (read-only access to GitHub, Cloudflare, AWS Marketplace seller portal, Azure Partner Center, Google Workspace admin), exercise is verbal but participants may consult these to ground decisions.
- A whiteboard or shared document for capturing decision points and lessons learned.
What “passing” looks like: the participants reach end-of-scenario with a written timeline of decisions, a clear who-did-what, and a remediation list. There is no grade. The point is to surface gaps before they matter.

2. Pre-brief (15 minutes before scenario 1)

The facilitator reads the following:

Today is a HailBytes tabletop exercise. The two scenarios are intentionally severe but plausible for a security vendor at HailBytes’ scale. Treat them as if they are happening now. Do not optimize for what the exercise is testing; respond as you would respond in a real incident. The exercise can be paused at any time if a participant needs to disengage. Everything said in this room is for HailBytes’ internal use. We will write down the decisions made and the lessons learned at the end.

3. Scenario 1, Supply-chain compromise of the HailBytes release pipeline

This scenario maps to bcp-dr-plan.md §2.3.

3.1 Setup (T+0; 5 minutes)

The facilitator reads:

It is Tuesday at 10:14 UTC. David is on a customer call, John is in inbox triage, Boden is reviewing a pull request. An email lands at security@hailbytes.com from a customer’s security team: “Our nightly ASM scan ran a new behavior we didn’t expect. We ran the documented cosign verify against the image digest we pulled and it succeeded, but the certificate-identity claim points to a GitHub workflow path we don’t recognize from prior verifications, and the commit SHA in the certificate subject does not resolve in HailBytes’ public main branch. We’re pausing further updates until you confirm provenance.”

3.2 Inject sequence

Each inject is read by the facilitator at the listed wall-clock time. Participants respond verbally; the facilitator records decisions.

Wall-clock	Inject	Discussion prompt
T+5 min	Boden checks `ghcr.io/hailbytes/hailbytes-asm-web` and confirms the image digest the customer reported was published at 09:32 UTC. He runs the documented `cosign verify` against the digest himself and reaches the same conclusion the customer did: the certificate’s subject SHA does not exist in the `main` branch but does exist in a feature branch `boden/perf-tweaks` that was pushed at 09:30 UTC.	Is this a real compromise, a misconfigured branch protection, or Boden’s own work that accidentally got tagged for distribution? Who decides? What is the immediate containment step regardless?
T+12 min	Boden confirms `boden/perf-tweaks` has not yet been merged to `main`, but the Trivy SARIF scan that runs on every push completed successfully and the image was tagged `:edge`. The Marketplace VM image generation pipeline pulls `:edge` for the dev-channel AMI.	Is `:edge` an immediate customer-visible artifact, or is there a gap before customers would pull this? Check the marketplace pipeline status.
T+20 min	A second customer (an enterprise prospect in active evaluation) emails `support@hailbytes.com` independently reporting the same behavior change in the ASM web container. They have not yet run `cosign verify`.	Does this second customer need a direct phone call? Who makes it? What do we tell them given we are still in containment?
T+35 min	Boden inspects the image and finds that the only change from the previous `:edge` is a 200-line performance optimization in `web/celery_workers/dispatch.py` that does roughly what its commit message says, but also a single added line that disables a CSP nonce check on one HTML response.	Real attack with plausible cover, or sloppy commit by Boden himself? What is the containment step regardless? How do we tell the difference within the next 30 minutes?
T+50 min	David receives a Slack DM from a contact at one of the threat-intel sources HailBytes pulls from saying “saw your image identity in our feed, anything going on?”	Coordinated disclosure or accidental data exposure on Boden’s own pull request branch? Does this change the containment calculus?
T+70 min	The on-call cloud account in AWS shows that the Packer build VM for the dev-channel AMI pulled `:edge` at 09:35 UTC and is mid-build.	Kill the build VM or let it complete to preserve forensic evidence? Who decides?
T+85 min	A second customer reports the same anomaly. Their security team is asking for a written statement within the hour.	Public statement, customer-private statement, or hold until containment is complete? Whose decision? What goes into the statement?

3.3 End-of-scenario decision points captured

At T+90 the facilitator pauses the scenario and the participants write down on the whiteboard:

The decision sequence that occurred (containment, communication, forensics).
The named individual who held each decision at each point.
Any decision point where the participants disagreed in real time.
Any moment where a participant did not know what to do (an honest gap surface).

3.4 Specific questions to answer at end of scenario

These are the gaps bcp-dr-plan.md §2.3 commits to closing. The tabletop tests them:

How long did it take for the customer’s cosign verify failure to reach the right HailBytes responder? Document the time between image publication (09:32) and the customer’s email (10:14). Is the security-inbox triage cadence fast enough to act on a customer-reported supply-chain anomaly?
Was image yanking attempted? Who has the credentials? How long did it take to remove the image from ghcr.io?
Did the signed customer advisory get drafted? What was the wording? Who has authority to publish it without further sign-off?
Was the build VM preserved for forensics? If not, why not?
Did anyone communicate to the customers who emailed at T+20 and T+85? If yes, what was the message? If no, why not?
Is the customer-side cosign verify documentation discoverable enough that customers actually run it? This scenario depends on at least one customer having run the documented verification. Is the security-evidence-package.md §3 verification documentation linked from the install-time touchpoints customers actually read? If not, fix it.

4. Scenario 2, Compound key-person loss during an active customer engagement

This scenario maps to bcp-dr-plan.md §2.6 and key-person-succession.md §4.2.

4.1 Setup (5 minutes)

The facilitator reads:

It is Monday morning. David is unreachable for what will become 72 hours (medical event; participants do not know the duration at the start of the scenario). Enterprise Customer A is at contract-signing stage, with a security review meeting scheduled for Wednesday at 14:00 in their local time (the day after tomorrow). Enterprise Customer B is in the DPA-redlining stage and the customer’s procurement lead emails this morning asking for a clarification on the controller/processor framing in our LGPD document. A small ASM v2.3.1 patch release is ready to ship; the only outstanding step is David’s final review.

4.2 Inject sequence

Wall-clock	Inject	Discussion prompt
T+10 min	Boden discovers David is unreachable; the team confirms over the next 30 minutes that they will not hear from David for at least 24 hours.	What is the first communication to send, to whom, and through what channel? Who sends it?
T+45 min	The Customer A security review meeting is in 50 hours. The customer’s lead architect has asked for David specifically.	Do we keep the meeting, reschedule, or substitute? Who attends if substitute? Who tells the customer?
T+90 min	Customer B’s procurement lead has reminded the team that they need the DPA clarification by Wednesday.	Who responds? With what authority? What if the clarification requires a substantive change to the DPA position?
T+2 hours	A separate customer reports a P1 production issue in their ASM deployment (memory leak in the celery worker). Their support contract specifies a 4-hour response.	Who triages? Who responds? Is the response written, called, or both?
T+4 hours	The v2.3.1 patch release contains a fix that addresses a moderate-severity finding in the most recent Trivy SARIF scan. Boden is the only person who has a clear view of the change.	Ship without David’s review or hold? Under what review constraint? Document the decision rationale.
T+24 hours	David is now expected to be unreachable for 72 hours total. The Customer A meeting is in 26 hours.	Reaffirm or revise the Customer A meeting decision.
T+50 hours	The Customer A meeting begins. Who is in the (real or rehearsed) meeting? What is the briefing the substitute(s) need?	The participants briefly role-play the first five minutes of that meeting.

4.3 End-of-scenario decision points captured

Did all customer-facing communications go out on time, with the right authority, from the right person?
Did the team’s effective decision authority match the documented succession plan in key-person-succession.md §4.2?
Did the v2.3.1 release ship? Under what review process?
Was anything dropped (an inbox message that didn’t get a response, a meeting that wasn’t rescheduled, a customer call that didn’t get returned)?

4.4 Specific questions to answer at end of scenario

Are the customer-relationship channels documented well enough that John and Boden can respond without David? Specifically: does each named enterprise customer’s contract have the four-contact list documented in key-person-succession.md §3 in place today, or only in roadmap?
Is there a release-shipping playbook for the David-unavailable case? Document it now if not. The CI gates are not the only blocker, the decision to ship under reduced review is a policy question that should be settled before the next absence.
Did anyone burn out attempting to cover for David in real time? Even in a 4-hour tabletop, this surface is visible. In a real 72-hour incident it is the dominant failure mode for compound key-person loss.

5. Combined debrief (60 minutes)

The participants and facilitator review:

Decision-record consistency. Walk through the whiteboard timelines for both scenarios. Note any decision that, in hindsight, the participants would make differently. Note any decision where the documented plan and the participants’ real-time judgment disagreed.
Communication gaps. Identify any moment in either scenario where a customer-facing communication, an internal communication, or a regulator-facing communication did not get sent or was sent with incorrect content.
Tooling gaps. Identify any specific tooling that, if it existed, would have changed the response.
- Most likely candidates: a status-page subscription mechanism, a customer-contact roster cross-referenced to each contract, a pre-drafted “David unavailable” briefing template, a “supply-chain anomaly reported by customer” runbook.
Documentation gaps. Identify any place where a participant defaulted to “I’ll figure it out” rather than “I’ll consult the runbook” because no runbook existed.
Remediation list. For each gap, write a one-line remediation item with an owner (David, John, Boden) and a target date. Add to the internal operational tracker within the same week.
Lessons-learned write-up. David writes up a 500-word post-tabletop reflection within 5 business days, posts internally to the team, and on the next quarterly review cycle updates the trust-package documents that the tabletop revealed needed updating.

6. Post-exercise update of trust-package documents

After the first exercise (2026-Q3):

Update bcp-dr-plan.md “Last tested” date.
Update §7 of bcp-dr-plan.md if the cadence changes.
Resolve the relevant entry in the internal operational tracker for the BCP/DR tabletop.
If the exercise revealed any control gaps mapped to CAIQ-Lite, update the relevant caiq-lite.md row.
If the exercise revealed any succession-plan ambiguity, update the key-person succession plan accordingly.

7. Exercise variants for future years

The tabletop is intended to evolve. Suggested scenario rotation for future years:

Year 2 (2027): GitHub organization compromise (broader than just signing-pipeline); customer-tenant ransomware with HailBytes support obligations.
Year 3 (2028): Cloud-provider outage during a Marketplace listing freeze; coordinated disclosure of a critical CVE affecting both products simultaneously.
Year 4 (2029): HailBytes-vanishing scenario rehearsed from the customer’s perspective with a real customer participating under NDA.

The scenarios above are written so the same exercise format (90-minute scenario blocks, 60-minute combined debrief, structured decision capture) applies year over year.

Cross-references: bcp-dr-plan.md §2.3, §2.6, §7; key-person succession plan §4 (available on request); security-evidence-package.md §3 (the customer-side cosign verify capability the scenario depends on).