{"uuid": "80c30b40-4f0a-4cfd-9f31-84028ee0473c", "vulnerability_lookup_origin": "1a89b78e-f703-45f3-bb86-59eb712668bd", "author": "9f56dd64-161d-43a6-b9c3-555944290a09", "vulnerability": "cve-2024-3094", "type": "seen", "source": "https://gist.github.com/YoraiLevi/d788d3ecbc8545d40c41e0957683ca22", "content": "# The Bootstrap Conundrum\n\n### A Rookie's Live Book on How Real Companies Solve the Trust Problem\n\n---\n\n## Front matter\n\n&gt; **Who this book is for.** You've just been told that your team runs \"infrastructure\" \u2014 Vault, Kubernetes, GitHub Actions runners, whatever \u2014 and that there's a \"Day 1 setup\" you need to do, and then \"Day 2 operations,\" and then \"disaster recovery,\" and at every step you keep running into the same uncomfortable feeling: *\"how can I configure X if X is what I'm trying to set up?\"* You're not crazy. You've discovered the **bootstrap problem**. This book is everything our team learned from 10 weeks of reading the industry's accumulated answer to that question, distilled so you don't have to re-derive it.\n&gt;\n&gt; **Why this book exists.** The bootstrap problem is one of the oldest open questions in computer security and distributed systems. Ken Thompson identified its deepest form in his 1984 Turing Award lecture; the DNS root operators have been performing scripted ceremonies to ground it since 2010; HashiCorp built a company on monetizing one slice of it. The lessons are scattered across academic papers, conference talks, regulator publications, vendor blogs, postmortems, and the working knowledge of senior SREs. We pulled them together because we couldn't find this book and decided to write the one we wished existed.\n&gt;\n&gt; **How to read it.** Read Part I from start to finish \u2014 it builds the conceptual scaffold and you cannot skip it. Then read whichever chapter of Part II is closest to your immediate problem. Part III ties everything back to action items for our DGX fleet project, but the recommendations generalize.\n&gt;\n&gt; **What this book is not.** It is not a Vault tutorial, an Ansible reference, a Kubernetes how-to, or a step-by-step \"do this.\" It is the *vocabulary* and *mental model* that lets you make sense of every Vault tutorial, every Ansible reference, every Kubernetes how-to. Once you have the model, the tutorials become legible.\n&gt;\n&gt; **Project context (for the curious).** This book emerged from operating a small fleet \u2014 two NVIDIA DGX Spark inference boxes plus a Proxmox VM running Vault + k3s + actions-runner-controller + ARA. We are two engineers, not fifty. Many of the lessons here are over-engineered for that scale; we mark which ones are \"do this today\" vs \"save for when you grow.\"\n&gt;\n&gt; **A note on humility.** Every recommendation in this book is something we found in the wild from people doing real work at real scale. We did not invent any of it. The agents that researched each chapter cited URLs aggressively \u2014 follow them. Where we disagree with consensus, we say so explicitly.\n\n---\n\n## Table of contents\n\n### Part I \u2014 The problem\n1. [The naive state \u2014 what you think before you read this](#1-the-naive-state)\n2. [The bootstrap problem, properly named](#2-the-bootstrap-problem)\n3. [The three buckets \u2014 identity, data, config](#3-the-three-buckets)\n4. [Day 0, Day 1, Day 2 \u2014 the time-scale axis](#4-day-0-day-1-day-2)\n\n### Part II \u2014 How real companies do it\n5. [Running HashiCorp Vault for the long haul](#5-vault-in-production)\n6. [The Google SRE lens \u2014 toil, DiRT, error budgets](#6-google-sre)\n7. [Key ceremonies \u2014 30 years of PKI wisdom](#7-key-ceremonies)\n8. [GitOps for stateful infrastructure](#8-gitops)\n9. [Tested DR \u2014 chaos engineering and game days](#9-chaos)\n10. [Kubernetes and etcd \u2014 the parallel problem](#10-kubernetes)\n11. [Workload identity \u2014 the death of secret zero](#11-workload-identity)\n12. [Supply chain trust \u2014 trust has no floor](#12-supply-chain)\n13. [Compliance-driven DR \u2014 what auditors actually want](#13-compliance)\n14. [Boring infrastructure for two engineers \u2014 the homelab way](#14-homelab)\n\n### Part III \u2014 Synthesis and action\n15. [The unified methodology](#15-unified-methodology)\n16. [Concrete next steps](#16-concrete-next-steps)\n17. [Reading order for going deeper](#17-reading-order)\n\n---\n\n# Part I \u2014 The problem\n\n## 1. The naive state\n\nBefore you've thought hard about infrastructure, you carry an unspoken model that goes something like this:\n\n1. *Setup* is a thing you do once. It's annoying, but it's a one-time cost.\n2. After setup, the system *runs*. You just have to maintain it.\n3. If something breaks, you fix it. If it breaks badly, you \"rebuild from backups.\"\n4. Secrets go in a *secret manager*. The secret manager is just another piece of software you set up once.\n\nEach of those four statements is wrong. Specifically:\n\n1. **Setup never ends.** Every system needs new policies, new users, new rotations, new upgrades. The \"setup\" capability is a *permanent service*, not a one-time event.\n2. **The \"running\" state hides ongoing configuration drift.** Every minute that passes, the system you're operating diverges from the documentation that describes it. Unless you fight that drift explicitly, the documentation rots, the runbooks rot, and Day-90 you cannot reproduce what Day-1 you did.\n3. **\"Rebuild from backups\" is a fantasy until you've actually rebuilt from backups.** Untested backups are not backups; they are stories you tell yourself about backups. The industry has a name for this trap: it killed GitLab in 2017 (five backup mechanisms, none worked).\n4. **The secret manager has the bootstrap problem worst of all.** It holds the keys to everything else, which means it needs *its own* keys, which means *its own* secret manager, which means... we'll get to it.\n\nHold all four of those in your head. The rest of Part I unpacks why each is wrong, and Part II shows you the patterns the industry has converged on for each.\n\n`\u2605 Insight \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500`\nThe discomfort you feel reading the four naive statements above is your brain doing the work. If a senior engineer rolls their eyes when you say \"we'll just restore from backup,\" that eye-roll is the result of having lived through what happens when statement 3 turns out to be wrong. The point of this book is to inherit those reflexes without first living through the outages that produced them.\n`\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500`\n\n## 2. The bootstrap problem\n\nDefine a **trust dependency** as: *system A cannot operate until it can authenticate to system B*. Now consider three trust dependencies on top of each other:\n\n- Your fleet of inference servers needs to authenticate to Vault to fetch secrets.\n- Vault needs to be authenticated-to via a credential the fleet servers hold.\n- The credential is itself a secret that needs to be distributed somehow.\n\nIf you store the credential **in Vault**, then the fleet needs to authenticate to Vault before it can fetch the credential to authenticate to Vault. That's a cycle. The cycle has a name: it's the **bootstrap problem**, and the specific case of \"the credential you need to fetch your other credentials\" is called **secret zero**. Every system that holds secrets faces secret zero at the boundary where the system meets the world.\n\nKen Thompson's 1984 paper *Reflections on Trusting Trust* \u2014 which we recommend you read before any other paper in this field \u2014 generalizes the bootstrap problem to its deepest form: *you cannot trust code you didn't write unless you can trust the compiler that compiled it, and you cannot trust the compiler unless you trust the compiler that compiled the compiler, and so on, all the way down*. The infinite regress only terminates at a **root of trust** that you accept on some basis other than further verification \u2014 a piece of paper in a safe, a hardware token, a human's memory, the manufacturer's signature on a TPM chip.\n\nThis is not an academic problem. It shows up everywhere in your project:\n\n- **Vault TLS** needs a hostname and a certificate. The certificate could come from a CA. The CA's private key could live in Vault. \u2192 cycle. Resolution: self-signed cert + IP address for Day-1; revisit with a real internal CA when one exists.\n- **External Secrets Operator (ESO)** needs to authenticate to Vault. ESO runs in Kubernetes. The auth token needs to come from somewhere. \u2192 cycle. Resolution: a bootstrap k8s Secret seeded manually once, then ESO migrates to Kubernetes auth method (which uses the cluster's own ServiceAccount JWT as the trust anchor).\n- **ARC** (actions-runner-controller) needs a GitHub App private key. The key could live in Vault. ESO could materialize it. But ARC is what brought up ESO. \u2192 cycle. Resolution: operator-manual `vault kv put`, then ESO materializes, then ARC reads.\n- **AppRole `secret_id`** must live on each Spark for Vault auth. It can't be checked into git. \u2192 cycle. Resolution: wrapped-token delivery with 10-minute TTL, operator-driven once per host.\n\nIn every case the cycle is broken the same way: **an operator hand-injects the minimum trust at the smallest possible scope, then immediately reduces their own privilege.** That's it. That's the entire technique. The 1984 paper, the DNSSEC ceremony, the Vault unseal procedure, the GitHub App seeding \u2014 all the same shape, different costumes.\n\nWhat changes is the *cost of getting the hand-injection wrong*:\n\n- Get the Vault unseal shards wrong \u2192 you have backed yourself out of your own Vault. Recovery is impossible if you also lose the data.\n- Get the GitHub App key seeding wrong \u2192 you have a short period of degraded CI. Annoying.\n- Get the SSH key for the operator account wrong \u2192 you go in via the Proxmox console. Mild inconvenience.\n\nThe **discipline** of operating infrastructure is partly: *recognize the bootstrap moments, treat the high-cost ones with ceremony, automate around the low-cost ones, and document the difference.*\n\n## 3. The three buckets\n\nThe single most useful mental model in this entire book \u2014 the one piece of conceptual machinery that makes every subsequent chapter legible \u2014 is the three-bucket split:\n\n| Bucket | What it is | Where it lives | How it survives destruction |\n|---|---|---|---|\n| **Identity** | Root keys, unseal shards, signing keys, operator credentials | **Offline / out-of-band** (humans, hardware tokens, sealed envelopes, HSMs) | Cannot be re-derived from anything else; redundancy via N-of-M distribution |\n| **Data** | Actual secrets *inside* Vault, NAS-resident model weights, Raft state, audit logs, database rows | **Backups** (snapshots, replication, off-host copies) | Restored from snapshot; periodicity bounds your recovery point objective |\n| **Config** | Policies, auth methods, KV mount layouts, RBAC roles, network firewall rules, deployment manifests | **Code** (git repo, applied idempotently by Ansible/Terraform/etc.) | Re-applied from `git clone` |\n\nThe rule that makes recovery tractable: **never let any two buckets mix.** Config never embeds secrets. Data never embeds policy. Identity never embeds in code. The moment one bucket leaks into another, you've created a recovery dependency that crosses categories \u2014 and recovery becomes \"you need three things at once,\" which is much harder to rehearse than \"you need three procedures, each independent.\"\n\nA concrete example: it is *very* tempting to write a play that stores a fresh AppRole `secret_id` into a file in the repo. You're saving yourself a manual step! The cost is that you've mixed identity (the `secret_id`) into config (the repo). Now anyone with read access to git can authenticate to Vault. You've made recovery *harder* by making setup easier. Every shortcut across bucket lines is a future incident.\n\nThe three-bucket split also gives you a clean answer to the reproducibility question:\n\n&gt; *\"If everything goes to dust, can I rebuild?\"*\n\nHas three sub-answers:\n\n- **Identity**: you need it offline, you need redundancy, and you need to have *tested* that 3 of 5 custodians can actually reach you within an hour. If you cannot answer \"yes\" to that, your identity bucket has the same problem GitLab's backups had.\n- **Data**: you need snapshots, you need them on a different blast radius than the source, and you need a *tested* restore procedure. Snapshot exists \u2260 restore works. The compliance frameworks all agree on this point (Chapter 13).\n- **Config**: you need everything in code, the code needs to be applied idempotently, and the same `bash bootstrap.sh` command needs to work on a fresh checkout (Chapter 14).\n\nIf you can answer \"yes, tested\" to all three, you have a recoverable system. If you can only answer \"yes, in theory,\" you have a write-only recovery plan, which is no recovery plan at all.\n\n## 4. Day 0, Day 1, Day 2\n\nThe Kubernetes community popularized a vocabulary that turns out to be the right vocabulary for the whole bootstrap conversation:\n\n- **Day 0** \u2014 *design*. Drawing on a whiteboard, picking which tools, deciding the topology.\n- **Day 1** \u2014 *deploy*. Provisioning the boxes, installing the software, getting it running for the first time.\n- **Day 2** \u2014 *operate*. Everything from then on: upgrades, rotations, drills, incidents, refactors, capacity planning.\n\nThe defining insight: **Day 2 is forever**. Day 0 happens once. Day 1 happens once. Day 2 happens every day until the system is retired. The mistake new engineers make is thinking \"Day 2\" is a separate, optional concern that happens after \"real work\" is done. It is the real work. Day 1 is just the first day of Day 2.\n\nThe bootstrap problem replays at every Day 2 cadence, just at a longer time scale:\n\n| Cadence | The cyclic dependency | The resolution |\n|---|---|---|\n| **Initial** (hours) | Vault config needs Vault running | Vault-free `bootstrap.yml` + one-shot init runbook with privileged token, then revoke |\n| **Configuration drift** (continuous) | Vault config needs to evolve; applying config requires Vault auth | Scoped `vault-config-applier` identity + `roles/vault_config/` in code (Chapter 8) |\n| **Backup** (daily) | Vault's own data must be preserved, but the data includes the encryption keys | Raft snapshots, encrypted with Shamir keys, shipped off-host (Chapter 5) |\n| **Disaster recovery** (rare) | Restoring requires the original shards, which lived offline precisely because they couldn't be in any system that might be destroyed | The 5-destination shard distribution survives single-point failures (Chapter 7) |\n| **Improvement** (perpetual) | Bootstrap infra ages; you can't change it without using it | PR-reviewed runbook changes + staging Vault + quarterly drills (Chapter 9) |\n\nThe pattern that resolves every layer is the same: *separate identity from data from config, and verify each independently.* Part II of this book is ten different industries showing you ten different applications of this same pattern. Once you see it, you cannot unsee it.\n\nWith Part I in your head, Part II will feel like coming home.\n\n---\n\n# Part II \u2014 How real companies do it\n\n&gt; Each chapter that follows was produced by a dedicated research agent reading the open web for ~10 minutes, iterating queries, and synthesizing what real practitioners do. The agents were briefed on the three-bucket model and asked to surface concrete procedures, war stories, and actionable lessons \u2014 not vendor marketing. URLs are preserved so you can verify and dig deeper. The agents wrote their sections independently; we've added connecting tissue but preserved their reporting verbatim where it works.\n\n---\n\n## 5. Vault in production\n\n&gt; *The pattern in 2-3 sentences:* In real production, Vault is not \"install, init, done\" \u2014 it is a continuously-operated stateful service whose hardest problem is **bootstrapping trust without already having trust**. Mature shops solve this with one of two patterns: (a) **cloud-KMS auto-unseal + short-lived workload identity** for cloud-native fleets, or (b) **Shamir-split keys held offline by humans + a \"trusted orchestrator\" that response-wraps short-TTL AppRole SecretIDs** for on-prem fleets. Everything else (DR, rotation, upgrades) reduces to \"have raft snapshots in a foreign blast-radius, do rolling restarts, and never let one human hold a quorum.\"\n\n### Concrete real-world examples\n\n1. **Adobe** \u2014 The most-documented at-scale Vault Enterprise deployment. By 2 years in, they ran **3 clusters per region, one primary + 11 performance secondaries, ~6 operators, ~130 onboarded teams, ~800M requests/month, p50 ~3 ms, p99 ~14 ms, 99.97% availability**. DR cluster lived in a different cloud (AWS) and a different coast. Auto-unseal triggered by **Rundeck via AWS KMS**, config managed via **Terraform + SaltStack**, monitoring via **Splunk + Prometheus \u2192 Grafana**. They tracked sealed-state, replication-lag, and ops/sec as their three critical signals. ([HashiCorp \u2013 2 Years of Vault Enterprise at Adobe](https://www.hashicorp.com/en/resources/keep-it-secret-safe-everywhere-2-years-vault-enterprise-adobe))\n2. **G-Research** \u2014 Runs **1,000+ Vault namespaces** with self-service onboarding via Jenkins + Kubernetes + Terraform, explicitly to keep the central IAM team out of the loop. Config-as-code is the unblocking pattern.\n3. **Roblox** \u2014 Open-sourced their [**vault-cookbook**](https://github.com/Roblox/vault-cookbook), a Chef cookbook that installs Vault as a systemd service, with library resources `vault_config`, `vault_service`, `vault_secret`. Test-Kitchen on CentOS 7/Ubuntu 16.04. They treat Vault as a managed daemon, not a Kubernetes pod.\n4. **Sky Betting &amp; Gaming** \u2014 Developers grab **dynamic credentials** without specifying which; the platform team encodes the mapping. The pattern: humans never copy-paste secrets, ever.\n5. **Hippo Technologies** \u2014 Published [a HashiCorp \"trenches\" piece](https://www.hashicorp.com/en/resources/vault-configuration-as-code-via-terraform-stories-from-the-trenches) on Vault config-as-code via Terraform with one PR per auth-method/mount/policy change and state stored in a separate, hardened backend (Terraform-state-is-a-secret).\n\n### Specific tools / runbooks / cadences\n\n- **Snapshots**: Vault Enterprise has built-in `/sys/storage/raft/snapshot-auto` (1.6+) supporting S3/GCS/Azure Blob. OSS users run the [**Argelbargel/vault-raft-snapshot-agent**](https://github.com/Argelbargel/vault-raft-snapshot-agent) \u2014 defaults to `1h` frequency, supports multiple destinations simultaneously, authenticates via AppRole/Kubernetes/etc. Typical production cadence: **hourly to local + daily to S3 with 365-day retention**.\n- **Auto-unseal**: AWS KMS / GCP Cloud KMS / Azure Key Vault / **Vault Transit** (one Vault unseals another). Trade-off: auto-unseal removes the human-in-the-loop but **recovery keys cannot unseal Vault if the KMS is gone** \u2014 they only authorize root-token generation. So even with auto-unseal, you still print 5 recovery shards and put them in safes.\n- **Upgrades**: There is **no true zero-downtime upgrade** \u2014 expect a few hundred ms during leader step-down. Standard procedure: snapshot \u2192 upgrade followers one-by-one \u2192 step-down leader. Vault 1.11+ Enterprise has **Autopilot automated upgrades** that promotes newer-version voters and demotes older ones.\n- **Config-as-code**: Terraform Vault provider for policies, auth methods, mounts. State backend must be hardened \u2014 it contains secrets-about-secrets.\n- **Secret zero**: Vault Agent with `secret_id_response_wrapping_path`, `secret_id_num_uses=1`, `secret_id_bound_cidrs`, RoleID baked into the AMI/image, wrapped SecretID delivered just-in-time by a trusted orchestrator (Ansible/Jenkins/SaltStack).\n\n### War stories / failure modes\n\n- **Auto-unseal traps people on restore**: HashiCorp [Issue #7595](https://github.com/hashicorp/vault/issues/7595) \u2014 restoring a snapshot from a different cluster fails to unseal because the KMS-wrapped root key in the snapshot doesn't match the new cluster's seal. Lesson: **DR-restore drills must use the same KMS keyring**, or you must rekey before restore.\n- **Random re-sealing every 5\u20137 days** ([Issue #5593](https://github.com/hashicorp/vault/issues/5593)) \u2014 usually a downstream storage hiccup. Watch `vault.core.unsealed` and alert on transitions.\n- **Vault Agent does not survive restart with response-wrapped tokens** ([Issue #16148](https://github.com/hashicorp/vault/issues/16148)) \u2014 the wrap is single-use. Production fix: persist the unwrapped token in a tmpfs sink and re-fetch wrap on systemd restart.\n- **Adobe \"10 million KV pairs\" incident** \u2014 one team accidentally created 10M K/V entries, and \"600\u2013700k tokens at once.\" Vault held up, but the lesson is **per-namespace quotas + chargeback-style usage reports** (Adobe used KV-v2 metadata for cost-center tagging).\n\n### Lessons for any small Vault deployment\n\n1. **Codify the operator-only runbooks now, in detail, with one-line copy-paste blocks.** Adobe and G-Research both confirm: the privileged-init step *must* stay human, but it must also be **shockingly boring to execute** so the operator does not improvise. Put the exact `vault operator init -key-shares=5 -key-threshold=3` invocation with the 1Password CLI command beside it.\n2. **Adopt the trusted-orchestrator + response-wrapped SecretID pattern.** RoleID lives in the deployment manifest (non-secret); the SecretID is generated by an Ansible play with `-wrap-ttl=300s`, `secret_id_num_uses=1`, written to a tmpfile, and consumed by Vault Agent. Never check SecretIDs into git.\n3. **Hourly raft snapshots to a foreign blast radius from day one.** Run `vault-raft-snapshot-agent` (OSS) writing to an S3-compatible target *not* on the same host (Backblaze B2, an off-site MinIO, a friend's NAS). Default `1h` cadence, 168-snapshot retention. Test restore quarterly on a throwaway VM.\n4. **Defer auto-unseal until you have a second machine that can be the transit-unseal source.** A single-VM deployment with KMS-on-the-same-host is a circular dependency. Stay on Shamir; rehearse the 3-of-5 unseal quarterly with the actual humans.\n5. **Vault upgrades trigger a documented runbook, never an auto-apply.** Even a 300 ms blip cascades to ESO/ARC reconcile failures. Snapshot first, follower-first, leader last.\n\n### Further reading\n\n- [Adobe \u2013 2 Years of Vault Enterprise](https://www.hashicorp.com/en/resources/keep-it-secret-safe-everywhere-2-years-vault-enterprise-adobe) \u2014 the most concrete at-scale numbers public anywhere\n- [Vault AppRole production best-practice patterns](https://developer.hashicorp.com/vault/docs/auth/approle/approle-pattern) \u2014 anti-patterns table is the single most useful page\n- [Secure introduction tutorial](https://developer.hashicorp.com/vault/tutorials/app-integration/secure-introduction) \u2014 trusted-orchestrator + wrap_ttl walkthrough\n- [Vault SOP: Restore](https://docs.hashicorp.com/vault/tutorials/standard-procedures/sop-restore) and [SOP: Upgrade](https://docs.hashicorp.com/vault/tutorials/standard-procedures/sop-upgrade) \u2014 the two runbooks to clone\n- [Argelbargel/vault-raft-snapshot-agent](https://github.com/Argelbargel/vault-raft-snapshot-agent) \u2014 production OSS snapshot daemon\n- [Sealing best practices](https://developer.hashicorp.com/vault/docs/configuration/seal/seal-best-practices) \u2014 auto-unseal + recovery-key guidance\n- [Chris Zembower \u2013 Recommended Patterns for Vault Unseal and Recovery Key Management](https://medium.com/@czembower/recommended-patterns-for-vault-unseal-and-recovery-key-management-d6366a2f4607)\n- [Validated pattern: Vault Agent + AppRole for brownfield apps](https://developer.hashicorp.com/validated-patterns/vault/vault-agent-approle) \u2014 operator-administrator role split\n- [Roblox vault-cookbook](https://github.com/Roblox/vault-cookbook) \u2014 concrete reference for \"Vault as systemd daemon\" vs \"Vault as pod\"\n- [Hippo Technologies \"stories from the trenches\"](https://www.hashicorp.com/en/resources/vault-configuration-as-code-via-terraform-stories-from-the-trenches) \u2014 Terraform-Vault provider in real life, including state-backend hardening\n\n---\n\n## 6. Google SRE\n\n&gt; *The pattern in 2-3 sentences:* Google SRE treats every manual, repeatable operator procedure as a *bug in the system* (toil) but accepts that some procedures \u2014 especially those touching root credentials or cold-start dependencies \u2014 cannot be safely automated and must instead be ruthlessly *practiced*. The discipline that holds it together is DiRT (Disaster and Recovery Testing): annual, company-wide, intentionally-disruptive drills that exercise the recovery procedures themselves, including the break-glass paths used to recover the recovery infrastructure. Toil-elimination and drill-the-runbook are not opposites \u2014 they are two sides of the same coin, gated by SLOs/error budgets that force the team to actually do the engineering work.\n\n### Concrete real-world examples\n\n1. **Google DiRT \u2014 annual multi-day company-wide drill.** Founded in 2006, run for ~9 years by Kripa Krishnan. Tests include disconnecting entire data centers, diverting live traffic, bringing services up with known bugs, and explicitly *removing key personnel* to expose knowledge single points of failure. Rules require revert plans, cross-functional review, and VP approval for high-risk tests. ([Weathering the Unexpected, ACM Queue](https://queue.acm.org/detail.cfm?id=2371516); [10 Years of Crashing Google \u2014 LISA15](https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan))\n2. **Google break-glass credential rehearsal during DiRT.** From *Building Secure and Reliable Systems* ch. 16: \"SREs tested the procedure and functionality of break-glass credentials: could they gain emergency access to the corporate and production networks when standard ACL services were down?\" This is the literal Vault-root-token-during-an-outage problem. ([BSRS ch.16](https://google.github.io/building-secure-and-reliable-systems/raw/ch16.html))\n3. **The 50% Rule and the \"if a human touches it, it's a bug\" doctrine.** Google caps SRE toil at 50% of each engineer's time, currently averaging ~33%. The framing: *\"If a human operator needs to touch your system during normal operations, you have a bug.\"* ([SRE Book ch.5 \u2014 Eliminating Toil](https://sre.google/sre-book/eliminating-toil/))\n4. **AWS GameDay + Operational Readiness Review (ORR).** Werner Vogels formalized GameDays as one of six cloud-architecture principles. ORR is a curated checklist distilled from real AWS incidents that every new service must pass before launch, and again periodically. ([REL12-BP05 Conduct game days regularly](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_testing_resiliency_game_days_resiliency.html); [AWS ORR docs](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html))\n5. **Wheel of Misfortune (Google's low-stakes drill format).** A \"Dungeon Master\" simulates an outage; on-call engineers must walk through diagnosis using real dashboards. Crucially the DM is allowed to declare \"the engineer who knows that is on a plane with no signal\" \u2014 exposing knowledge SPOFs without a real outage. ([Cloud Blog: Shrinking time to mitigate](https://cloud.google.com/blog/products/management-tools/shrinking-the-time-to-mitigate-production-incidents))\n\n### Specific tools / runbooks / cadences\n\n- **DiRT cadence:** Annual company-wide event + continuous smaller scenarios. Test rules require revert plans, comms protocols, and tiered approval.\n- **Wheel of Misfortune:** Weekly to monthly per-team, ~1 hour, no production impact. Templates published.\n- **AWS ORR:** Built into Well-Architected Tool; checklist re-run on every major change, not once.\n- **Error Budget Policy:** When budget burns past threshold, feature freeze until reliability work catches up. Forces toil-reducing engineering into the schedule ([SRE Workbook \u2014 Error Budget Policy](https://sre.google/workbook/error-budget-policy/)).\n- **MTTM (Mean Time To Mitigation) as the headline SLI** for incident response, not MTTR \u2014 measured from page-ack to user-impact-stopped.\n\n### War stories / notable failures\n\n- **GitLab 2017** \u2014 the canonical \"five backup methods, none worked\" disaster: pg_dump silently failed, DMARC rejected the alert emails, LVM snapshots weren't enabled on the DB server, replication had stopped. They lost 6 hours of data and recovery took 18 hours over throttled disk. Lesson: *untested backups are not backups, and tested-once is not tested.* ([GitLab postmortem](https://about.gitlab.com/blog/postmortem-of-database-outage-of-january-31/))\n- **Google's own change-induced emergency** (SRE Book, Emergency Response): rollback procedures had *never been tested in a test environment*, so when they were needed in prod they were flawed and extended the outage. Direct quote: *\"Because we hadn't tested our rollback procedures in a test environment, these procedures were flawed.\"* ([SRE Book ch.14](https://sre.google/sre-book/emergency-response/))\n- **DiRT hidden-dependency find:** Krishnan documents an exercise blocking access to \"just one of a hundred\" MySQL shards and discovering services in unrelated parts of Google were hard-coded against that specific shard.\n\n### Lessons for any small operation\n\n1. **Operator-only runbooks are toil that we explicitly accept as toil \u2014 and the price of admission is rehearsal.** Without rehearsal you have a write-only runbook, which is worse than no runbook.\n2. **Test break-glass before you need it.** Schedule a \"Vault is sealed, the operator with the unseal key shards is unreachable, here's the runbook \u2014 go\" drill. If the answer involves logging into something protected by ESO-from-Vault, you've found a circular dependency to fix.\n3. **Add an ORR-style checklist gate before each phase completes.** Each phase milestone should have an \"ORR\" sub-task with checklist items derived from the chapter \u2014 *backups tested by restore, not by existence; alerts tested by intentional failure; key personnel removed from one drill per quarter.*\n4. **Define an SLO for the control-plane itself**, not just the workloads on top. \"Vault unseal-to-serving &lt; 15 min after reboot, measured monthly.\" Burn the error budget through a deliberate restart drill. This converts \"did Vault come back?\" from anecdote to data.\n5. **Treat \"the engineer who knows\" as a SPOF.** Run one Wheel-of-Misfortune per month where the person who built the system is in the spectator chair, not the player chair.\n\n### Further reading\n\n- [SRE Book ch.5 \u2014 Eliminating Toil](https://sre.google/sre-book/eliminating-toil/)\n- [SRE Book ch.14 \u2014 Managing Incidents / Emergency Response](https://sre.google/sre-book/emergency-response/)\n- [Weathering the Unexpected \u2014 ACM Queue (Krishnan, 2012)](https://queue.acm.org/detail.cfm?id=2371516)\n- [10 Years of Crashing Google \u2014 LISA15 (Krishnan)](https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan)\n- [Cloud Blog: Shrinking time to mitigate](https://cloud.google.com/blog/products/management-tools/shrinking-the-time-to-mitigate-production-incidents)\n- [Building Secure and Reliable Systems, ch.16](https://google.github.io/building-secure-and-reliable-systems/raw/ch16.html)\n- [SRE Workbook \u2014 Error Budget Policy](https://sre.google/workbook/error-budget-policy/)\n- [AWS Operational Readiness Reviews](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html)\n- [AWS REL12-BP05 \u2014 Conduct game days regularly](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_testing_resiliency_game_days_resiliency.html)\n- [GitLab 2017 Postmortem](https://about.gitlab.com/blog/postmortem-of-database-outage-of-january-31/)\n- [HashiCorp Vault: Emergency break-glass features](https://developer.hashicorp.com/vault/tutorials/operations/emergency-break-glass-features)\n\n---\n\n## 7. Key ceremonies\n\n&gt; *The pattern in 2-3 sentences:* A **key ceremony** is a scripted, witnessed, video-recorded, tamper-evident event in which a quorum of unaffiliated humans physically come together to generate, activate, or rotate an irreducible root secret stored in (or splittable into) shares \u2014 typically an HSM master key, a CA root private key, or in our case Vault unseal shares. The discipline rests on four pillars: **scripted procedure read aloud**, **dual control / M-of-N quorum**, **tamper-evident physical artifacts** (bags, safes, smartcards with serial numbers), and **independent witnesses + audit logs that outlive every participant**. The whole point: no single human, no single facility, no single moment of compromise can fabricate the root.\n\n### Concrete real-world examples\n\n1. **ICANN DNSSEC Root KSK Ceremony** \u2014 the gold standard. Held quarterly in two geographically-separated facilities (El Segundo, CA and Culpeper, VA). **3-of-7 Crypto Officers** activate the HSM (operational signing); **5-of-7 Recovery Key Share Holders** can reconstruct the Storage Master Key if disaster recovery is needed. Officers are unaffiliated \"Trusted Community Representatives\" \u2014 not ICANN/PTI/Verisign employees. Each ceremony follows a pre-published script, is filmed on multiple cameras (live-streamed), witnessed by external auditors, and produces hash-chained audit logs retained for 10 years. Cadence: Q1/Q2/Q3/Q4, no later than 33 days before each cycle. Ceremony scripts and videos are public at . The DPS is at .\n\n2. **Let's Encrypt Root Ceremonies** \u2014 \"Generation Y\" hierarchy (ISRG Root YR RSA-4096 / ISRG Root YE ECDSA P-384) generated September 2025. Let's Encrypt publishes **ceremony configs in git before the ceremony** at , using their open-source `boulder` ceremony tool. Pre-ceremony previews are posted to the Mozilla `dev-security-policy` list for community scrutiny. This is the modern \"ceremony-as-code\" pattern: pinned tool version, declarative YAML inputs, reproducible outputs.\n\n3. **Public CA root ceremonies under CA/Browser Forum rules** \u2014 every WebTrust-audited CA (DigiCert, Sectigo, GlobalSign, Entrust) follows the same skeleton mandated by Ballot SC40 (\"Security Requirements for Air-Gapped CA Systems\"): offline root, FIPS 140-2 L3 HSM, M-of-N smartcard activation, tamper-evident bags, dual-tier physical access, signed video recording. Each publishes a Certificate Policy / Certification Practice Statement (CP/CPS) describing the procedure to the byte.\n\n4. **Code-signing HSMs (Microsoft / Apple / Google Play)** \u2014 since June 2023 the CA/Browser Forum mandates code-signing private keys be generated and held only on FIPS 140-2 L3+ HSMs (Azure Dedicated HSM, AWS CloudHSM, Google Cloud HSM). Initial generation requires a Key Generation Ceremony (KGC); a \"Bring Your Own Auditor\" can witness, but the report must cover every mandated step or the ceremony is invalidated and must be redone.\n\n5. **AWS CloudHSM / Thales Luna / nCipher nShield smartcard ceremonies** \u2014 the operational primitive most enterprises copy. Luna's PED (PIN Entry Device) ceremony splits the HSM activation secret across N \"blue/black/red\" smartcards using Shamir; nShield uses ACS/OCS card sets. Both are the direct architectural ancestors of Vault's Shamir scheme.\n\n### Specific tools / runbooks / cadences\n\n- **Pre-published scripts**: ICANN's quarterly scripts (PDF, signed) \u2014 every word the Ceremony Administrator will read is fixed weeks ahead. \n- **Ceremony-as-code**: Let's Encrypt's `ceremony-demos` repo pins boulder tool versions via `CEREMONY_BIN_YYYY` env vars; configs are YAML; outputs are publicly diffable.\n- **Tamper-evident kits**: numbered/signed evidence bags (e.g., TydenBrooks 7000 series), safes with dual-keyed locks (CO key + Safe Custodian key \u2014 neither alone suffices), HSM smartcards in individual sealed envelopes.\n- **Cadence**: ICANN quarterly; KSK rollover every ~5 years; Vault quarterly **unseal drills** are HashiCorp's explicit recommendation ().\n- **Witness logs**: paper logbook, scribe transcript, video \u2014 three independent streams so any single tampering is detectable.\n\n### War stories / notable failures\n\n- **DigiNotar (2011)** \u2014 total CA compromise; **531+ fraudulent certs** (including \\*.google.com used to MITM Iranian users). Root cause cascade: single admin account spanning all 8 CA servers, **no central logging**, no critical-system segregation, and log files lived on the same compromised hosts so forensics couldn't reconstruct what was issued. DigiNotar was distrusted by every major browser within weeks and went bankrupt. Black Tulip report: . **Lesson:** logs that live on the audited system are not logs.\n- **Comodo (2011)** \u2014 an RA partner (InstantSSL.it) with plaintext credentials in CSR submission allowed 9 fraudulent certs (mail.google.com, login.yahoo.com, addons.mozilla.org). Comodo had outsourced trust to RAs whose security it didn't oversee. . **Lesson:** delegated trust without continuous attestation is no trust.\n- **SSLMate's CA failure timeline** () is the single best survey of what goes wrong: misissuance, weak validation, EOL crypto, lost private keys.\n\n### Lessons for a Vault Shamir 3-of-5 ceremony\n\n1. **Write the script before the ceremony, commit it to git, read it aloud.** Treat the unseal ceremony like ICANN: every action the operator will take is pre-typed in `runbooks/vault-init.md`, including verbatim verification lines (\"Operator reads aloud: 'Share 1 envelope serial XXXXX, seal intact, signed by custodian Alice'\"). The script is reviewed in PR before it's executed.\n2. **Quarterly unseal drills.** HashiCorp explicitly recommends this. Schedule a calendar reminder; rehearse with all 5 custodians; record times-to-quorum. This is also the only way to catch \"custodian Alice changed jobs and lost her share\" before a real outage.\n3. **PGP-encrypt the shares at `vault operator init` time** using each custodian's published GPG key (`-pgp-keys=alice.asc,bob.asc,...`). The operator who runs `init` never sees a plaintext share. This closes Vault's documented init-time exposure window where shares are printed to one human's terminal.\n4. **Tamper-evident physical storage + geographic distribution.** Shares in numbered tamper-evident bags inside a sealed envelope inside a safe; **no two shares in the same building**. Mirror ICANN's logic: a fire, a subpoena, or a coercion event against one site cannot reach quorum.\n5. **Two-stream audit log.** Write actions to (a) a paper logbook signed by the operator + an independent witness present in the room or on video, and (b) a video recording stored outside the controller VM. DigiNotar's lesson: logs on the audited system are not logs.\n\n### Further reading\n\n-  \u2014 ICANN Root KSK ceremony scripts, videos, and audit packages (the gold standard reference)\n-  \u2014 DNSSEC Practice Statement\n-  \u2014 the actual ~100-page procedure document\n-  \u2014 accessible explainer for someone new to ceremonies\n-  \u2014 Let's Encrypt ceremony configs (the \"ceremony-as-code\" reference)\n-  \u2014 example of community pre-review\n-  \u2014 CA/Browser Forum air-gapped CA requirements\n-  \u2014 Black Tulip / DigiNotar postmortem\n-  \u2014 comprehensive CA failure timeline\n-  \u2014 HashiCorp's official Shamir best practices\n\n---\n\n## 8. GitOps for stateful infrastructure\n\n&gt; *The pattern in 2-3 sentences:* The mature pattern treats Vault like any other stateful database: a one-time operator-driven `init` (secret zero) followed by **declarative config reconciliation** \u2014 policies, auth methods, KV mounts, and roles live in git, applied by either Terraform (`hashicorp/vault` provider) or Ansible (`community.hashi_vault`) under a **scoped admin token**, never root. Consumers (workloads, ESO, ARC) never touch the operator path \u2014 they use **Kubernetes auth** so the Kubernetes service-account JWT becomes \"secret zero,\" eliminating the chicken-and-egg of storing a Vault token in a K8s Secret.\n\n### Concrete real-world examples\n\n1. **HashiCorp Validated Patterns \u2014 \"Define Vault policies with HCP Terraform\"** is the canonical reference. A Terraform workspace whose state holds policy HCL, auth method config, and KV mounts; the apply runs with a Vault token bound to a scoped admin policy (not root) and is gated by PR review. URL: https://developer.hashicorp.com/validated-patterns/vault/manage-vault-with-terraform\n\n2. **Celonis engineering** wrote a detailed postmortem on a multi-cluster ESO rollout: 0.9.1 \u2192 0.16.2 broke because the new ESO dropped `externalsecrets/v1alpha1` while Argo CD still pushed v1alpha1 manifests, and rollback was impossible because storage-version conversion had already mutated some objects. They now version-pin ESO per-cluster and stage CRD-bearing upgrades. https://careers.celonis.com/blog/updating-crds-through-breaking-changes\n\n3. **`ncorrare/vault-policy`** (GitHub) \u2014 a public reference repo showing the AppRole-scoped CI pattern: every PR runs `vault policy fmt` + `vault policy write -name=test`, plan/diff is posted as a PR comment, and merge to `main` triggers `vault policy write` with a short-lived token. https://github.com/ncorrare/vault-policy\n\n4. **Codefresh blog** (a real platform team's writeup) layers Argo CD + Vault + ESO in production: Argo manages workload manifests, ESO syncs `ExternalSecret` CRs from Vault paths, and Vault's *own* config (policies, auth methods) is managed by a separate Terraform workspace. https://codefresh.io/blog/gitops-secrets-with-argo-cd-hashicorp-vault-and-the-external-secret-operator/\n\n5. **Container Solutions / b-nova / Verifa** consultancy writeups consistently show ESO Kubernetes-auth setups where the cluster's API server signs the JWT that ESO presents to Vault, and Vault's `token_reviewer_jwt` validates it via `system:auth-delegator`. This is the **secret-zero-via-Kubernetes-trust** pattern. https://verifa.io/blog/comparing-methods-for-accessing-secrets-in-vault-from-kubernetes/\n\n### Specific tools / runbooks / cadences\n\n- **Terraform `hashicorp/vault` provider** v4.x: resources `vault_policy`, `vault_auth_backend`, `vault_kubernetes_auth_backend_role`, `vault_kv_secret_backend_v2`. Run in CI via `terraform plan` \u2192 PR comment \u2192 manual approve \u2192 `terraform apply`. State file lives in S3/GCS with locking.\n- **Ansible `community.hashi_vault` collection** for shops without Terraform: use `vault_write` with `changed_when: false` for known-idempotent writes, since most write modules can't guarantee idempotence \u2014 this is documented and is the single biggest landmine. https://docs.ansible.com/projects/ansible/latest/collections/community/hashi_vault/vault_write_module.html\n- **Drift cadence**: nightly `terraform plan` in CI; non-empty plan posts to Slack and opens a \"drift\" issue. Several teams run `pre-commit` hooks (`terraform fmt`, `vault policy fmt`).\n- **CRD upgrade ritual for ESO**: stage per-cluster, never trust `helm upgrade --reuse-values` for CRD bumps, snapshot etcd before storage-version migrations.\n- **Vault auth methods**: enable `kubernetes` first (for in-cluster workloads + ESO), `approle` for CI runners (short TTLs, response wrapping), keep `userpass` disabled in prod.\n\n### War stories / notable failures\n\n- **Celonis ESO CRD upgrade** (above) \u2014 the canonical \"GitOps tool ate its own state\" failure. Lesson: storage-version conversion is one-way; you cannot trivially downgrade once the operator has rewritten objects.\n- **`terraform-provider-vault` issue #1907**: in 3.16.0+ the provider validates the token at *init* time, so the classic \"bootstrap Vault and configure it in one `terraform apply`\" idiom broke silently. Many teams' Day-1 pipelines snapped overnight on a minor-version bump. https://github.com/hashicorp/terraform-provider-vault/issues/1907\n- **`vault_generic_endpoint` perpetual drift** (issue #842) \u2014 this resource always reports changes; teams who used it for \"everything not yet a typed resource\" got noisy plans and learned to wrap it with `lifecycle { ignore_changes = [...] }` or graduate to typed resources.\n- **ESO operator hang** (kubernetes-external-secrets #362): the sync loop silently stops, no errors, metric flatlines. Lesson: alert on `external_secrets_sync_calls_total` rate, not just on operator pod liveness.\n\n### Lessons for our project\n\n1. **Add a `roles/vault_config/` Ansible role** that owns policies, auth methods, KV mounts, and Kubernetes-auth roles \u2014 kept distinct from `roles/vault/` which owns the binary and listener config. The config role is the *only* thing that should run on Day-2; the server role is frozen after Day-1.\n\n2. **Create a scoped `controlplane-configurer` Vault policy** \u2014 `auth/*`, `sys/policies/acl/*`, `sys/mounts/*` with `create,read,update,delete,list` but **no** `read` on `kv/*` data paths. Generate a token with this policy, store it in 1Password, export it as `VAULT_TOKEN` in CI / on operator's laptop. Document the rotation cadence in `runbooks/vault-configurer-rotate.md`.\n\n3. **Every `community.hashi_vault.vault_write` task must have explicit `changed_when:`** (lint rule). Until then we're flying blind on idempotence.\n\n4. **k3s workloads (ARC, ESO) authenticate via Kubernetes auth**, never AppRole, never static token. The `kubernetes_host` and `token_reviewer_jwt` config are managed by the `vault_config` role; the bound role's policies live as files under `roles/vault_config/files/policies/`.\n\n5. **Nightly drift check**: a CI job runs `ansible-playbook playbooks/vault_config.yml --check --diff` against the live Vault and fails loudly if diff is non-empty. This is our equivalent of `terraform plan` drift detection.\n\n### Further reading\n\n- https://developer.hashicorp.com/validated-patterns/vault/manage-vault-with-terraform \u2014 HashiCorp's own validated pattern for codifying Vault\n- https://developer.hashicorp.com/vault/docs/configuration/programmatic-best-practices \u2014 official \"don't use root token\" doctrine\n- https://external-secrets.io/latest/provider/hashicorp-vault/ \u2014 ESO's Vault provider reference (read the Kubernetes-auth section carefully)\n- https://external-secrets.io/latest/examples/gitops-using-fluxcd/ \u2014 FluxCD + ESO worked example\n- https://www.hashicorp.com/en/blog/solving-secret-zero-with-vault-and-openshift-virtualization \u2014 HashiCorp's framing of the secret-zero problem\n- https://careers.celonis.com/blog/updating-crds-through-breaking-changes \u2014 the ESO CRD-upgrade postmortem; required reading before any ESO bump\n- https://github.com/hashicorp/terraform-provider-vault \u2014 the provider, plus its issue tracker which is a goldmine of gotchas\n- https://github.com/ncorrare/vault-policy \u2014 minimal reference of policies-in-git with CI\n- https://docs.ansible.com/projects/ansible/latest/collections/community/hashi_vault/index.html \u2014 the Ansible collection; note idempotency caveats on every write module\n- https://github.com/jeffsanicola/vault-policy-guide \u2014 opinionated guide to policy authoring\n- https://verifa.io/blog/comparing-methods-for-accessing-secrets-in-vault-from-kubernetes/ \u2014 side-by-side of VSO vs ESO vs Agent Injector\n\n---\n\n## 9. Chaos\n\n&gt; *The pattern in 2-3 sentences:* A disaster recovery plan you have never executed is a fiction. The chaos-engineering and game-day discipline \u2014 pioneered at Netflix, formalized at Google, and now standard across AWS, Stripe, and Shopify \u2014 says: schedule deliberate, time-boxed failures against production-like systems on a recurring cadence, document what breaks (including the runbooks and the humans), and turn every surprise into a fix before it surprises you in the middle of the night.\n\n### Concrete real-world examples\n\n1. **Netflix \u2014 Chaos Monkey / Simian Army (2011 \u2192 today).** Originated the practice. Chaos Monkey randomly terminated EC2 instances *in production* during business hours so engineers had to design for instance failure. The original SimianArmy repo is archived; Chaos Monkey lives on as a Spinnaker-integrated standalone tool. The broader family \u2014 Chaos Gorilla (kill an AZ), Chaos Kong (kill a region), Latency Monkey, Conformity Monkey \u2014 encoded the idea of progressively wider blast radii. https://netflix.github.io/chaosmonkey/ and https://github.com/Netflix/SimianArmy.\n\n2. **Google \u2014 DiRT (Disaster Recovery Testing), since 2006.** Kripa Krishnan's \"Weathering the Unexpected\" (CACM, 2012) describes an *annual, company-wide, multi-day* event. Two characteristic rules: (a) tests run against live systems, and (b) \"critical personnel, area experts, and leaders\" are explicitly excluded so the bench gets exercised. Scenarios range from datacenter loss to \"the person who knows how to do X is on a plane.\" https://cacm.acm.org/magazines/2012/11/156583-weathering-the-unexpected/fulltext and the LISA15 retrospective \"10 Years of Crashing Google\" https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan.\n\n3. **Stripe \u2014 `kill -9` on Redis primary.** Stripe's public write-up \"Game Day Exercises at Stripe\" describes a fraud-scoring service whose three-node Redis cluster *lost all data* when they killed the primary, because a config change made the day before turned out to be incompatible with snapshotting. Found in a planned afternoon, not at 3 a.m. Their quick-start checklist: dev team + someone with network/provisioning rights + a blocked afternoon. https://stripe.com/blog/game-day-exercises-at-stripe.\n\n4. **Shopify \u2014 BFCM Game Days.** Shopify runs Game Days every spring through fall against the analytics pipeline and \"Critical Journey\" endpoints in preparation for Black Friday/Cyber Monday. 2024 surfaced: Kafka partition counts insufficient for spikes, API memory leaks, connection-pool exhaustion. https://shopify.engineering/bfcm-readiness-2025.\n\n5. **AWS \u2014 GameDay framework, codified in the Well-Architected Reliability Pillar (REL12-BP05).** AWS's \"Build Your Own Game Day\" prescribes a 5-stage method: identify key services \u2192 map people/process/tech \u2192 define and run failure scenarios \u2192 observe \u2192 retrospective. Tooling: AWS Fault Injection Simulator (FIS), CloudWatch, X-Ray. https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_testing_resiliency_game_days_resiliency.html and https://aws.amazon.com/blogs/architecture/build-your-own-game-day-to-support-operational-resilience/.\n\n### Specific tools, runbooks, cadences\n\n- **Principles (the canonical text):** https://principlesofchaos.org/ \u2014 four steps (steady-state \u2192 hypothesis \u2192 inject variable \u2192 try to disprove); advanced principles emphasise *minimize blast radius* and *prefer production*.\n- **Kubernetes-native injection:** Chaos Mesh (CNCF, CRD-driven: PodChaos, NetworkChaos, StressChaos, IOChaos) https://chaos-mesh.org/; LitmusChaos with its ChaosHub of pre-built experiments https://litmuschaos.io/. Both work on k3s.\n- **Vault DR Fire-Drill:** HashiCorp's documented procedure \u2014 back up primary, demote, promote secondary using DR operation tokens, redirect traffic, then fail back. https://support.hashicorp.com/hc/en-us/articles/4408969695763-How-to-perform-vault-DR-Fire-Drill. Snapshot/restore SOP: https://developer.hashicorp.com/vault/docs/sysadmin/snapshots/restore.\n- **Gremlin's GameDay roles** (Owner / Coordinator / Reporter / Observer): https://www.gremlin.com/community/tutorials/how-to-run-a-gameday \u2014 a tight 4-role model that maps cleanly to small teams.\n- **Typical session length:** 2\u20136 hours. Plan ~1 month in advance, calendar-block the team, retro afterwards.\n\n### War stories\n\n- **Stripe Redis (above):** the lesson wasn't \"kill -9 is dangerous\" \u2014 it was that two safe-looking config changes (snapshotting + replication tweak) combined into data loss only under the exact failover the runbook claimed to support.\n- **Google DiRT, recurring finding:** when the on-call expert is \"on a plane,\" the documented runbook frequently turns out to be in the head of the person who is unreachable. This is *why* DiRT bans experts from participating.\n- **Shopify analytics:** Kafka partition counts that were \"fine\" passed every load test but starved consumers at BFCM peak \u2014 only discovered because a Game Day simulated peak + a failed broker simultaneously.\n- **AWS Well-Architected war games:** teams running \"lose production account\" scenarios discovered their CloudFormation templates *did* let them rebuild in another region \u2014 but their runbooks and SOPs lagged the templates by months.\n\n### Lessons for our project\n\n1. **Quarterly cadence, not annual.** Google's annual DiRT works because they have a chaos team; we don't. Quarterly keeps muscle memory fresh and matches typical Vault unseal-key rotation and runner-image rebuild cycles. Each drill: ~half a day.\n2. **Start with a tabletop, escalate to live fire over 2 quarters.** Tabletops are cheap and surface \"who does what\" gaps; live exercises surface the gaps tabletops can't (silent dependencies, stale runbooks). Source: https://uptimelabs.io/learn/tabletop-vs-live-incident-response/.\n3. **First live drill: Vault snapshot restore on a throwaway VM.** Smallest blast radius, highest payoff \u2014 Vault losing data takes the whole control plane down. Follow the documented fire-drill procedure but on a *cloned* VM, not production. Confirm `runbooks/vault-restart.md` reflects reality.\n4. **Second drill: kill the k3s node, watch ARC.** Power off the Proxmox VM. Time how long until runners are unreachable, until alerts fire, until the operator-only `arc-github-app-seed.md` would be re-needed. Use Chaos Mesh `PodChaos` for finer-grained drills once the basics work.\n5. **Bench the expert.** Borrow Google's rule: rotate who runs the drill. If only one person knows how to recover Vault, that's the bug \u2014 not the answer. Capture every \"huh, that's odd\" moment as a `PITFALLS.md` entry the same day.\n6. **Bound blast radius with a written rollback.** Every drill ticket must include the exact `git revert` / `vault operator raft snapshot restore` / `ansible-playbook --tags=\u2026` rollback command before the drill starts. Owner has the kill switch.\n\n### Further reading\n\n- https://principlesofchaos.org/ \u2014 the canonical 4-step / 5-principle definition. Read first.\n- https://stripe.com/blog/game-day-exercises-at-stripe \u2014 short, concrete, \"we lost all our Redis data and were grateful\" story.\n- https://cacm.acm.org/magazines/2012/11/156583-weathering-the-unexpected/fulltext \u2014 Krishnan on DiRT's philosophy and rules of engagement.\n- https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan \u2014 \"10 Years of Crashing Google,\" LISA15 talk video + slides.\n- https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_testing_resiliency_game_days_resiliency.html \u2014 AWS REL12-BP05, the most prescriptive enterprise checklist.\n- https://aws.amazon.com/blogs/architecture/build-your-own-game-day-to-support-operational-resilience/ \u2014 AWS's 5-stage build-your-own framework.\n- https://www.gremlin.com/community/tutorials/how-to-run-a-gameday \u2014 Owner/Coordinator/Reporter/Observer role split.\n- https://shopify.engineering/bfcm-readiness-2025 \u2014 large-scale game-day program tied to a known peak event.\n- https://support.hashicorp.com/hc/en-us/articles/4408969695763-How-to-perform-vault-DR-Fire-Drill \u2014 Vault-specific drill steps.\n- https://chaos-mesh.org/ and https://litmuschaos.io/ \u2014 CNCF chaos tooling that runs natively on k3s.\n- https://github.com/dastergon/awesome-chaos-engineering \u2014 curated index of further reports, papers, and tools.\n\n---\n\n## 10. Kubernetes\n\n&gt; *The pattern in 2-3 sentences:* Kubernetes has the same circular dependency Vault does: the cluster's source of truth (etcd) holds the credentials, the credentials authenticate kubeadm/kubelets that talk to the API server, and the API server lives on top of etcd. The community resolves this with **(a) bootstrap tokens** \u2014 short-lived, prefix-identified bearer tokens that establish initial bidirectional trust between joining nodes and the control plane; and **(b) a \"bootstrap-and-pivot\" pattern** in Cluster API, where a throwaway local cluster (kind) creates the real management cluster, then transfers its own CRDs into the target and self-destructs.\n\n### Concrete real-world examples\n\n1. **kubeadm bootstrap tokens** \u2014 the canonical primitive. Format `[a-z0-9]{6}.[a-z0-9]{16}` (e.g., `abcdef.0123456789abcdef`), 24-hour TTL by default. `kubeadm token create --ttl 2h --print-join-command` produces the join string; the discovery half uses a CA pubkey hash so the joiner can't be phished by a fake API server. Docs: \n\n2. **Cluster API (CAPI) bootstrap-and-pivot** \u2014 kubernetes-sigs/cluster-api's accepted workflow. Create a local kind cluster \u2192 install CAPI controllers \u2192 declare your management cluster as CR \u2192 `clusterctl move` pivots all CAPI objects into the freshly-built cluster \u2192 tear down kind. Scott Lowe's writeup is the rookie-readable version: \n\n3. **Rancher backup-restore-operator** \u2014 Helm-installable operator that backs up *the Rancher app's CRs* (clusters, users, settings) to S3, independent of any single downstream cluster's etcd. Lets you nuke the local cluster and redeploy Rancher onto a fresh one. Repo: \n\n4. **Velero** \u2014 the de-facto \"Kubernetes-resource-aware\" backup tool. Backs up cluster API objects (selectively, by namespace/label) + PV data via CSI snapshots or restic/kopia. Crucially complementary to etcd snapshots: etcd protects against infrastructure failure, Velero against fat-fingered deletes and lets you *restore into a different cluster*. \n\n5. **k3s embedded-etcd snapshot to S3** \u2014 built-in. `k3s server --etcd-s3 --etcd-s3-bucket=... --etcd-snapshot-schedule-cron='0 */12 * * *' --etcd-snapshot-retention=5` runs automatic snapshots. Restore is `systemctl stop k3s &amp;&amp; k3s server --cluster-reset --cluster-reset-restore-path=` then start normally. \n\n### Specific tools / runbooks / cadences\n\n- **etcd snapshots:** `ETCDCTL_API=3 etcdctl --endpoints=$EP --cacert=ca.crt --cert=server.crt --key=server.key snapshot save snap.db`. Recommended cadence: every 30 min for high-write clusters, every 6\u201312h for low-churn (our case). Test restore monthly in a sandbox.\n- **k3s snapshots:** scheduled snapshots are **on by default** at 00:00 and 12:00, retention 5. Add `--etcd-s3*` flags to replicate offsite.\n- **Velero schedules:** `velero schedule create daily-full --schedule=\"@daily\" --ttl=720h0m0s` is the standard incantation. Combine with `--include-namespaces` to scope.\n- **CNCF \"Surgeon's Handbook\":** five-step surgical restore \u2014 restore snapshot to a *local* etcd, `etcdctl get /registry/configmaps//`, decode with `auger`, validate, `kubectl apply --dry-run=server`. Recover individual resources without nuking the cluster. \n- **Server token discipline:** k3s encrypts bootstrap data inside the snapshot with the server token (`/var/lib/rancher/k3s/server/token`). **Lose the token, lose the snapshot.** Back up the token file alongside every snapshot.\n\n### War stories and notable failures\n\n- **etcd v3.5.0\u2013v3.5.2 data inconsistency** (2022) \u2014 silent divergence between members. CNCF advisory: do not run these versions; `--experimental-initial-corrupt-check` now default in v3.6. \n- **Monzo 2017 outage** \u2014 etcd + Linkerd interaction killed the payments platform; presented at KubeCon: \n- **Grafana Cloud Hosted Prometheus, 2020** \u2014 stuck TCP connection to a dead etcd master broke the cluster despite quorum: \n- **Tesla 2018 kubeconfig leak** \u2014 exposed Kubernetes dashboard let attackers mine crypto on Tesla's AWS bill. The canonical \"do not leave admin kubeconfig lying around\" story.\n- **Power-outage etcd corruption** \u2014 repeatedly reported (kubernetes/kubernetes#128735, etcd-io/etcd#18881). Single-node etcd + dirty shutdown = WAL corruption. Recovery requires snapshot restore; live data is gone.\n- **Rancher k3s S3-restore wipes cluster bug** \u2014 rancher/rancher#42251 \u2014 restoring from S3 in some k3s versions wiped the cluster cleanly when token mismatched. Hard-won lesson: **token != snapshot key != optional**.\n\n### Lessons for our single-node k3s + ARC stack\n\n1. **Embedded etcd, not SQLite.** k3s defaults to SQLite for single-node. Switch to embedded etcd (`--cluster-init`) so `k3s etcd-snapshot save` and the `--cluster-reset` restore flow are available. The added complexity is worth the production-grade DR story.\n2. **Snapshots to S3 from day one, with the token.** Add `--etcd-s3 --etcd-s3-bucket=dgx-k3s-snapshots` to the k3s systemd unit. **Separately back up `/var/lib/rancher/k3s/server/token`** \u2014 to Vault KV or 1Password \u2014 because the snapshot is useless without it. Document in a `runbooks/k3s-restore.md`.\n3. **Velero on top of etcd snapshots.** Two layers: etcd snapshots for \"the box died\" disasters; Velero for \"Claude deleted the ARC namespace at 2am\" disasters. Velero backups go to the same S3 bucket, different prefix.\n4. **Bootstrap the admin kubeconfig like a Vault root token.** Treat `/etc/rancher/k3s/k3s.yaml` as a break-glass credential: encrypt at rest, never commit, rotate after every operator use. Daily work uses a `cluster-admin` ServiceAccount kubeconfig.\n5. **Quarterly restore drill.** Restore the latest S3 snapshot into a kind cluster on the operator's laptop and confirm ARC's RunnerScaleSets, ESO, and Vault references all reconcile. Untested backups are theatre. Add a `runbooks/k3s-restore-drill.md` and put a calendar reminder in `tasks.json`.\n\n### Further reading\n\n-  \u2014 authoritative spec for kubeadm bootstrap tokens\n-  \u2014 official \"operating etcd for k8s\" guide\n-  \u2014 the etcd project's own DR doc, including multi-member rebuild\n-  \u2014 k3s-specific backup/restore\n-  \u2014 full reference for `k3s etcd-snapshot` flags\n-  \u2014 CAPI quickstart; shows the kind-bootstrap \u2192 pivot flow\n-  \u2014 readable walkthrough of \"chicken-and-egg\" pivot\n-  \u2014 Velero's own DR playbook\n-  \u2014 surgical single-resource recovery from etcd snapshots\n-  /  \u2014 curated index of public postmortems\n\n---\n\n## 11. Workload identity\n\n&gt; *The pattern in 2-3 sentences:* \"Secret zero\" is the bootstrap credential a workload needs to fetch all its other credentials \u2014 and if it leaks, the whole castle falls. The workload-identity movement (SPIFFE/SPIRE, AWS IRSA, GCP WIF, Azure Workload ID, Vault's Kubernetes auth method) eliminates that bootstrap secret by deriving identity from **cryptographically-verifiable platform attestation** (TPM measurements, cloud-metadata APIs, signed Kubernetes ServiceAccount JWTs) instead of distributing a shared password. The workload proves *what it is* \u2014 and the identity infrastructure issues a short-lived SVID/token in response, with no human-managed key ever touching disk.\n\n### Concrete real-world examples\n\n1. **Uber** runs SPIRE across **~250,000 nodes and 4,500 services** spanning GCP, OCI, AWS and on-premise. SPIRE agent ships in their \"golden\" OS image; node attestation uses container-launcher trust + node aliases; an LRU cache let them fit **2.5\u00d7 more workloads per host group** while cutting server CPU 40%. [Uber Engineering Blog (2023)](https://www.uber.com/us/en/blog/our-journey-adopting-spiffe-spire/)\n2. **Bloomberg** authored the open-source [`spire-tpm-plugin`](https://github.com/bloomberg/spire-tpm-plugin) using TPM 2.0 credential activation \u2014 node identity is rooted in the TPM endorsement key, signed by the manufacturer CA. Presented at KubeCon NA 2020 / SPIFFE Production Identity Day.\n3. **Square** uses SPIFFE/SPIRE to issue mTLS identities to payment-processing services *and* AWS Lambda functions across their hybrid infrastructure \u2014 the same trust domain spans VMs, containers, and serverless.\n4. **Google Cloud** has [standardized on SPIFFE across GCP](https://github.com/spiffe/spire/blob/main/ADOPTERS.md) as the unified workload-identity platform; Anthropic, Netflix, Pinterest, ByteDance/TikTok, GitHub, Twilio, Niantic and Duke Energy (microgrid TPM attestation) are listed adopters.\n5. **AWS IRSA** is the de-facto pattern on EKS: pods get an OIDC-signed ServiceAccount JWT, hand it to STS `AssumeRoleWithWebIdentity`, and receive short-lived IAM credentials \u2014 **no static AWS keys in any pod**. [AWS docs](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html).\n\n### Specific tools, runbooks, cadences\n\n- **SPIRE server + agent**, one agent per node, communicating with workloads via a **Unix Domain Socket** (`/run/spire/agent.sock`) \u2014 no network auth needed because UDS peer credentials prove the caller's PID, which workload attestors then map to a pod/process. [SPIRE Concepts](https://spiffe.io/docs/latest/spire-about/spire-concepts/)\n- **Node attestors** (pick one): `k8s_psat` (projected ServiceAccount tokens validated against the K8s TokenReview API), `aws_iid`, `gcp_iit`, `tpm_devid` (Bloomberg plugin), `join_token` (one-shot, manual \u2014 last resort).\n- **Workload attestors**: `unix` (UID/GID/binary path), `k8s` (queries kubelet for pod metadata), `docker`.\n- **SVIDs**: X.509 certs (default ~1 hr TTL) or JWT-SVIDs. Workloads re-fetch automatically before expiry \u2014 agents keep a cache so brief server outages don't kill the fleet.\n- **Cadence**: trust-bundle / root-CA rotation requires planned coordination (especially in federated trust domains); test in staging. [SPIFFE scaling docs](https://spiffe.io/docs/latest/planning/scaling_spire/)\n- **Vault K8s auth**: pod presents its projected SA token to Vault \u2192 Vault calls `TokenReview` on the K8s API \u2192 returns short-lived Vault token. Bind a \"reviewer\" SA with the **`system:auth-delegator`** ClusterRole. [Vault K8s auth docs](https://developer.hashicorp.com/vault/docs/auth/kubernetes)\n\n### War stories &amp; notable failures\n\n- **The bootstrap chicken-and-egg with Vault K8s auth**: when Vault lives *outside* the cluster (the exact topology of our project \u2014 Proxmox VM, external k3s), you must hand Vault a long-lived `token_reviewer_jwt` so it can call TokenReview. That long-lived token *is* a small secret zero. HashiCorp explicitly calls this out: [\"External Vault Kubernetes Auth\" discuss thread](https://discuss.hashicorp.com/t/external-vault-kubernetes-auth/36318). Mitigation: rotate it on a schedule, or run Vault inside the cluster so the projected SA token is used automatically.\n- **SPIRE \"trust the platform\" caveat**: standard attestors (`k8s_psat`, `aws_iid`, `join_token`) trust the orchestrator \u2014 if an attacker compromises the K8s API server or EC2 metadata, they can register rogue agents. Only **TPM/TEE-backed attestation** (Bloomberg's plugin, Red Hat's Keylime+SPIRE integration) closes that gap: [Red Hat \u2014 SPIFFE/SPIRE + Keylime](https://next.redhat.com/2025/01/24/spiffe-spire-and-keylime-software-identity-based-on-secure-machine-state/).\n- **\"Almost no one can afford to build it right\"**: Aembit and others note that successful SPIRE adopters (Uber, Indeed, Macquarie Bank) dedicated **platform-engineering teams** for months and upstreamed fixes. A 2-VM homelab is the wrong scale for full SPIRE.\n- **Single SPIRE server = SPOF**: must run nested topology or HA replicas; DB read-replicas are how Uber scaled.\n\n### Lessons for our stack\n\n1. **Yes \u2014 use Vault's Kubernetes auth method for ESO and ARC, not static tokens.** Pods present their projected SA JWT; Vault validates via TokenReview. This is the smallest, most idiomatic step away from secret zero we can take *today*, and it's what ESO docs recommend for production.\n2. **Accept the one residual secret: the `token_reviewer_jwt`.** Because Vault lives in a Proxmox VM outside k3s, document its rotation in a runbook (`docs/runbooks/vault-token-reviewer-rotate.md`) on a 90-day cadence. Consider migrating Vault *into* k3s in Phase 7 to eliminate even that.\n3. **Skip SPIRE for Phase 1.** Two DGX nodes + one VM is below the threshold where SPIRE pays for itself \u2014 Vault K8s auth + ServiceAccount projected tokens covers ~90% of the value at ~10% of the operational cost. Revisit SPIRE only if/when we add a third trust boundary.\n4. **Plan for TPM attestation on the DGX Sparks themselves.** The Sparks have TPMs. Even without SPIRE, we can use the TPM for `tang`/`clevis` LUKS unlock and (later) for Vault's TPM auth method or a future SPIRE TPM node-attestor migration. This is the *only* path that fully solves the bottom turtle.\n5. **Use Vault response wrapping (`-wrap-ttl`) for any unavoidable secret hand-off** \u2014 e.g., the initial GitHub App private key seeding in `arc-github-app-seed.md`. The operator unwraps once; if interception occurred, the wrapper is already consumed.\n\n### Further reading\n\n- [Solving the Bottom Turtle (free PDF book)](https://spiffe.io/pdf/Solving-the-bottom-turtle-SPIFFE-SPIRE-Book.pdf) \u2014 the canonical 200-page introduction\n- [Uber Engineering \u2014 Our Journey Adopting SPIFFE/SPIRE at Scale](https://www.uber.com/us/en/blog/our-journey-adopting-spiffe-spire/) \u2014 most detailed production case study\n- [Bloomberg `spire-tpm-plugin`](https://github.com/bloomberg/spire-tpm-plugin) \u2014 hardware-rooted node identity\n- [SPIRE Concepts](https://spiffe.io/docs/latest/spire-about/spire-concepts/) \u2014 agent/server architecture\n- [SPIRE ADOPTERS.md](https://github.com/spiffe/spire/blob/main/ADOPTERS.md) \u2014 current list of named production users\n- [HashiCorp Vault \u2014 Kubernetes auth method](https://developer.hashicorp.com/vault/docs/auth/kubernetes)\n- [GitGuardian \u2014 The Secret Zero Problem](https://www.gitguardian.com/nhi-hub/the-secret-zero-problem-solutions-and-alternatives) \u2014 accessible rookie explainer\n- [CyberArk \u2014 Can SPIFFE Solve the Secret Zero Problem?](https://developer.cyberark.com/blog/can-spiffe-solve-the-secret-zero-problem/)\n- [AWS IRSA introduction](https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/)\n- [Red Hat \u2014 SPIFFE/SPIRE + Keylime](https://next.redhat.com/2025/01/24/spiffe-spire-and-keylime-software-identity-based-on-secure-machine-state/)\n\n---\n\n## 12. Supply chain\n\n&gt; *The pattern in 2-3 sentences:* You cannot \"audit\" your way to a trustworthy binary by reading source \u2014 Ken Thompson proved in 1984 that a malicious compiler can hide a backdoor in *itself* and in any program it builds, leaving source code clean. The modern answer isn't a single audit; it's a stack of overlapping techniques \u2014 **reproducible (bit-identical) builds, signed provenance attestations, transparency logs, and bootstrappable toolchains** \u2014 so that \"trust\" becomes \"independently re-derivable from a tiny, hand-auditable seed plus public source.\" The XZ backdoor (2024) and SolarWinds (2020) are the two case studies that prove every layer of this stack matters.\n\n### Concrete real-world examples\n\n1. **Debian Reproducible Builds \u2014 bit-for-bit verifiable archive.** Debian 14 (\"Forky\") will mandate reproducible packages: every `.deb` must be byte-identical when independently rebuilt by anyone. They run a public rebuilder farm at `tests.reproducible-builds.org` that continuously rebuilds the archive and publishes diffs (`diffoscope`, the tool, is the workhorse \u2014 it recursively unpacks any container/archive and diffs the contents semantically).\n2. **NixOS / Bootstrappable Builds \u2014 256-byte seed.** NixOS PR #227914 builds the entire `stdenv` from a ~512-byte hex seed via `stage0-posix` \u2192 `M2-Planet` \u2192 `GNU Mes` \u2192 `tinycc` \u2192 `gcc`. Combined with GNU Guix's parallel effort, Mes has now been DDC-verified across three distros \u00d7 three GCC versions, all producing bit-identical output.\n3. **Sigstore (Cosign + Fulcio + Rekor) \u2014 keyless signing for OCI.** Kubernetes, Distroless, npm (since 2023), and PyPI all publish to Rekor. The CNCF Cosign workflow in GitHub Actions issues a short-lived cert tied to the workflow's OIDC identity (e.g. `https://github.com/org/repo/.github/workflows/release.yml@refs/tags/v1.2.3`) \u2014 there is no long-lived signing key to steal.\n4. **SLSA v1.0 build track (Google/OpenSSF).** Levels L0 \u2192 L3. L1 = provenance exists. L2 = hosted build with signed provenance. L3 = build isolation + signing material inaccessible to user steps. The reference implementation is the `slsa-github-generator` set of reusable workflows.\n5. **in-toto layouts (CNCF).** A signed YAML \"layout\" declares the expected supply chain steps (`clone \u2192 test \u2192 build \u2192 sign`), the authorized actor per step, and how artifact hashes must flow between steps. Used in production by Datadog, SolarWinds (post-incident), and Toradex's Torizon OS for IoT provenance.\n\n### Specific tools / runbooks / cadences\n\n- **`diffoscope`** \u2014 investigate non-determinism between two builds; output is a recursive HTML/JSON diff.\n- **`reprotest`** \u2014 Debian tool that builds a package twice while varying time, locale, path, hostname, build user, etc., to surface non-determinism sources.\n- **`cosign sign --yes `** in GitHub Actions with `id-token: write` permission, paired with `cosign verify --certificate-identity-regexp ... --certificate-oidc-issuer https://token.actions.githubusercontent.com` in admission/CI.\n- **`slsa-verifier`** \u2014 validates SLSA provenance attestations against expected source repo, builder, and entry point.\n- **Cadence:** Debian rebuilders run continuously; Sigstore root key ceremonies happen ~yearly with public attendance and recorded video; SLSA-aware projects regenerate provenance per release tag.\n\n### War stories\n\n- **XZ Utils / CVE-2024-3094 (March 2024).** \"Jia Tan\" spent ~2.5 years building maintainer trust, then planted the payload only in the *release tarball* (in `build-to-host.m4` and binary test fixtures) \u2014 the git repo was clean. Discovered by accident by Andres Freund chasing a 500 ms sshd latency regression. Lesson: **release tarballs \u2260 git source**. Reproducible builds from VCS would have caught the divergence immediately. The maintainer-burnout vector (Lasse Collin overwhelmed, one person on vacation) is now the canonical example.\n- **SolarWinds SUNBURST (2020).** Attackers planted SUNSPOT on the build server itself, watching for `MsBuild.exe` invocations and swapping a source file *during* compilation. Signed Orion DLLs went out to ~18,000 customers. The source repo was never modified. Lesson: **even signed binaries from a \"trusted\" vendor are worthless if the build platform is the threat.** SLSA L3 (build isolation) is the direct response.\n- **event-stream / flatmap-stream (Nov 2018).** Dominic Tarr handed npm publishing rights to a stranger (\"right9ctrl\") who asked nicely. The new maintainer pulled in `flatmap-stream`, which contained a Copay-wallet-targeted Bitcoin stealer activated only in a specific build environment. Lesson: **package-manager identity is weaker than the social-engineering attack against it.**\n- **log4shell (2021).** Not a backdoor, but proved that a single transitively-included library can put every JVM-running enterprise into incident-response mode for months \u2014 a forcing function for SBOMs and dependency-pinning hygiene.\n\n### Lessons for our project\n\n1. **Pin Ansible collections by checksum, not version.** `requirements.yml` accepts `version:` but not native checksum verification \u2014 wrap installs with `ansible-galaxy collection install --offline` from a tarball whose SHA-256 is committed alongside `versions.env`. Cross-check against the upstream Galaxy API in CI.\n2. **Sign the runner image with cosign keyless from GitHub Actions, verify on pull in k3s.** Use the `policy-controller` or Kyverno's `verifyImages` rule to enforce that every image admitted to the cluster has a Rekor entry whose certificate identity matches the fleet's release workflow. Same pattern for the inference image.\n3. **Generate SLSA v1.0 provenance for the runner image.** The `slsa-github-generator` reusable workflow gives this for free at L3, and `slsa-verifier` can run in the k3s admission controller. This makes the SolarWinds attack class infeasible against us \u2014 an attacker compromising a maintainer laptop cannot forge provenance tied to our GitHub OIDC issuer.\n4. **Verify upstream tarballs against their git tags before vendoring.** The XZ lesson: anything we pull as a tarball (Helm charts in particular \u2014 they ship as opaque `.tgz`) gets a `chart-from-git-vs-released-tgz` diff in CI. If they diverge without justification, fail the build.\n5. **Reproducibility as a release gate, not an aspiration.** When we bake the runner image, build it twice in two isolated runners and compare digests. If they don't match, fix the non-determinism (timestamps, locale, build paths) before shipping. Debian's `reprotest` is the model.\n\n### Further reading\n\n- [Reflections on Trusting Trust \u2014 Ken Thompson, 1984](https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf) \u2014 read this once\n- [Fully Countering Trusting Trust through Diverse Double-Compiling \u2014 David A. Wheeler](https://dwheeler.com/trusting-trust/) \u2014 the formal answer\n- [SLSA v1.0 specification](https://slsa.dev/spec/v1.0/levels) \u2014 read `levels` and `provenance` pages\n- [Sigstore Cosign signing overview](https://docs.sigstore.dev/cosign/signing/overview/)\n- [XZ backdoor \u2014 Sam James's timeline gist](https://gist.github.com/thesamesam/223949d5a074ebc3dce9ee78baad9e27)\n- [Datadog Security Labs \u2014 XZ deep dive](https://securitylabs.datadoghq.com/articles/xz-backdoor-cve-2024-3094/)\n- [CrowdStrike SUNSPOT technical analysis](https://www.crowdstrike.com/en-us/blog/sunspot-malware-technical-analysis/)\n- [npm event-stream post-mortem (Snyk)](https://snyk.io/blog/a-post-mortem-of-the-malicious-event-stream-backdoor/)\n- [Debian ReproducibleBuilds wiki](https://wiki.debian.org/ReproducibleBuilds)\n- [NixOS PR #227914 \u2014 256-byte stdenv bootstrap](https://github.com/NixOS/nixpkgs/pull/227914)\n- [in-toto attestation framework](https://github.com/in-toto/attestation)\n\n---\n\n## 13. Compliance\n\n&gt; *The pattern in 2-3 sentences:* Every major regulatory regime (NIST 800-34, SOC 2 CC7, ISO 27001 A.17 / 2022 A.5.29-30, FedRAMP, HIPAA 164.308, PCI DSS 3.6) converges on the same five demands: a written contingency plan, a documented key-management lifecycle with named custodians, **annual tested** restores (not just backups), tabletop exercises with after-action reports, and an immutable evidence trail proving the test happened. The single most common audit failure is \"backups exist, restores never tested.\" For a control plane that runs Vault and k3s, this means the unseal ceremony, ARC GitHub-App seed, and `etcd` snapshot restore each need a runbook, a scheduled drill, and a signed After-Action Report \u2014 even if no auditor will ever read them.\n\n### Concrete real-world examples (with URLs)\n\n1. **NIST SP 800-34 Rev 1 \u2014 CP-4 contingency-plan testing.** Mandates annual functional exercises for moderate-impact systems; documented test plans must precede the drill, and \"theoretical estimates do not satisfy contingency plan testing requirements.\" Full PDF: [nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-34r1.pdf](https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-34r1.pdf)\n2. **NIST SP 800-57 Part 1 Rev 5 \u2014 key lifecycle.** Defines generation \u2192 distribution \u2192 storage \u2192 use \u2192 rotation \u2192 archival \u2192 destruction, with cryptoperiods tied either to key age or message volume. [csrc.nist.gov/pubs/sp/800/57/pt1/r5/final](https://csrc.nist.gov/pubs/sp/800/57/pt1/r5/final)\n3. **SOC 2 CC7.4 / CC7.5 \u2014 system operations.** Requires a SIEM, vulnerability scanning, incident-response plan, and **annual DR test**. In a Type 2 period with zero incidents, a **tabletop exercise is the minimum bar** to demonstrate the IRP is \"operationally known to the team.\" [auditfront.com/frameworks/soc-2/common-criteria/cc7-1](https://www.auditfront.com/frameworks/soc-2/common-criteria/cc7-1/)\n4. **ISO 27001:2022 A.5.29 + A.5.30** (replacing the older A.17). A.5.30 is the new \"ICT readiness for business continuity\" \u2014 explicit RTO/RPO per service and tested recovery procedures. [certpro.com/iso-27001-2022-controls](https://certpro.com/iso-27001-2022-controls/)\n5. **PCI DSS v4.0 \u00a73.6.8** \u2014 **cryptographic key custodians must formally acknowledge in writing** that they accept their responsibilities; manual processes require split knowledge and dual control (e.g., 3-of-5 Shamir quorum). [blog.rsisecurity.com/pci-compliance-key-management-requirements](https://blog.rsisecurity.com/pci-compliance-key-management-requirements/)\n6. **HIPAA 45 CFR \u00a7164.308(a)(7).** Five required implementation specs: Data Backup Plan, Disaster Recovery Plan, Emergency Mode Operation Plan, Testing &amp; Revision Procedures, Applications/Data Criticality Analysis. [ecfr.gov/current/title-45/.../section-164.308](https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/subpart-C/section-164.308)\n7. **FedRAMP Moderate ISCP template** \u2014 the actual `.docx` agencies fill out. Daily incremental + weekly full backups; annual functional exercise. [fedramp.gov/.../REV_4_SSP-A06-FedRAMP-ISCP-Template.docx](https://www.fedramp.gov/assets/resources/documents/rev4/REV_4_SSP-A06-FedRAMP-ISCP-Template.docx)\n\n### Specific runbook/document templates auditors expect\n\n- **Information System Contingency Plan (ISCP)** \u2014 FedRAMP template above is the gold standard; sections: roles, BIA, RTO/RPO matrix, activation criteria, recovery procedures, reconstitution, plan-maintenance schedule.\n- **Key Custodian Acknowledgement Form** \u2014 one-page signed PDF per custodian (PCI 3.6.8). Lists named key, custodian's quorum share number, rotation cadence, escape clause on termination.\n- **Key Ceremony Script** \u2014 step-by-step, two-person-rule, with checkboxes initialled by each custodian; counterpart in our world is `vault-init.md`.\n- **After-Action Report (AAR)** \u2014 date, scenario, participants, timeline of events vs. plan, gaps, owners, due dates. Often the *only* artefact an SOC 2 auditor asks to see for CC7.5.\n- **Test Schedule** \u2014 quarterly tabletop, annual full restore, ad-hoc after major change. Tracked in a register the auditor pulls.\n- **BIA (Business Impact Analysis)** \u2014 what dies if Vault stays sealed for 24 h? for 7 days? Drives the RTO.\n\n### War stories\n\n- **Equifax (2017).** GAO report [GAO-18-559](https://www.gao.gov/assets/gao-18-559.pdf) and the [MIT CAMS case study](https://cams.mit.edu/wp-content/uploads/2021-06-PUBLISHED-MISQE-Applying-the-Lessons-from-the-Equifax-Cybersecurity-Incident.pdf) document that **internal audit had flagged patching and certificate-expiry gaps before the breach but findings were never closed**. Cost: $1.38 B and the CEO's job. The lesson is not \"patch faster\" \u2014 it's \"make audit findings have owners and due dates that the board sees quarterly.\"\n- **Capital One (2019).** $190 M fine despite state-of-the-art cloud. Root cause: an over-privileged WAF role + missing detective controls. Their key management *was* good; their access-management telemetry was not.\n- **The \"9-month SOC 2 remediation.\"** Per [compassitc.com](https://www.compassitc.com/blog/what-happens-if-you-fail-a-soc-2-examination), a client with no DR capability had to build the program, configure backups, set RTO/RPO, run tests, and write runbooks \u2014 and the auditor still demanded **nine months of operating evidence** before clearing the exception. Moral: start the clock today, not the quarter before your first audit.\n\n### Lessons for an unregulated shop\n\n1. **Sign a Vault Key-Custodian Acknowledgement.** Three Shamir-share holders, named, dated, with rotation policy. Costs 20 minutes; satisfies PCI 3.6.8 thinking; forces honest discussion of \"what if a custodian quits?\"\n2. **Quarterly tabletop, annual restore.** Pick a date, run \"Vault VM lost\"; do an actual `etcd snapshot restore` + Vault unseal in the lab from cold media. Write the AAR to `docs/runbooks/drill-reports/YYYY-MM-DD.md`. This is the single artefact a SOC 2 auditor would ask for first.\n3. **Publish RTO/RPO per service.** Vault: RTO 1 h / RPO 24 h. k3s control plane: RTO 4 h / RPO 24 h. ARA: best-effort. Without these numbers, \"DR plan\" is just vibes.\n4. **Make findings have owners and due dates.** Treat every PITFALLS entry like an audit finding: open ticket, owner, ETA, close-out evidence. Equifax died from open findings, not unknown ones.\n5. **Auto-generate evidence.** The runbook should `tee` to a timestamped log; commit it. \"Cutover-style immutable audit log\" is overkill for us, but a `git log` of drill reports is the same idea on a budget.\n\n### Further reading\n\n- [NIST SP 800-34 Rev 1](https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-34r1.pdf) \u2014 read \u00a73.4 (Testing, Training, and Exercises)\n- [NIST SP 800-57 Part 1 Rev 5](https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-57pt1r5.pdf) \u2014 \u00a75.3 cryptoperiods, \u00a78 lifecycle states\n- [CMS Key Management Handbook](https://security.cms.gov/learn/cms-key-management-handbook) \u2014 readable real-world translation\n- [AuditFront SOC 2 CC7.1 implementation guide](https://www.auditfront.com/frameworks/soc-2/common-criteria/cc7-1/)\n- [FedRAMP ISCP template](https://www.fedramp.gov/assets/resources/documents/rev4/REV_4_SSP-A06-FedRAMP-ISCP-Template.docx)\n- [eCFR 45 CFR \u00a7164.308](https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/subpart-C/section-164.308)\n- [PCI key-management deep dive (RSI Security)](https://blog.rsisecurity.com/pci-compliance-key-management-requirements/)\n- [ISO 27001:2022 changes (CertPro)](https://certpro.com/iso-27001-2022-controls/)\n- [USENIX: Running DR Tabletop Exercises](https://www.usenix.org/publications/loginonline/running-disaster-recovery-plan-tabletop-exercises)\n- [GAO-18-559: Equifax response report](https://www.gao.gov/assets/gao-18-559.pdf)\n\n---\n\n## 14. Homelab\n\n&gt; *The pattern in 2-3 sentences:* Small-team and homelab operators have converged on a remarkably consistent stack: a **declarative Git repo as the source of truth**, a **single bootstrap command** that gets you from bare metal to reconciling cluster, a **pull-based GitOps agent** (Flux or Argo CD) that closes the loop, and **encrypted-secrets-in-Git** (SOPS+age) instead of a separately-operated secret server. The discipline that distinguishes the survivors from the cargo-cult is **testing the bootstrap by actually destroying and rebuilding** \u2014 and writing the recovery doc as a printed PDF stored offline, not a wiki page on the server you just lost.\n\n### Concrete real-world examples\n\n1. **[onedr0p/home-ops](https://github.com/onedr0p/home-ops)** \u2014 The canonical \"k8s-at-home\" reference. Talos + Flux + Renovate + external-secrets-operator backed by 1Password Connect. Volsync handles PVC backup/restore. Demonstrates that one person can run Ceph, Cilium, Istio, and a Cloudflared ingress without losing weekends \u2014 because Renovate auto-PRs every dependency and Flux applies on merge.\n2. **[khuedoan/homelab](https://github.com/khuedoan/homelab)** \u2014 \"Empty disk to running services with a single `make` command.\" PXE boot \u2192 Ansible \u2192 k3s \u2192 Argo CD root-app \u2192 everything else. A great study in the **root-app pattern**: bootstrap installs exactly one Argo Application that manages itself plus all children.\n3. **[Stan's blog: Moving my personal infra to single-node k3s](https://stanislas.blog/2025/04/moving-to-k8s/)** \u2014 A working example of the *exact* shape we're building: one node, k3s, Flux, Renovate, cert-manager, Restic. Bootstrap = one curl command. Recovery = `rsync /etc/rancher /var/lib/rancher` and re-run install. Explicitly evaluated and rejected Longhorn + Velero as \"overkill.\"\n4. **[nlewo/comin](https://github.com/nlewo/comin)** \u2014 GitOps-for-NixOS in pull mode. The NixOS analogue to Flux: machines poll their own config from Git. Worth knowing exists even if you stay on Ubuntu \u2014 it's the cleanest expression of \"the machine reconciles itself.\"\n5. **[chrisleekr/homelab-infrastructure](https://github.com/chrisleekr/homelab-infrastructure)** \u2014 Two-stage IaC (Ansible bootstrap \u2192 Terraform app stack) on single-node k8s. Closest structural analogue to our control-plane + fleet split.\n\n### Specific tools / patterns / cadences\n\n- **Renovate, not Dependabot.** Renovate understands Helm charts, container digests, Ansible Galaxy roles, and Terraform modules. Configure it for `automerge: true` on patch updates of trusted images and `dependencyDashboard: true` so all pending updates show up as one tracking issue.\n- **Pinning by digest, not tag.** Every serious homelab repo pins container images and Helm charts by SHA. Renovate rewrites the digest on update.\n- **SOPS + age, not Vault.** For a two-person team, running Vault *as a consumer* is sensible; running it just to store five secrets is not. The homelab pattern is `.sops.yaml` with two age recipients (one per engineer), keys backed up on YubiKeys or a paper-printed master.\n- **Pull-mode reconciliation.** Flux/Argo/comin all *pull* from Git. The control plane never needs an inbound CI hook. Tailscale ACLs (or just SSH-over-Tailscale) handle the human-access side.\n- **The \"weekly chore\" cadence.** Solo operators don't do quarterly DR drills \u2014 they do a Sunday-morning `restic check` and a monthly \"destroy a non-prod VM and re-bootstrap from scratch\" exercise. That's the cadence that keeps the runbook honest.\n\n### War stories\n\n- **SourceHut's [PXE-boot post-mortem-as-feature](https://sourcehut.org/blog/)** \u2014 After a server refused to boot from disk and they had to \"hastily assemble DHCP configs,\" they built a permanent PXE-boot setup specifically so the *next* recovery would take minutes. The lesson: every incident should leave behind one piece of permanent automation.\n- **Drew DeVault's [in-housing of DNS](https://drewdevault.com/)** \u2014 SourceHut moved DNS *back* in-house against industry advice, on the principle that \"we are proud believers in owning our infrastructure rather than operating at planet scale.\" Small teams can absolutely run real infrastructure; the trick is choosing fewer pieces, not fewer features.\n- **[Tailscale's three-person infra team](https://tailscale.com/blog/infra-team-stays-small)** \u2014 They explicitly eliminated entire problem categories rather than scaling staffing: ACLs replace per-resource auth, overlay networking replaces VPC peering, their internal `setec` replaces per-cloud secrets managers, automatic TLS replaces a PKI rotation rota. **The pattern: collapse N tools into 1 wherever identity already crosses the boundary.**\n- **[Grumpy.systems' bus-factor post](https://grumpy.systems/2023/preparing-for-you-homelabs-demise/)** \u2014 Best concrete bus-factor advice on the open web. Weekly auto-generated PDF of recovery procedures + encrypted config bundle, uploaded to cloud *and printed*. Tested with \"could my partner actually recover photos if I died tomorrow?\"\n\n### Lessons for our project (the 80/20)\n\n1. **Treat the bootstrap as the test.** Our \"done\" definition already runs `bash bootstrap.sh` from a fresh checkout \u2014 make that a literal CI job that spins up a Proxmox VM weekly and runs end-to-end. khuedoan and onedr0p both do this; it's why their READMEs aren't stale.\n2. **Renovate over manual digest bumps.** We already have a fleet-repo workflow that PRs digest updates via `versions.env`. Extend Renovate config to also bump Helm chart digests (ARC, ESO, ARA) and Ansible collection versions. Auto-merge patch, human-approve minor, block major.\n3. **Print the runbook. Literally.** Our three operator-only runbooks (`vault-init.md`, `arc-github-app-seed.md`, `auto-unseal-migration.md`) should generate a PDF artifact in CI, with the operator-only token-handling sections highlighted. Store on a USB drive in a fireproof box. This is the homelab community's actual practice, and it's the right move for a two-person team where one person's laptop is the bus factor.\n4. **Drop \"observability backend\" until you've used it in anger.** Stan's blog explicitly cut Longhorn + Velero as overkill on a single node. For two engineers + one control plane VM + two DGX boxes, Prometheus + Loki + a dashboard is fine; don't add Tempo, Mimir, or an SLO platform until you've actually had an incident that needed them.\n5. **Keep Vault, but only as a *consumer* problem.** The homelab consensus is \"don't run Vault as a hobbyist.\" We have a real reason (ESO + ARC need a KV store with identity-aware auth), so keep it \u2014 but lean hard on the existing rule that Vault never restarts on play apply, and treat the operator-only init runbook as sacred. Don't add Vault as the store for ops-only secrets when SOPS+age would do; that's the line where homelab pragmatism beats enterprise instinct.\n\n### Further reading\n\n- [Tailscale \u2014 How our infrastructure team stays small](https://tailscale.com/blog/infra-team-stays-small) \u2014 the canonical \"boring infra for tiny teams\" essay\n- [onedr0p/home-ops](https://github.com/onedr0p/home-ops) \u2014 read the `kubernetes/apps/` layout and `.github/renovate.json5` configs\n- [khuedoan/homelab](https://github.com/khuedoan/homelab) \u2014 the single-command bootstrap pattern\n- [Stan \u2014 Moving to single-node k3s](https://stanislas.blog/2025/04/moving-to-k8s/) \u2014 closest analogue to our control-plane VM shape\n- [Grumpy Systems \u2014 Preparing for your homelab's demise](https://grumpy.systems/2023/preparing-for-you-homelabs-demise/) \u2014 bus-factor planning\n- [SourceHut blog](https://sourcehut.org/blog/) \u2014 Drew DeVault on owning DNS, PXE boot for fast recovery\n- [nlewo/comin](https://github.com/nlewo/comin) \u2014 pull-mode GitOps for NixOS\n- [Managing secrets with SOPS in your homelab (codedge)](https://www.codedge.de/posts/managing-secrets-sops-homelab/)\n- [k3sup (alexellis)](https://github.com/alexellis/k3sup) \u2014 bootstrap k3s over SSH in under 60 seconds\n- [Tailscale \u2014 Workload identity federation](https://tailscale.com/blog/workload-identity-beta)\n\n---\n\n# Part III \u2014 Synthesis and action\n\n## 15. Unified methodology\n\nAcross 10 chapters, 50+ companies, and 100+ cited URLs, the same pattern keeps emerging. We call it the **Bootstrap Operations Methodology** (BOM), and it is six rules:\n\n### Rule 1: Separate the three buckets, ruthlessly\n\n- **Identity** (root keys, unseal shards, operator credentials) lives offline, in N-of-M distribution, on devices that survive the destruction of any other system.\n- **Data** (Vault contents, etcd, NAS files) lives in backups on a *different blast radius* than the source. Test restores, don't just count snapshots.\n- **Config** (policies, manifests, playbooks) lives in code, applied idempotently by an agent that holds only the privilege to apply config \u2014 never to read data.\n\nEvery mixed bucket is a future incident. The discipline is permanent.\n\n### Rule 2: Reduce privilege immediately after every bootstrap\n\n- Day-0 root tokens exist for ~30 minutes during init, then revoke.\n- Day-1 admin tokens exist for the duration of one operator session, then revoke.\n- Day-2 config-applier tokens exist for the duration of one CI run, then revoke.\n- The longest-lived credential should be the smallest-privileged one.\n\nThis is the principle of least privilege applied to its own meta-configuration. It's how Vault's `admin` policy is structured, how Google's break-glass procedures work, how AWS's IAM Role assumption works.\n\n### Rule 3: Treat operator-only runbooks as code\n\n- Versioned in git, reviewed in PRs, executed verbatim from a printed copy.\n- Each contains an explicit \"this is operator-only because X\" justification.\n- Each is rehearsed on a scheduled cadence (quarterly minimum).\n- Each generates an After-Action Report on every execution (real or drill).\n\nThe fact that humans execute them doesn't make them informal \u2014 it makes the *review discipline tighter*, not looser, because there's no CI to catch errors.\n\n### Rule 4: Untested DR is fiction\n\n- Backups exist? Restore them. Quarterly.\n- Runbook exists? Execute it. Quarterly.\n- Person knows it? Remove them from the drill. Quarterly.\n- Document the After-Action Report. Always.\n\nGoogle DiRT, AWS GameDays, ICANN ceremonies, FedRAMP, SOC 2, NIST 800-34 \u2014 every mature operator has converged on the same answer: only tested recovery counts. GitLab 2017 is the parable everyone cites for a reason.\n\n### Rule 5: Improve the bootstrap, continuously\n\n- Every incident produces one piece of permanent automation (SourceHut's PXE lesson).\n- Every PITFALLS entry has an owner and a due date (Equifax's lesson).\n- Every runbook is dated; if it's been &gt; 12 months since execution, it's stale.\n- Every cycle through the system reduces toil somewhere (Google's 50% rule).\n\nThe bootstrap layer is itself a product. It needs roadmaps, retrospectives, and pruning.\n\n### Rule 6: Match practices to scale\n\n- A 2-engineer team should not adopt SPIRE.\n- A 50-engineer team should not run SOPS+age for production.\n- A 500-engineer team should not skip key ceremonies.\n- A 5000-engineer team must run DiRT-class drills.\n\nThe practices scale, but blindly applying enterprise practices to a homelab is cargo cult. Pick the smallest practice that *actually* addresses the failure modes you face today, and add one practice per quarter as the scale demands.\n\n---\n\n## 16. Concrete next steps\n\nFor our DGX fleet + Proxmox-Vault project specifically, here is the prioritized backlog the research surfaced. Each item is a candidate `tasks.json` entry:\n\n### Immediate (before Phase 1 even runs)\n\n1. **Write a Key-Custodian Acknowledgement form** for the 5 Shamir shard holders. One page, signed PDF, includes named custodian, share number, rotation cadence, escape clause. Store with the runbook.\n2. **Pre-write the `vault-init.md` script word-by-word**, including verbatim verification lines (\"Operator reads aloud: ...\"). Have a peer review the script before any ceremony.\n3. **Pre-decide Vault shard destinations and verify reachability**: tabletop \"can 3 of 5 custodians be reached within 1 hour by a single operator?\" If the answer is no, redistribute.\n\n### Phase 1 (during initial bring-up)\n\n4. **Use Vault's Kubernetes auth method for ESO**, not the bootstrap-token pattern. Migrate to it during Phase 1; document the `token_reviewer_jwt` rotation in a new `runbooks/vault-token-reviewer-rotate.md`.\n5. **Add a `roles/vault_config/` Ansible role** that idempotently applies Vault policies, auth methods, KV mounts. This is the GitOps escape hatch for the \"Vault config rot\" problem. It should run on every `site.yml` apply, scoped to a `vault-config-applier` policy that can change Vault config but not read secrets.\n6. **Switch k3s to embedded etcd** (`--cluster-init`) so snapshot-and-restore are first-class. Configure `--etcd-s3 --etcd-snapshot-schedule-cron='0 */6 * * *' --etcd-snapshot-retention=28` for 6-hour snapshots, 7-day retention. Back up the server token to Vault.\n\n### Phase 6-7 (hardening)\n\n7. **Generate a printed PDF of operator-only runbooks** as a CI artifact on every release tag. Store on USB, fireproof box.\n8. **Pin Ansible collections by SHA-256**, not just version. Verify against Galaxy API in CI.\n9. **Sign the runner image with cosign keyless** from GitHub Actions. Verify in k3s admission (Kyverno or policy-controller).\n10. **Schedule quarterly drills**:\n    - Q1: 5-shard unseal drill (just verify shards are reachable from 3 destinations)\n    - Q2: Vault Raft snapshot restore on a throwaway VM\n    - Q3: k3s etcd snapshot restore on a kind cluster, verify ARC/ESO/Vault references reconcile\n    - Q4: GitHub App key rotation against staging Vault + test repo first\n\n### Future (defer until scale demands)\n\n11. **SPIRE/SPIFFE on the Sparks themselves** for workload identity \u2014 once we have a third trust boundary that isn't Vault K8s auth.\n12. **TPM-based unlock (`tang`/`clevis`)** for the Spark LUKS volumes \u2014 once we have a real threat model that demands measured boot.\n13. **SLSA L3 provenance** for inference images \u2014 once we have an inference image we ship externally.\n14. **Reproducible-builds release gate** for the runner image \u2014 once we have a regression where build non-determinism caused a confusing failure.\n\nThe first 10 items are doable in ~4 person-weeks of work, spread across Phases 1-7 as they naturally fit. The remaining items are explicit \"not yet\" decisions.\n\n---\n\n## 17. Reading order\n\nIf you read three things from this book and only three:\n\n1. **[Ken Thompson \u2014 Reflections on Trusting Trust (1984)](https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf)** \u2014 three pages, the deepest formulation of the bootstrap problem. The lecture that every infrastructure engineer should encounter at least once.\n2. **[Google SRE Book ch.5 \u2014 Eliminating Toil](https://sre.google/sre-book/eliminating-toil/)** \u2014 the language of \"toil\" reframes what you do every day. After you read this, you cannot un-see toil in your own workflows.\n3. **[Solving the Bottom Turtle (SPIFFE book)](https://spiffe.io/pdf/Solving-the-bottom-turtle-SPIFFE-SPIRE-Book.pdf)** \u2014 200 pages on workload identity, which is the field's most coherent answer to \"what does it mean to identify a piece of software.\" Even if you never run SPIRE, the mental model is foundational.\n\nIf you have a weekend:\n\n4. **[Weathering the Unexpected (Kripa Krishnan, 2012)](https://queue.acm.org/detail.cfm?id=2371516)** \u2014 Google's DiRT philosophy, written by the person who ran it for nine years.\n5. **[GitLab 2017 Postmortem](https://about.gitlab.com/blog/postmortem-of-database-outage-of-january-31/)** \u2014 the canonical \"five backups, none worked\" parable.\n6. **[Tailscale \u2014 How our infrastructure team stays small](https://tailscale.com/blog/infra-team-stays-small)** \u2014 the right antidote to enterprise-cargo-cult thinking.\n\nIf you have a week:\n\n7. **[Site Reliability Engineering (full book, free online)](https://sre.google/sre-book/table-of-contents/)** \u2014 read chapters 1, 3, 5, 11, 14, 15. Skip the rest until you need them.\n8. **[Building Secure and Reliable Systems (Google book, free online)](https://google.github.io/building-secure-and-reliable-systems/raw/toc.html)** \u2014 chapters 8 (design for resilience), 16 (disaster planning), 17 (crisis management).\n9. **[Sam James's XZ backdoor timeline](https://gist.github.com/thesamesam/223949d5a074ebc3dce9ee78baad9e27)** \u2014 what a real supply chain attack looks like, day-by-day.\n10. **[NIST SP 800-34 Rev 1](https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-34r1.pdf)** \u2014 the regulator's view, which is surprisingly readable and surprisingly sensible.\n\nIf you have a month:\n\n11. **[Designing Distributed Systems (Brendan Burns)](https://azure.microsoft.com/en-us/resources/designing-distributed-systems/)** \u2014 the canonical patterns.\n12. **[The Twelve-Factor App](https://12factor.net/)** \u2014 30 minutes, foundational discipline.\n13. **[ICANN Root KSK ceremony scripts](https://www.iana.org/dnssec/ceremonies)** \u2014 actually download one of the PDFs and read it cover-to-cover. The level of choreography is humbling.\n14. **[SLSA v1.0 specification](https://slsa.dev/spec/v1.0/levels)** \u2014 supply chain levels, what each level demands.\n\n---\n\n## Closing\n\nYou started this book uncertain whether your infrastructure was reproducible. You should now be uncertain about something much more specific: *which of the three buckets is least well-defended on each tier of your stack, and what's the smallest cheapest experiment that would tell you?*\n\nThat's the question to take to your next planning meeting. The answer determines what you build next.\n\nThe bootstrap problem is not solvable in the sense that anxiety dreams of. It is *manageable* in the sense that hundreds of mature operators have shown \u2014 with separation of concerns, scoped privilege, tested recovery, and continuous improvement. The five Vault shards in five destinations, the printed runbook in the fireproof box, the quarterly drill on the calendar, the After-Action Report committed to git: these are the artifacts of an organization that has accepted that bootstrap is permanent and has built a discipline around it.\n\nYou don't need to be enterprise-scale to adopt any of this. You need to pick one practice per quarter that addresses your actual current failure mode. In four years you will have a remarkably resilient stack. In eight years your runbooks will have outlived three generations of underlying software. That is the practice.\n\nWelcome to Day 2.\n\n---\n\n*This book was assembled from 10 parallel research-agent investigations across HashiCorp Vault production deployments, Google SRE methodology, PKI key ceremonies, GitOps for stateful infrastructure, chaos engineering, Kubernetes/etcd disaster recovery, workload identity, supply chain trust, compliance-driven DR, and homelab small-team practices. Total citations: 100+. Total companies referenced: 50+. Production at: 2026-05-09.*\n", "creation_timestamp": "2026-05-13T16:50:55.000000Z"}