{"uuid": "0456183c-4ab9-4658-a8b5-be62e2b6e5a1", "vulnerability_lookup_origin": "1a89b78e-f703-45f3-bb86-59eb712668bd", "author": "9f56dd64-161d-43a6-b9c3-555944290a09", "vulnerability": "CVE-2025-32711", "type": "seen", "source": "https://gist.github.com/niallmerrigan/b43ce627736adaa3dfe9d7c582b89190", "content": "# LLM Red-Team: Mitigations &amp; Further Reading (Attendee Handout)\n\nA one-page-per-section field guide to defending against the attacks covered in this talk \u2014\nplus a curated, source-backed reading list. Covers both directions: **attacks on LLMs** and\n**LLMs used to attack people**.\n\n&gt; Scan the QR or open the gist. Slides reference the numbered categories below.\n&gt; Full corpus (technical deep-dives, incidents, references): see the project site / repo.\n\n---\n\n## How to use this handout\n\n- **Universal controls** apply across every category \u2014 start here.\n- **Per-category mitigations** give 3\u20136 concrete, do-this-Monday controls plus the residual risk you can't engineer away.\n- **Framework crosswalk** maps each category to OWASP LLM Top 10 (2025), MITRE ATLAS, and NIST AI RMF / AI 600-1.\n- **Further reading** is grouped Standards \u2192 Vendor guidance \u2192 Notable incidents.\n\n---\n\n## Universal controls (the cross-cutting top 10)\n\nThese reduce risk in *every* category. If you do nothing else, do these.\n\n1. **Treat all model input as untrusted data, never as instructions** \u2014 user text, retrieved docs, tool results, web pages, emails, images. There is no reliable parser boundary between \"data\" and \"commands\" in natural language.\n2. **Keep secrets and authorization out of prompts** \u2014 prompts are recoverable configuration, not a vault. Enforce authz in code/policy, not in the system prompt.\n3. **Least privilege for tools and agents** \u2014 scope tokens narrowly, separate read from write, and gate high-impact actions (payment, email send, deploy, delete) behind explicit human approval.\n4. **Break the path: untrusted content \u2192 privileged tool \u2192 external sink.** Most agentic and injection harm requires all three links; cut any one.\n5. **Provenance on everything** \u2014 tag the source and trust level of every retrieved item, dataset, model, adapter, and tool. Reputation (download counts, stars) is not provenance.\n6. **Defense in depth, not one classifier** \u2014 combine model-level safety, input/output filtering, and application containment. Any single layer will be bypassed eventually.\n7. **Constrain outputs** \u2014 small deterministic schemas, allow-listed actions, and output validation beat free-form generation feeding downstream systems.\n8. **Log, monitor, and rate-limit** \u2014 retrieval telemetry, tool-call audit trails, anomaly detection, and unbounded-consumption caps. You can't respond to what you can't see.\n9. **Identity and workflow controls beat content judgment** \u2014 for social-engineering categories, make *accurate context insufficient for authorization*; use phishing-resistant MFA, callbacks, and out-of-band verification.\n10. **Red-team continuously and assume residual risk** \u2014 repeated sampling and new strategies find rare failures. Plan for detection and recovery, not just prevention.\n\n---\n\n## Per-category mitigations\n\n### 01 \u2014 Direct prompt injection\n*Risk: user-turn text overrides intended model behavior.*\n- State an explicit instruction hierarchy and label user content as data, not commands.\n- Add input classifiers (jailbreak/leak phrasing, odd encodings) and output classifiers (sensitive disclosure, schema breaks, unexpected tool plans).\n- Keep task scope narrow with deterministic output contracts for classifiers/extractors.\n- Never place secrets or authz rules in the prompt; delimiters aid readability but are **not** enforcement.\n- **Residual risk:** no prompt or classifier perfectly separates instructions from data.\n\n### 02 \u2014 Indirect prompt injection\n*Risk: payloads arrive via retrieved email, web, docs, images, tool results.*\n- Attach provenance + trust level to every retrieved artifact; render untrusted content inertly.\n- Do **not** auto-execute tools from retrieved content; require approval for high-impact actions.\n- Strip/escape active markup (Markdown links, images, hidden text) before it reaches the model.\n- Apply per-modality filtering (text, HTML, image-embedded text) and egress controls on data sinks.\n- **Residual risk:** assistants must read hostile content to be useful (cf. CVE-2025-32711).\n\n### 03 \u2014 Jailbreaks &amp; policy bypass\n*Risk: DAN, Skeleton Key, Crescendo, many-shot, GCG defeat refusals.*\n- Layer model hardening + safety classifiers (e.g., Prompt Shields / Content Safety) + app containment.\n- Cap multi-turn escalation; watch for Crescendo-style gradual boundary erosion across a session.\n- Constrain long-context and repeated-sampling abuse with budgets and anomaly detection.\n- Run automated red-team suites (e.g., PyRIT) against your exact workflow, not generic benchmarks.\n- **Residual risk:** enough sampling + novel phrasing still finds rare refusal failures.\n\n### 04 \u2014 System-prompt leak &amp; extraction\n*Risk: Sydney/GPTs-style prompt disclosure; model-stealing.*\n- Assume the prompt **will** leak; remove secrets, keys, and enforcement logic from it.\n- Move authorization and business rules to server-side code with their own access checks.\n- Rate-limit and monitor extraction patterns (repeated \"repeat the above\", translation/summarize tricks).\n- Treat prompts as versioned, recoverable configuration \u2014 not as a security boundary.\n- **Residual risk:** models can quote, summarize, translate, or infer hidden context.\n\n### 05 \u2014 Training-data poisoning\n*Risk: sleeper agents and web-scale poisoning survive filtering.*\n- Treat datasets as supply-chain artifacts: provenance, immutable snapshots, signed manifests (SLSA).\n- Add promotion gates and trigger-conditioned evaluation (test for backdoor triggers, not just accuracy).\n- Constrain and vet web-scraped corpora; prefer curated, attestable sources for high-stakes models.\n- Keep dataset bills-of-materials and the ability to trace any example back to a source.\n- **Residual risk:** a few poisoned examples can survive and fire only under rare triggers.\n\n### 06 \u2014 Model supply-chain backdoors\n*Risk: pickle RCE, malicious LoRAs, model squatting, conversion jobs.*\n- Treat models, adapters, tokenizers, and inference servers like executable dependencies.\n- Prefer safetensors over pickle; scan artifacts; sign and verify (Sigstore) across the pipeline.\n- Pin versions and verify integrity (hashes/manifests); never trust download counts as provenance.\n- Sandbox conversion/loading jobs; lock down inference servers (cf. ShadowRay).\n- **Residual risk:** model ecosystems still mix code and data; reputation \u2260 provenance.\n\n### 07 \u2014 RAG corpus poisoning\n*Risk: PoisonedRAG, retrieval hijacking, embedding attacks.*\n- Govern the corpus as an executable influence surface: source provenance + chunk-level controls.\n- Add retrieval telemetry and gate actions taken on retrieved \"evidence.\"\n- Filter/score documents on ingest; isolate untrusted or user-contributed sources.\n- Apply least-privilege over what the retriever can reach (cf. M365 Copilot data boundaries).\n- **Residual risk:** a user-authorized but malicious doc can still be retrieved and synthesized.\n\n### 08 \u2014 Agentic tool &amp; MCP abuse\n*Risk: confused-deputy, tool poisoning, MCP supply chain, agent worms.*\n- Cut the graph: untrusted content \u2192 privileged tool \u2192 external sink. Require approval at sinks.\n- Treat tool descriptions and tool results as untrusted natural-language influence surfaces.\n- Pin and verify MCP servers/tools (integrity manifests); follow MCP security best practices.\n- Enforce per-tool least privilege, allow-listed actions, and full tool-call audit logging.\n- **Residual risk:** every tool surface can steer the agent despite prompt instructions (CWE-441).\n\n### 09 \u2014 LLM-augmented phishing\n*Risk: WormGPT/FraudGPT, polymorphic, localized BEC at scale.*\n- Stop relying on typos/grammar as the tell; shift to identity, workflow, and payment controls.\n- Deploy phishing-resistant MFA (FIDO2) and verified-sender/auth (DMARC/BIMI) on email infrastructure.\n- Add out-of-band verification + dual-approval for payments and vendor bank-detail changes.\n- Train staff on *interactive* AI follow-up, not just static lures.\n- **Residual risk:** AI makes plausible, personalized, multilingual messaging nearly free.\n\n### 10 \u2014 Deepfake vishing &amp; CFO fraud\n*Risk: Arup $25M, Ferrari, WPP \u2014 synthetic voice/video on calls.*\n- Make finance/identity workflows independent of voice, video, hierarchy, and urgency.\n- Mandatory callback to known-good numbers + code words for any high-value/urgent transfer.\n- Dual control and hold/cooling-off on large or unusual payments; no exceptions for \"the CEO.\"\n- Adopt content-provenance signals (C2PA) where available; don't rely on detection alone.\n- **Residual risk:** synthetic media exploits legitimate trust signals, not just detection gaps.\n\n### 11 \u2014 Spear-phishing &amp; OSINT augmentation\n*Risk: LLM-driven victimology from public footprints.*\n- Make accurate context **insufficient** for authorization \u2014 knowing details \u2260 being authorized.\n- Reduce unnecessary public process leakage (org charts, workflows, vendor lists, travel).\n- Strengthen recruiter/exec/developer flows that attackers target with tailored pretexts.\n- Verify requests through role-based, out-of-band channels regardless of how convincing.\n- **Residual risk:** professionals must have minable public lives.\n\n### 12 \u2014 Voice clone &amp; real-time impersonation\n*Risk: ElevenLabs/Voice Engine-class cloning; grandparent scams.*\n- Remove voice as sufficient proof of identity; pre-agree family/finance **callback** procedures.\n- Use shared code words and out-of-band confirmation before money or sensitive action moves.\n- Educate high-risk groups (older adults, finance teams) before panic-driven moments arrive.\n- Pair provenance/watermarking (C2PA) with policy; note FCC ruling on AI-voice robocalls.\n- **Residual risk:** cloned voices exploit deep trust and reach via phone, apps, and robocalls.\n\n---\n\n## Framework crosswalk\n\n| # | Category | OWASP LLM Top 10 (2025) | MITRE ATLAS | NIST AI RMF / AI 600-1 |\n|---|---|---|---|---|\n| 01 | Direct prompt injection | LLM01 | AML.T0051 / .000 | Govern/Map/Measure/Manage; GAI: CBRN, Info Integrity |\n| 02 | Indirect prompt injection | LLM01 | AML.T0051.001 | Manage 4.x; Info Integrity |\n| 03 | Jailbreaks &amp; policy bypass | LLM01 | AML.T0051 | Measure 2.x (red-team), Manage |\n| 04 | System-prompt leak | LLM07 / LLM02 | AML.T0051 | Map/Measure; Sensitive Info |\n| 05 | Training-data poisoning | LLM04 | AML.T0020 (data poisoning) | AML 100-2e2025; Govern data |\n| 06 | Model supply-chain backdoors | LLM03 | AML (supply chain) | SLSA/Sigstore-aligned; Govern |\n| 07 | RAG corpus poisoning | LLM04 / LLM08 | AML.T0051.001 | Manage; Info Integrity |\n| 08 | Agentic tool &amp; MCP abuse | LLM06 (Excessive Agency) | AML.T0051 + CWE-441 | Manage 4.x; human-in-loop |\n| 09 | LLM-augmented phishing | LLM09 (Misinformation) | AML (offensive use) | AI RMF + NIST 800-63B |\n| 10 | Deepfake vishing &amp; CFO fraud | \u2014 (human-facing) | AML (offensive use) | 800-63B; C2PA; FCC/FTC |\n| 11 | Spear-phishing &amp; OSINT | LLM09 | AML (offensive use) | AI RMF; 800-63B |\n| 12 | Voice clone &amp; real-time | \u2014 (human-facing) | AML (offensive use) | 800-63B; C2PA; FCC |\n\n*Crosswalk is indicative \u2014 see the per-folder `frameworks/` files and `references.md` for exact technique IDs.*\n\n---\n\n## Further reading (curated, source-backed)\n\n### Standards &amp; government guidance\n- **OWASP GenAI \u2014 LLM Top 10 (2025).** https://genai.owasp.org/llm-top-10/\n- **OWASP \u2014 LLM Prompt Injection Prevention Cheat Sheet.** https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html\n- **MITRE ATLAS (adversarial ML knowledge base).** https://atlas.mitre.org/\n- **NIST AI Risk Management Framework.** https://www.nist.gov/itl/ai-risk-management-framework\n- **NIST AI 600-1 \u2014 Generative AI Profile.** https://doi.org/10.6028/NIST.AI.600-1\n- **NIST AI 100-2e2025 \u2014 Adversarial ML: Taxonomy &amp; Mitigations.** https://csrc.nist.gov/pubs/ai/100/2/e2025/final\n- **NIST SP 800-63B \u2014 Digital Identity / Authentication.** https://pages.nist.gov/800-63-3/sp800-63b.html\n- **MCP \u2014 Security Best Practices.** https://modelcontextprotocol.io/specification/2025-06-18/basic/security_best_practices\n- **SLSA \u2014 Supply-chain Levels for Software Artifacts.** https://slsa.dev/spec/v1.0/\n- **Sigstore \u2014 signing &amp; verification.** https://docs.sigstore.dev/\n- **C2PA \u2014 content provenance specs.** https://c2pa.org/specifications/specifications/2.2/index.html\n- **CISA \u2014 Avoiding Social Engineering &amp; Phishing.** https://www.cisa.gov/news-events/news/avoiding-social-engineering-and-phishing-attacks\n- **FCC \u2014 AI-generated voices in robocalls are illegal.** https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal\n\n### Vendor &amp; practitioner guidance\n- **Microsoft \u2014 Defend against indirect prompt injection.** https://learn.microsoft.com/en-us/security/zero-trust/sfi/defend-indirect-prompt-injection\n- **Microsoft \u2014 Prompt Shields / jailbreak detection.** https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection\n- **Microsoft \u2014 Azure AI Content Safety overview.** https://learn.microsoft.com/en-us/azure/ai-services/content-safety/overview\n- **Microsoft \u2014 Mitigating Skeleton Key jailbreaks.** https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-type-of-generative-ai-jailbreak-technique/\n- **Microsoft \u2014 Open automation framework to red-team GenAI (PyRIT).** https://www.microsoft.com/en-us/security/blog/2024/02/22/announcing-microsofts-open-automation-framework-to-red-team-generative-ai-systems/\n- **Microsoft/OpenAI \u2014 Staying ahead of threat actors in the age of AI.** https://www.microsoft.com/en-us/security/blog/2024/02/14/staying-ahead-of-threat-actors-in-the-age-of-ai/\n- **Microsoft \u2014 Disrupting a global cybercrime network abusing GenAI.** https://blogs.microsoft.com/on-the-issues/2025/02/27/disrupting-cybercrime-abusing-gen-ai/\n- **MSRC \u2014 CVE-2025-32711 (M365 Copilot indirect injection).** https://msrc.microsoft.com/update-guide/vulnerability/CVE-2025-32711\n\n### Notable incidents (talk anchors)\n- **Arup $25M deepfake video call (CNN, 2024).** https://www.cnn.com/2024/05/16/tech/arup-deepfake-scam-loss-hong-kong-intl-hnk/index.html\n- **Finance worker pays $25M after deepfake \"CFO\" call (FT, 2024).** https://www.ft.com/content/6108c15d-948e-4d3e-8a64-6b4b6c9e7b5e\n- **How Ferrari hit the brakes on a deepfake CEO (MIT SMR, 2025).** https://sloanreview.mit.edu/article/how-ferrari-hit-the-brakes-on-a-deepfake-ceo/\n- **Fraudsters mimic CEO's voice (WSJ, 2019).** https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402\n- **Bing Chat prompt-leak (CBC, 2023).** https://www.cbc.ca/news/science/bing-chatbot-ai-hack-1.6752490\n- **ShadowRay \u2014 exposed AI infra exploited (MITRE ATT&amp;CK C0045).** https://attack.mitre.org/campaigns/C0045/\n\n&gt; Full bibliography (157 deduped references across academic, vendor, government, news, and community sources): see `research/REFERENCES.md` in the corpus.\n\n---\n\n*Handout generated for the talk. Mitigations distilled from the 12 per-category defense briefs in the\nresearch corpus. Numbered categories match the slides and the project site's taxonomy.*\n", "creation_timestamp": "2026-05-31T20:52:14.000000Z"}