{"uuid": "be37e4e2-c686-40b0-bcd5-7f964dd0fd0c", "vulnerability_lookup_origin": "1a89b78e-f703-45f3-bb86-59eb712668bd", "author": "9f56dd64-161d-43a6-b9c3-555944290a09", "vulnerability": "GHSA-mxfr-6hcw-j9rq", "type": "seen", "source": "https://gist.github.com/Julius-mercator/b78185acb27329feb3508768444bd583", "content": "# Chat-to-Dashboard over a Logistics DB \u2014 Research Report\n\n*Prior art, required layers, anti-patterns, architecture trade-offs, and evaluation \u2014 for building \"Atlas\" (a read-only conversational data-explorer + dashboard-builder over Mercator's multi-tenant logistics data).*\n\nScope of the underlying data (from the [WIP pipeline writeup](https://gist.github.com/Julius-mercator/4923ad4040445258565f505b3d941e72)): canonical, multi-tenant `po_line`/shipment/WIP rows keyed by a synthetic `entity_id`, with soft-deletes (`is_current`) and per-field change history (`field / old / new`). That shape matters and is referenced throughout.\n\n&gt; **Sourcing note.** Every accuracy number below is tagged with its provenance and reliability. Two widely-circulated figures were **adversarially refuted during research and are deliberately excluded** (Snowflake's \"90%+/2\u00d7 GPT-4o\" marketing claim; the \"ChatGPT 40% vs human 93% on BIRD\" stat). Vendor benchmarks are flagged as such. Treat all percentages as *directional snapshots*, not current SOTA.\n\n---\n\n## 0. The one lesson that frames everything\n\n**An LLM will happily produce SQL that runs cleanly and returns a number that is simply wrong.** This is the \"looks-right-but-wrong-number\" trust problem, and it's structural, not a tuning issue. Uber documented it in production: QueryGPT generated `WHERE status = 'Finished'` when the real enum value was `'Completed'` \u2014 *\"resulting in a successful run with no output\"* \u2014 and also emitted queries *\"with tables that don't exist or with columns that don't exist on those tables.\"* ([Uber QueryGPT](https://www.uber.com/en-GB/blog/query-gpt/))\n\ndbt put the contrast crisply in their 2026 benchmark: **\"With text-to-SQL, failure looks like a plausible but incorrect answer. With the Semantic Layer, failure looks like an error message.\"** ([dbt](https://docs.getdbt.com/blog/semantic-layer-vs-text-to-sql-2026)) A wrong answer that looks right is the worst possible failure for a logistics ops team making shipment decisions \u2014 so the entire architecture below is organized around *making wrong answers loud instead of silent.*\n\nEverything else follows from this:\n1. Constrain the model to a **curated semantic/metrics layer** instead of raw schema.\n2. **Decompose** generation into checkable steps (agentic), don't single-shot.\n3. **Confirm and clarify** with the human before committing to an answer (HITL).\n4. **Abstain** when a question is infeasible rather than guessing.\n5. **Evaluate** on your own domain data with semantic-equivalence checks, not vendor benchmark scores.\n\n---\n\n## 1. Prior art \u2014 what's been built, what worked, what failed\n\n### Uber \u2014 QueryGPT (the most useful published case study)\nFirst-party engineering blog *with negative disclosures*, which makes it the highest-signal source. ([Uber](https://www.uber.com/us/en/blog/query-gpt/))\n\n- **Started naive, it broke.** Single-shot text-to-SQL with flat RAG over all schema + SQL samples. *\"As we started to onboard more tables and their associated SQL samples into the service, we started seeing declining accuracy.\"* Plain similarity search on a raw NL prompt *\"doesn't return relevant results.\"*\n- **Fix = decompose into specialized agents** over a vector DB:\n  - **Intent Agent** \u2192 maps the question to a curated business-domain **\"Workspace\"** (e.g. Ads, Mobility, Core Services) to *narrow the retrieval radius* before searching.\n  - **Table Agent** \u2192 selects candidate tables, then **sends them to the user to ACK or edit** (human-in-the-loop) *before* SQL is generated.\n  - **Column Prune Agent** \u2192 an LLM call that drops irrelevant columns to shrink context.\n- **Unsolved:** a recursive self-correcting **Validation agent** *\"remains an area that we haven't completely solved yet.\"* So even Uber doesn't fully trust auto-generated SQL \u2014 they lean on the human ACK step.\n\n**Takeaways for Atlas:** scope retrieval by business domain (PO / shipment / WIP / vendor); decompose generation; gate on a human confirmation of *what data* will be queried before running.\n\n### Pinterest \u2014 text-to-SQL ([Pinterest Eng](https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff))\n- The hard part wasn't writing SQL \u2014 it was **finding the right tables**: *\"identifying the correct tables amongst the hundreds of thousands in our data warehouse is actually a significant challenge for users.\"*\n- Built a **RAG table-selection layer**: an offline job embeds **table summaries + historical queries** into a vector index (OpenSearch). Candidate tables are returned **to the user for validation/alteration before SQL generation** \u2014 same HITL pattern as Uber, arrived at independently.\n- **The cheapest accuracy lever was documentation.** Adding table documentation to the embeddings moved their (self-reported, *offline*) search hit-rate from **40% \u2192 90%**, increasing roughly linearly with the weight on docs. *Caveat: Pinterest itself called the eval data \"insufficient\" \u2014 treat as offline/directional.*\n- **Limitation they flagged:** HITL covers *table* selection only; generated SQL was *\"directly returned to the user without validation.\"*\n\n**Takeaway:** invest in high-quality table/column descriptions \u2014 it's the highest ROI, lowest-effort accuracy work you'll do.\n\n### Snowflake \u2014 Cortex Analyst (the semantic-layer argument)\n- Core thesis from their docs: *\"Generic AI solutions often struggle with text-to-SQL conversions when given only a database schema, as schemas lack critical knowledge like business process definitions and metrics handling.\"* Cortex Analyst *\"uses a semantic model to bridge the gap.\"* ([Snowflake docs](https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-analyst))\n- Generation is constrained: the model is given *\"the metadata provided in the semantic model YAML file (table names, column names, value type, descriptions, etc.) only\"* \u2014 plus governed sample values \u2014 **never raw schema introspection**. The output is therefore bounded to the modeled business layer.\n- **Honest accuracy signal:** on Snowflake's *internal* real-world BI eval, GPT-4o *\"accuracy plummeted to 51%\"* \u2014 versus 90%+ headline numbers on the public Spider leaderboard. *(Vendor-reported, private 150-question set. Their splashier \"90%+/2\u00d7 GPT-4o\" marketing claim was refuted in research \u2014 not cited.)*\n\n**Takeaway:** a semantic layer is the single strongest published accuracy lever \u2014 *for questions within modeled scope.*\n\n### dbt \u2014 Semantic Layer vs text-to-SQL benchmark (2026)\n- Independent-of-Snowflake corroboration of the semantic-layer thesis: **semantic-layer accuracy 98.2% (Claude Sonnet 4.6) / 100% (GPT-5.3)** vs **raw text-to-SQL 90.0% / 84.1%**. ([dbt](https://docs.getdbt.com/blog/semantic-layer-vs-text-to-sql-2026))\n- The quotable insight (repeated from \u00a70): semantic-layer failure is an *error message*; text-to-SQL failure is a *plausible wrong answer*. *Caveat: dbt is itself a semantic-layer vendor and these are modeled-scope questions only.*\n\n### WrenAI / Vanna.ai / Dataherald (open-source build stacks)\n- **WrenAI** is built around the same lesson \u2014 *\"the semantic layer is essential for reliable text-to-SQL.\"* ([WrenAI](https://medium.com/wrenai/why-the-semantic-layer-is-essential-for-reliable-text-to-sql-and-how-wren-ai-brings-it-to-life-c54cc0e6e4bc)) Good reference architecture if you want to study a complete open implementation (semantic modeling + RAG + generation + viz).\n- **Vanna.ai** ([GitHub](https://github.com/vanna-ai/vanna)) is a lighter RAG-on-your-own-schema approach: you \"train\" it on DDL, documentation, and example question\u2192SQL pairs, and it retrieves those at query time. Useful pattern: *example SQL pairs are training data*, not just prompt decoration.\n\n### Databricks \u2014 Genie / AI-BI + multi-agent visualization\n- Relevant for the **charting** layer (see \u00a73, Layer 6): a Supervisor agent delegates to a Vega-Lite chart tool that *\"uses a concise JSON specification to declaratively describe charts, allowing agents to generate visualizations without writing imperative plotting code\"* \u2014 with schema-based validation for self-correction. ([Databricks](https://www.databricks.com/blog/bringing-visualizations-life-multi-agent-systems-vega-lite))\n\n### The commercial landscape (ThoughtSpot Sage, Looker + Gemini, Power BI Copilot, Tableau Pulse/Ask Data, QuickSight Q)\nThese didn't yield strong *published engineering post-mortems*, but the architectural convergence is the signal: **every mature commercial product sits on a curated semantic/metrics model** (ThoughtSpot's worksheets, Looker's LookML, Power BI's tabular model, Tableau's data model, QuickSight's topics). None of them point a raw LLM at a raw schema. That unanimity is itself evidence.\n\n---\n\n## 2. The benchmark-to-production gap (why vendor accuracy numbers don't transfer)\n\nThis deserves its own section because it's the most common scoping mistake.\n\n| Benchmark | What it measures | Best score (snapshot) | Why it matters |\n|---|---|---|---|\n| **Spider 1.0** | Cross-domain, *clean* academic schemas | top models **90%+**; GPT-4o ~86.6% | The number everyone quotes. Not representative. |\n| **Spider 2.0** | *Real* enterprise DBs (1,000+ columns, BigQuery/Snowflake), multi-step | o1-preview agent **21.3%** (vs 91.2% on Spider 1.0); newer agentic ~35\u201336% | The collapse is the point. ([Spider 2.0](https://arxiv.org/abs/2411.07763), [repo](https://github.com/xlang-ai/Spider2)) |\n| **BIRD** | 12,751 Q-SQL pairs, 95 DBs (33.4GB), 37 domains; *dirty values + external knowledge* | \u2014 | Mirrors your reality: dirty enums, free-text, domain rules. ([BIRD](https://bird-bench.github.io/), [paper](https://arxiv.org/pdf/2305.03111)) |\n| **Snowflake internal BI** | Real BI questions | GPT-4o **51%** (vendor) | Same model that scores ~87% on Spider. |\n\nThe durable lesson (not the exact percentages, which move monthly): **public scores overstate real-world performance by 30\u201370 points** once schemas get big and questions get realistic. Spider 2.0 tasks *\"frequently require understanding and searching through database metadata, dialect documentation, and even project-level codebases\"* and generating SQL *\"exceeding 100 lines\"* \u2014 which is *why single-shot is insufficient and agentic/multi-step exploration is required.*\n\n**Implication for Atlas:** do not scope confidence off any leaderboard. Build a domain eval set on your *actual* PO/shipment/WIP schema before you trust anything (\u00a76).\n\n---\n\n## 3. The necessary layers (and why each must exist)\n\nOrdered from data outward. Each layer's job is to *remove a class of wrong answers* the layer below can't.\n\n```\n User question\n      \u2502\n \u250c\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 L8  Observability + eval logging (every Q\u2192SQL\u2192result)      \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \u2524\n \u2502 L7  Cache (semantic + result)                              \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 L6  Result \u2192 chart (structured Vega-Lite/recharts spec)    \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 L5  HITL clarification + abstention                        \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 L4  Validation &amp; guardrails (allow-list, EXPLAIN, limits)  \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 L3  Query planning / decomposition (intent\u2192table\u2192prune\u2192gen)\u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 L2  Schema/metadata retrieval (RAG, domain-scoped)         \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 L1  Semantic / metrics layer  \u2190 biggest accuracy lever     \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 L0  Read-only execution + tenant isolation (RLS / RO role) \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n      \u2502\n   Postgres (po_line, shipments, WIP, change_history)\n```\n\n### L0 \u2014 Read-only execution + tenant isolation *(non-negotiable, build first)*\nThe database, not the prompt, is your last line of defense.\n- **Dedicated read-only role.** A Langroid advisory is the cautionary tale: an LLM SQL agent with a DB role that had code-exec/filesystem privileges was coerced via prompt injection into **remote code execution on the DB host** ([advisory](https://github.com/langroid/langroid/security/advisories/GHSA-mxfr-6hcw-j9rq)). Least privilege: read-only, scoped to allow-listed tables/views. *(You already have `mercator_atlas_ro` \u2014 keep it strictly SELECT-only and revoke everything else.)*\n- **Tenant isolation via Postgres Row-Level Security (RLS).** The key property for AI-generated SQL: *\"PostgreSQL enforces RLS policies automatically on every query, regardless of how the query was generated\"* \u2014 so **the LLM never needs to (and can't be trusted to) add `WHERE tenant_id = \u2026`**, and it physically cannot leak across tenants even if it hallucinates a join ([Crunchy Data](https://www.crunchydata.com/blog/row-level-security-for-tenants-in-postgres)). Pattern: per-request `SET app.current_tenant = \u2026` on a session variable that RLS policies read, or per-tenant DB roles.\n\n&gt; **Why this is Layer 0:** every layer above can fail, and L0 still guarantees no cross-tenant leak, no writes, no RCE. Defense that doesn't depend on the model behaving.\n\n### L1 \u2014 Semantic / metrics layer *(the biggest accuracy lever)*\nA curated model that sits between the raw schema and the LLM, defining **measures** (e.g. *on-time %*, *units in transit*, *avg ETA slip*), **dimensions** (vendor, region, delivery_month, REG/EXT), **joins**, **allowed enum values**, and **business definitions** \u2014 so the LLM picks a *metric*, not a raw column, and the layer expands it to correct SQL deterministically.\n\nWhy it's the top lever (\u00a71, \u00a72): schemas *\"lack business process definitions and metrics handling\"* (Snowflake); modeling them deterministically converts silent wrong answers into errors (dbt). For your data specifically:\n- **Soft-deletes:** define `is_current = true` *inside* the measure so \"open orders\" never silently includes retired rows. The LLM should never have to remember the flag.\n- **Change history:** model \"ETA slip\" as a first-class measure over your `field/old/new` history notes \u2014 this is a capability generic text-to-SQL stacks *don't* have, and it's a differentiator you get for free from your existing data model.\n- **Enums:** feed the *allowed* status values into the model so the LLM can't invent `'Finished'` vs `'Completed'` (the exact Uber bug).\n\n### L2 \u2014 Schema/metadata retrieval (RAG, domain-scoped)\nEven with a semantic layer, for anything beyond a handful of entities you must *retrieve* the relevant subset of metadata per question. Lessons: scope retrieval by business domain first (Uber's Workspaces) \u2014 flat similarity search degrades as you add tables; and **rich table/column documentation is the cheapest win** (Pinterest 40%\u219290%). Index semantic-model entities + curated example question\u2192SQL pairs (Vanna's pattern).\n\n### L3 \u2014 Query planning / decomposition (agentic, not single-shot)\nSpider 2.0 proves single-shot fails on real schemas. Decompose into checkable steps, mirroring Uber:\n`intent / domain classification \u2192 table selection (HITL) \u2192 column pruning \u2192 SQL generation \u2192 self-validation`.\nEach step is independently loggable and attributable when something goes wrong (\u00a76).\n\n### L4 \u2014 Validation &amp; guardrails\nBelt-and-suspenders on top of L0:\n- **Static SQL checks:** parse the generated SQL; reject non-SELECT, reject tables/columns not in the allow-list (catches the \"non-existent table\" hallucination before execution).\n- **`EXPLAIN` cost gate + `statement_timeout` + `LIMIT` injection + row caps** to kill runaway/expensive queries. Resource limits (CPU/memory/timeout) are standard guidance for agent tools ([OWASP LLM Prompt Injection Prevention](https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html)).\n- **Prompt-injection-via-data defense.** Your DB rows are *untrusted input* \u2014 a vendor name or free-text note could contain \"ignore previous instructions.\" Two patterns from the literature: **runtime authorization checks** that fire at the point of execution and *\"trust neither storage nor the model\"* ([Cisco](https://blogs.cisco.com/ai/prompt-injection-is-the-new-sql-injection-and-guardrails-arent-enough)); and the **dual-LLM pattern** \u2014 a *privileged* LLM that holds tools but never reads raw row content, and a *quarantined* LLM that reads untrusted content but cannot act ([Design Patterns for Securing LLM Agents, arXiv 2506.08837](https://arxiv.org/pdf/2506.08837)). For Atlas, the cleanest version: the LLM works over *metadata + aggregates*, and raw row text is rendered to the user, never fed back into the planning prompt.\n\n### L5 \u2014 Human-in-the-loop clarification + abstention\n- **Clarify ambiguity instead of guessing.** AmbiSQL: detect ambiguity, turn it into a *multiple-choice clarification question*, rewrite the query from the user's pick \u2014 accuracy on ambiguous queries **42.5% \u2192 92.5%** ([arXiv 2508.15276](https://www.arxiv.org/pdf/2508.15276)). Logistics is full of overloadable terms: \"late,\" \"open orders,\" \"in transit,\" \"this season\" \u2014 each is a clarification trigger.\n- **Confirm table/scope selection** before running (Uber + Pinterest both do this).\n- **Abstain on infeasible questions.** TrustSQL reframes the goal as *correct SQL for feasible questions **and** abstention on infeasible ones*; current systems *\"silently emit wrong SQL\"* for both ([arXiv 2403.15879](https://arxiv.org/html/2403.15879v6)). Build an explicit \"I can't answer that from this data\" path.\n\n### L6 \u2014 Result \u2192 chart (structured spec, not generated code)\nGenerate a **declarative chart spec (JSON)**, not plotting code. Databricks' lessons for why JSON specs beat free-form code ([Databricks](https://www.databricks.com/blog/bringing-visualizations-life-multi-agent-systems-vega-lite)):\n- **LLM-friendly:** *\"Compact specs are typically easier to generate and validate in constrained context windows.\"*\n- **Self-correcting:** *\"Schema-based validation enables fast correction.\"*\n- **Secure:** *\"Declarative JSON avoids risks of generated plotting code.\"*\n- **Portable:** renders consistently across surfaces.\n\nThe pipeline: validate result data \u2192 infer field types (categorical vs quantitative) \u2192 select chart type \u2192 construct spec. *Your stack uses recharts \u2014 same principle: have the LLM emit a validated recharts-config JSON, not React/JS code.* (Vega-Lite is worth considering precisely because it has a published JSON Schema you can validate/repair against; recharts you'll validate against your own typed config schema.)\n\n### L7 \u2014 Caching\nTwo tiers: **semantic cache** (normalized question \u2192 plan/SQL) and **result cache** (SQL \u2192 rows, with tenant-scoped keys and short TTLs given your weekly-updating data). Cuts cost and latency for the repeated questions an ops team asks daily.\n\n### L8 \u2014 Observability + eval logging\nLog every `question \u2192 intent \u2192 tables \u2192 SQL \u2192 result \u2192 (chart) \u2192 user feedback`. This is both your debugging trail and the raw material for your golden eval set (\u00a76). Capture thumbs-up/down and \"this number looks wrong\" signals explicitly.\n\n---\n\n## 4. What to avoid (anti-patterns, each tied to a failure mode)\n\n| Anti-pattern | Why it bites | Cheapest mitigation |\n|---|---|---|\n| **Raw schema \u2192 LLM, no semantic layer** | Business definitions get re-invented per query; silent wrong numbers | L1 semantic/metrics layer |\n| **Single-shot generation** | Collapses on real schemas (Spider 2.0 21%) | L3 decomposition |\n| **Flat RAG over all metadata** | Accuracy *declines* as you add tables (Uber) | L2 domain-scoped retrieval |\n| **Trusting that SQL runs = SQL correct** | The `'Finished'`/`'Completed'` class of bug | L4 self-validation + L6 show-the-SQL + L5 clarify |\n| **No tenant isolation in the DB** | One hallucinated join leaks another customer's data | L0 RLS / per-tenant role |\n| **Over-privileged DB role** | Prompt injection \u2192 RCE (Langroid) | L0 read-only, allow-listed |\n| **Feeding raw row text back into the prompt** | Prompt injection via data | L4 dual-LLM / quarantine |\n| **No query limits** | Runaway/expensive queries, DoS | L4 timeout + LIMIT + EXPLAIN gate |\n| **Guessing on ambiguous questions** | Confidently wrong (AmbiSQL: 42.5%) | L5 clarification questions |\n| **No abstention path** | Silent wrong answers on infeasible questions (TrustSQL) | L5 explicit \"can't answer\" |\n| **Scoping confidence off benchmark scores** | 30\u201370pt gap to production | \u00a76 your own eval set |\n| **LLM-generated plotting code** | Insecure, hard to validate | L6 structured chart spec |\n\n---\n\n## 5. Architecture decisions by case (the \"why X for Y\" matrix)\n\n**Semantic layer vs raw text-to-SQL**\n- *Semantic layer* when questions are **recurring, business-defined, and trust matters** (ops dashboards, KPIs, anything someone acts on). Cost: upfront modeling; only answers in-scope questions. \u2192 **This is the right default for Atlas.**\n- *Raw text-to-SQL* only for **exploratory, low-stakes, long-tail** questions a model can't cover \u2014 and even then, gate behind show-the-SQL + HITL. Best as a *fallback path* when the semantic layer abstains, not the primary path.\n\n**Single-shot vs agentic/multi-step**\n- *Single-shot:* small/simple schema, low latency need, trivial questions. Rarely sufficient for real enterprise data.\n- *Agentic:* large schema, multi-table joins, need to attribute failures, need self-correction. \u2192 Atlas, given your multi-table PO/shipment/WIP/history model.\n\n**Predefined metrics vs free-form SQL**\n- *Predefined metrics:* the trustworthy core. Curated, tested, fast, cacheable.\n- *Free-form:* the escape hatch for novel questions \u2014 always with the SQL visible and a results-may-be-wrong affordance.\n- Pattern: **metrics-first, free-form-fallback, abstain-last.**\n\n**Chart: structured spec vs DSL vs templates**\n- *Structured JSON spec (Vega-Lite/validated recharts config):* best general default \u2014 validatable, secure, portable.\n- *Predefined chart templates keyed off query shape* (1 dim + 1 measure \u2192 bar; time + measure \u2192 line): most reliable for the common cases; combine with structured-spec fallback for the rest.\n- *Generated plotting code:* avoid.\n\n**How multi-tenancy/RBAC changes the design**\n- Push isolation **into the database (RLS / per-tenant role)** so it's enforced regardless of generated SQL \u2014 never rely on the LLM to scope tenants.\n- Consider **per-tenant semantic models** if customers' data shapes diverge.\n- Apply **column/PII masking** at the role/view level, not in the prompt.\n\n---\n\n## 6. Evaluation \u2014 how to actually measure this\n\n**Benchmarks to know (for literacy, not for scoping your confidence):** Spider 1.0 (clean, saturated), **Spider 2.0** (realistic, brutal), **BIRD** (dirty data + domain knowledge), **TrustSQL** (reliability + abstention), **AmbiSQL/AmbiQT** (ambiguity).\n\n**Build your own harness \u2014 multi-dimensional, not execution-success-only.** Execution success is necessary but insufficient (a query can run and return wrong data). Uber's 5 dimensions are a good template ([Uber](https://www.uber.com/en-GB/blog/query-gpt/)):\n1. **Intent accuracy** \u2014 did it route to the right domain?\n2. **Table-overlap score** (0\u20131) \u2014 fraction of correct tables selected.\n3. **Successful execution** \u2014 did it run?\n4. **Output presence** \u2014 did it return &gt;0 rows?\n5. **Qualitative similarity** (LLM-judged, 0\u20131) vs a golden SQL.\n\nDecomposing eval per stage lets you attribute failures (intent vs table-selection vs generation).\n\n**Measure the \"wrong-number\" class specifically with semantic-equivalence / result-set matching**, not exact string match. Use **execution accuracy** (compare result sets of generated vs golden SQL) and **distilled test suites** to catch SQL that's syntactically different but semantically equal \u2014 and vice versa ([test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval); formal-verification approaches like [SpotIT](https://www.arxiv.org/pdf/2510.26840)).\n\n**Weight reliability over raw accuracy (TrustSQL).** Penalize confident-wrong answers heavily \u2014 TrustSQL's scheme: +1 correct or correct-abstention, 0 for abstaining on a feasible question, **\u2212c for any wrong SQL or any attempt at an infeasible question** (a T5 model scored 57.8% unpenalized \u2192 **\u2212364.5%** at moderate penalty without abstention). For a logistics ops tool, a wrong shipment number is far costlier than \"I can't answer that\" \u2014 your eval should reflect that asymmetry.\n\n**Golden-set process:** seed from your L8 query logs (real questions), have a domain expert label golden SQL + expected result, and run the harness on every prompt/model/semantic-model change. This is the only number that should drive your go/no-go.\n\n**Trust UX (the eval that happens in the user's head):**\n- **Always show the generated SQL** (Pinterest does, even ungated).\n- **Confirm tables/scope** before running (Uber + Pinterest).\n- **Ask clarifying questions** on ambiguity (AmbiSQL).\n- **Abstain visibly** when infeasible (TrustSQL).\n- Surface the **semantic-layer definitions** used (\"on-time = delivered \u2264 ETA, current rows only\") so users can verify the *business logic*, not just the SQL.\n\n---\n\n## 7. Recommended shape for Atlas (concrete)\n\n1. **L0 first:** lock `mercator_atlas_ro` to SELECT-only on an allow-list of views; enforce tenant isolation with **Postgres RLS** keyed off a per-request session variable. Verify with a red-team test that no prompt can cross tenants or write.\n2. **L1 semantic layer** over your canonical rows: measures (*on-time %, units in transit, avg ETA slip, open POs*), dimensions (vendor, region, delivery_month, REG/EXT), with `is_current` baked into measures, **change-history modeled as first-class \"slip\" measures**, and allowed enum values declared.\n3. **Metrics-first agent** (L3): intent/domain classify \u2192 retrieve semantic entities (L2, doc-rich) \u2192 confirm scope (L5) \u2192 expand metric to SQL \u2192 self-validate (L4) \u2192 execute (L0) \u2192 structured chart spec (L6).\n4. **Free-form SQL fallback** only when the semantic layer abstains \u2014 SQL always visible.\n5. **Eval harness** (L6/\u00a76) seeded from real logged questions, scored multi-dimensionally with semantic-equivalence + abstention penalties.\n\n### Example flows\n\n**Happy path \u2014** *\"What's our on-time delivery rate for Tropic this season?\"*\n\u2192 intent: shipment-performance \u2192 retrieves `on_time_rate` measure + vendor/season dims \u2192 confirms scope (\"Vendor = Tropic, Season = SP26, current rows\") \u2192 semantic layer expands to SQL (RLS auto-scopes tenant, `is_current` baked in) \u2192 1 measure + 1 dim \u2192 bar/number card. *Trustable because the business definition was curated, not inferred.*\n\n**Edge (ambiguity) \u2014** *\"Show me late orders.\"*\n\u2192 \"late\" is ambiguous \u2192 clarification card: **(a)** ETA already passed, not delivered; **(b)** ETA slipped vs original; **(c)** past original delivery_month \u2192 user picks (b) \u2192 query rewritten against change-history \"slip\" measure \u2192 renders. *AmbiSQL pattern; avoids a confident wrong answer.*\n\n**Failure (infeasible / would-be hallucination) \u2014** *\"What's the profit margin per shipment?\"*\n\u2192 no `margin`/cost measure exists in the semantic model \u2192 **abstain**: \"I don't have cost/margin data \u2014 I can show freight cost or units shipped instead.\" *TrustSQL abstention \u2014 beats inventing a `WHERE`-clause number. Contrast with raw text-to-SQL, which would likely emit a plausible-looking but meaningless query (the Uber `'Finished'` failure mode).*\n\n---\n\n## 8. Open questions / where the literature is thin\n- **Per-tenant semantic models + column/PII masking** in a conversational setting \u2014 RLS handles rows; the masking/role design is yours to specify.\n- **Production golden-set maintenance** for a weekly-updating logistics domain \u2014 how to keep goldens fresh as data drifts.\n- **Result-set semantic-equivalence checks** tuned to your data (e.g. tolerance for ordering, nulls rendered as `\u2014`, soft-deleted rows).\n\n---\n\n## Sources (graded)\n\n**Primary \u2014 production case studies (highest signal, incl. negative disclosures):**\n- Uber QueryGPT \u2014 https://www.uber.com/en-GB/blog/query-gpt/ (and /us/en/blog/query-gpt/)\n- Pinterest text-to-SQL \u2014 https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff\n- Snowflake Cortex Analyst docs \u2014 https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-analyst\n- Snowflake Cortex Analyst accuracy blog \u2014 https://www.snowflake.com/en/blog/engineering/cortex-analyst-text-to-sql-accuracy-bi/\n\n**Primary \u2014 benchmarks/papers:**\n- Spider 2.0 \u2014 https://arxiv.org/abs/2411.07763 \u00b7 https://github.com/xlang-ai/Spider2\n- BIRD \u2014 https://bird-bench.github.io/ \u00b7 https://arxiv.org/pdf/2305.03111\n- TrustSQL (reliability/abstention) \u2014 https://arxiv.org/html/2403.15879v6\n- AmbiSQL (ambiguity) \u2014 https://www.arxiv.org/pdf/2508.15276 \u00b7 AmbiQT https://arxiv.org/abs/2310.13659\n- Design Patterns for Securing LLM Agents vs Prompt Injection \u2014 https://arxiv.org/pdf/2506.08837\n- test-suite-sql-eval (semantic equivalence) \u2014 https://github.com/taoyds/test-suite-sql-eval\n- SpotIT (formal verification of eval) \u2014 https://www.arxiv.org/pdf/2510.26840\n\n**Secondary \u2014 vendor/engineering blogs:**\n- dbt Semantic Layer vs text-to-SQL (2026 benchmark) \u2014 https://docs.getdbt.com/blog/semantic-layer-vs-text-to-sql-2026\n- WrenAI semantic layer \u2014 https://medium.com/wrenai/why-the-semantic-layer-is-essential-for-reliable-text-to-sql-and-how-wren-ai-brings-it-to-life-c54cc0e6e4bc\n- Vanna.ai \u2014 https://github.com/vanna-ai/vanna\n- Databricks multi-agent Vega-Lite viz \u2014 https://www.databricks.com/blog/bringing-visualizations-life-multi-agent-systems-vega-lite\n- Crunchy Data \u2014 Postgres RLS for tenants \u2014 https://www.crunchydata.com/blog/row-level-security-for-tenants-in-postgres\n- OWASP LLM Prompt Injection Prevention \u2014 https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html\n- Cisco \u2014 prompt injection is the new SQL injection \u2014 https://blogs.cisco.com/ai/prompt-injection-is-the-new-sql-injection-and-guardrails-arent-enough\n- Langroid SQLChatAgent RCE advisory \u2014 https://github.com/langroid/langroid/security/advisories/GHSA-mxfr-6hcw-j9rq\n\n**Excluded (refuted during research \u2014 do not cite):** Snowflake \"90%+/2\u00d7 GPT-4o/+14% vs competitors\" marketing claim; \"ChatGPT 40.08% vs human 92.96% on BIRD.\"\n\n*Method: deep-research harness \u2014 5 search angles, 23 sources fetched, 113 claims extracted, 25 adversarially verified (3-vote), 23 confirmed / 2 refuted; supplemented with targeted searches on multi-tenancy/RLS, prompt-injection defense, charting, and semantic-equivalence eval.*\n", "creation_timestamp": "2026-06-16T03:07:04.000000Z"}