Skip to content

AC Analyst — Design Document, v0.1

Status: v0.1 design (2026-05-25). The first substantive design pass at Coframe's AI-powered analytical workspace. Captures architectural commitments, tool surface, context engineering, and harness-quality disciplines agreed in the strategic conversations of 2026-05-25.

Author: reeeneeee Audience: implementers of the v1 Analyst (slice A2–A8 follow this doc); reviewers checking architecture before code lands; future maintainers who need to understand load-bearing decisions.

Scope. This document specifies the AC Analyst — a per-AC AI-powered analytical workspace, sitting alongside Frame-QL, MCP, and HTTP API as one of the AC Surfaces (coframe_platform_design_v2_1_supplement.md §10). It covers architecture, tool surface, context engineering, conversation orchestration, artifact UX, ANALYST.md memo specification, eval/observability infrastructure, reproducibility, and the phased implementation plan.

Out of scope. The MCP server (a separate Surface that exposes the same tool surface to external clients — covered in a future design doc); cross-AC analytics (Pro tier, v2+); multi-tenant deployment (Pro tier); fine-tuned per-installation models (Pro tier); production telemetry dashboards (Pro tier). Where relevant, this doc notes the v1/Pro split per the commercial strategy memo (coframe_commercial_strategy_v0_1.md).

Working principle. AC Analyst is harness engineering applied to analytical work. Quality comes from doing all of context engineering, tool ergonomics, loop discipline, author grounding, observability, and reproducibility well — not from picking the right LLM. The design discipline throughout this doc reflects that.


1. The Claude Code parallel (load-bearing framing)

The AC Analyst is to the Coframe AC what Claude Code is to a codebase. The architectural pattern is identical; only the substrate differs.

Claude Code AC Analyst
Substrate LLM weights (hundreds of billions of parameters of frozen knowledge) The AC (structurally-complete analytical surface over the data)
State of substrate alone Can produce text but cannot do anything Can answer queries in principle but is inert in practice
What the harness adds Bounded tools (Read, Edit, Bash, Grep, Glob) + observe-think-act loop Bounded tools (list_families, execute_frame_ql, profile_column) + observe-think-act loop
What the harness doesn't add New knowledge — the model already knew everything New data — the AC already had everything
Where the value comes from Turning latent capability into applied work Turning latent data into applied insight
Size ratio Harness ~100K LOC vs model ~1T params (10^7×) Harness ~10K LOC vs AC + data ~TB (10^8×)
Audience the harness unlocks Anyone with a terminal — not just ML engineers Anyone with a question — not just SQL writers

The parallel isn't decorative. It's prescriptive. Things Claude Code gets right that the AC Analyst must:

  1. A small, deliberate tool surface. Claude Code has ~12 tools. Each one is a precisely one capability — none is a swiss army knife. AC Analyst targets ~12-15 tools, same discipline.
  2. Transparency as load-bearing. Every tool call in Claude Code is shown to the user. AC Analyst's equivalent: every Frame-QL the analyst runs, every QM card it consults, every refusal it received — all visible. Audit-by-default.
  3. The loop is the system. Claude Code's value isn't any single tool; it's the observe-think-act cycle. AC Analyst inherits this discipline.
  4. Bounded recursion + escape hatches. Claude Code has bounded turn counts, sleep-loop detection, self-correction prompts. AC Analyst must too.
  5. The substrate must be prepared. Anthropic builds models that are tool-use-trained. Coframe builds ACs that are structurally complete + constructively correct. The harness depends on the substrate's preparation.
  6. Author-supplied grounding (CLAUDE.md analog). Claude Code's CLAUDE.md mechanism is what makes generic Claude into a project-specific assistant. AC Analyst's analog — ANALYST.md at installation + AC levels — is the single highest-leverage feature in the design.

The position article frames this in measured terms: Coframe's substrate enables the analytical-agent layer. This design doc operationalizes that framing.


2. What AC Analyst is — and isn't

2.1 What it is

A per-AC AI-powered analytical workspace: a two-panel UI where the user converses with an AI analyst on the left and a stack of typed artifacts (Frames, charts, narrations, simulations, suggested follow-ups, etc.) accumulates on the right. The conversation is the primary artifact; Frames are moments within it. Sessions persist, branch, export, and replay. Every artifact carries its Frame-QL provenance, verification level, and trace ID.

The analyst speaks the AC's vocabulary natively (families, dimensions, anchors, verification levels) and uses a small set of AC-aware tools to introspect, query, narrate, and propose. The user does not need to know any of this vocabulary; the analyst translates.

2.2 What it is NOT

  • Not text-to-SQL. It doesn't generate SQL. It generates Frame-QL, which the resolver validates against the AC's structural commitments. The cost of a wrong query is refusal, not a silently-wrong number.
  • Not a chatbot. It uses tools to ground claims in actual data. Narration is grounded in Frames the user can inspect.
  • Not a single-shot translator. It engages in multi-turn dialogue: clarifies ambiguous intent, asks before destructive choices, surfaces verification caveats, proposes follow-ups.
  • Not a BI replacement. It complements the Workbench surfaces (Tables, Lineage, Verify, Declare, Engine, Attest, Query) — those remain available for power-user direct control. AC Analyst is the universal interface; Workbench is the manual interface.
  • Not a generic AI agent. It's scoped to one AC at a time (v1). It cannot edit the AC, mutate data, or perform actions outside the analytical-query surface.

2.3 The three operating modes

The Analyst's behavior spans a spectrum based on user intent:

Mode Trigger Behavior Latency
Translation User types something close to Frame-QL ("revenue by region for Q4") One-shot translate → execute → return ~2-4s
Guided User types vague intent ("how's my business doing?") Multi-turn: discover, clarify, propose, execute, narrate ~10-60s of dialogue
Analysis User states a business problem ("Q4 revenue looks weak, help me figure out why") Proactive exploration — multiple Frames, comparisons, narrations, suggested next steps minutes of dialogue

Modes are emergent, not declared. The orchestrator + system prompt + AC vocabulary together let the analyst pick the right mode for the user's intent.


3. The six harness-engineering principles

Operating at expert level changes the design discipline. Six concerns are co-equal and load-bearing:

3.1 Tool ergonomics

Each tool has strict input + output schemas; structured error formats with actionable context; in-description worked examples; an explicit read/mutate flag. Output is compact and machine-friendly — never a dump of a Pydantic model.

Bad: list_families() returns [MetricFamily(...full model...), ...] → 500 tokens of nested fields.

Good: list_families() returns [{name, kind, derived, anchor_locked, primary_schema, summary}, ...] → 50 tokens, exactly the fields the LLM needs to plan its next step.

3.2 Context architecture (with caching)

A 4-tier model. Each tier has different update cadence and cache strategy:

Tier Contents Cache strategy
L1 — Static System prompt + tool definitions + AC catalog summary Prompt-cached (Anthropic 5-min cache); recomputed when AC COMMITs
L2 — Author memo ANALYST.md (installation + AC, composed) Prompt-cached; recomputed when memo edits
L3 — Conversation Recent N turns verbatim; older turns summarized Cached up to most recent user message
L4 — Artifacts NOT in context by default; LLM holds references; injects on demand via recall_artifact(id) On-demand

Prompt caching is mandatory, not optional. The L1+L2 tiers are large and stable per-AC; without caching, every conversation pays the full ingestion cost per turn, which is both expensive (~$0.10-0.50/turn for a real-sized AC) and slow (~2-5s of latency on context ingestion alone). With caching, L1+L2 cost drops by ~90% on cache hits.

3.3 Loop discipline

Bounded turns per user intent (default 8). Tool-call deduplication (same tool + same args twice → fail-fast). Stuck detection at 5 consecutive tool calls without an artifact emitted to the user (orchestrator injects "are you stuck?" turn). Parallel tool calls when independent. These are not polish — they're what separates a working agent from one that loops or hallucinates under stress.

3.4 Author-supplied grounding (ANALYST.md)

The single highest-leverage feature. Per-installation ANALYST.md carries org-wide conventions (currency, fiscal calendar, exclusions, voice). Per-AC ANALYST.md carries AC-specific semantics (what families mean in business terms, scope, refusal patterns). Both auto-load. Hierarchical: AC overrides installation on conflict. Without ANALYST.md, the analyst is generic and frustratingly cautious. With it, the analyst sounds like it works at the company.

3.5 Observability + evals

Trace logging (every session is a structured log: messages, tool calls + timings, artifacts emitted, errors — replayable). Eval corpus (hand-curated (user_message, expected_tool_calls, expected_artifact_kinds) triples against the retail AC — runs on every commit). Quality metrics (turns-to-resolution, refusal rate, user-confirm-vs-reject, artifact-pin rate — surfaced in the Engine page). Prompt versioning (system prompt has a version; old sessions remain replayable against their generation-time prompt).

3.6 Reproducibility

Every artifact carries provenance (Frame-QL, served_from, verification level, tool call ID, conversation turn ID). Session export to JSON. Session forking from any artifact (continues conversation with that artifact as anchor). Crucial for exploratory analysis where users want to backtrack.


4. The two-panel workspace UX + artifact taxonomy

4.1 Layout

┌─────────────────────────────────────┬──────────────────────────────────────┐
│ DIALOGUE                            │ ARTIFACTS                            │
│                                     │                                      │
│ user: "show me revenue trends"      │  ┌────────────────────────────────┐  │
│                                     │  │ Frame · monthly revenue        │  │
│ analyst:                            │  │ ────────────────────────────── │  │
│   I'll look at monthly totals       │  │ [datagrid]                     │  │
│   across the 14-month window…       │  │ provenance ↓                   │  │
│   [tool: execute_frame_ql]          │  └────────────────────────────────┘  │
│   ↳ engine_cache hit                │  ┌────────────────────────────────┐  │
│   ↳ Frame ↑                          │  │ Chart · line                   │  │
│                                     │  │  [rendered Vega-Lite spec]     │  │
│   Trending up through Q3 then       │  └────────────────────────────────┘  │
│   dipping in November. Want me      │  ┌────────────────────────────────┐  │
│   to drill into that month?         │  │ Follow-ups (clickable)         │  │
│                                     │  │  • Break Nov by region         │  │
│ [user input box]                    │  │  • Compare to Nov 2024         │  │
│ ┌─────────────────────────────────┐ │  │  • Look at units instead       │  │
│ │ Yeah, break Nov by region      │ │  └────────────────────────────────┘  │
│ └─────────────────────────────────┘ │                                      │
│ [Send]                              │                                      │
└─────────────────────────────────────┴──────────────────────────────────────┘

Left is the conversation log + input. Right is the artifact stack — typed artifacts that accumulate as conversation progresses, persist across turns, can be pinned, branched from, exported.

4.2 Artifact kinds

The analyst's response is a typed list of artifacts. Each has a dedicated renderer:

Kind Renderer Carries
narration Markdown card The analyst's prose — explanation, observation, caveat. NEVER hallucinated; always grounded in a referenced Frame or tool result.
frame Datagrid (sortable, scrollable) A Frame with Frame-QL + served_from + verification level (collapsible "see query" / "see provenance")
chart Vega-Lite renderer Auto-suggested viz over a Frame; chart type chosen based on Frame shape (1 dim + 1 metric → bar; time + metric → line; 2 metrics → scatter)
qm_card Column-profile card Distinct values, cardinality, nunique — works for derived dims (via Unit A QM synthesis)
structure Small interactive graph FD-DAG or dimension-family hierarchy view
comparison Two-Frame side-by-side "This period vs last period"; structured diff
simulation Interactive scenario Sliders + parameterised Frame that recomputes (see §11)
followups Clickable chips Suggested next questions; clicking sends them back as user messages
refusal Diagnostic card When the AC can't answer (dubious, anchor-locked, etc.) — plain-language explanation + suggested reformulation
provenance Footer on any Frame Frame-QL, served_from badge, verification level — always accessible, surfaced on demand
clarification Plain question + multiple-choice chips When intent is ambiguous, the analyst asks before running

Each artifact has an ID (art_{session_id}_{turn}_{index}) the LLM can reference (recall_artifact(id)) and the user can pin / branch from / export.

4.3 The provenance commitment

This is the single most important UI commitment. Every Frame in the artifact stack has — accessible via a "provenance" expansion — the exact Frame-QL that produced it, the served_from indicator (cache vs backend), and the verification level (A/AA/AAA) the answer inherits. This is the auditability claim made visible. Without it, AC Analyst would be one more confident liar; with it, every numerical claim is reproducible by hand.

4.4 Session-level operations

Op Behavior
Pin Mark artifact as important; orchestrator de-prioritizes pruning
Branch Fork conversation from a specific artifact; new session inherits up to that point
Export Download session as JSON (messages + artifacts + traces); replayable on another machine
Share Generate URL pointing to a read-only view of the session
Clear Reset conversation (artifacts archived, not deleted)

5. Tool surface (the contract with the LLM)

Twelve tools, organized by function. Each entry specifies input schema, output shape, error formats, and a worked example. The discipline: every output is compact + structured + actionable.

5.1 Introspection tools

list_families()

Purpose: enumerate the AC's metric + dimension families. Read-only.

Input: none.

Output (compact):

{
  "metric_families": [
    {"name": "revenue", "kind": "metric", "derived": false,
     "anchor_locked": false, "primary_schema": "transactions",
     "ip_reducers": ["SUM"], "summary": "Transaction revenue, SUM-rollable"},
    {"name": "profit", "kind": "metric", "derived": true,
     "anchor_locked": false, "formula": "revenue - cost",
     "inputs": ["revenue", "cost"], "summary": "Derived: revenue - cost"}
  ],
  "dimension_families": [
    {"name": "geography", "members": ["store", "city", "region", "coarse_region"],
     "base_level": "store", "has_derived_dims": true,
     "summary": "Store geography; coarse_region derived from region"},
    ...
  ],
  "verification_level": "AA",
  "ac_filter": []
}

Errors: none (always succeeds for a loaded AC).

Example: see above.

describe_family(name)

Purpose: detailed view of one metric or dimension family. Read-only.

Input: {name: str}.

Output (varies by kind):

For a primitive metric family:

{
  "kind": "metric",
  "name": "revenue",
  "family_root": {"schema": "transactions", "column": "revenue"},
  "ip_reducers": [{"operator": "SUM", "a_block": []}],
  "cache_hint": {"materialize_at": [["region"], ["region", "day"]]},
  "sibling_schemas": ["region_daily_summary"],
  "verification_status": "verified"
}

For a derived metric family:

{
  "kind": "metric",
  "name": "profit",
  "derived": {"formula": "revenue - cost", "inputs": ["revenue", "cost"]},
  "inherited_ip_reducers": [{"operator": "SUM", "a_block": []}],
  "verification_status": "inherits_from_components"
}

For a dimension family:

{
  "kind": "dimension",
  "name": "geography",
  "base_level": "store",
  "members": ["store", "city", "region", "coarse_region"],
  "hierarchies": [{"name": "administrative", "path": ["store", "city", "region"]}],
  "derived_dimensions": [
    {"name": "coarse_region", "derived_from": "region",
     "mapping_size": 3, "default": null}
  ]
}

Errors:

{"error_kind": "unknown_family", "name": "revneue",
 "available": ["revenue", "cost", "profit", "units", ...],
 "suggestion": "Did you mean 'revenue'?"}

column_profile(schema, column)

Purpose: retrieve QM profile for a physical column or derived dim (the latter via Unit A synthesis). Read-only.

Input: {schema: str, column: str}. Use schema="<virtual>" for derived dims.

Output:

{
  "name": "region",
  "dtype": "string",
  "nunique": 3,
  "n_rows": 523023,
  "null_rate": 0.0,
  "cardinality_class": "very_low",
  "kind_hint": "dimension",
  "top_values": [
    ["East", 218401], ["Central", 175829], ["West", 128793]
  ]
}

Errors:

{"error_kind": "not_profiled", "schema": "transactions", "column": "region",
 "suggestion": "Run profile_table('transactions') first."}

discover_fd_dag(schema, columns?)

Purpose: data-attested FD-DAG over a schema's columns. Read-only.

Input: {schema: str, columns?: list[str]}.

Output: condensed FDDiscoveryReport — constants, keys, equivalence classes, reduced DAG edges.

verification_level(ac_name?)

Purpose: current AC verification level + which conditions contribute. Read-only.

Output:

{"level": "AA", "level_a_count": 12, "level_aa_count": 8, "level_aaa_count": 0,
 "outstanding": ["sibling_coherence:revenue@region_daily_summary"]}

5.2 Execution tools

propose_frame_ql(intent_summary)

Purpose: construct a Frame-QL string from an intent description WITHOUT executing it. Read-only. Useful for clarification dialogue before commitment.

Input: {intent_summary: str} — the analyst's own summary of what it thinks the user wants.

Output:

{
  "frame_ql": "SELECT region, SUM(revenue) AS r AT region",
  "justification": "Grain (region,) — geography family; metric revenue with SUM ip_reducer; no filter inferred",
  "would_resolve": true,
  "predicted_served_from": "engine_cache"
}

Errors (if proposal would fail to resolve):

{"error_kind": "would_refuse", "frame_ql": "...",
 "reason": "dubious — 'revenue' resolves to two siblings in different schemas",
 "suggested_disambiguation": "FROM transactions or FROM region_daily_summary"}

execute_frame_ql(frame_ql)

Purpose: parse + resolve + execute Frame-QL. Returns a Frame (or refusal).

Input: {frame_ql: str}.

Output (success):

{
  "frame_id": "art_sess_abc123_turn_3_0",
  "columns": ["region", "r"],
  "rows": [["East", 3868636.36], ["West", ...], ["Central", ...]],
  "row_count": 3,
  "served_from": "engine_cache",
  "verification_level": "AA",
  "frame_ql_executed": "SELECT region, SUM(revenue) AS r AT region"
}

Output (refusal — first-class, not an exception):

{
  "error_kind": "refusal",
  "refusal_kind": "dubious_query",
  "frame_ql": "SELECT revenue AT region",
  "reason": "'revenue' resolves to multiple sibling families; disambiguation required",
  "suggestion": "Add FROM clause: SELECT revenue AT region FROM transactions"
}

Refusals are the LLM's correction signal — they're how the resolver communicates "your query was structurally invalid; here's why; here's how to fix it." The LLM reacts by reformulating, not by hallucinating an answer.

5.3 Synthesis tools

narrate_frame(frame_id, user_intent)

Purpose: Frame → prose summary. Grounded: the LLM is given the Frame's actual data + the user's stated intent and asked to summarize. Cannot hallucinate.

Input: {frame_id: str, user_intent: str}.

Output:

{
  "narration": "Revenue trended up through Q3 then dipped in November...",
  "frame_id": "art_sess_abc123_turn_3_0",
  "tokens_used": 142
}

propose_followups(frame_id, conversation_summary)

Purpose: suggest 2-4 next questions based on current artifact + conversation context.

Output:

{
  "followups": [
    "Break the November dip down by region",
    "Compare November to November of last year",
    "Look at units sold for the same period"
  ]
}

recall_artifact(artifact_id)

Purpose: inject a prior artifact's full content into the current LLM context. Useful when the user refers back ("compare this to the chart from earlier").

Input: {artifact_id: str}.

Output: the artifact's full content (Frame data, narration text, etc.).

5.4 Simulation tools (v1 scope; see §11)

simulate_filter(frame_id, filter_clause)

Purpose: re-run the Frame's Frame-QL with an additional filter. E.g., "exclude West region from this view."

Input: {frame_id: str, filter_clause: str} — filter in Frame-QL WHERE syntax.

Output: new Frame artifact reflecting the filter.

simulate_perturbation(frame_id, column, perturbation)

Purpose: apply a value perturbation to a column in the existing Frame (post-output transform; doesn't re-execute).

Input: {frame_id: str, column: str, perturbation: {kind: "multiply"|"add", value: float, filter?: dict}}.

Output: new Frame artifact with the perturbation applied; provenance notes the perturbation.

simulate_goal_seek(frame_id, column, target_value, vary: str)

Purpose: solve for the value of vary that makes column reach target_value. Works for linear aggregates.

Output: new artifact stating the required change.

5.5 Tool design summary

Property Applied to all tools
Read-only by default All v1 tools are read-only — analyst cannot mutate the AC or data
Structured errors {error_kind, context, suggestion} — actionable by LLM
Compact outputs Tuned for LLM consumption, not human
In-description examples Each tool's docstring has worked example
Idempotent where possible Same call = same result
Parallel-friendly Independent tools can be called in parallel

6. Context architecture in detail

The 4-tier model with caching:

┌──────────────────────────────────────────────────────────────────────┐
│ L1 (static; prompt-cached, TTL ~hours/days)                          │
│                                                                       │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │ Base system prompt                                              │  │
│  │  - You are the AC Analyst...                                    │  │
│  │  - Your job: help the user understand their data                │  │
│  │  - Always ground claims in tool results                         │  │
│  │  - Refuse rather than guess; ask the user when ambiguous        │  │
│  │  - [Claude Code parallel commentary for the LLM]                │  │
│  └────────────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │ Tool definitions (with worked examples)                         │  │
│  │  - 12-15 tools, full JSON schemas, examples                     │  │
│  └────────────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │ AC catalog summary                                              │  │
│  │  - Family names + kinds (primitive/derived)                     │  │
│  │  - Dimension hierarchies                                        │  │
│  │  - Sample schemas                                               │  │
│  │  - Verification level summary                                   │  │
│  └────────────────────────────────────────────────────────────────┘  │
│                                                                       │
│  Cache key: hash(prompt_version, ac_version)                         │
│  Invalidated on: AC COMMIT, prompt version bump                      │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ L2 (author memo; prompt-cached)                                      │
│                                                                       │
│  ANALYST.md composed (installation-level then AC-level)              │
│                                                                       │
│  Cache key: hash(installation_memo, ac_memo)                          │
│  Invalidated on: memo edits                                          │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ L3 (conversation; cached up to last user message)                    │
│                                                                       │
│  Recent N turns verbatim (default N=10)                              │
│  Older turns summarized (one-paragraph each)                         │
│  Tool call results inlined                                           │
│                                                                       │
│  Cache key: rolling — cached up to most-recent stable prefix         │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ L4 (artifacts; NOT in context by default)                            │
│                                                                       │
│  Artifact stack — Frames, charts, narrations, etc.                   │
│  LLM holds references only ("art_sess_abc123_turn_3_0")              │
│  Full content injected on `recall_artifact(id)` only                 │
│                                                                       │
│  Why not in context: Frame data can be huge (thousands of rows);    │
│  most artifacts are for the user, not the LLM                       │
└──────────────────────────────────────────────────────────────────────┘

6.1 Why L4 is critical

Frame data is the user's primary artifact, not the LLM's. A 5000-row Frame is interesting to the user (they can scroll, sort, filter); the LLM doesn't need to see all of it to plan the next step. The LLM holds an artifact reference + a summary stat (rows, columns); if the user asks about specifics, the LLM calls recall_artifact and gets the full content for that turn only.

This is the difference between a working agent and one that runs out of context window. Without L4 discipline, every Frame the analyst produces is in the LLM's working memory forever — 10-15 turns later, the analyst is paying for context it doesn't need.

6.2 Caching economics

With L1 (static) at ~5000 tokens and L2 (memo) at ~1500 tokens, a typical conversation pays the full ~6500-token ingestion cost on the first turn only. Subsequent turns pay ~50 tokens (the cache-read indicator) instead. For a 20-turn conversation, this is the difference between ~$2 and ~$0.20 in LLM cost. Multiply across a few hundred sessions/day and it's the difference between viable and not.

6.3 Prompt versioning

The L1 system prompt has a semantic version (prompt_v0.3.1). When we update the prompt, the version bumps. Old sessions remain replayable against their generation-time prompt — the trace log records the prompt version. This is what makes evals across model/prompt changes possible.


7. ANALYST.md — the author memo

7.1 Two-level hierarchy

Per the conversation on 2026-05-25:

installation.coframe/
  installation.yaml
  ANALYST.md                  ← Installation-level memo
                                (organizational conventions, cross-AC caveats)

retail.coframe/
  ac.yaml
  ANALYST.md                  ← AC-level memo
                                (this AC's family semantics, refusal scope)

Both optional. Hierarchical composition: installation memo loaded first; AC memo loaded second; AC overrides on conflict (more specific wins).

7.2 Installation-level memo example

# Analyst Memo — Retail Demo Installation

## Organizational conventions
- All currency: USD, no FX conversions
- Fiscal year starts in February (Q1 = Feb–Apr)
- Date format: ISO 8601 (YYYY-MM-DD)
- "Last quarter" means most recently *completed* fiscal quarter

## Cross-cutting caveats
- Store S012 was rebuilt 2025-06 → exclude from same-store-sales comparisons
- All metrics are net of returns unless stated otherwise

## Analyst voice
- Direct and concrete; avoid hedging language
- Surface verification levels when below AAA
- Cite Frame-QL in every numerical claim

7.3 AC-level memo example

# Analyst Memo — Retail Full AC

## Purpose
- Operational analytics across transactions, stores, products, inventory.
- NOT for forecasting (no model registered).
- NOT for individual-customer analytics (no customer dim).

## Family semantics (where names need interpretation)
- `revenue`: per-transaction sale amount before discounts
- `cost`: per-transaction unit cost (COGS-equivalent)
- `profit`: derived as `revenue - cost` (post-aggregation)
- `coarse_region`: 3-bucket coarsening — Atlantic / Pacific / Plains
- When users say "margin", interpret as `profit / revenue` at the output grain

## Recommended starting points
- Executive views: (region, month) grain
- Operational views: (store, day) grain
- Sub-transaction grain doesn't exist; refuse politely if asked

## Refusal scope
- Don't speculate about future periods
- Don't claim causation from correlation; suggest follow-ups instead
- Don't compare individual stores without normalizing for size

7.4 Composition mechanics

def build_analyst_memo(installation_path, ac_path) -> str:
    parts = []
    inst_memo = (installation_path / "ANALYST.md").read_text() if exists else None
    ac_memo = (ac_path / "ANALYST.md").read_text() if exists else None
    if inst_memo:
        parts.append("### Installation-level guidance\n\n" + inst_memo)
    if ac_memo:
        parts.append(
            "### AC-specific guidance (overrides installation on conflict)\n\n"
            + ac_memo
        )
    return "\n\n---\n\n".join(parts) if parts else ""

The composed text goes into L2 of the context. Conflicts are resolved by the LLM via the explicit "overrides on conflict" note (rather than programmatic merging) — same pattern Claude Code uses for multi-level CLAUDE.md composition.

7.5 Forward-compatible levels

The architecture should support (but v1 doesn't ship) user-level + session-level memos:

Level Contents When to add
User-level Per-user preferences ("show currency in millions") When multi-user installations land
Session-level "For this conversation, only look at East region" When session-pin features mature

Add by extending the loader chain: installation → user → AC → session. Same precedence rule.


8. Orchestration loop discipline

8.1 The loop

def serve_user_message(session, user_message: str) -> Response:
    session.append_message("user", user_message)
    turn_count = 0
    artifacts_this_intent = []

    while turn_count < MAX_TURNS_PER_INTENT:  # default 8
        # Build LLM context (L1 cached + L2 cached + L3 + tool defs)
        messages = build_context(session)

        # Call LLM with tools
        response = adapter.send(messages, tools=AC_TOOLS, model=route_model(turn_count))

        # If response is a message to user, emit + return
        if response.is_message_to_user:
            session.append_message("assistant", response.text)
            session.add_artifacts(artifacts_this_intent)
            return Response(text=response.text, artifacts=artifacts_this_intent)

        # If tool calls, execute (with dedup check)
        for tool_call in response.tool_calls:
            if is_duplicate(tool_call, session.recent_tool_calls()):
                inject_error(session, "duplicate_tool_call", tool_call)
                continue
            result = TOOLS[tool_call.name](**tool_call.args)
            session.append_tool_result(tool_call, result)

            # Some tool results produce artifacts
            if is_artifact_producing(tool_call):
                artifact = artifact_from_tool_result(tool_call, result)
                artifacts_this_intent.append(artifact)

        turn_count += 1

        # Stuck detection
        if turn_count >= STUCK_THRESHOLD and not artifacts_this_intent:
            inject_stuck_warning(session)

    # Exceeded MAX_TURNS — force the LLM to summarize what it has
    return force_summary_turn(session, artifacts_this_intent)

8.2 Bounds + escape hatches

Bound Default What it prevents
MAX_TURNS_PER_INTENT 8 tool calls between user messages Runaway agents
STUCK_THRESHOLD 5 consecutive tool calls with no artifact Tool-call loops
DEDUP_WINDOW Same tool + same args within last 3 calls → reject Confused retries
MAX_PARALLEL_TOOLS 4 tools called in parallel per turn Resource exhaustion
TOOL_TIMEOUT 30s per tool call Hung backends
MAX_CONTEXT_TOKENS 80% of model's context window OOM at LLM layer

8.3 Parallel tool calls

When the LLM requests multiple independent tools in one turn (e.g., column_profile for 3 columns), execute them in parallel. Claude supports this natively via the tool-use API. Reduces latency 2-4× for common discovery flows.

8.4 Stuck detection

After 5 consecutive tool calls with no artifact emitted to the user, inject a synthetic system message:

"Status: 5 consecutive tool calls without emitting an artifact. Are you stuck? Consider: - Asking the user for clarification - Producing a partial result and noting limitations - Refusing with a structured error if the AC can't answer"

This breaks loops where the LLM keeps probing without converging. Usually causes the next turn to be a user-facing message.


9. Confidence + verification-level signaling

9.1 Patterns

The analyst should be explicit about what it knows vs. what it's inferring. Concrete patterns:

Situation Analyst behavior
Ambiguous intent "I'm going to interpret this as X — does that match what you wanted?" (uses clarification artifact, not just plain text)
AC at Level A on this surface "This Frame relies on declared FD-edges that aren't fully data-attested yet. Take regional rollups as approximate." (provenance card shows level)
Anomaly in the data "I notice November is unusually low relative to the prior period. I don't know if this is a data issue or a real event." (doesn't speculate beyond what data shows)
Refusal received Pass through the resolver's structured error in plain language; suggest reformulation
Multiple valid interpretations Surface as clarification artifact with multiple-choice chips

9.2 Verification-level surfacing

Every Frame artifact carries the AC's verification level at execution time. The provenance card shows it. The analyst's narration mentions it when below AAA:

"Total revenue by region for Q4 was: [Frame]. Note: this AC is at verification Level AA — sibling-coherence for revenue against the region_daily_summary pre-aggregate hasn't been fully attested for Q4. The numbers are consistent within transactions; if the pre-aggregate disagrees, that's a data issue we'd surface separately."

This is the "AAA attestation" claim made operationally visible. The analyst is structurally incapable of presenting an A-attested number as if it were AAA, because the verification level is in the context.


10. Narration — grounded, never free

10.1 The grounding pattern

Narration is itself a tool: narrate_frame(frame_id, user_intent). The implementation:

def narrate_frame(frame_id: str, user_intent: str) -> str:
    frame = artifact_store.get(frame_id)
    # Build a focused narration prompt with the Frame's data + intent
    prompt = (
        f"User asked: {user_intent}\n\n"
        f"Frame produced:\n{frame.to_compact_text()}\n\n"
        f"Summarize this Frame in 2-3 sentences for the user. "
        f"Mention any notable patterns. Don't speculate beyond the data."
    )
    return adapter.send(prompt, model=NARRATION_MODEL).text

The LLM call for narration is separate from the planning LLM call. It has the Frame's data in hand. It cannot hallucinate "revenue went up" when the data shows a dip — because the data is in the prompt.

10.2 Why this matters

Without the grounding pattern, a multi-step analyst can be many turns away from the original tool result by the time it narrates. The Frame data has been summarized, the LLM is operating on a memory of a memory, and confabulation creeps in. Forcing narration through a dedicated tool with the original data attached prevents this.

10.3 Narration scope

The narration tool produces a narration artifact (separate from the Frame it describes). The user sees both: the Frame as a datagrid, the narration as a prose card. They cite each other (the narration includes the Frame's ID; the Frame's provenance includes any narrations it triggered).


11. Simulation — v1 scope

11.1 What's in v1

Four counterfactual patterns, each cheap to implement and high-value for analytical work:

Pattern Tool Mechanism Example
Filter substitution simulate_filter(frame_id, filter_clause) Rerun Frame-QL with additional WHERE "exclude West region"
Output perturbation simulate_perturbation(frame_id, column, perturbation) Post-output transform; doesn't re-execute "what if revenue grew 10% in East"
Goal-seek simulate_goal_seek(frame_id, column, target, vary) Solve for required input value "what would Q4 revenue need to be to hit $5M target"
Time-shift comparison simulate_filter with time-shifted clause Rerun with modified date WHERE "show as if it were Q4 last year"

11.2 What's deferred

Pattern When to add
Sensitivity sweep (vary X ± 20%, show Y impact) v2 — requires parameter binding + multi-Frame composition
Forecast / extrapolation (project next 3 months) v3 — requires statistical model
Counterfactual structural ("what if we had 20% more stores") v3+ — requires explicit data modeling

11.3 Simulation artifact

Simulation results are simulation artifacts. Each carries: - The source Frame's ID - The simulation type + parameters - The result Frame - Provenance: "this is a simulation, not an observed result"

The UI renders simulations distinctly from observed Frames (different border color, "SIM" badge). This is the same auditability discipline as provenance — the user should never confuse a simulated number with an observed one.


12. Eval framework + trace logging

12.1 Trace logging

Every session produces a structured trace:

{
  "session_id": "sess_abc123",
  "ac_name": "retail_full",
  "started_at": "2026-05-25T14:23:00Z",
  "prompt_version": "v0.3.1",
  "model": "claude-sonnet-4.5",
  "events": [
    {"t": 0.0, "kind": "user_message", "text": "show me revenue trends"},
    {"t": 0.4, "kind": "llm_response", "tool_calls": ["list_families"]},
    {"t": 0.6, "kind": "tool_result", "tool": "list_families", "tokens": 280},
    {"t": 0.8, "kind": "llm_response", "tool_calls": ["execute_frame_ql"]},
    {"t": 1.4, "kind": "tool_result", "tool": "execute_frame_ql",
     "frame_ql": "SELECT month, SUM(revenue) AS r AT month",
     "served_from": "engine_cache", "row_count": 14},
    {"t": 1.6, "kind": "artifact", "kind_detail": "frame", "id": "art_sess_abc123_turn_1_0"},
    {"t": 2.1, "kind": "llm_response", "tool_calls": ["narrate_frame"]},
    {"t": 2.6, "kind": "artifact", "kind_detail": "narration",
     "id": "art_sess_abc123_turn_1_1"},
    {"t": 2.7, "kind": "assistant_message",
     "text": "Revenue trended up through Q3 then dipped in November..."},
    ...
  ]
}

The trace is the unit of replay. Same trace + same prompt version + same model = bit-equivalent replay.

12.2 Eval corpus

Curated (user_message, expected_tool_calls, expected_artifact_kinds, expected_frame_ql_pattern) triples against the retail AC. Examples:

- name: "monthly revenue overview"
  user_message: "show me revenue trends"
  expected_tool_calls:
    - {tool: "list_families"}  # optional discovery first
    - {tool: "execute_frame_ql",
       frame_ql_pattern: "SELECT month, SUM\\(revenue\\) .* AT month"}
    - {tool: "narrate_frame"}
  expected_artifact_kinds: ["frame", "narration"]
  expected_artifact_count_range: [2, 4]
  expected_max_turns: 4

- name: "ambiguous revenue (cousin)"
  user_message: "what's revenue by region"
  expected_artifact_kinds: ["clarification"]
  # Should ask which schema (transactions vs region_daily_summary)
  expected_max_turns: 2

- name: "anchor-locked refusal"
  user_message: "average unit_price by region"
  expected_tool_calls:
    - {tool: "execute_frame_ql",
       frame_ql_pattern: "SELECT region, .*unit_price.* AT region",
       expected_outcome: "refusal"}
  expected_artifact_kinds: ["refusal"]

Run on every commit. Catches regressions when the prompt or model changes.

12.3 Quality metrics

Surfaced in the Engine page (alongside cache stats):

Metric What it measures
turns_per_intent (p50/p99) How quickly the analyst resolves user messages
refusal_rate Fraction of intents that end in refusal
user_confirm_rate When clarification artifacts are shown, fraction confirmed (vs reformulated)
artifact_pin_rate Fraction of artifacts the user pins (positive engagement signal)
branch_rate Fraction of sessions that fork — indicates exploratory use
cache_hit_rate (L1/L2) Prompt cache effectiveness
eval_pass_rate Fraction of eval corpus passing — tracked per prompt version

13. Reproducibility + sharing

13.1 Session export

GET /analyst/sessions/{id}/export
→ application/json
{
  "session_id": "sess_abc123",
  "ac_name": "retail_full",
  "prompt_version": "v0.3.1",
  "model": "claude-sonnet-4.5",
  "messages": [...],
  "artifacts": [...],
  "trace": [...]
}

Importable on another machine. Replayable against the same AC + same prompt version.

13.2 Session forking

POST /analyst/sessions/{id}/fork
{
  "from_artifact": "art_sess_abc123_turn_3_0"
}
→ new session_id

The new session inherits messages + artifacts up to the chosen artifact. The user can then continue the conversation in a new direction without losing context.

13.3 Session sharing

POST /analyst/sessions/{id}/share
→ {"share_url": "/sessions/shared/abc123def456"}

Read-only URL. Anyone with the URL can view the session (messages + artifacts + provenance). No write access; no continued conversation. For showing colleagues "here's what I found."


14. Model routing architecture (v1 = single; arch supports multi)

14.1 Per-task routing (post-v1)

Task class Model Why
Planning / tool selection / refusal Sonnet 4.5 / Opus Reasoning-heavy
Narration of a Frame Haiku Fast; just summarizing data
Follow-up generation Haiku Fast
Multi-step counterfactual Opus Hard reasoning

14.2 v1 implementation

Single model (Claude Sonnet 4.5 default). The adapter exposes a select_model(task_class) hook that v1 wires to a constant; v2 routes per task class.

14.3 Provider abstraction

class LLMAdapter(ABC):
    @abstractmethod
    def send(
        self,
        messages: list[Message],
        tools: list[ToolDef],
        model: str | None = None,
        cache_breakpoints: list[int] | None = None,
    ) -> LLMResponse:
        ...

    @abstractmethod
    def select_model(self, task_class: TaskClass) -> str:
        ...

class ClaudeAdapter(LLMAdapter): ...
class OpenAIAdapter(LLMAdapter): ...
class OpenLLMAdapter(LLMAdapter): ...  # for self-hosted Llama etc.

Per the commercial strategy memo: Claude is the default reference (Anthropic strategic alignment); GPT + OpenLLM ship as peer reference adapters in Core (model neutrality). Pluggable architecture from day 1.


15. Skills / extensibility (architecture only in v1)

15.1 The pattern

Per-installation custom tools, loaded from .coframe/analyst_skills.py:

# .coframe/analyst_skills.py
from coframe.runtime.analyst import skill

@skill(name="seasonality_check")
def check_seasonality(metric_family: str, dimension: str = "month") -> dict:
    """Report seasonality strength of a metric over a time dim."""
    ...

The orchestrator picks up analyst_skills.py at session start; registered tools become available to the LLM alongside the core tools.

15.2 What v1 ships

Architecture only — the skill registration mechanism exists, with no built-in skills. Vertical packs (retail seasonality, finance period-close, etc.) are Pro features (per commercial strategy memo §3).

15.3 Why architect for it now

If we ship skills as a v2 retrofit, every existing deployment needs migration. Building the registration hook into the v1 orchestrator is cheap; making skills work later is expensive. Standard architectural-affordance discipline.


16. LLM adapter — Claude default, pluggable architecture

Per coframe_commercial_strategy_v0_1.md §6 Decision B (pluggable + Claude default), the v1 adapter ships three reference implementations:

Adapter Status Default?
ClaudeAdapter Reference, fully tested Yes
OpenAIAdapter Reference, fully tested No
OpenLLMAdapter (Llama / local) Reference, basic No

Selection via installation.yaml:

analyst:
  enabled: true
  default_adapter: claude
  adapter_config:
    claude:
      api_key_env: ANTHROPIC_API_KEY
      default_model: claude-sonnet-4.5
    openai:
      api_key_env: OPENAI_API_KEY
      default_model: gpt-4o

User brings their own LLM key (env var). Coframe never holds keys in default deployment. This is the standard pattern (dbt Cloud, etc.) and the right operational posture.


17. Phasing — A1 through A8

A1 is this document. A2 through A8 are implementation slices, reshuffled per harness-engineering priorities (substrate concerns first, polish later).

A1 — Design doc (this document)

Scope: the document you're reading. ~1300 lines. Establishes architecture, tool surface, context model, harness disciplines, evals, reproducibility, phasing.

Deliverable: committed to drafts/.

Duration: ~3 hours (just landed).

A2 — Adapter abstraction + tool registry + AC-aware prompt builder

Scope: - LLMAdapter abstract + Claude reference + OpenAI reference - Tool typed registry; 12-15 core tools (per §5) - AC catalog → prompt summary builder (L1 content) - ANALYST.md loader + composer (L2 content) - Prompt caching wired through to Anthropic API - Tests for tool I/O schemas, error formats, prompt construction

Deliverable: coframe-runtime/src/coframe/runtime/analyst/{adapter.py, tools.py, prompt.py, memo.py} + tests.

Estimate: ~800 LOC + 30 tests; 1.5 days.

A3 — Orchestrator + artifact taxonomy + session state

Scope: - Session model: messages, artifacts, traces, state - Typed Artifact models (11 kinds per §4.2) - Orchestrator turn loop (per §8): tool dispatch, dedup, stuck detection, parallel tools, bounded turns, model routing hook - Trace logging infrastructure - Tests for loop discipline (each escape hatch tested explicitly)

Deliverable: analyst/{session.py, artifacts.py, orchestrator.py, trace.py} + tests.

Estimate: ~700 LOC + 25 tests; 1.5 days.

A4 — Narration grounding + simulation v1 + HTTP routes

Scope: - narrate_frame grounded narration pattern (per §10) - Simulation tools: filter, perturbation, goal-seek, time-shift (per §11) - HTTP routes: POST /analyst/sessions, POST /analyst/{id}/message, GET /analyst/{id}/state, POST /analyst/{id}/fork, POST /analyst/{id}/share, GET /analyst/{id}/export - Session persistence (in-memory store + serialization to .coframe/sessions/)

Deliverable: analyst/{narration.py, simulation.py, http.py, store.py} + tests.

Estimate: ~600 LOC + 20 tests; 1 day.

A5 — Workbench UI: two-panel workspace

Scope: - New top-level AnalystPage.tsx (fourth UI alongside Workbench/Management/Query) - DialoguePanel.tsx — message log + input - ArtifactStack.tsx — typed artifact renderers - Per-artifact renderers (11 kinds): Frame, Chart, Narration, QM, Followups, Refusal, Provenance, etc. - Vega-Lite renderer for chart artifacts - Sidebar navigation + cross-link from QueryPage

Deliverable: packages/coframe-frontend/src/analyst/ with AnalystPage.tsx + sub-components.

Estimate: ~700 LOC frontend; 1 day.

A6 — Eval framework + trace replay

Scope: - Eval corpus format (drafts/data/retail_demo/analyst_eval_corpus.yaml) - Eval runner: load corpus, run sessions, compare traces, report pass/fail - Trace replay infrastructure (deterministic replay with mocked LLM responses) - ~20 initial eval cases on the retail AC

Deliverable: analyst/eval/ + initial corpus.

Estimate: ~500 LOC + corpus; 1 day.

A7 — Retail demo polish + ANALYST.md authoring

Scope: - Write the retail AC's ANALYST.md files (installation + AC) - End-to-end retail-demo walkthrough doc (docs/analyst_demo_walkthrough.md) - Workbench tour wiring (how an author lands on AC Analyst as the default) - Quality metrics visible on the Engine page (per §12.3) - Performance tuning: prompt cache verification, parallel tool execution, narration latency

Deliverable: demo materials + UX polish.

Estimate: ~300 LOC + docs; 0.5 day.

A8 — Position article + Manual chapter update

Scope: - Manual gets a new Chapter 12 — The AC Analyst Surface (architecture, tool surface, ANALYST.md spec, eval/reproducibility) - Position article's "What this enables" section gets a brief update referencing the now-shipped AC Analyst as concrete instantiation - Cross-references to design doc, commercial strategy memo - Glossary updates

Deliverable: Manual + position-article updates.

Estimate: ~600 lines docs; 0.5 day.

Total

~5-6 days of focused work, ~3600 LOC + ~600 lines docs + ~100 tests + initial eval corpus. Bigger than the derived-metrics/dims arcs (~3-4 days) but proportional to the ambition.


18. Non-goals

Explicitly out of scope for v1, with notes on when each lands:

Non-goal When
Cross-AC analyst (one session reasoning over multiple ACs) Pro v2; needs careful design
Streaming token-by-token in narration v2; perfectly fine as batched in v1
Multi-modal artifacts (PDF, audio narration, embedded SQL view for power users) v2; v1 ships text + charts + interactive
Persistent session state across server restarts (today: in-memory + file dumps) v2 with proper DB
Production-grade telemetry dashboard for sessions Pro v2
Fine-tuned per-installation models Pro v3
User-level / session-level ANALYST.md memos When user accounts ship (Pro multi-user feature)
Built-in vertical skills (retail seasonality, etc.) Pro v2 — extensibility hook ships in v1
Real-time collaboration (multiple users in one session) v3+
Compute simulations (sensitivity sweeps, parametric scenarios) v2 simulation extension

19. Open questions

Questions to resolve before / during implementation:

  1. Tool naming consistency. I've used list_families, column_profile, execute_frame_ql — should we converge on a naming pattern (verb_noun? all snake_case?). Bikeshed-y but worth deciding once.
  2. Artifact ID format. art_{session_id}_{turn}_{index} is verbose but unambiguous. Alternative: short hashes. Decision affects how artifacts appear in narrations + share URLs.
  3. Cache breakpoint placement. Anthropic prompt caching needs explicit breakpoints. Default plan: breakpoint after L1, breakpoint after L2, no L3 breakpoint. Verify performance under realistic loads.
  4. Refusal pass-through detail. Some resolver refusals carry rich context (block-set conflict details, FD-reachability failures); how much detail does the analyst need to surface to the user vs. internalize? Probably full detail to LLM, summarized to user.
  5. Frame size limits. Should execute_frame_ql cap row count? E.g., refuse Frames > 10,000 rows with "this would be slow; add a filter or LIMIT." Or stream them. Decide before A4.
  6. Cross-session memory. v1 sessions are independent. Should we surface "your prior sessions" to the user even if the LLM doesn't see them? Probably yes — it's UX, not LLM context.
  7. Authentication / multi-user. v1 assumes single-user-per-deployment. Multi-user with per-user sessions is implied but deferred. Need to confirm v1 scope.
  8. Skill loading security. analyst_skills.py is arbitrary Python — same trust model as Claude Code's CLAUDE.md (trusted within installation). Confirm acceptable; alternative is sandboxed skill format (e.g., declarative JSON), which is more work.

Each is a small design decision; flag during implementation, capture in commit messages.


20. Summary

The AC Analyst is Coframe's analytical-agent harness — the operational realization of the "substrate for the analytical-agent layer" claim in the position article. It uses the AC's structural completeness as the substrate the way Claude Code uses a codebase + a language server: as a bounded, structured world the LLM can compose over with confidence.

The design discipline is harness-engineering, not LLM-engineering:

  1. Tool ergonomics — strict schemas, structured errors, compact outputs, worked examples
  2. Context architecture — 4-tier with caching; L4 keeps artifact data out of the LLM by default
  3. Loop discipline — bounded turns, dedup, stuck detection, parallel tools
  4. Author-supplied grounding — ANALYST.md at installation + AC levels, hierarchically composed
  5. Observability + evals — trace logging, replay, corpus, per-prompt-version evaluation
  6. Reproducibility — export, fork, share; every artifact carries provenance

The two-panel workspace (dialogue + artifact stack) is the UX shape. Eleven typed artifacts give the analyst rich expression beyond text. Sessions persist, branch, replay.

Per the commercial strategy memo: AC Analyst v1 ships in Coframe Core as the strategic flagship demonstrating "what Claude can do for enterprise data." Advanced features (custom skills library, multi-AC analyst, fine-tuned models, sensitivity simulation) belong in Pro v2+. The pluggable LLM adapter ships in Core with Claude as the default reference; GPT + OpenLLM are peer references (model neutrality).

Implementation is A2–A8, ~5-6 days of focused work, delivering an end-to-end AI Analyst experience at quality bar high enough to be publicly demoable. Once shipped, MCP server (separate Surface, exposing the same tools to external clients like Claude Desktop) becomes the natural follow-up — the wide-distribution complement to the high-quality inside experience.


End of AC Analyst v0.1 design.

Next: A2 implementation begins — adapter abstraction, tool registry, prompt builder.