Coframe Core¶
The Complete Technical Manual¶
A framework for constructively-correct analytics: column-native construction over a shared coordinate space, with grammatical correctness guaranteed by rule and data integrity verified as a separate discipline.
Edition: Coframe Core, catalog coframe-core-catalog/1.0, data-API coframe-data-api/1.0
Compiled: May 2026
Table of Contents¶
Front Matter
Chapters
- Chapter 1: What Coframe Constructs
- Chapter 2: Foundations
- Chapter 3: The AC Specification
- Chapter 6: The Data-API Protocol
- Chapter 7: Data Quality and Structural Verification
- Chapter 8: Frame-QL
- Chapter 9: Query Resolution
- Chapter 10: The Operator Catalog
- Chapter 11: The Metric Engine
Appendices
Preface¶
Who this manual is for¶
This manual is written for engineers, analysts, and architects who work with analytical data — people whose daily work involves describing how to derive correct answers from collections of warehouse tables, semantic-layer models, or analytical platforms. The reader is assumed to know SQL well, to have encountered dimensional modeling (Kimball-style fact and dimension tables, conformed measures, slowly-changing dimensions), and to have at least passing acquaintance with the semantic-layer category (LookML, dbt MetricFlow, Cube, AtScale, or similar).
The reader is not assumed to know any specific academic literature on multidimensional data models. Where the manual draws on or differs from that literature, it cites and explains.
What you will be able to do after reading¶
After working through this manual you will be able to:
- Read and write Coframe Analytic Collections (ACs) for analytical workloads of moderate complexity.
- Declare the structural commitments — anchors, dimension families, metric families, operators, lineage — that allow Frame-QL queries to be answered constructively.
- Run the framework's data-quality and integrity-verification process against an AC and interpret the results.
- Write Frame-QL queries that describe desired output frames and understand why the framework accepts, refuses, or flags them.
- Reason about the trade-offs between Coframe's posture and the postures of adjacent tools (semantic layers, lakehouse query engines, BI tools).
The manual is a complete reference for Coframe Core. Pro features are mentioned where they extend Core's machinery; they are specified in a separate Pro manual.
The framework's posture¶
Five commitments shape how Coframe approaches analytical data. They are stated here so that the reader can hold them in mind as the rest of the manual unfolds.
Structural rigor is binary. An AC either honors the framework's structural commitments or it does not. There is no "permissive mode" in which the framework relaxes its checks to accept ill-formed declarations. Commitments that cannot be verified against data are reported explicitly. The framework's correctness guarantees follow from the rigor; weakening rigor would weaken the guarantees.
Grammar and semantics are separate. The framework reasons about grammar — the structural relationships among columns expressed through declared commitments. It does not reason about semantics — what the columns mean in the business sense. Whether revenue should include returns, whether customer_segment reflects current or historical classification, whether a particular ratio is the right indicator for the business question — these are the AC author's responsibility. The framework verifies that the AC author's structural declarations are internally consistent and data-attested; it does not interpret what the declarations are claims about.
Names are opaque labels. A column's name field is used by the framework only for equality comparison (family membership) and naming-function verification (when a naming function is declared). The framework does not parse names, look for substrings, or infer meaning from naming conventions. If an AC author wants the framework to know that two columns are related, that relationship must be declared structurally — through lineage, through shared family-name, through declared FD-edges. Conventions in naming are the AC author's tool for human readability; they are not framework signals.
Coframe describes data, not transactions. The framework's structural conditions are conditions on data: anchors, missingness, functional dependencies, partition-invariance, identity-preservation. These are properties the data either has or does not have, independent of any particular framework. Coframe does not invent these conditions; it formalizes them and makes them declarable and verifiable. An AC author whose data does not satisfy the conditions is not encountering a Coframe limitation — they are encountering a condition the data does not honor.
Coframe is the Analytic Layer — a peer category to the Semantic Layer, not a semantic layer. This deserves emphasis because the surface similarity to semantic-layer tools is real: both Coframe and semantic layers expose a logical analytical surface above physical tables; both let consumers query without writing joins; both centralize metric definitions. The difference is in what is centralized and how. Semantic layers centralize named metrics with operational logic — revenue: SUM(transactions.amount) — attaching business definitions to physical models. Coframe centralizes structural metadata — anchors, family-genealogy, dimension families, operator catalog — without bundling business definitions into metric declarations. Frame-QL queries answer themselves from the AC's structural commitments; they do not look up business logic embedded in named metric definitions. A reader from the semantic-layer category will find some familiar surface (query without joins) and substantial structural difference underneath. The two layers are complementary peers in a modern data stack: the semantic layer answers what does this metric mean?; the Analytic Layer answers is this derivation structurally correct? (See coframe_platform_design_v2_1_supplement.md §1 for the category framing in full.)
How to read this framework¶
Coframe has a strong guiding philosophy, and the rest of this manual states it in many forms — structural rigor, grammar over semantics, describe what you want rather than how to compute it. A philosophy this clear invites a particular mistake, and naming it here will save the reader time.
The principles in this manual are constraints the design satisfies, not formulas to re-derive the design from. It is tempting, once the philosophy clicks, to use it as a generator — to predict what Coframe "should" do by reasoning forward from the philosophy, to purify one's ACs and queries toward an aesthetic, or to reject a perfectly good construct because it "feels procedural" or "feels like SQL." This reads the principles backward. A principle used as a generator fiddles endlessly with surface while, paradoxically, missing substance: it will reject a clean, sanctioned construct for looking impure and wave through a genuinely disallowed one for feeling elegant.
The discipline that actually works is to apply the specific structural test the relevant chapter provides, not the philosophy's general flavor. The framework's distinctions are decidable, and each chapter hands the reader the test for its domain. A few of the load-bearing ones:
- Is a metric rollup valid in a given direction? Apply the partition-invariance test and the ip_reducer's block set (Chapter 2 §2.6, §2.8) — not the intuition that "summing feels right here."
- Can a quantity be navigated across grain, or is it anchor-locked? Apply the test of whether a partition-invariant ip_reducer exists for its family (§2.8.4) — not the sense that "it's a number, so it should aggregate."
- Is a computation expressible in Frame-QL? Apply the test of whether it describes a column's value at its own grain (expressible) or requires referencing other rows of the assembled result (not expressible in Core) — not the question of whether it "looks like SQL." (Chapter 8 §8.2.)
- Should a derived quantity be stored or computed? Apply the test of whether it can be reconstructed from navigable components at the query grain (Chapter 10 §10.8.6) — not the instinct to "expose everything for completeness."
Surface resemblance to SQL is not a defect; Frame-QL deliberately shares familiar surface (SELECT, FROM, WHERE) and differs in substance — what it queries (an AC, not tables) and what the author must know (the AC, not the backend). When in doubt about whether something belongs, reach for the chapter's test, not the manual's mood. The principles tell you why the tests are what they are; the tests tell you what to do.
Conventions¶
Italicized definitions like Analytic Collection (AC) introduce framework concepts on first use.
Code identifiers are backtick-formatted (column_name, revenue, family-root).
Hyphenation: family-name and family-root are hyphenated; siblings and cousins are not.
Operations between metrics use the notation m_pred --op--> m, with m_pred = (name_pred, A_pred) for the predecessor and m = (name, A) for the successor.
Cross-references use chapter-and-section numbers where unambiguous (e.g., "see §3.4"); chapter titles or numbers are used for cross-chapter references (e.g., "see Chapter 7" or "see ColumnSpec §3.7").
The retail AC example in the front matter is referenced throughout. Where a chapter needs a structural case not present in the retail AC, it constructs a minimal extension and notes the deviation.
A note on the manual's structure¶
The manual is organized in the dependency order of its ideas, not in the order an implementer would build the framework from scratch.
Chapter 1 establishes what Coframe constructs and why. Chapter 2 (Foundations) specifies the grammar-level primitives the construction rests on. Subsequent chapters elaborate: the AC specification format including ColumnSpec and schema.init (Chapter 3), the data-API protocol (Chapter 6), the data-quality and integrity-verification process (Chapter 7), the Frame-QL query language (Chapter 8), query resolution (Chapter 9), the operator catalog (Chapter 10), the optional Metric Engine acceleration substrate (Chapter 11), and the AC Analyst — an AI-powered analytical workspace built on top of the AC (Chapter 12). Appendices contain the BNF grammar and glossary.
The chapter numbering reflects the manual's deliberate scope: this manual specifies the Coframe Core framework. Workflow guidance for AC authoring (originally planned as Chapter 4) and the MCP Surface (originally a separate companion-document chapter) are addressed in separate companion documents — they are products built on the framework rather than parts of the framework's specification. The chapter numbers are preserved with gaps to make the boundary explicit.
A reader who is impatient may skim Chapter 1, read Chapter 2 carefully, then dip into later chapters as needed. A reader who wants the full argument should read Chapters 1 and 2 in order.
The Retail AC: A Worked Example¶
This artifact establishes the concrete retail Analytic Collection (AC) referenced throughout the manual. Subsequent chapters build their illustrations from this example.
Setting¶
A regional retailer operates several hundred stores across multiple countries. The retailer's data infrastructure consists of three primary tables maintained by separate teams:
- A transactions table recording each point-of-sale event.
- A stores table listing each store with its location and operational metadata.
- A store-monthly-inventory table from the supply-chain team, recording end-of-month inventory snapshots per store.
The retail AC selects columns from these three sources, organizes them into named dimension families and metric families, and declares the structural commitments that allow Frame-QL queries to be answered constructively with guaranteed correctness.
Dimension families¶
The AC declares four dimension families.
time dimension family¶
Multiple hierarchies share a common base level.
- Base level:
day - Hierarchies:
calendar: day → month → quarter → yearfiscal: day → fiscal_period → fiscal_quarter → fiscal_yearweek: day → iso_week → iso_year
The base level day is the finest temporal AC-dimension. Coarser AC-dimensions (month, fiscal_quarter, etc.) are reachable from day through declared FD-edges.
geography dimension family¶
Single hierarchy.
- Base level:
store - Hierarchies:
administrative: store → city → region → country
store is the base level, identifying individual retail locations. Coarser geographic AC-dimensions roll up through the administrative hierarchy.
product dimension family¶
Single hierarchy.
- Base level:
sku - Hierarchies:
merchandising: sku → product_line → product_category → department
Each transaction line item references a sku; SKUs roll up into product lines, categories, and departments.
transaction dimension family¶
Single-level dimension family (no coarser levels).
- Base level:
transaction - Hierarchies: none
The transaction AC-dimension identifies each point-of-sale event. It serves as the finest grain for transactional metric families. Coarser anchorings of transaction-level metrics use other dimension families (time, geography, product) rather than coarser transaction-levels.
Schemas¶
The AC declares three schemas, each binding to a backend source.
transactions schema¶
Source: the backend transactions table.
ColumnSpecs (excerpt, in conceptual form):
| Column | Anchor A |
Family | Operator | Lineage | Notes |
|---|---|---|---|---|---|
transaction |
{self} |
transaction |
OBSERVED |
self | Grain-role here; AC-dimension; base level of transaction dimension family |
store |
{transaction} |
store |
OBSERVED |
self | Non-grain-role here; AC-dimension (base level of geography, grain role in stores) |
sku |
{transaction} |
sku |
OBSERVED |
self | Non-grain-role here; AC-dimension (base level of product) |
day |
{transaction} |
day |
OBSERVED |
self | Non-grain-role here; AC-dimension (base level of time) |
revenue |
{transaction} |
revenue |
OBSERVED |
self | AC-metric; observation-rooted |
units_sold |
{transaction} |
units_sold |
OBSERVED |
self | AC-metric; observation-rooted |
cost |
{transaction} |
cost |
OBSERVED |
self | AC-metric; observation-rooted |
Declared scope: non-degenerate on time, geography, product, transaction.
Note on roles: a column's AC-level trichotomy (AC-dimension / AC-attribute / AC-metric) is distinct from its role within a particular schema (grain-role or non-grain-role). In the transactions schema, store, sku, and day are in non-grain role — they reference dimension families without being unique per row — but at the AC level they are AC-dimensions, since each is in grain role somewhere or is a declared dimension-family member.
stores schema¶
Source: the backend stores table.
| Column | Anchor A |
Family | Operator | Lineage | Notes |
|---|---|---|---|---|---|
store |
{self} |
store |
OBSERVED |
self | Grain-role here; AC-dimension; base level of geography |
city |
{store} |
city |
OBSERVED |
self | AC-dimension (geography member); FD-edge store → city |
region |
{store} |
region |
OBSERVED |
self | AC-dimension (geography member); FD-edge store → region |
country |
{store} |
country |
OBSERVED |
self | AC-dimension (geography member); FD-edge region → country |
store_open_date |
{store} |
store_open_date |
OBSERVED |
self | AC-attribute (date-valued; not a coordinate, not in any dimension family) |
Declared scope: non-degenerate on geography; degenerate on time, product, transaction.
store_monthly_inventory schema¶
Source: the backend store_monthly_inventory table.
| Column | Anchor A |
Family | Operator | Lineage | Notes |
|---|---|---|---|---|---|
store |
{store, month} |
store |
OBSERVED |
self | Grain-role contributor |
month |
{store, month} |
month |
OBSERVED |
self | Grain-role contributor |
eom_inventory |
{store, month} |
eom_inventory |
SUM |
self | AC-metric; reducer-rooted; family-root |
peak_inventory |
{store, month} |
peak_inventory |
MAX |
self | AC-metric; reducer-rooted; family-root |
The schema's grain is the composite (store, month): each row records one store's inventory state for one month.
Declared scope: non-degenerate on geography and time; degenerate on product and transaction.
Metric families¶
The AC's metric families illustrate the framework's main structural cases.
revenue — observation-rooted, fully partition-invariant¶
- Family-root:
revenueat anchor{transaction}intransactionsschema. - Family-root operator:
OBSERVED. - ip_reducers:
(SUM, A_block = ∅). - Properties: observation-rooted (data is the root); operator-attested (siblings at any reachable anchor confirm SUM rollup).
- Siblings: any sibling materialized at a coarser anchor (e.g.,
revenueat{region, quarter}) belongs to this family with lineage(revenue, {transaction}, OBSERVED)and opSUM.
This is the simplest case: a flow measure observed at the finest grain, fully additive in every direction.
eom_inventory — reducer-rooted, semi-additive¶
- Family-root:
eom_inventoryat anchor{store, month}instore_monthly_inventoryschema. - Family-root operator:
SUM. - ip_reducers:
(SUM, A_block = {time})— SUM is safe across the geography dimension family but not across time.(MAX, A_block = ∅)— MAX rollup is safe in any direction (peak inventory).- Properties: reducer-rooted (no finer-grained sibling in the AC); operator-asserted under SUM (the framework trusts the AC author's declaration); operator-attested under MAX if any sibling exists.
- Block-set semantics: a query asking for "total inventory by region in 2024" via SUM rollup across time is refused as blocked; a query asking for "peak inventory by region in 2024" via MAX is accepted.
This case exercises the semi-additive machinery. The same metric supports two different ip_reducers, each with its own block set; the query resolver selects based on requested rollup direction.
peak_inventory — extremal, identity-preserving family¶
- Family-root:
peak_inventoryat anchor{store, month}. - Family-root operator:
MAX. - ip_reducers:
(MAX, A_block = ∅). - Properties: MAX is partition-invariant (monoidal on totally-ordered values) and identity-preserving in the sense that peak-of-peaks is still a peak. The family rolls up by MAX in any direction.
Note that peak_inventory and eom_inventory are different families despite drawing from the same underlying inventory data. The AC author has chosen to materialize them separately and declare them as separate family-roots. They are not siblings; they are not cousins (different names); they are different families.
gross_margin_pct — singleton, anchor-locked¶
This family is illustrative. It is included to exercise singleton handling, but it also illustrates a design default: an intensive ratio like this would normally not be stored as an AC-metric at all — it would be computed in Frame-QL via RATIO_OF(revenue, cost) or SUM(revenue)/SUM(cost) at the query grain. See the note below.
- Family-root:
gross_margin_pctat anchor{transaction}intransactionsschema (illustrative materialization). - Family-root operator:
MAP_DIV(multi-input). - Lineage: tuple of predecessor snapshots —
(revenue, {transaction}, OBSERVED)and(cost, {transaction}, OBSERVED). - ip_reducers: none. Anchor-locked.
- Properties: singleton; multi-input lineage; no rollup via name-preserving aggregation. A query requesting
gross_margin_pctat any coarser anchor cannot navigate from this transaction-grain singleton (it is anchor-locked); it must reconstruct the ratio fromrevenueandcostrolled up to that anchor — a Frame-QL expression, not a family rollup.
This case exercises singleton handling and the boundary between in-family reasoning and cross-family computed expressions.
Why this would normally not be stored. A stored intensive ratio is anchor-locked: queryable only at {transaction} and unreachable from region, quarter, or category grain — exactly the grains at which margin is actually queried. Every such query reconstructs the ratio from revenue and cost anyway, so the stored singleton is navigationally inert. The framework's default is therefore to compute intensive ratios in Frame-QL (RATIO_OF), not store them. A stored ratio earns its place only when the Frame-QL expression is unavailable — a component is not in the AC, the ratio is a primary observation, or the denominator is non-navigable (Chapter 10 §10.8.6). gross_margin_pct is retained in this example AC purely to demonstrate the singleton structure; a production AC would express it in queries.
avg_basket_size — intensive, anchor-locked¶
This family is illustrative: it is not materialized in any of the three retail-AC schemas above. It is described here to exercise the anchor-locked case. An AC author wanting it would add a schema (e.g., a store_daily_summary table) materializing it at {store, day}.
- Family-root:
avg_basket_sizeat anchor{store, day}(illustrative; would require a materializing schema). - Family-root operator:
AVG. - ip_reducers: none (AVG is not partition-invariant).
- Properties: anchor-locked; the family has no ip_reducer. Cross-grain navigation refused.
A query asking for avg_basket_size by region is refused unless avg_basket_size is independently materialized at {region, day} (creating a cousin) or computed inline from revenue and a transaction-count metric at the target grain.
Cross-cutting structure¶
FD-DAG¶
Each dimension family's hierarchy contributes FD-edges to the AC's FD-DAG:
- Within
time: day → month → quarter → year; day → fiscal_period → fiscal_quarter → fiscal_year; day → iso_week → iso_year. - Within
geography: store → city → region → country. - Within
product: sku → product_line → product_category → department. - Within
transaction: no FD-edges (single-level).
Cross-family FD-edges are absent in this AC. Transactions reference geography and product through non-grain-role AC-dimension columns (store, sku) whose values identify positions in those dimension families; the relationship between a transaction and its store/sku/day is captured at the schema level (these columns are anchored at {transaction}), not through cross-family FD-edges between dimension families.
Operator-attested vs. operator-asserted¶
revenue: operator-attested (any sibling at a coarser anchor verifies SUM rollup against the root).eom_inventoryunder SUM: operator-asserted (no finer-grained sibling exists in the AC; the framework trusts the AC author's declaration of SUM as the ip_reducer; the block setA_block = {time}reflects the AC author's commitment that SUM is safe only across geography).eom_inventoryunder MAX: operator-attested if a MAX-rolled sibling is materialized; operator-asserted otherwise.peak_inventoryunder MAX: depends on whether siblings exist.gross_margin_pct: singleton; operator-attestation does not apply (no rollup lineage upward).avg_basket_size: anchor-locked; no rollup; no attestation question.
Cousins and the dubious-query mechanism¶
Suppose the AC also materializes revenue at {region, day} in a separate schema regional_daily_summary, but the AC author does not declare its lineage as (revenue, {transaction}, OBSERVED) — declaring it instead as a self-referential observation root.
This creates a cousin of the transaction-level revenue: same name (revenue), different family-root. A query referencing revenue at the regional-daily grain has two candidate sources (rollup from the transaction sibling vs. direct observation from the regional cousin). The framework refuses the query as dubious and requires disambiguation.
The cousin case in this AC is constructed deliberately to illustrate the mechanism; well-formed ACs typically declare such columns as siblings (with proper lineage) rather than as cousins.
What this example exercises¶
This AC is intentionally small but covers the framework's main structural cases:
| Concept | Where exercised |
|---|---|
| Anchors and the (A, M) pair | All columns |
| AC-dimension / AC-attribute / AC-metric trichotomy | All columns |
| Dimension families with multiple hierarchies | time |
| Dimension families with single hierarchies | geography, product, transaction |
| Base levels | day, store, sku, transaction |
| Observation-rooted families | revenue, units_sold, cost |
| Reducer-rooted families | eom_inventory, peak_inventory |
| Semi-additive measures via block sets | eom_inventory under SUM |
| Multiple ip_reducers per family | eom_inventory (SUM and MAX) |
| Identity-preserving vs. non-identity-preserving operators | SUM preserves family (revenue → revenue); COUNT does not (revenue → revenue_count) |
| Singleton multi-input families | gross_margin_pct |
| Anchor-locked families | avg_basket_size |
| Operator-attested vs. operator-asserted | revenue vs. eom_inventory under SUM |
| Cousin disambiguation | revenue cousin if mis-declared |
| Schema scope and declared degeneracy | All three schemas |
Subsequent chapters of the manual reference this AC by name when illustrating concepts. Where a chapter needs a structural case not present here, it constructs a minimal extension and notes the deviation.
Chapter 1: What Coframe Constructs¶
The framework's purpose: a constructive algorithm for answering rich analytical queries against curated collections of data, with guaranteed correctness derived from declared and verified structural commitments.
1.1 The analytical-query problem¶
Consider an analytical question a retail business might ask:
Total revenue by region by quarter for the last three years, comparing each quarter to the same quarter the year before, filtered to the consumer-electronics department, with rollups that correctly handle stores that opened or closed during the period.
This is a representative analytical query. It is not unusually complex; questions like it arise daily in any data-driven business. But producing a correct answer involves a sequence of decisions, each of which can quietly go wrong:
- The revenue figures must come from transaction-level data, summed correctly to the (region, quarter) grain.
- The same-quarter-prior-year comparison requires aligning two grain levels of the time dimension correctly.
- The department filter requires connecting product-level transaction lines to the merchandising hierarchy.
- The region rollup must handle store-to-region mapping, including stores that changed regions or did not exist for parts of the period.
- The result must be defensible: a colleague asking "is this the right number?" should receive a structural answer, not a hopeful one.
The question is rich. So is the space of ways to get it wrong.
A data engineer producing this answer in SQL today has substantial latitude. They choose which tables to join, which keys to join on, how to handle null timestamps, whether to filter before or after aggregation, whether to use the calendar quarter or a fiscal one. Each choice can be defended; some choices silently produce wrong answers. The query author carries the entire correctness burden.
A semantic-layer tool — LookML, Cube, MetricFlow, AtScale — moves some of this burden into named metric definitions: revenue is defined once, in one place, and consumers reference the named metric without writing the underlying SQL. But the correctness of cross-grain rollups, cross-table coherence, and dimensional consistency still depends on the AC author's care in defining the metric and on the engine's interpretation of the named metric's operational logic.
Coframe takes a different stance. Its purpose is to find the minimum set of grammar-level structural commitments that, when declared on a curated collection of columns and verified against the underlying data, allow the constructible facets of a query like the one above to be answered constructively — by an algorithm that the framework executes — with guaranteed correctness — meaning the answer is a consequence of the declared and verified commitments, not of any semantic interpretation the framework supplies. (As §1.5 will show, one facet of the query above — the year-over-year comparison — falls outside Coframe Core's scope; the framework constructs what it can guarantee and is explicit about where its scope ends. That boundary is itself part of the stance.)
The rest of this chapter explains what that stance means and why it is structurally distinctive.
1.2 The traditional settling point: queries against tables¶
For most of the history of analytical computing, the unit of querying has been the table. A query author selects columns from one or more tables, joins them on declared keys, filters and aggregates, and produces a result. This is what SQL was designed for, and it remains the dominant paradigm in warehouses, lakehouses, and OLAP engines.
This settling point has consequences for where correctness lives. When the unit of querying is the table:
- The query author must know which tables exist, what each table's grain is, and which columns are joinable to which others.
- The query author must choose joins. Different join choices produce different results, sometimes silently. A LEFT JOIN where an INNER JOIN was needed yields a different aggregate; an extra join multiplies row counts.
- The query author must understand semi-additive measures. Summing inventory across stores is correct; summing the same inventory across months is wrong. SQL does not stop the author from writing the wrong sum.
- The query author must reconcile pre-aggregated tables with detail tables. If
daily_revenue_by_storewas pre-computed from the transactions table, the author must trust (or verify) that the two agree. SQL provides no machinery for this. - The query author must handle conformance. If two tables both have a
regioncolumn, the author must determine whether the values agree, whether the columns refer to the same regional taxonomy, whether they were populated at consistent times.
A skilled data engineer handles these concerns competently. Many are second nature. But the framework — the database engine — does not enforce them. The engine's correctness guarantee is operational: the query, as written, produces a result consistent with the operational semantics of its operators. Whether that result is what the business wants is the author's problem.
Semantic layers are an attempt to relocate some of this burden. By naming metrics centrally and exposing them to consumers, semantic layers ensure that the metric revenue is defined once and queried consistently. But the structural correctness questions — cross-grain rollup safety, dimensional conformance, definitional consistency between pre-aggregated and detail tables — are largely still the AC author's responsibility, encoded in the semantic-layer model's metric definitions and join paths. The semantic layer makes consistent querying easier; it does not provide structural guarantees about the queries it answers.
1.3 An alternative settling point: queries against ACs¶
Coframe proposes a different unit of querying: the Analytic Collection (AC). The AC is the primitive of the Analytic Layer — the category Coframe instantiates, peer to the Semantic Layer (see Preface).
An AC is a curated, named, structurally-committed collection of columns drawn from one or more backend tables. The AC author selects which columns to include, declares structural commitments about each, and exposes the resulting collection as an analytical surface. Consumers query the AC — not the underlying tables, not joins between them — by referencing the AC's exposed AC-dimensions and AC-metrics and describing the desired output frame.
When the unit of querying shifts from the table to the AC, several things move:
- The AC's exposed surface is what the query author knows. The AC author has chosen which AC-dimensions and AC-metrics to expose. The consumer sees these, not the backend tables. The consumer does not write joins; there are no joins to write at the AC level.
- The AC's structural commitments are what the framework reasons over. Each AC-dimension is part of a named dimension family with declared hierarchies. Each AC-metric belongs to a metric family with a declared family-root, declared ip_reducers, and a declared block-set per ip_reducer. The framework's algorithm consults these declarations.
- Correctness becomes a framework guarantee. When the framework accepts a Frame-QL query against an AC, the answer it produces is a consequence of the AC's declared and verified commitments. When the framework cannot produce a guaranteed answer, it refuses the query and explains why. There is no third option of "the framework guesses."
This shift is the framework's central commitment, and the rest of the manual is the answer to: what structural commitments must be declared and verified on an AC to support this?
The shift does not eliminate the AC author's work. The AC author still chooses columns, declares commitments, and verifies them against data. But the work shifts in kind: the AC author's job is to describe the data correctly in the framework's vocabulary, after which the framework handles query construction. The query author's job — for whomever queries the AC — becomes substantially smaller: describe the desired output frame and consume the result.
1.3.1 The framework is column-native throughout¶
There is a deeper way to read the shift from tables to ACs, and it is worth stating plainly because it shapes everything that follows: Coframe is column-native end to end. The column — a name, an anchor (the coordinates its values are keyed by), and its values — is the only first-class object in the framework. Tables and frames are emergent groupings of columns, not primitives.
This is true at every layer:
- There is no schema-level specification. Everything the framework knows about a schema is derived from its per-column declarations (ColumnSpecs, Chapter 3). The schema's grain, its scope, its place in the AC — all are computed from columns.
- There is no table operation in construction. The resolver does not join, group, or pivot tables. It performs a small set of operations over individual columns: deduplicate (when a column's anchor is coarser than a table's row grain), reduce (apply an operator to a single column across an anchor), fill-up (broadcast a coarser column's values to a finer grain), and collect (arrange aligned columns into output order). Cross-schema combination is alignment of columns on their shared coordinates followed by collect — not a join (Chapter 9).
- A query is not operations. A Frame-QL query is a collection of lightweight ColumnSpecs describing the desired output (Chapter 8). The output Frame is itself just a set of columns.
- An AC is a collection of columns, not of tables. The backend tables are the source of data; the AC's conceptual content is columns and the commitments on them.
This is why the framework can dispense with joins entirely — and why a whole class of errors becomes not detectable but inexpressible. Fanout (double-counting from a one-to-many join) cannot occur because there is no join to express it. The absence of the error and the absence of the operation are the same fact.
What makes a collection of keyed columns coherent — rather than a heap of unrelated values — is that anchors are not arbitrary keys but positions in a shared coordinate space: the dimension families, organized by functional-dependency edges (§2.5). That shared space is what makes columns commensurable and is the precise boundary of the framework's applicability: Coframe's guarantees hold to the degree that a shared coordinate space can be laid over the data. Relational and dimensional data carry that space essentially pre-built, which is why they are the framework's domain.
1.4 What this requires¶
Making AC-queries constructively answerable with guaranteed correctness requires answering a sequence of structural questions:
What is an analytical value? Every column carries values that depend on identifiable things at identifiable grain. The framework formalizes this through anchors — the set of things a column's value is about. The (A, M) paired declaration on each column commits to what the value depends on (A) and how missingness arises (M).
How are dimensions structured? Dimension columns in an AC are not isolated; they are organized into named dimension families with declared hierarchies. The time dimension family contains AC-dimensions day, month, quarter, year with declared rollup edges; the geography dimension family contains store, city, region, country. Cross-grain navigation traverses these declared hierarchies.
How are metrics structured? Metric columns belong to metric families — named groupings related through lineage edges. The revenue family contains every column named revenue connected through same-named lineage to a common family-root. Siblings within a family represent the same metric observed at different anchors. The framework uses family structure to determine when one column can be derived from another via rollup.
What aggregation operators are safe? Operators are catalog-defined with declared structural properties: partition-invariance (does the operator commute with arbitrary partitioning of its input — an algebraic monoid property), identity-preservation (does the operator preserve the family-name across application), type signatures, missing-value treatment. The catalog accommodates not only numeric operators (SUM, MAX, AVG) but also operators on non-numeric value spaces (HLL_MERGE, theta-set-union, t-digest-merge) — the partition-invariance property is algebraic, not numeric.
Which rollups are semantically valid? Partition-invariance of the operator is necessary but not sufficient. A SUM rollup of inventory across time is meaningless even though SUM is algebraically partition-invariant. The framework handles this through A_block annotations on ip_reducers: each metric family declares, per ip_reducer, which dimension families that operator must not be applied along. Inventory's SUM-ip-reducer has A_block = {time}; rollup across time is refused.
How is cross-schema coherence verified? When a metric family has siblings materialized in multiple schemas at different grains, the framework must verify they agree. Per-lineage-edge attestation during Data Quality Phase 3 checks that each declared lineage edge holds against the data: the predecessor's data, aggregated via the family's ip_reducer at the successor's anchor, agrees with the successor's observed values (respecting block sets and missing-value treatment).
How is the AC's surface exposed for querying? Frame-QL (Chapter 8) is the query language for describing desired output frames. A Frame-QL query references AC-dimensions and AC-metrics, describes filtering and grouping, and specifies what the output should look like. The framework's resolver (Chapter 9) constructs the algorithm that produces the desired frame from the AC's declared and verified commitments.
These are the framework's grammar-level primitives. Each is introduced in Foundations (Chapter 2) and specified in detail in subsequent chapters. The manual's structure mirrors this list: each chapter contributes one component of the answer.
1.5 Frame-QL: describing output frames¶
Frame-QL is the query language for AC-queries. A Frame-QL query is not a sequence of operations to perform; it is a description of the desired output frame, expressed in terms of the AC's exposed AC-dimensions and AC-metrics.
The motivating query from §1.1 — total revenue by region and quarter, filtered to consumer electronics, with a derived margin — written in Frame-QL against the retail AC, looks something like:
SELECT
region,
quarter,
SUM(revenue) AS revenue,
SUM(revenue) - SUM(cost) AS profit,
(SUM(revenue) - SUM(cost)) / SUM(revenue) AS gross_margin_pct
FROM transactions
WHERE
product_category = 'consumer_electronics'
AND year IN (2022, 2023, 2024)
AT (region, quarter)
The query describes the output frame: its columns (the SELECT clause), its grain (AT (region, quarter)), and its input scope (WHERE). It does not reference any backend table beyond naming the AC's transactions schema, write any join, or commit to any execution strategy.
Several things distinguish this from SQL:
- No joins. The query references AC family-names (
revenue,cost) and AC-dimensions (region,quarter,product_category,year) — structural objects exposed by the AC. The framework handles whatever cross-schema reach is needed (here, joining transactions to store geography to obtainregion) without an author-written join. - Grain-aware navigation.
regionis in thegeographydimension family;quarterandyearare in thetimedimension family. The framework navigates each dimension family's hierarchy from whereverrevenueandcostare materialized to the requested(region, quarter)grain, applying each metric family's ip_reducer with its block set. - Derived columns are per-row specs, not procedures.
profitandgross_margin_pctare computed at the output grain from the metric columns in the same row. They are additional lightweight column specifications in the output frame, not operations over other rows. (Naming convention used throughout: profit = revenue − cost; gross profit = revenue − COGS; margin = profit / revenue; gross margin = gross profit / revenue. The retail dataset'scostis the all-in per-transaction cost, sorevenue − costis profit, not gross profit.)
The motivating question in §1.1 also asked to compare each quarter to the same quarter the year before — a year-over-year comparison. That facet is deliberately absent from the query above: period-over-period comparison is a window analytic, referencing other rows of the assembled frame, and Coframe Core does not support window functions. The framework constructs the per-(region, quarter) revenue frame correctly; the cross-year comparison itself is performed in Coframe Pro (when available) or outside Coframe.
This boundary is worth understanding by its test, not just as a limitation. The question to ask of any column is: can its value be specified from its own row's grain, or does it require referencing other rows of the assembled result? The first is expressible in Frame-QL; the second is not (in Core). gross_margin_pct for a (region, quarter) row is computed entirely from that row's own revenue and cost — it passes. A year-over-year change for a (region, quarter, year) row must reach into a different row (the prior year's) — it does not. The same test decides every case, and a reader who carries it can predict what Frame-QL will and will not express without consulting a feature list. It also illustrates the framework's posture: it constructs what it can guarantee and is explicit about where its scope ends, rather than approximating a pattern it cannot construct correctly.
Chapter 8 specifies Frame-QL in full. The syntax above is illustrative; the actual grammar may differ in details. The conceptual claim is what matters: Frame-QL describes what the output frame should be, and the framework constructs how to produce it.
1.6 The constructive-correctness guarantee¶
When the framework accepts a Frame-QL query against an AC, it commits to one of three outcomes:
1. A constructed answer. The framework's resolver finds a derivation path from the AC's materialized columns to the requested output frame, traversing dimension-family hierarchies and applying ip_reducers per their declared block sets. The answer is computed and returned. Every step in the derivation is grounded in a declared and verified structural commitment.
2. A refusal with structural explanation. Some queries cannot be answered from the AC's commitments. A query referencing a metric family at an anchor unreachable through the family's ip_reducers; a query whose requested rollup is blocked by the ip_reducer's A_block; a query referencing an anchor-locked family at a coarser grain. The framework refuses such queries and explains the structural reason. The refusal is determinate: a query is either answerable or it is not, and the framework names which case applies.
3. A flagging as dubious. A subset of refusals: queries whose answer is not unique under the AC's commitments. The cousin case (a family-name with multiple non-equivalent family-roots in the AC) is the canonical example. The framework names the ambiguity and requires the query to disambiguate.
There is no fourth option. The framework does not guess; it does not interpolate; it does not impute missing values; it does not infer relationships not declared. If a query asks for something the AC does not commit to, the framework refuses with explanation rather than answering speculatively. This is the framework's epistemic commitment.
The corollary is that an AC's expressiveness — the range of queries it can answer — depends on the richness of its declarations. An AC with comprehensive metric family declarations, well-populated dimension families, and verified lineage edges supports a broad range of queries. An AC with sparse declarations supports fewer. The AC author's work is, in this sense, the work of expanding the AC's query-answering surface — declaring more, verifying more, exposing more.
A well-formed AC's correctness is a structural property — provable from its declarations and the framework's reasoning — rather than an empirical one (does the answer look right?). When the framework reports an answer, that answer is correct in the framework's sense: every step is grounded.
1.6.1 Two kinds of correctness¶
It is worth being precise about what "correct" means here, because the word carries two distinct senses that the framework keeps separate.
Grammatical correctness is the sense the framework guarantees. A constructed result is grammatically correct when it is what the query denotes under the rules of the grammar: the operator matches the operand type, the metric family resolves to a single source, the navigation path is valid, no fanout is possible. This is rule-based and mathematical. Once a query passes the rules, the result is correct in the same sense a well-typed expression is correct — and, as §1.3.1 noted, the framework achieves this less by checking than by making incorrect queries inexpressible. This guarantee does not depend on the data; it depends only on the query and the declared structure.
Integrity is a separate matter: whether the data satisfies the structural premises the grammar relied on — whether the declared functional dependencies actually hold, whether a pre-aggregated table actually equals the rolled-up detail. An integrity failure produces a grammatically-correct construction over false premises: wrong data, not a wrong query. Integrity is verified by a separate process (Chapter 7), and a structural premise can earn its warrant three ways — author assertion, affirmation by the processing code that produced the data, or data attestation (testing the premise against the data directly).
The two are related — a grammatically-correct construction over failed premises returns wrong data — but they are different kinds of incorrectness with different remedies, and the framework reports them separately. The grammatical guarantee is intact regardless of integrity status; integrity tells the consumer how much to trust the premises the guarantee stood on.
A third sense — whether the result matches the user's intent — the framework does not address. Bridging a query to what a user meant is the work of a semantic layer (see the Preface). Coframe operates at the grammar layer: it matches operations to queries by rule. It is deliberately not in the business of interpreting intent.
This is the central claim Coframe is built to deliver. The rest of the manual specifies how.
1.7 Map of the manual¶
The chapters that follow are organized in the dependency order of their ideas.
Chapter 2 — Foundations. The grammar-level primitives: anchors, dimension families, metric families, operators, lineage, structural rules, integrity conditions. Every subsequent chapter draws on this vocabulary.
Chapter 3 — The AC specification. The per-column declaration format (ColumnSpec), the AC-level declarations (dimension families, metric families, naming function, attestation), the schema declarations, and the schema.init YAML document that holds them.
Chapter 6 — Data-API protocol. How the framework communicates with the backend. The operations the framework requires (introspection, verification, projection, aggregation) and the protocol for invoking them.
Chapter 7 — Data quality and structural verification. The DQ process: the three phases, the integrity conditions verified, the asserted-but-not-verified facts, the per-lineage-edge attestation regime, the AC verification levels (A, AA, AAA).
Chapter 8 — Frame-QL. The query language for describing output frames. Lexical structure, top-level constructs, frame clauses, expressions, semantics, error categories.
Chapter 9 — Query resolution. The construction algorithm: how the framework finds derivation paths from the AC's materialized columns to the requested output frame. Cross-grain navigation, cross-schema coherence, the dubious-query mechanism, and Multi-Table Invariance as the plan-construction and optimization-licensing mechanism.
Chapter 10 — Operator catalog. Coframe Core's operator catalog: each operator's type, partition-invariance, identity-preservation, type signature, naming-function entry, and missing-value treatment.
Chapter 11 — The Metric Engine. The optional per-AC acceleration substrate: opt-in model, the METRIC + QM domains on a shared Polars+Parquet store, the serve() three-branch dispatch (exact match → FD-DAG rollup → backend fallback), compose() for multi-metric Frames, cache_hint push pre-materialisation, the lazy-fill / promotion lifecycle, the served_from indicator, verification-level inheritance, and stability-window invalidation.
Appendices. BNF grammar for Frame-QL (A); glossary (B); related work (C).
The chapter numbering skips 4 and 5 deliberately. Chapter 4 (AC authoring workflow) is addressed in a separate authoring guide; Chapter 5 was merged into Chapter 3 during the manual's organization. The MCP Surface (formerly slotted as Chapter 11 in early drafts; per the v2.1 supplement §10.2 amendment, MCP is one of the AC Surfaces) is now addressed in a separate product reference; the Chapter 11 slot is occupied by the Metric Engine, which is language-level Core. This manual specifies the framework; companion documents address products and practices built on it.
Chapter 2: Foundations¶
The grammar-level primitives — anchors, dimension families, metric families, operators, lineage — and the structural rules and integrity conditions that bind them. The foundations on which the framework's constructive-correctness guarantee rests.
2.1 What this chapter does¶
Chapter 1 established what Coframe constructs: an algorithm for answering Frame-QL queries against an Analytic Collection (AC) with guaranteed correctness. That construction rests on a small set of structural primitives — concepts the framework reasons over. This chapter introduces those primitives.
The order of introduction is dependency-driven. We begin with a single analytical observation and ask what is needed to describe it (§2.2). We then introduce the AC as the structural object that holds many such descriptions (§2.3). We specify how each column commits to what its values are about, through anchors and the missingness anchor (§2.4). We organize dimension columns into named dimension families with hierarchies (§2.5). We introduce operators and their algebraic properties (§2.6). We organize metric columns into named metric families through lineage (§2.7). We extend ip_reducers with block-set annotations to handle semi-additive measures and intensive families (§2.8). We specify schemas as the binding layer between structural commitments and physical data (§2.9). We collect the framework's structural rules and integrity conditions (§2.10). Finally, we crystallize the two principles the structure encodes (§2.11).
A reader who absorbs this chapter has a working mental model of the framework. Subsequent chapters specify how the primitives are declared (Chapter 3), verified (Chapters 6, 7), and used (Chapters 8, 9, 10).
The retail AC introduced in the front matter is referenced throughout. Where additional structural cases need illustration, this chapter constructs them and flags the deviation.
2.2 An analytical observation¶
Consider a single number from a retail database:
The cell value
127.43, in row 8,392,104 of thetransactionstable, in columnrevenue_amount.
What would a reader need to know about this number to reason about it?
What is it about? The number describes a particular transaction — transaction 8,392,104. Without knowing this, the number is a floating-point literal; we cannot meaningfully compare it to another, sum it with others, or filter it. The number's value depends on the identity of the transaction it records.
What does it measure? It measures revenue — a specific quantity (currency amount received by the retailer for the sale recorded in that transaction). It does not measure cost, or tax, or units sold, even though those quantities also pertain to the same transaction. The number's conceptual content is "revenue".
How was it produced? The number was recorded directly from the point-of-sale system; it was not computed from anything else. It is an observation, not a derivation.
Now consider a different number, in a different table:
The cell value
48,392.17, in row 412 of theregion_quarterly_summarytable, in columnrevenue.
Asking the same questions:
What is it about? It is about the combination of a region and a quarter — say, the North-Atlantic region in 2024-Q3. It is not about a particular transaction.
What does it measure? Revenue — the same conceptual quantity as the transaction-level number.
How was it produced? It was computed by summing the revenue_amount values of all transactions that occurred in stores in the North-Atlantic region during 2024-Q3. It is a derivation, not a direct observation. Specifically, it is the result of applying the operator SUM to the transaction-level revenue values, partitioned by (region, quarter).
Three things have emerged from these two examples:
- A column's value depends on identifiable things. Transaction 8,392,104; the (region, quarter) pair (North-Atlantic, 2024-Q3). We will call this the column's anchor.
- A column carries a conceptual quantity that may be shared across columns at different anchors. Both numbers carry "revenue." We will call this the column's family.
- A column may be derived from another by an operator. The region-quarterly revenue was produced by applying SUM to transaction-level revenue. We will call this the column's lineage, and the operator the lineage's operator.
These three — anchor, family, operator — are the framework's fundamental primitives. Every column in an AC carries an anchor declaration (what it is about), participates in a family (what it measures), and either is observed directly or was produced from a predecessor through an operator (how it arises).
The chapter that follows specifies each of these in detail. But before we elaborate, we need the structural object that holds many such column-descriptions together: the Analytic Collection.
2.3 The Analytic Collection¶
An Analytic Collection (AC) is the framework's unit of analytical reasoning. An AC is a deliberate, curated selection of columns from one or more backend tables, organized into schemas, with structural commitments declared at multiple levels.
2.3.1 What an AC contains¶
An AC has the following structural components:
- A name (
ac_name) and identifying metadata. - A backend the AC binds to. The backend may be a single database, a logical warehouse, or a federation of sources; the data-API protocol (Chapter 6) abstracts the binding.
- A set of schemas, each binding to one backend source (table or materialized view) and declaring ColumnSpecs for its analytically-relevant columns.
- A set of dimension families, each a named container for related AC-dimensions arranged in one or more hierarchies (§2.5).
- A set of metric families, each a named grouping of AC-metrics related through lineage edges to a common family-root (§2.7).
- A naming function declaration, specifying how the AC author's column names relate to operational lineage (Chapter 3 specifies; we reference it briefly here).
- An attestation configuration, specifying per-lineage-edge verification behavior (Chapter 7).
- Verified integrity status, the result of running the framework's verification process against the AC's data.
The AC is the complete analytical surface the framework reasons over. Queries are written against the AC's exposed AC-dimensions and AC-metrics. The framework's algorithms — query resolution, cross-grain navigation, dubious-query detection, per-lineage-edge attestation — operate over the AC's declared and verified commitments.
2.3.2 AC plurality¶
An AC is one possible curation of the underlying backend data. Multiple ACs can coexist over the same backend, exposing different subsets of columns, different naming conventions, different scope choices, or different structural commitments. An AC for the retail analytics team and an AC for the supply-chain team may both draw from the same transactions table but expose different families, different naming functions, or different schemas.
ACs do not depend on each other. Each AC is structurally closed: every commitment it declares is in terms of its own ColumnSpecs and its own AC-level declarations. There are no cross-AC references in Coframe Core.
2.3.3 The AC scope¶
An AC's scope is the set of structural commitments the AC makes. It comprises:
- Selection: which columns from the backend are included as ColumnSpecs.
- Naming: the AC's chosen
namefor each ColumnSpec, plus the optional naming function tying names to operational lineage. - Structural commitments: the per-ColumnSpec declarations (anchor, missingness anchor, operator, lineage, family-name) plus the AC-level declarations (dimension families, metric families, FD-DAG, attestation configuration).
- Verified integrity status: the result of structural verification (Chapter 7).
Backend columns not included as ColumnSpecs are outside the AC scope and not visible to queries against the AC. An AC author's choice to exclude a column is as deliberate as their choice to include one; both shape the AC's surface.
2.3.4 The AC is not a translation surface¶
A point worth being explicit about, because it distinguishes Coframe from adjacent tools: an AC is not a translation surface from physical data to a "semantic layer" in the conventional sense. An AC is a primary structural object, not a derived view of underlying tables that consumers query as if it were one big table.
The distinction matters when we consider what queries against an AC mean. Queries do not lower into table-level SQL through a translation step that supplies join paths and metric definitions. The AC's structural commitments are what queries reason over; the framework's resolver constructs the algorithm to produce the requested frame from those commitments. The backend tables are the source of data; they are not the conceptual model the query references.
A reader from the semantic-layer tradition will find this an unusual stance. Semantic layers position themselves as a translation layer above tables, with the translation being the value-add. Coframe positions the AC itself as the conceptual model — the tables are inputs to its verification, not the model being queried.
2.3.5 The AC's three structural objects¶
An AC's structural reasoning involves three top-level objects, each with its own organizing principle:
- Schemas — the binding layer: each schema binds the AC to a single backend source through ColumnSpecs.
- Dimension families — the coordinate layer: each dimension family is a named grouping of AC-dimensions, organized into hierarchies, capturing the structure of analytical coordinates.
- Metric families — the observation layer: each metric family is a named grouping of AC-metrics, related through lineage, capturing the structure of analytical observations.
These three are introduced separately in the sections that follow:
- §2.4 introduces the per-column commitments (anchor, missingness) that ColumnSpecs declare.
- §2.5 specifies dimension families.
- §2.6 specifies operators.
- §2.7 specifies metric families.
- §2.9 specifies schemas (which contain ColumnSpecs and are the unit of physical binding).
The framework's overall structure is: each schema's ColumnSpecs declare per-column commitments; the AC's dimension families and metric families organize related columns into named structural groups; the framework's algorithms reason over both layers.
2.4 Anchors¶
The anchor of a column is the set of things the column's value depends on. It is declared per ColumnSpec through the A field (read as "the anchor"), paired with the missingness anchor M (§2.4.3).
2.4.1 A(c, S): the anchor declaration¶
For a column c in a schema S, the anchor A(c, S) is a set of AC-dimensions. The column's value, for each row in the schema, is determined by the values of these AC-dimensions in that row.
Examples from the retail AC:
revenuein thetransactionsschema hasA = {transaction}. Each row oftransactionsrecords arevenuevalue determined by the transaction it represents.regionin thestoresschema hasA = {store}. Each row ofstoresrecords aregionvalue determined by the store it represents.eom_inventoryinstore_monthly_inventoryhasA = {store, month}. Each row records inventory determined by the (store, month) pair.
The anchor is a commitment, not a description: the AC author commits that the column's value depends on exactly the AC-dimensions named in A. The framework verifies this commitment through cross-schema consistency checks (Chapter 7).
2.4.2 Grain-role columns¶
A column whose anchor consists of itself — A(c, S) = {c} — is in grain role in schema S. Such a column identifies the schema's rows uniquely; its values are the schema's primary identifiers.
In the retail AC:
transactionintransactionshasA = {transaction}: it is in grain role.storeinstoreshasA = {store}: it is in grain role.(store, month)together are in grain role instore_monthly_inventory— meaningstorehasA = {store, month}andmonthhasA = {store, month}together, forming a composite grain.
A column may be in grain role in one schema and in non-grain role in another. store is in grain role in stores (one row per store) and in non-grain role in transactions (where its values identify the store for each transaction without being unique per row).
2.4.3 M(c, S): the missingness anchor¶
Paired with A on every ColumnSpec is the missingness anchor M(c, S): the set of coordinates on whose values the column's missingness depends. Where A declares what the column's value depends on, M declares what its being-missing depends on. M is a subset of the column's anchor extended with the self-marker:
M(c, S) ⊆ A(c, S) ∪ {self}
The members of M are read as: an AC-dimension in M means missingness depends on that coordinate; self in M means missingness depends on the column's own (possibly unobserved) value.
M is declared as a single set. The familiar three-way classification of missingness mechanisms is a derived, lossy summary of M, not a separately-declared field:
M = ∅→ MCAR (Missing Completely At Random): missingness depends on nothing in the data.M ⊆ A,self ∉ M,M ≠ ∅→ MAR (Missing At Random): missingness depends only on other (observed) coordinates, not on the value itself.self ∈ M→ MNAR (Missing Not At Random): missingness depends on the column's own value, possibly in addition to other coordinates.
The category is a projection: it loses information that M retains. M = {self} and M = {self, age} are both MNAR but are different missingness anchors — the first says missingness depends only on the true value; the second, on the true value and age. The framework reasons over the set M, deriving the category label only for human readability. (This is consistent with the missing-data literature, where the MAR/MNAR boundary shifts with the conditioning set; the set M is exactly the conditioning information the bare category cannot carry. See Appendix C.)
For grain-role columns, the framework auto-derives M = {self} and treats missing values as inadmissible — a missing value in a grain-role column is an integrity violation, since the column is supposed to identify the schema's rows.
2.4.4 The twin anchor (A, M)¶
A declares what the value depends on; M declares what its missingness depends on. Both are sets of coordinates over the same space (M ranging over A ∪ {self}), so together they form a twin anchor — two coordinate-sets attached to every column, one for the value and one for its missingness. Neither is optional; the framework requires both on every ColumnSpec.
The twin anchor's reasoning role:
- Operator catalog entries specify missing-value treatment as a function of the operator and of how
Mrelates to the rollup (§10.12). The treatment is not fixed per operator: the same reducer handles a column differently depending onM, and the sameMis handled differently by different operators. This joint sensitivity is possible precisely becauseMis a per-column coordinate the catalog can dispatch on. - Cross-schema consistency checks (§2.10) verify that same-named columns across schemas declare consistent
(A, M)twin anchors where the trichotomy requires it. - Query resolution (Chapter 9) consults
Ato determine cross-grain navigability and consultsM— specifically, whetherM's coordinates survive into the output grain or are coarsened away — to determine which missing-value treatment applies (§10.12).
2.4.5 The trichotomy¶
Every column in an AC is classified into one of three categories. The classification is derived from A patterns and dimension-family membership across the AC — the framework computes it; the AC author does not declare it directly.
-
AC-dimension: a column that is either (a) in grain role in at least one schema (
A(c, S) = {c}somewhere), or (b) reachable from a grain-role column through the AC's FD-DAG — i.e., declared as a member of some dimension family (§2.5). AC-dimensions function as analytical coordinates. Their values identify positions in the AC's coordinate space. -
AC-attribute: a column that is not an AC-dimension and not an AC-metric. AC-attributes have a fixed single anchor (
|A| = 1) across all schemas where they appear, and they belong to no dimension family. They describe their anchor without serving as a coordinate. Examples in the retail AC:store_open_datedescribes a store but is not a coordinate axis. -
AC-metric: a column whose
A(c, S)may vary across schemas — the column is observed (or observable) at multiple anchors and represents a measure. AC-metrics are organized into metric families (§2.7).
The trichotomy is exhaustive (every column falls in exactly one category), mutually exclusive (no column falls in two), and metadata-derivable (computed from declarations without consulting data).
The expanded AC-dimension definition reflects a practical reality: not every dimension-family member needs to be materialized in grain role. The time dimension family's members include day, month, quarter, year; the AC may materialize only day (or month in a composite-grain schema) and reach the others through function-derived FD-edges (§2.5.6) or through the data-attested FD-DAG. All members of a dimension family are AC-dimensions because they all serve as analytical coordinates the framework reasons over.
In the retail AC:
- AC-dimensions (all dimension-family members):
transaction,store,city,region,country,sku,product_line,product_category,department,day,month,quarter,year,fiscal_period,fiscal_quarter,fiscal_year,iso_week,iso_year. The atomic AC-dimensions (those in grain role somewhere) aretransaction(grain role intransactions),store(grain role instores), andmonth(grain-role contributor instore_monthly_inventory). The remaining AC-dimensions are derived: reachable from atomic AC-dimensions through the FD-DAG (cityfromstore,quarterfromday, etc.). - AC-attributes:
store_open_date. (A column describing a store but not serving as a coordinate.) - AC-metrics:
revenue,cost,units_sold,eom_inventory,peak_inventory, and the singletongross_margin_pct.
The trichotomy is the column's structural role in the AC. It is independent of the column's family membership (which is determined by name) and of its lineage (which is determined by the lineage field). It is also independent of the column's value type — a date-valued column may be an AC-attribute; a string-valued column may be an AC-dimension if it functions as a coordinate.
2.5 Dimension families¶
2.5.1 What a dimension family is¶
A dimension family is a named grouping of related AC-dimensions, organized into one or more hierarchies. The AC declares dimension families at the AC level; each AC-dimension belongs to exactly one dimension family.
In the retail AC, four dimension families are declared:
- The
timedimension family contains AC-dimensionsday, month, quarter, year, fiscal_period, fiscal_quarter, fiscal_year, iso_week, iso_year. - The
geographydimension family containsstore, city, region, country. - The
productdimension family containssku, product_line, product_category, department. - The
transactiondimension family contains the single AC-dimensiontransaction.
Each dimension family is a coordinate space the AC reasons over. Where queries navigate "across geography" or "across time," they navigate within a dimension family.
2.5.2 Base level¶
Each dimension family has a designated base level: the finest AC-dimension in the family, from which coarser AC-dimensions are reachable through FD-edges.
In the retail AC:
- The base level of
timeisday. - The base level of
geographyisstore. - The base level of
productissku. - The base level of
transactionistransaction.
The base level is identified by being the unique AC-dimension in the family from which all other AC-dimensions in the family are reachable through declared FD-edges. If a dimension family contains only one AC-dimension (like transaction), that AC-dimension is the base level by default.
2.5.3 Hierarchies¶
A hierarchy within a dimension family is a named path from the base level to a designated top, traversing declared FD-edges. A dimension family may have one or more named hierarchies.
In the retail AC, the time dimension family declares three named hierarchies:
calendar: day → month → quarter → yearfiscal: day → fiscal_period → fiscal_quarter → fiscal_yearweek: day → iso_week → iso_year
The geography and product dimension families each declare a single hierarchy. The transaction dimension family declares no hierarchies (single-level).
Multiple hierarchies enable distinct rollup conventions over the same data. A query asking for revenue by quarter traverses the calendar hierarchy; a query asking for revenue by fiscal_quarter traverses the fiscal hierarchy. Both queries use the same base-level day data.
The hierarchies within a dimension family may share the base level and the top (the implicit All aggregate is the top of every hierarchy) but otherwise traverse distinct AC-dimensions. They are not required to be disjoint above the base.
2.5.4 FD-edges and the FD-DAG¶
The structural relationships within a dimension family are functional-dependency edges (FD-edges). An FD-edge d1 → d2 declares that every value of d1 determines a unique value of d2: knowing d1, you know d2. In the retail AC, store → region is an FD-edge in the geography dimension family — every store has a unique region.
The collection of all FD-edges in the AC forms the FD-DAG. The FD-DAG is partitioned by dimension family: each FD-edge connects AC-dimensions within a single dimension family. Cross-dimension-family FD-edges are not used in Coframe Core.
The FD-DAG must be acyclic. Cycles among AC-dimensions in a dimension family would mean that two AC-dimensions mutually determine each other — they would be the same AC-dimension under different names. The framework rejects ACs with cyclic FD-DAGs.
2.5.5 Candidate vs. data-driven FD-DAGs¶
The AC declares a candidate FD-DAG: the set of FD-edges the AC author commits to. The framework also examines the data to determine the data-driven FD-DAG: the set of FD-edges actually attested by the data.
The integrity condition is:
Candidate FD-DAG ⊆ Data-driven FD-DAG.
Every declared FD-edge must be attested by the data. The data may also contain FD-edges the AC has not declared; this is permitted but does not extend the AC's declared structure.
Verification of declared FD-edges against data is part of Data Quality Phase 2 (Chapter 7).
2.5.6 Function-derived FD-edges¶
An FD-edge may be grounded in two ways:
- Data-attested: the mapping
d1 → d2is observed in the backend data, with eachd1-value mapping to a uniqued2-value across rows. - Function-derived: the mapping is computed by a declared function from the operator catalog.
month = MONTH_OF(day);quarter = QUARTER_OF(day);iso_week = ISO_WEEK_OF(day). These FD-edges are grounded by the function's deterministic semantics — givenday, the function produces a uniquemonth— combined with the framework's trust in the data engine to evaluate functions correctly.
Both grounding regimes contribute identically to the FD-DAG. The framework's reasoning over the FD-DAG does not distinguish them. An AC author choosing whether to materialize month as a stored column (data-attested) or compute it on demand (function-derived) is making an operational choice with no structural consequence.
This duality has an analogue on the metric side, discussed in §2.7.
2.5.7 What dimension families enable¶
Dimension families serve several reasoning purposes:
- Cross-grain navigation: a query asking for revenue by
regionnavigates thegeographydimension family's hierarchy from wherever revenue is materialized to theregionlevel. - Block-set declaration: a metric family's ip_reducer block set (§2.8) is declared in terms of dimension families.
eom_inventory's SUM-ip-reducer hasA_block = {time}— blocking thetimedimension family as a whole, not individual time AC-dimensions. - Multiple-hierarchy reasoning: queries can specify which hierarchy to traverse when a dimension family has multiple.
AT quartervs.AT fiscal_quarterselects different hierarchies intime. - Agent-facing exposure: external interfaces (such as the MCP server integration; see the separate MCP reference) expose dimension families with their hierarchies, allowing agents to reason about the AC's analytical coordinate space.
2.6 Operators¶
2.6.1 What operators are¶
An operator is a catalog-defined relationship between a predecessor column and a successor column. Each operator is specified once in the operator catalog (Chapter 10) with its structural properties; ColumnSpecs reference catalog entries by name.
Operators come in two types:
- Reducers: operators that aggregate values across a predecessor's anchor, producing a successor at a coarser anchor (
A_pred ⊇ A_self). Examples: SUM, MAX, MIN, COUNT, AVG, MEDIAN, HLL_MERGE. - Functions: operators that transform values without aggregation, producing a successor at the same anchor as the predecessor (
A_pred = A_self). Examples: MONTH_OF, ABS, BUCKET, SUBSTR.
A third category — multi-input operators — takes multiple predecessors and produces a successor whose anchor is the common anchor of the inputs. Examples: MAP_DIV(revenue, cost), IF(cond, a, b). Multi-input operators produce singleton family-roots (§2.7.7).
2.6.2 Partition-invariance — the algebraic property¶
The most important property an operator can have is partition-invariance. An operator R is partition-invariant if its result on a set of values is the same as its result on the partition-wise results, for any partitioning of the input.
Formally: for any multiset X partitioned into sub-multisets X_1, ..., X_n:
$$R(X) = R(R(X_1), R(X_2), \ldots, R(X_n))$$
This is a monoidal property. The triple (V, R, e) — value space, operator, identity element — forms a commutative monoid if and only if R is partition-invariant on V.
Partition-invariance is what licenses rollup. If R is partition-invariant, the framework can apply R to predecessor values at a fine anchor and produce a correct successor value at a coarser anchor, regardless of the path through the FD-DAG. If R is not partition-invariant, rollup is path-dependent and the framework cannot guarantee correctness.
2.6.3 The breadth of partition-invariant operators¶
Coframe Core's operator catalog includes partition-invariant operators across diverse value spaces:
- Numeric values, additive: SUM, COUNT. The triple
(ℝ, +, 0)is a commutative monoid. - Numeric values, extremal: MAX and MIN. The triple
(ℝ ∪ {-∞}, max, -∞)is a commutative monoid for MAX;(ℝ ∪ {+∞}, min, +∞)is the corresponding monoid for MIN. - Boolean values: AND, OR.
(𝔹, ∧, true)and(𝔹, ∨, false)are commutative monoids. - Bit-vectors: BIT_AND, BIT_OR, BIT_XOR.
- Set values: UNION, INTERSECTION (where each row carries a set).
- HLL sketches: HLL_MERGE. The triple
(HLL-sketches, merge, empty-sketch)is a commutative monoid; merging sketches of partition-pieces gives the same sketch as the merge of the unpartitioned data. - Theta sketches: THETA_UNION, THETA_INTERSECTION.
- Quantile sketches: T_DIGEST_MERGE, KLL_MERGE.
The unifying property is that the (value space, operator, identity) triple forms a commutative monoid. This is more general than "the operator does arithmetic on numbers"; the framework accommodates non-numeric monoidal value spaces as first-class.
2.6.4 Liftable operators (algebraic in Gray's sense)¶
Some operators are not partition-invariant in their natural value space but become partition-invariant when the value space is enriched.
The canonical example is AVG. AVG over real numbers is not partition-invariant: AVG(AVG(X_1), ..., AVG(X_n)) ≠ AVG(X) in general, because the partitions may have unequal sizes. But the pair (SUM, COUNT) is partition-invariant on (ℝ × ℕ), and AVG can be recovered by the final projection (s, n) → s/n. If the framework carries the (SUM, COUNT) pair through aggregations and only computes AVG at the final step, partition-invariance is recovered.
Such operators are liftable: non-partition-invariant in their natural value space but partition-invariant in an enriched space, with a final projection extracting the desired result.
The operator catalog (Chapter 10) declares for each operator:
- Whether it is natively monoidal (partition-invariant in its declared value space; no lift needed).
- Whether it is liftably monoidal (partition-invariant in an enriched value space, with a declared lift and final projection).
- Whether it is holistic (not partition-invariant in any finite-state enrichment — exact MEDIAN over arbitrary data, exact COUNT_DISTINCT over large data).
Coframe Core treats natively monoidal operators directly, may treat liftable operators by carrying the enriched state internally, and refuses to roll up holistic operators except where the AC author has independently materialized siblings at the requested grain.
2.6.5 Identity-preservation — the orthogonal property¶
A second property of operators, independent of partition-invariance, is identity-preservation. An operator R is identity-preserving for a predecessor m_pred if applying R to m_pred produces a column in the same family as m_pred — same family-name, same conceptual quantity.
SUM applied to revenue produces revenue at a coarser anchor. The family-name persists; SUM is identity-preserving for revenue.
MAX applied to a temperature reading produces a different conceptual quantity — the peak temperature in the partition. The family-name does not persist (or, more precisely, the convention is to name the result peak_temperature rather than temperature). MAX is not identity-preserving for temperature.
Identity-preservation is declared per (operator, predecessor-family) pair, mediated by the AC's naming function (Chapter 3). The naming function specifies what name an operator's output should carry given its predecessor's name and the operator. If the function's output equals the predecessor's name, the operator is identity-preserving for that predecessor; otherwise it is not.
Identity-preservation and partition-invariance are orthogonal:
| Operator | Partition-invariant | Identity-preserving (for typical predecessor) |
|---|---|---|
| SUM | yes | yes (revenue → revenue) |
| COUNT | yes | no (revenue → revenue_count) |
| MAX | yes | varies (revenue → peak_revenue; temperature → peak_temperature; but the family may be conventionally MAX-rooted) |
| AVG | no (liftable) | varies |
| MAP_DIV | not applicable (multi-input) | no (multi-input; produces singleton) |
A family's ip_reducer (introduced next, §2.7) is the operator that is both partition-invariant and identity-preserving for the family.
2.6.6 Operator catalog properties¶
Each operator catalog entry declares:
- Type: reducer / function / multi-input.
- Anchor relation:
A_pred ⊇ A_selffor reducers;A_pred = A_selffor functions. - Partition-invariance classification: natively monoidal / liftably monoidal / holistic, with the lift specification where applicable.
- Identity-preservation: per (operator, predecessor-family) — mediated by the naming function.
- Type signature: input value space(s), output value space.
- Missing-value treatment: per (operator, predecessor-M) combination.
- Catalog metadata: name, description, default naming-function entry.
Chapter 10 specifies the catalog format and contents. This section's purpose is to make explicit what operators contribute to the framework's reasoning: they are the declared semantics of column-to-column transformations, with structural properties that govern how the framework reasons about derivation.
2.7 Metric families¶
2.7.1 Lineage¶
Every ColumnSpec has a lineage field recording its immediate predecessor: a triple (name_pred, A_pred, op_pred) snapshotting the predecessor's name, anchor, and operator. The op field on the ColumnSpec records the operator that produced this column from the predecessor.
For a root column — one with no predecessor — the lineage is self-referential: lineage = (name_self, A_self, op_self). Walking lineage backward from a root yields the root itself.
For a non-root column, lineage points to a different ColumnSpec elsewhere in the AC: in any schema, including the same schema. The framework verifies that a ColumnSpec matching (name_pred, A_pred, op_pred) exists in the AC.
Walking lineage from any column backward yields a lineage chain — a sequence of predecessors terminating at a root.
2.7.2 Lineage edges and the lineage graph¶
A lineage edge is the relationship between a column and its predecessor. The collection of all lineage edges across the AC's ColumnSpecs forms the AC's lineage graph. Walking the lineage graph backward yields ancestry; walking forward yields descendants.
The lineage graph is a forest of trees with multi-input nodes. Each tree is rooted at a self-referential ColumnSpec. Multi-input operators (singletons, §2.7.7) introduce nodes with multiple incoming edges.
2.7.3 Family-name and family¶
A family-name is a name value appearing in one or more ColumnSpecs in the AC. Two columns with the same name belong to the same metric family.
Family membership is determined by name equality. The framework does not parse names; identity-of-name is identity-of-family, and the AC author commits to this when assigning names.
2.7.4 Family-root¶
To find a column's family-root, walk lineage backward as long as name_pred equals name_self. The family-root is the last column reached while names matched.
- If a column's lineage is self-referential and the column is alone in its lineage chain (no predecessor in the AC), it is its own family-root.
- If a column's lineage chain contains earlier columns with the same name, the family-root is the earliest one.
- If walking backward reaches a column whose
name_preddiffers fromname_self, that point marks the boundary of the family. The current column's family-root is the column just before the name changed.
Two columns share a family-root iff their same-name lineage chains terminate at the same column.
2.7.5 Structural relations within and across families¶
Two columns in the AC can be related in four structurally distinct ways:
- Identical: same name, same anchor, same family-root. Structurally interchangeable.
- Siblings: same name, different anchors, same family-root. The same metric observed at different anchors. Cross-anchor navigation between siblings is well-defined under any of the family's applicable ip_reducers (§2.8).
- Cousins: same name, different family-root. Independent observations sharing a family-name. Applying any ip_reducer to two cousins at a target anchor produces different results because the underlying observations differ. Queries referencing a family-name that resolves to multiple cousins are refused as dubious.
- Different families: different names. No structural relation under grammar-layer reasoning. Any conceptual relationship between differently-named families must be encoded explicitly via lineage (one family's columns derive from the other's through declared operators).
The cousin relation is the framework's main subtlety. Two columns named revenue in different schemas, both with self-referential lineage (each is its own family-root), are cousins. The framework cannot determine whether they should agree at a common coarsening: they were independently observed, and the AC author has not declared a lineage relationship between them. The dubious-query mechanism (Chapter 9) requires the AC author to either declare one as derived from the other (making them siblings) or to disambiguate at query time.
2.7.6 Operator-attested vs. operator-asserted families¶
A family-root's op field declares the operator under which the family's columns roll up — i.e., the family's ip_reducer (§2.8). For a family with op = OBSERVED, no rollup operator is being claimed; the root is direct observation. For a family with op set to a reducer (SUM, MAX, HLL_MERGE), the AC author is committing that the family's columns roll up via that operator.
The framework's reasoning over the family depends on this operator-claim. A query asking for the family at a coarser anchor will roll up via the declared ip_reducer. But how does the framework know the declared operator is the right one — that the AC author's claim is consistent with the data?
The answer depends on whether the family has an in-AC sibling whose data can independently test the rollup:
-
Operator-attested: there exists at least one ColumnSpec in the AC that is a sibling of the family-root — a same-named column at a different anchor, with lineage pointing to the family-root via the declared operator. Phase 3 of DQ (Chapter 7) tests this sibling by aggregating the family-root's data via the declared ip_reducer at the sibling's anchor and comparing to the sibling's observed values. The operator's applicability to this family is verified against data.
-
Operator-asserted: no in-AC sibling exists. The family-root's data is the only materialization in the AC. The framework cannot independently test that the declared ip_reducer is consistent with the family's data — there is nothing in the AC to compare against. The framework trusts the catalog's algebraic claim about the operator (partition-invariance) and the AC author's choice; the DQ deliverable records this commitment as asserted-not-verified.
In the retail AC:
-
revenue: operator-attested if any coarser sibling oftransactions.revenueis materialized in the AC. The OBSERVED root intransactionsis the data; a coarser sibling can be Phase-3-tested against it. Without any sibling, the family is also operator-asserted; whetherrevenueis attested or asserted depends on what siblings are materialized. -
eom_inventoryunder SUM: operator-asserted. No finer-grained sibling exists in the AC; the data only exists at{store, month}. The AC author commits that SUM is the ip_reducer withA_block = {time}; the framework cannot independently verify this commitment because no daily-grain inventory data is in the AC. -
eom_inventoryunder MAX: similarly operator-asserted unless a sibling materialized via MAX rollup is added.
The distinction is not a quality judgment. Operator-asserted families are legitimate — sometimes the data simply does not exist at a finer grain, or the AC author has chosen not to materialize finer-grained siblings. The DQ deliverable reports the distinction so AC authors and consumers know which commitments are data-attested and which are trusted from declaration.
2.7.7 Singleton families¶
Multi-input operators produce singleton columns. The gross_margin_pct column in the retail AC is computed from revenue and cost via MAP_DIV: its lineage is a tuple of predecessor snapshots ((revenue, {transaction}, OBSERVED), (cost, {transaction}, OBSERVED)), and its op is MAP_DIV.
A singleton is structurally a leaf in the AC's metric genealogy: other columns do not derive from it through lineage. It participates in no further family genealogy beyond its own definition.
Singletons exist for AC authors to expose computed columns (ratios, differences, indicator variables) as named members of the AC's surface. They are useful for agent-facing exposure and for query-language brevity. They are not, in Coframe Core, the start of new family lineages.
The boundary between a singleton and a derived family is structural. If an AC author wants a computed quantity to roll up across grains, they materialize it (or have the framework compute it via Frame-QL inline expressions) at multiple anchors. If they want it as a standalone, queryable column, they declare it as a singleton.
2.7.8 Primitive vs. derived families¶
A family-root with op = OBSERVED and no further lineage upward is in a primitive family — the family does not derive from any other family in the AC. The data is the family's root.
A family-root whose lineage points to a column in a different family is in a derived family. The family-root inherits structure from the predecessor family. An edge in the family-DAG records the derivation.
The family-DAG is acyclic (lineage chains terminate at column-roots; family-roots inherit this acyclicity).
In the retail AC, all metric families are primitive in the as-declared version. A derived family example would be a revenue_normalized family rooted at a column whose lineage points to revenue via a normalization operator. This case is not exercised in the retail AC but is structurally available.
2.7.9 What metric families enable¶
Metric families serve several reasoning purposes:
- Cross-anchor navigation within a family: siblings of a family-root represent the same metric at different anchors, navigable via the family's ip_reducer.
- Dubious-query detection: cousins (same name, different family-roots) trigger refusal with disambiguation requirement.
- Per-lineage-edge attestation: each lineage edge is a verifiable structural commitment.
- Agent-facing exposure: agents see families as named, structured metric concepts with available rollup paths.
- Frame-QL family references: queries reference metric families by name; the framework resolves to the appropriate sibling based on requested grain.
The metric family is the primary observation-side structural object in Coframe — the parallel to the dimension family on the coordinate side. The framework's reasoning operates on both layers: dimension families organize the coordinate space; metric families organize the observation space; the FD-DAG and lineage graph are the structural connectives within each.
2.8 ip_reducers and block sets¶
2.8.1 What an ip_reducer is¶
An ip_reducer — short for identity-preserving reducer — is the operator under which a metric family's columns roll up consistently. It is the framework's mechanism for cross-anchor navigation within a family.
An ip_reducer of a family is a pair (R, A_block) where:
Ris a partition-invariant operator (in the algebraic monoid sense of §2.6).A_blockis a set of dimension families along whichRmust not be applied for this family.
When A_block = ∅, the family is fully partition-invariant under R: rollup is safe along any path reachable through the FD-DAG. When A_block ≠ ∅, the family is partially partition-invariant under R: rollup is safe along dimension families not in A_block, refused along dimension families in A_block.
A family may declare multiple ip_reducers, each with its own block set. The family eom_inventory in the retail AC declares two:
(SUM, A_block = {time})— SUM rolls inventory up safely acrossgeography(summing across stores within a month makes sense) but is blocked acrosstime(summing inventory across months does not).(MAX, A_block = ∅)— MAX rolls inventory up safely in any direction (peak inventory is a meaningful quantity at any grain).
Queries select an ip_reducer based on the requested rollup direction. A query asking for "regional monthly inventory" navigates SUM across stores (safe; geography is not blocked). A query asking for "regional yearly peak inventory" navigates MAX in both directions (safe; MAX has no block). A query asking for "regional yearly inventory" without specifying an aggregation finds no ip_reducer whose block set permits both the geography and time rollups under a single operator — the query is refused as needing disambiguation.
2.8.2 Why the block set lives at the family level¶
An ip_reducer's block set is declared at the metric family level, not at the operator catalog level and not at the per-column level. This placement is deliberate.
Block sets are not properties of operators. SUM is algebraically partition-invariant on numbers, period; the operator's catalog entry does not declare which dimensions are safe to sum across. Whether summing across time is meaningful depends on the measure: revenue is safe; inventory is not. The operator catalog cannot know this.
Block sets are not properties of individual columns. Every column in the eom_inventory family should respect the same block. A new column added to the family later — say, eom_inventory at {region, month} materialized as a sibling — inherits the family's block set without redeclaration.
Block sets are properties of the family — the named metric concept. The AC author declares them once, at the family level, and the framework applies them uniformly across the family's columns.
2.8.3 The block set is downward-closed along the FD-DAG¶
A block set declared at the family level must be downward-closed along the FD-DAG within each blocked dimension family. If month is in the block set, then every AC-dimension reachable from month via FD-edges (i.e., quarter, year, fiscal_quarter, fiscal_year) is also blocked.
The block set is declared at the dimension family level (e.g., A_block = {time} means the entire time dimension family is blocked, including all its AC-dimensions and hierarchies). The framework expands this declaration to the implied set of blocked AC-dimensions and verifies downward-closure as a well-formedness condition.
Declaring block at the dimension-family level (rather than enumerating AC-dimensions) is also more robust to evolution. If the AC later adds a new hierarchy within the time dimension family (say, an academic-calendar hierarchy), the block set automatically applies to the new AC-dimensions without redeclaration.
How the block set generalizes semi-additivity¶
The additive / semi-additive / non-additive classification of measures is long established in data-warehousing practice and in the summarizability literature (Appendix C). The block set is an encoding of semi-additivity, but it generalizes the established notion on two axes that most prior treatments do not cover.
First, it does not privilege time. Conventional treatments of semi-additive measures — and the first-child / last-child machinery in many OLAP tools — center on time as the special non-additive axis (opening and closing balances, snapshot semantics). Here the blocked directions are arbitrary dimension families. A measure that must not be summed across time and a measure that must not be summed across some non-temporal family are the same kind of constraint, handled by the same mechanism; time is not special.
Second, it applies to arbitrary partition-invariant reducers, not only additive ones. Because partition-invariance is the monoid property (§2.6.2), a block set constrains the rollup direction of any monoidal reducer — HLL_MERGE, MAX, a sketch union — including reducers with no additive character at all. "Semi-additivity" generalizes here to "direction-restricted partition-invariant aggregation," of which additive semi-additivity is the special case. This generality falls directly out of treating aggregation algebraically rather than arithmetically.
2.8.4 Anchor-locked families¶
A family whose family-root operator is not partition-invariant has no ip_reducer. Such a family is anchor-locked: its columns exist at specific anchors and cannot be derived to other anchors via name-preserving aggregation.
Examples in the retail AC and adjacent cases:
avg_basket_sizerooted at AVG — anchor-locked. AVG is not partition-invariant.median_transaction_valuerooted at MEDIAN — anchor-locked. MEDIAN is holistic.- An exact-distinct-count family rooted at COUNT_DISTINCT — anchor-locked unless approximated via HLL (a different family).
Anchor-locked families are not framework failures; they reflect the underlying truth that not every metric meaningfully rolls up. The framework refuses cross-grain queries against anchor-locked families and offers no synthetic answer.
If an AC author wants cross-grain accessibility for a concept like "average basket size at the regional level," they declare a separate family — for example, regional_avg_basket_size — with its own family-root materialized at the regional grain. Because it has a different name, this is a different family (not a cousin, which would require the same name); queries select which family they want by name. (Had the author instead materialized a second avg_basket_size at the regional grain with its own self-referential root, the two would be cousins — same name, different family-root — and a bare avg_basket_size reference would be dubious.)
2.8.5 Multiple ip_reducers and query semantics¶
When a family declares multiple ip_reducers, queries select one based on what they ask for. Frame-QL provides syntax for explicit selection (Chapter 8); when a query is unambiguous (only one ip_reducer's block set permits the requested rollup), the framework selects automatically.
The framework does not assert that two ip_reducers on the same family yield the same answer. SUM-rolled regional monthly inventory and MAX-rolled regional monthly inventory are different quantities, each legitimate under its own ip_reducer. The query author commits to which they want by referring to the appropriate ip_reducer.
This is a feature, not a bug. The asymmetry reflects what is true: peak inventory and total inventory are different measures over the same underlying data. The family exposes both; the query commits to one.
2.8.6 ip_reducer attestation¶
Per-lineage-edge attestation (Chapter 7) extends naturally to ip_reducers with block sets:
- For an edge
predecessor → successorunder ip_reducer(R, A_block): - Compute the rolled-up dimension set
Δ = A_pred \ A_succ(the AC-dimensions coarsened by the rollup). - Compute the dimension families containing the AC-dimensions in
Δ. - If any of these dimension families is in
A_block, the edge is blocked under R — attestation under this ip_reducer does not apply. - Otherwise, the edge is attestable under R: aggregate the predecessor's data via
RatA_succ, compare to the successor's observed values.
Blocked edges are recorded as a distinct category in the DQ report — neither passing nor failing, but structurally inapplicable for this ip_reducer. (They may be attestable under a different ip_reducer of the same family.)
2.9 Schemas¶
2.9.1 What a schema is¶
A schema in an AC is a structural object binding to a single backend source — a physical table or materialized view — and declaring ColumnSpecs for the analytically-relevant columns of that source. The AC's schemas are the units of physical binding between the AC's declarations and the underlying data.
Each schema declares:
- A
schema_name— unique within the AC. - A
sourcebinding — the backend table or view name (and connection metadata) that this schema observes. - A
declared_scope— per dimension family, whether the schema is non-degenerate or degenerate on that family. - A list of
column_specs— one per analytically-relevant column in the source.
The retail AC declares three schemas: transactions, stores, store_monthly_inventory — each binding to one backend table.
2.9.2 Virtual tables¶
A schema binds to a virtual table: the conceptual table the schema observes. The virtual table need not correspond directly to a physical table; it may be a view, the result of a query, a pre-materialized aggregate, or a federation. The data-API protocol (Chapter 6) abstracts the binding.
For most purposes the manual uses "schema" and "virtual table" interchangeably. The distinction matters when an AC binds to logical-rather-than-physical sources, but in the retail AC each schema corresponds to a single physical backend table.
2.9.3 Schema grain¶
A schema's grain is the anchor at which the schema's rows are uniquely identified. It is derived from the grain-role ColumnSpecs in the schema — those with A = {self}.
For the retail schemas:
transactions: grain{transaction}. One row per transaction.stores: grain{store}. One row per store.store_monthly_inventory: grain{store, month}. One row per (store, month) pair — a composite-grain fact.
A schema may have a single-column grain (one grain-role ColumnSpec) or a composite grain (multiple grain-role ColumnSpecs that together identify rows). The framework verifies grain uniqueness as part of DQ Phase 2 (Chapter 7).
2.9.4 Declared scope and degeneracy¶
A schema is non-degenerate on a dimension family if its rows can be meaningfully classified along that dimension family's hierarchies. It is degenerate on a dimension family if it has no analytical anchoring in that dimension family.
The retail schemas declare:
transactions: non-degenerate on all four dimension families (each transaction has a time, geography, product, and transaction identity).stores: non-degenerate ongeography; degenerate ontime,product,transaction(a row instoresdescribes a store, not a temporal observation, product transaction, or any specific transaction).store_monthly_inventory: non-degenerate ongeographyandtime; degenerate onproductandtransaction.
Declared scope tells the framework which dimension families a schema's queries can navigate. Cross-grain queries that would traverse a degenerate dimension family for a particular schema use that schema's data only if it is non-degenerate on the relevant family.
2.9.5 Schema type¶
Schemas are informally classified into types based on their column composition. The classification is metadata-derivable and does not drive framework behavior; it is documentation:
- Reference schemas contain primarily AC-attributes and AC-dimensions describing the schema's grain.
storesis a reference schema: its rows describe stores via attributes (city, region, country, open_date). - Fact schemas contain primarily AC-metrics at a single grain.
transactionsis a fact schema: each row records metric observations (revenue, cost, units_sold) about one transaction. - Composite-grain fact schemas are fact schemas with composite grains.
store_monthly_inventoryis a composite-grain fact schema: each row records metrics about a (store, month) pair.
These types are descriptive; they do not constrain the framework's reasoning. A schema may contain any mix of AC-dimensions, AC-attributes, and AC-metrics.
2.9.6 ColumnSpec structure (preview)¶
Each ColumnSpec in a schema is structurally divided into four parts. Chapter 3 specifies the format in detail; this section previews the structure for the reader's orientation:
| Part | Fields | Role |
|---|---|---|
| Backend-facing | src_name, data_type |
Bind to physical data |
| Anchor-facing | A, M |
The paired value-determining commitment |
| Operator-lineage | op, lineage |
The column's operational lineage |
| Cross-schema linkage | name |
Family identifier |
The four parts have distinct structural roles and are independently declared. The framework reasons over each separately. Together they constitute the column's complete structural commitment.
2.10 Structural rules and integrity conditions¶
The framework's reasoning depends on the AC honoring certain structural conditions. These conditions are organized into three layers by the granularity at which they apply and the means of verification.
2.10.1 Per-column rules¶
These rules are checkable from a single ColumnSpec being well-formed.
- (A, M) paired declaration: every ColumnSpec has both
AandMdeclared (withMauto-derived for grain-role columns). - Anchor cardinality: for an AC-dimension or AC-attribute in non-grain role,
|A| = 1(it is anchored at a single coordinate). For a grain-role column,Ais the schema's grain:A = {self}for a single-column grain, or the full composite (e.g.,A = {store, month}) for a column contributing to a composite grain. For AC-metrics,|A| ≥ 1(the metric's anchor, possibly composite). - Operator-type-appropriate anchor relation: for each non-root ColumnSpec, the anchor relation between the column and its lineage predecessor matches the operator's type —
A_pred ⊇ A_selffor reducer ops,A_pred = A_selffor function ops. - Naming consistency (when a naming function is declared): the column's declared
nameequals the AC's naming function called with the column's lineage predecessor and operator (orname = name_predifopis identity-preserving for the predecessor).
2.10.2 Per-schema rules¶
These rules involve multiple ColumnSpecs within one schema.
- No-all-dimensions: a schema cannot consist entirely of AC-dimensions with no AC-attributes or AC-metrics; such a schema would observe coordinates without observations.
- Type consistency: same-named columns within a schema must have compatible
data_type. - Same-name uniqueness within schema: two ColumnSpecs in the same schema do not share a
name. - Schema grain well-formedness: the schema's grain-role columns identify rows uniquely (verified per DQ Phase 2).
2.10.3 AC-level rules¶
These rules involve multiple schemas or AC-level constructs.
- Candidate FD-DAG acyclicity: within each dimension family, the declared FD-edges form an acyclic graph.
- Dimension family well-formedness: each AC-dimension belongs to exactly one dimension family; each dimension family has a designated base level (the unique AC-dimension from which all others in the family are FD-reachable).
- Family-root uniqueness within (name, A): two ColumnSpecs in the AC with the same
(name, A)walk lineage to the same family-root. Violation indicates two non-equivalent metrics share an identity claim — a structural inconsistency. - Block-set well-formedness: every ip_reducer's
A_blockis a set of dimension families, downward-closed along the FD-DAG in the implied AC-dimensions. - ip_reducer partition-invariance: every operator declared as an ip_reducer is partition-invariant per the operator catalog.
2.10.4 Data-attested integrity conditions¶
These conditions are verified against backend data during the DQ process. They are not derivable from declarations alone.
- Candidate FD-DAG ⊆ Data-driven FD-DAG: every declared FD-edge is attested by the data.
- Schema scope honoring: the schema's declared scope on each dimension family is honored — non-degenerate scopes have data; degenerate scopes have none.
- Grain combo-key uniqueness: the schema's grain-role columns identify rows uniquely in the data.
- Cross-schema value-mapping consistency: same-named AC-dimensions and AC-attributes across schemas have consistent values for the same key.
2.10.5 The cross-schema metric coherence lemma¶
A key structural claim — the basis for cross-grain query correctness:
Cross-schema metric coherence. Across schemas containing siblings of the same family-root, the metric values at common coarsenings agree (i.e., applying the family's ip_reducer to the finer-grained sibling at the common coarsening produces values matching the coarser-grained sibling), respecting the ip_reducer's block set.
The framework asserts this from Principle 2 (§2.11) plus the ip_reducer's partition-invariance. It verifies this per attestable lineage edge during DQ Phase 3 (Chapter 7). Verification compares the predecessor's data, aggregated via the family's ip_reducer at the successor's anchor (respecting the ip_reducer's block set and operator missing-value treatment), against the successor's observed values; deltas surface as integrity violations or advisories per the AC's failure-mode setting.
This lemma is the basis for Multi-Table Invariance (MTI) in query resolution (Chapter 9): siblings produce equivalent query results because the AC's commitments guarantee their coherence at common coarsenings — which is what licenses the resolver to serve a query from whichever sibling is cheapest.
2.10.6 The integrity-condition posture¶
Three categories of facts emerge from the rules above, classified by how the framework relates to their verification:
- Structural rules: checkable from declarations alone, mechanically and immediately. The framework refuses to operate on an AC that violates these.
- Data-attested conditions: verified against backend data through the DQ process. The framework's reasoning depends on these being verified; opt-out is possible but visible.
- Asserted-not-verified facts: trusted from the catalog or principle, not separately verified per AC. The operator catalog's partition-invariance declarations are an example; Principle 2 itself is another. These are the framework's leap-of-faith items; they are explicit and documented.
The framework's correctness guarantee is conditional on all three categories holding. Structural rules are always verified before the framework operates. Data-attested conditions are verified per AC during DQ; opt-out shifts a verified fact into the asserted-not-verified column. Asserted-not-verified facts are the framework's irreducible commitments.
2.11 The two principles¶
We arrive at the principles last, by way of summary. They are not axioms imposed on the reader; they are the commitments the framework's structural primitives encode. Having seen the primitives, the reader can see the principles as their natural distillation.
A note on terminology before the statements: the principles speak of entities — the things observations are about. This is the natural informal word for "the world-things a column's values describe." The framework's anchor (A) is the declaration mechanism by which a column commits to which entities its value depends on. The principles are about the underlying truth that anchors declare.
2.11.1 Principle 1: Column-borne information¶
Every column's value is determined by the entities it is anchored to, as declared by
A(c, S).
This is what A declares: the column's value depends on the AC-dimensions named in its anchor, and on nothing else outside that anchor's value space. Two rows of a schema agreeing on A should agree on c (modulo declared missingness M). The framework's reasoning over (A, M), the column trichotomy, and cross-schema consistency all draw from this principle.
A column whose value depends on something not in its anchor is structurally ill-formed. The framework's verification (Chapter 7) tests this: same-named AC-dimensions and AC-attributes across schemas should yield consistent values for the same key; if they do not, either an FD declaration is wrong or a column is observing something other than what its anchor claims.
2.11.2 Principle 2: Same universe of observation¶
All schemas in an AC observe the same underlying entities. Cross-schema commitments — FD-edges, family genealogies, sibling coherence — are structurally meaningful because the entities being observed are shared across schemas.
This is what makes cross-schema queries possible. When the transactions schema and the store_monthly_inventory schema both reference store, they reference the same conceptual store — not just an identifier that happens to overlap. When eom_inventory siblings exist at multiple anchors in different schemas, they observe the same conceptual quantity over the same conceptual universe.
The framework's verification of cross-schema value-mapping consistency, cross-schema metric coherence, and sibling-coherence at common coarsenings all derive from Principle 2. Without it, the AC would be a federation of structurally-disjoint tables; cross-schema reasoning would be unsound.
Schema-level degeneracy is the controlled exception. A schema may be declared degenerate on a dimension family (the schema does not observe that family). Degeneracy is a deliberate declaration, not a structural failure: the schema commits that it has no analytical observations along the degenerate dimension family. Queries respect this declaration.
2.11.3 What the principles guarantee¶
Combined, the two principles deliver the framework's central commitment:
An AC honoring Principles 1 and 2, with declared structural commitments and verified integrity conditions, supports Frame-QL queries answerable by an algorithm whose answers are correct as consequences of the AC's commitments.
This is the constructive-correctness guarantee from Chapter 1. The structure of this chapter has been to develop the primitives that make this guarantee deliverable: anchors and missingness (so that values are well-determined); the trichotomy (so that columns' structural roles are known); dimension families with hierarchies (so that the coordinate space is navigable); operators with partition-invariance and identity-preservation (so that derivations are sound); metric families with lineage (so that the observation space is structured); ip_reducers with block sets (so that rollups are anchor-aware); schemas with declared scope (so that the binding to physical data is principled); integrity conditions verified through DQ (so that the AC's commitments are grounded against data).
Each subsequent chapter builds on this foundation. The reader who has absorbed §§2.2–2.10 can now read the AC specification format (Chapter 3), the DQ verification process (Chapter 7), and Frame-QL with its resolution algorithm (Chapters 8–9) as elaborations of the structure introduced here.
The framework is not large. It is, at the foundational level, this: anchors, dimension families, metric families, operators, lineage, structural rules. Everything else is operational consequence.
Chapter 3: The AC Specification¶
The per-column declaration format (ColumnSpec), the AC-level declarations (dimension families, metric families, naming function, schemas), and the schema.init YAML document that holds them. Foundations introduced the primitives; this chapter specifies how they are declared.
3.1 What this chapter does¶
Foundations (Chapter 2) introduced the structural primitives the framework reasons over: anchors, dimension families, metric families, operators, lineage, ip_reducers with block sets, schemas. This chapter specifies how those primitives are declared in an actual AC.
An AC is specified by a schema.init file — a YAML document that declares the AC's identity, AC-level structural objects (dimension families, metric families, naming function, attestation config), schemas, and the ColumnSpecs within each schema. The framework consumes the file, verifies the declarations against its structural rules, and (during the DQ process, Chapter 7) verifies the data-attested integrity conditions against the backend.
The chapter is organized outside-in: the file's top-level structure first (§3.2), then the AC-level declarations (§§3.3–3.6), then schemas (§3.7), then the ColumnSpec format (§3.8), then derived properties (§3.9) and a worked example (§3.11). A reader writing an AC by hand can follow the sections in order.
This chapter is a specification, not an exposition. The "why" of each construct is in Foundations; this chapter says "how to declare it" with examples. Where a field's role requires deeper explanation, the chapter references the relevant Foundations section.
3.2 The schema.init file¶
3.2.1 Top-level structure¶
A schema.init file is a YAML document with the following top-level fields:
ac_name: <string>
ac_description: <string>
ac_metadata: <map> # optional
attestation: <object> # optional; defaults apply
naming_function: <object> # optional; defaults to catalog defaults
dimension_families: # AC-level
- <dimension-family-object>
metric_families: # AC-level
- <metric-family-object>
schemas: # the schemas containing ColumnSpecs
- <schema-object>
The top-level order is conventional; the framework parses the file as a YAML mapping and is order-insensitive. AC authors are encouraged to follow the order above for readability.
Notation conventions¶
Two conventions used throughout this chapter and the rest of the manual:
-
Field-name capitalization. The single-letter symbolic fields
A(anchor) andM(missingness) are capitalized, matching their mathematical notation in Foundations. All other field names use lowercasesnake_case(name,op,lineage,src_name,data_type,a_block,dimension_families, etc.). -
Sets vs. lists. Conceptually, an anchor
Aand a block setA_blockare sets — order is not significant and duplicates are nonsensical. The manual's prose uses set notation for them (A = {transaction},A_block = {time}, the empty set∅). In YAML, these serialize as lists (A: [transaction],a_block: [time], the empty list[]), since YAML has no native set type. The framework treats the YAML list as a set: the order of elements inAora_blockcarries no meaning, and the framework rejects duplicate elements.
3.2.2 File scope and naming¶
A schema.init file specifies one AC. ACs are not split across files in Coframe Core (Pro may relax this).
The file's name is conventionally schema.init.yaml or schema.init.yml. The framework reads any file matching the *.init.yaml pattern in a designated AC directory; AC tooling may impose stronger conventions.
3.2.3 References within the file¶
ColumnSpecs reference each other (through the lineage field), AC-dimensions (through A), dimension families (through dimension-family membership), and metric families (through name and through the metric family's family-root reference).
All references within a schema.init file are by name. The framework resolves references during validation; unresolved references are integrity violations.
3.3 AC-level declarations¶
3.3.1 ac_name¶
The AC's primary identifier. A string. Conventionally lowercase with underscores (e.g., retail_analytics_v1). Used by the framework in tooling, agent-facing exposure, and persistence.
3.3.2 ac_description¶
A free-form string describing the AC's purpose and analytical scope. Engineer-facing; not used by the framework's reasoning.
3.3.3 ac_metadata¶
Optional. A YAML map carrying arbitrary structured metadata: author, date, version, organizational scope. The framework preserves but does not interpret these fields.
ac_metadata:
author: retail-analytics-team
version: "2024.11"
created: "2024-11-15"
notes: "Phase 1 of the unified retail AC; covers transactions, stores, monthly inventory"
3.3.4 attestation¶
Optional. Configures per-lineage-edge attestation behavior (Chapter 7 specifies the operational semantics). When absent, defaults apply.
attestation:
enabled: true # default; set to false to opt out
failure_mode: soft # soft (default) | hard | tolerated
tolerated_edges: [] # required when failure_mode == tolerated
epsilon_relative: 1.0e-9
epsilon_absolute: 0
strict_row_sets: false
sampling_threshold_rows: 100_000_000
sampling_fraction: 0.01
sampling_min_rows_per_stratum: 10_000
sampling_confidence_target: 0.99
force_full: false
Opting out of attestation requires setting enabled: false explicitly — making opt-out a deliberate, visible choice. The framework treats absence of the attestation block as defaults, not as opt-out.
Field semantics are specified in Chapter 7 §7.6.8.
3.4 Dimension family declarations¶
3.4.1 Structure¶
Each entry in dimension_families declares one named dimension family:
dimension_families:
- name: <string>
description: <string> # optional; engineer-facing
base_level: <ac-dimension-name>
members: [<ac-dimension-name>, ...]
hierarchies:
- name: <string>
path: [<ac-dimension-name>, ...]
- ...
name: the dimension family's identifier. Referenced from block sets (A_block) and from Frame-QL queries.base_level: the finest AC-dimension in the family. Must be one ofmembers.members: the list of AC-dimensions belonging to this dimension family. Each AC-dimension belongs to exactly one dimension family.hierarchies: a list of named paths from the base level upward. The path is an ordered list of AC-dimensions starting with the base level.
3.4.2 Single-level dimension families¶
A dimension family with a single member declares an empty or absent hierarchies list:
3.4.3 Multi-hierarchy dimension families¶
A dimension family may declare multiple hierarchies, sharing the base level but otherwise traversing distinct paths:
- name: time
base_level: day
members: [day, month, quarter, year, fiscal_period, fiscal_quarter, fiscal_year, iso_week, iso_year]
hierarchies:
- name: calendar
path: [day, month, quarter, year]
- name: fiscal
path: [day, fiscal_period, fiscal_quarter, fiscal_year]
- name: week
path: [day, iso_week, iso_year]
Hierarchies must share the base_level as their starting AC-dimension. Above the base, they may share or diverge as the dimension family's structure dictates.
3.4.4 FD-edges from hierarchies¶
Each consecutive pair in a hierarchy's path implies an FD-edge. The calendar hierarchy above implies day → month, month → quarter, quarter → year. The AC's candidate FD-DAG is the union of FD-edges implied by all hierarchies across all dimension families.
FD-edges may also be declared explicitly through the extra_fd_edges mechanism described in §3.4.5, supplementing the hierarchy-implied edges.
3.4.5 Explicit FD-edge declaration¶
For FD-edges that do not appear in any hierarchy (rare in Coframe Core), the AC may declare them explicitly within a dimension family:
- name: geography
base_level: store
members: [store, city, region, country, metro_area]
hierarchies:
- name: administrative
path: [store, city, region, country]
extra_fd_edges:
- [store, metro_area] # store determines metro_area
- [metro_area, region] # metro_area determines region
The extra_fd_edges field lists pairs of AC-dimensions establishing additional FD-edges. The framework verifies that all declared FD-edges are within the same dimension family.
3.4.6 Function-derived hierarchies¶
A hierarchy may contain function-derived FD-edges grounded by operator-catalog functions:
- name: time
base_level: day
hierarchies:
- name: calendar
path:
- day
- {ac_dimension: month, derived_by: MONTH_OF}
- {ac_dimension: quarter, derived_by: QUARTER_OF}
- {ac_dimension: year, derived_by: YEAR_OF}
Where path entries are objects (rather than bare AC-dimension names), the framework interprets derived_by as the operator catalog entry that produces the AC-dimension from its predecessor in the path. Function-derived FD-edges are populated identically to data-attested ones (per Foundations §2.5.6).
3.4.7 Verification¶
The framework verifies at AC validation time:
- Each AC-dimension belongs to exactly one dimension family.
base_levelis amongmembers.- Every hierarchy's path starts with
base_level. - Every hierarchy path entry is among
members. - The implied FD-DAG within each dimension family is acyclic.
- Any AC-dimension appearing in a
ColumnSpec'sAfield belongs to some declared dimension family.
3.5 Metric family declarations¶
3.5.1 Structure¶
Each entry in metric_families declares one metric family:
metric_families:
- name: <family-name>
description: <string> # optional
family_root:
schema: <schema-name>
column: <column-name>
ip_reducers:
- operator: <operator-name>
a_block: [<dimension-family-name>, ...]
- ...
name: the family-name. Every ColumnSpec in this family has this string as itsnamefield. The family-name participates in Frame-QL queries.family_root: identifies the family-root ColumnSpec — the schema it lives in and the column name within that schema. The family-root's ownnamemust equal the family'sname.ip_reducers: a list of identity-preserving reducers under which the family rolls up. Each entry pairs an operator with its block set.
3.5.2 The ip_reducer list¶
A family may declare zero, one, or many ip_reducers.
- Zero ip_reducers: the family is anchor-locked. Cross-grain navigation refused.
- One ip_reducer: the typical case for additive measures (
revenueunder SUM) or extremal measures (peak_inventoryunder MAX). - Multiple ip_reducers: a family supporting different rollup semantics under different operators (
eom_inventoryunder both SUM withA_block = {time}and MAX with no block).
3.5.3 Block set declaration¶
A block set is a list of dimension family names. Each name must reference a declared dimension family.
ip_reducers:
- operator: SUM
a_block: [time] # SUM rollup across time blocked
- operator: MAX
a_block: [] # MAX rollup unrestricted
The framework expands a block set declaration [time] to the set of all AC-dimensions in the time dimension family (e.g., {day, month, quarter, year, fiscal_period, ...}) for purposes of rollup-direction checking. Declaring at the dimension family level (rather than enumerating AC-dimensions) ensures the block set automatically tracks evolution of the dimension family's membership.
3.5.4 Verification¶
The framework verifies at AC validation time:
- Every metric family's
family_rootreference resolves to an existing ColumnSpec. - The referenced ColumnSpec's
nameequals the family'sname. - The referenced ColumnSpec's
lineageis self-referential (it is in fact a family-root in Foundations' sense). - Every operator listed in
ip_reducersis partition-invariant per the operator catalog. - Every dimension family named in any
a_blockis a declared dimension family. - The implied set of blocked AC-dimensions in each
a_blockis downward-closed along the FD-DAG (an AC-dimension blocked under R implies all FD-reachable AC-dimensions are blocked).
3.5.5 Operator-attested vs. operator-asserted (derived)¶
The framework derives, per metric family:
- Operator-attested: the family-root has at least one sibling in the AC (a ColumnSpec with the same
nameand lineage pointing to the family-root). The family's ip_reducer is verifiable per-lineage-edge against this sibling during DQ Phase 3. - Operator-asserted: the family-root has no in-AC sibling. The family's ip_reducer is trusted from declaration, not data-attested.
This is a derived property reported in the DQ deliverable (Chapter 7), not declared by the AC author.
3.5.6 Examples¶
The retail AC's metric families:
metric_families:
- name: revenue
description: "Transaction-level revenue; rolls up by SUM in any direction"
family_root:
schema: transactions
column: revenue
ip_reducers:
- operator: SUM
a_block: []
- name: units_sold
family_root:
schema: transactions
column: units_sold
ip_reducers:
- operator: SUM
a_block: []
- name: cost
family_root:
schema: transactions
column: cost
ip_reducers:
- operator: SUM
a_block: []
- name: eom_inventory
description: "End-of-month inventory snapshots; semi-additive (SUM blocked across time)"
family_root:
schema: store_monthly_inventory
column: eom_inventory
ip_reducers:
- operator: SUM
a_block: [time]
- operator: MAX
a_block: []
- name: peak_inventory
family_root:
schema: store_monthly_inventory
column: peak_inventory
ip_reducers:
- operator: MAX
a_block: []
- name: gross_margin_pct
description: "Per-transaction gross margin percentage; singleton (no rollup)"
family_root:
schema: transactions
column: gross_margin_pct
ip_reducers: []
3.5.7 Cache hint (optional)¶
A metric family may declare a cache_hint block listing anchor-sets the Metric Engine should pre-materialise at COMMIT-time. The hint is advisory to the engine, not part of the family's semantic specification — omitting it does not change what the family means, only how aggressively the engine warms its cache.
metric_families:
- name: revenue
family_root:
schema: transactions
column: revenue
ip_reducers:
- operator: SUM
a_block: []
cache_hint: # optional
materialize_at:
- [region] # pre-materialise revenue@(region,)
- [region, day] # pre-materialise revenue@(region, day)
Semantics:
- Each entry under
materialize_atis a list of dimension-family names. The engine pre-materialises the family at that anchor (using the family's first ip_reducer). - The engine validates each anchor against the AC at COMMIT time: every named dimension family must exist and the resulting anchor must be FD-reachable.
- The Metric Engine's promotion-recommendation surface (see Chapter 11 §11.7) emits paste-able
cache_hintstanzas based on observed query patterns — the intended workflow is declare → observe → promote. cache_hintis ignored when the engine is disabled ininstallation.yaml(see Chapter 11 §11.2). The family's correctness is independent of the hint.
The hint is the Manual's only language-level acknowledgment that the engine exists; everything else about engine behavior lives in Chapter 11.
3.6 The naming function declaration¶
3.6.1 What the naming function does¶
The naming function is the AC-level declaration that maps (name_pred, A_pred, op) to name_self. It allows the framework to verify that a column's declared name is consistent with its operational lineage: given that a column derives from predecessor m_pred via operator op, the naming function determines what name the resulting column should carry.
For identity-preserving operators, the naming function returns name_pred (the name is preserved). For non-identity-preserving operators, the naming function maps to a derived name — revenue under MAX becomes peak_revenue; revenue under COUNT becomes revenue_count.
The naming function is the bridge between the AC author's naming choices and the framework's structural verification.
3.6.2 Declaration options¶
The AC has four options for declaring the naming function:
Option 1: Adopt the operator catalog's defaults.
The operator catalog (Chapter 10) ships with default naming-function entries per operator. This option uses them as-is. Suitable for ACs following catalog conventions.
Option 2: Override per-operator.
naming_function:
use_catalog_defaults: true
overrides:
MAX: "{name_pred}_max" # overrides catalog's "peak_{name_pred}"
COUNT: "n_{name_pred}"
The catalog defaults apply, with named overrides for specific operators. The override format uses {name_pred} as a substitution token. Other tokens may be supported ({op}, {A_pred}) per Chapter 10.
Option 3: Custom naming function.
naming_function:
rules:
- if: {op: SUM}
then: "{name_pred}" # SUM is identity-preserving
- if: {op: MAX}
then: "peak_{name_pred}"
- if: {op: MIN}
then: "min_{name_pred}"
- if: {op: COUNT}
then: "{name_pred}_count"
- if: {op: AVG}
then: "avg_{name_pred}"
- if: {op: COUNT_DISTINCT}
then: "distinct_{name_pred}"
- default: "{name_pred}_{op}"
A custom rule list. The first matching rule applies; the default rule applies if no rule matches.
Option 4: Decline structured naming.
The AC declines to have its column names verified against operational lineage. ColumnSpecs declare name freely; the framework does not check naming consistency. The AC's family membership is still determined by name equality, but the framework cannot verify that a column's name "follows from" its operator.
Option 4 trades verifiability for flexibility. It is documented but discouraged for new ACs.
3.6.3 Verification¶
When the naming function is declared (Options 1, 2, or 3), the framework verifies for every non-root ColumnSpec:
name_self= naming_function(name_pred,A_pred,op)
Mismatches are structural violations. The DQ report lists violations; the AC fails validation if any exist (under strict mode) or proceeds with warnings (under permissive mode — Pro feature; not in Core).
Under Option 4 (declined), no naming verification is performed. The framework consumes name for family-membership purposes only.
3.7 Schema declarations¶
3.7.1 Structure¶
Each entry in schemas declares one schema:
schemas:
- schema_name: <string>
description: <string> # optional
source:
type: table | view | query
connection: <connection-name>
reference: <string> # table name, view name, or SQL query
declared_scope:
<dimension-family-name>: non_degenerate | degenerate
...
column_specs:
- <ColumnSpec>
- ...
3.7.2 schema_name¶
A string identifying the schema within the AC. Must be unique among schemas in this AC.
3.7.3 source¶
Specifies the backend binding:
source:
type: table
connection: warehouse_primary # references a declared connection
reference: retail_db.transactions # backend table identifier
The type field is one of table, view, or query. For query, the reference field contains a SQL query string the backend evaluates to produce the virtual table.
The connection field references a backend connection declared elsewhere (Chapter 6 specifies the connection format). For ACs with a single backend, the connection may be implicit.
3.7.4 declared_scope¶
Per dimension family, declares whether the schema is non-degenerate or degenerate. Dimension families not listed default to non_degenerate if any of the schema's ColumnSpecs anchor in that family, degenerate otherwise.
declared_scope:
time: non_degenerate
geography: non_degenerate
product: degenerate
transaction: degenerate
Explicit declaration is preferred for clarity. The framework verifies during DQ Phase 1 that the schema's data honors its declared scope.
3.7.5 column_specs¶
The list of ColumnSpecs for the analytically-relevant columns in this schema. Specified next (§3.8).
3.7.6 Virtual splitting¶
A schema may declare additional virtual splits — projections of the schema's data into separate logical schemas, useful for cases where one physical table contains data at multiple grains:
schemas:
- schema_name: events
source:
type: table
reference: events_raw
virtual_splits:
- schema_name: events_session
filter: event_type = 'session_start'
- schema_name: events_pageview
filter: event_type = 'pageview'
Each virtual split inherits the source schema's connection but applies a filter to project a subset of rows. Each split is treated as its own schema with its own ColumnSpec list.
Virtual splits are an optional feature; most ACs do not need them.
3.8 ColumnSpec format¶
A ColumnSpec declares one column within a schema. It is the AC's per-column structural commitment. The structure is divided into four parts as introduced in Foundations §2.9.6.
3.8.1 Backend-facing fields¶
src_name¶
The physical column name in the backend table the schema binds to. A string. Backend-facing; not used in queries.
If src_name equals the column's name, the AC author may omit src_name. The framework defaults src_name from name when absent.
data_type¶
The column's data type as exposed by the backend. The framework recognizes:
| Type | Description |
|---|---|
numeric |
Integer or floating-point. |
integer |
Integer-valued. |
string |
Character data. |
boolean |
TRUE / FALSE. |
date |
Calendar dates. |
timestamp |
Dates with time-of-day. |
hll_sketch |
HyperLogLog sketch (for approximate distinct counts). |
theta_sketch |
Theta sketch (for approximate set operations). |
t_digest |
t-digest sketch (for approximate quantiles). |
kll_sketch |
KLL sketch (for approximate quantiles). |
bitmap |
Compressed bitmap. |
The non-numeric monoidal types (sketch types, bitmap) are first-class. ColumnSpecs may declare these as data_type; operators in the catalog acting on them (HLL_MERGE, THETA_UNION, T_DIGEST_MERGE) are partition-invariant in the algebraic sense.
Backend-specific types (decimal precision, varchar length, geographic types) map to recognized types per the backend's data-API.
data_type participates in:
- Type-consistency rules across same-named columns (§3.9).
- Operator-applicability checks: each catalog operator declares its input type signature.
- Frame-QL expression type-checking (Chapter 8).
The framework rejects ColumnSpecs whose data_type is incompatible with their op (e.g., SUM on a string column, HLL_MERGE on a numeric column).
3.8.2 Anchor-facing fields¶
A¶
The column's anchor: the set of AC-dimensions on whose values the column's value depends. Declared as a list:
A: [transaction] # anchored at transaction
A: [store] # anchored at store
A: [store, month] # anchored at (store, month) — composite
A: [self] # grain-role column
The cardinality of A is constrained by the column's trichotomy and its role in the schema:
- An AC-dimension or AC-attribute in non-grain role:
|A| = 1(anchored at a single coordinate). - A grain-role column:
A = [self]for a single-column grain (a special marker the framework resolves to the column itself), or the full composite grain (e.g.,A: [store, month]) for a column contributing to a composite grain. - An AC-metric:
|A| ≥ 1(its anchor, possibly composite).
The framework verifies during AC validation that every AC-dimension named in A belongs to some declared dimension family. The named AC-dimension need not itself be in grain role in any schema — derived AC-dimensions (reachable through the FD-DAG, such as region reachable from store) are valid anchor members. What the framework requires is that each AC-dimension's values be grounded — that is, the AC-dimension either (a) appears in at least one ColumnSpec in any role, grain or non-grain (so its values are directly observed), or (b) is reachable through the FD-DAG (data-attested or function-derived edges) from an AC-dimension that is so grounded. This ensures every AC-dimension's values can be obtained from observed data, whether by direct read or by FD-navigation.
For example, in the retail AC the time dimension family's base level day is never in grain role, but it appears as a non-grain-role observed column in the transactions schema (anchored at {transaction}) — grounding (a). The coarser time AC-dimensions month, quarter, year are grounded by reachability from day — grounding (b), via data-attested or function-derived FD-edges. A base level need not be in grain role; it need only be grounded.
M¶
The column's missingness anchor: the set of coordinates on whose values the column's missingness depends. Declared as a single list, parallel to A, ranging over A ∪ [self]:
M: [] # missingness depends on nothing (MCAR)
M: [region] # missingness depends on region (MAR)
M: [self] # missingness depends on the value itself (MNAR)
M: [self, region] # depends on the value AND region (MNAR)
Constraints:
M ⊆ A ∪ [self]: every member ofMis either an AC-dimension in the column's anchor or theselfmarker.
The three-way mechanism classification is derived from M, not separately declared (Foundations §2.4.3): M = [] is MCAR; M ⊆ A without self is MAR; any M containing self is MNAR. The set carries strictly more information than the category — [self] and [self, region] are both MNAR but are different missingness anchors — so the framework declares and reasons over the set, treating the category as a readable summary only.
Examples:
# MCAR — missingness unrelated to anything
M: []
# MAR — missingness depends on an observed coordinate
M: [region]
# MNAR — missingness depends on the value itself
M: [self]
Auto-derivation for grain-role columns¶
For grain-role columns (A = [self]), the framework auto-derives:
M = [self](missingness, were it permitted, would depend on the identifying value itself).- Missing values are inadmissible: a missing value in a grain-role column is an integrity violation, since the column must identify the schema's rows.
The AC author declares A = [self]; M may be omitted. If M is also declared, it must match the derived value.
For a composite-grain contributor (a grain-role column whose A is the explicit composite, e.g. A: [store, month]), the same treatment applies — M = [self], missing values inadmissible — but because A is not the [self] marker, the AC author declares M explicitly (as the worked example in §3.11 does) rather than relying on auto-derivation.
3.8.3 Operator and lineage fields¶
op¶
The operator that produced this column. References a catalog entry by name (Chapter 10).
op: OBSERVED # for raw observations
op: SUM # for SUM aggregations
op: MAX # for MAX aggregations
op: HLL_MERGE # for HLL sketch merges
op: MAP_DIV # for multi-input division
op: MONTH_OF # for function-derived AC-dimensions
For a root column (lineage self-referential), op is the column's root operator. For an OBSERVED root, op = OBSERVED indicates direct observation. For a reducer-rooted family (no in-AC predecessor), op is the family's intended ip_reducer.
For grain-role columns (A = [self]), op defaults to OBSERVED if not explicitly declared.
The framework verifies that op references a valid catalog entry and that the operator's type signature accepts the column's data_type (and the predecessor's data_type for non-root columns).
lineage¶
The column's predecessor record:
# For a non-root column
lineage:
name: <predecessor-family-name>
A: [<ac-dimensions>]
op: <predecessor-operator>
# For a root column (self-referential)
lineage:
name: <self-family-name>
A: [<self-A>]
op: <self-op>
For non-root columns, the framework verifies that a ColumnSpec matching (name, A, op) exists in the AC. The matching ColumnSpec is the lineage predecessor.
For root columns, the lineage field equals (name_self, A_self, op_self) — the column's own (name, A, op). The framework recognizes self-referential lineage as marking the column as a root.
Multi-input lineage (singletons)¶
For multi-input operators (producing singletons, per Foundations §2.7.7), lineage is a list of predecessor records:
# Singleton: gross_margin_pct = MAP_DIV(revenue, cost)
lineage:
- name: revenue
A: [transaction]
op: OBSERVED
- name: cost
A: [transaction]
op: OBSERVED
op: MAP_DIV
The framework verifies each predecessor record references an existing ColumnSpec and that all predecessors share a common anchor (the singleton's anchor).
Auto-derivation¶
For grain-role columns, lineage defaults to self-referential if not declared:
# Equivalent declarations for the transaction AC-dimension in transactions schema
A: [self]
# lineage and op auto-derived: lineage = (transaction, [transaction], OBSERVED), op = OBSERVED
3.8.4 Cross-schema linkage: name¶
The column's family-name. A string. Two ColumnSpecs in the AC with the same name belong to the same metric family.
The framework uses name to determine family membership. Naming-function verification (when declared) ensures name is consistent with the column's lineage and operator.
name and src_name are independent. A column may have src_name = revenue_amount (backend physical name) and name = revenue (AC family-name). The AC's naming reflects the AC author's choices; the backend's naming reflects the source system's choices.
3.8.5 Required vs. derivable fields¶
All four parts of a ColumnSpec must be present (declared or derived) at AC validation time. The framework auto-derives:
- For grain-role columns (
A = [self]): M = [self].op = OBSERVED(if not declared).lineage= self-referential (if not declared).- For columns where
src_nameequalsname: src_namedefaults fromname.
Missing required fields are integrity violations.
3.8.6 ColumnSpec examples¶
Examples from the retail AC.
A grain-role AC-dimension (the transaction column in the transactions schema):
An AC-dimension in non-grain role (the store column in the transactions schema, referencing the geography dimension family):
- name: store
data_type: string
A: [transaction]
M: []
op: OBSERVED
lineage:
name: store
A: [transaction]
op: OBSERVED
Here store is an AC-dimension (it is in grain role in the stores schema and is the base level of the geography dimension family), but its role in the transactions schema is non-grain: its anchor is {transaction}, meaning each transaction references a store. The trichotomy classification (AC-dimension) is an AC-level property; the grain-vs-non-grain role is per-schema.
An observation-rooted AC-metric (the revenue column in the transactions schema):
- name: revenue
src_name: revenue_amount
data_type: numeric
A: [transaction]
M: []
op: OBSERVED
lineage:
name: revenue
A: [transaction]
op: OBSERVED
A reducer-rooted AC-metric (the eom_inventory column in the store_monthly_inventory schema):
- name: eom_inventory
data_type: numeric
A: [store, month]
M: [store, month]
op: SUM # family's intended ip_reducer
lineage:
name: eom_inventory # self-referential: this is the family-root
A: [store, month]
op: SUM
A function-derived AC-dimension (a month column derived from day via MONTH_OF, declared in some schema):
Note that the month column is in grain role (A = [self]), but its op is MONTH_OF and its lineage points to a day predecessor. This is a function-derived AC-dimension — common when temporal dimension members are computed from a base date.
A singleton (the gross_margin_pct column in the transactions schema):
- name: gross_margin_pct
data_type: numeric
A: [transaction]
M: [transaction]
op: MAP_DIV
lineage:
- name: revenue
A: [transaction]
op: OBSERVED
- name: cost
A: [transaction]
op: OBSERVED
The multi-input lineage field captures both predecessors.
3.9 Derived properties¶
The framework computes several properties from declared ColumnSpecs and AC-level declarations. These are derived, not declared; they appear in the AC's compiled metadata, available to query resolution, agent-facing exposure, and the DQ deliverable.
3.9.1 Trichotomy classification¶
Per Foundations §2.4.5, every column is classified as AC-dimension, AC-attribute, or AC-metric, derived from A patterns across schemas.
3.9.2 Family membership¶
Per Foundations §2.7.3, every column with a given name belongs to that named metric family. Membership is computed from name equality.
3.9.3 Family-root¶
Per Foundations §2.7.4, the family-root of a metric family is the column reached by walking lineage backward as long as name_pred equals name_self. The metric_families declarations name the expected family-root; the framework verifies the derived family-root matches.
3.9.4 Sibling, cousin, and identical relations¶
Per Foundations §2.7.5, structural relations among same-named columns are derived from lineage walks.
3.9.5 Schema grain¶
Per Foundations §2.9.3, each schema's grain is derived from the set of grain-role ColumnSpecs in the schema.
3.9.6 FD-DAG¶
Per Foundations §2.5.4, the AC's candidate FD-DAG is computed as the union of hierarchy-implied FD-edges, explicit FD-edge declarations (§3.4.5), and function-derived FD-edges (§3.4.6).
3.9.7 Operator-attested vs. operator-asserted¶
Per §3.5.5, every metric family is classified as operator-attested or operator-asserted based on whether its family-root has an in-AC sibling.
3.9.8 Lineage edges and the lineage graph¶
Per Foundations §2.7.2, the lineage graph is computed as the directed graph of lineage edges across all ColumnSpecs.
3.10 AC-level integrity conditions¶
The framework verifies the following conditions at AC validation time. These reify Foundations §2.10's rules in the schema.init context.
3.10.1 Structural integrity (per-column, per-schema, AC-level)¶
- Every ColumnSpec has
name,data_type,A,M,op,lineagedeclared or derivable. - Anchor cardinality respects the trichotomy.
lineagereferences resolve to existing ColumnSpecs.- Operator type signatures match
data_typedeclarations. - Naming consistency holds per the naming function (when declared).
- Each dimension family is well-formed (single base level, members partition the AC-dimensions).
- The candidate FD-DAG is acyclic within each dimension family.
- Family-root uniqueness within (name, A) is maintained.
- Block sets reference declared dimension families; are downward-closed along the FD-DAG.
- ip_reducer operators are partition-invariant per the catalog.
3.10.2 Schema integrity¶
- Each schema's source binding resolves.
- Each schema's grain-role ColumnSpecs together identify rows uniquely (verified during DQ).
- Declared scope is internally consistent (a schema declared degenerate on dimension family D has no ColumnSpec with
Areferencing AC-dimensions from D).
3.10.3 Data-attested conditions¶
The following conditions require backend data and are verified during the DQ process (Chapter 7):
- Candidate FD-DAG ⊆ Data-driven FD-DAG.
- Declared scope honored in data.
- Grain combo-key uniqueness.
- Cross-schema value-mapping consistency for same-named AC-dimensions and AC-attributes.
- Per-lineage-edge value attestation for metric family siblings.
3.11 A complete example¶
The retail AC (introduced in the front matter) compiled into a complete schema.init form. Excerpted; some columns and metadata abbreviated for readability.
ac_name: retail_analytics_v1
ac_description: |
Unified retail AC covering transactions, store reference, and monthly inventory.
ac_metadata:
author: retail-analytics-team
version: "2024.11"
attestation:
enabled: true
failure_mode: soft
epsilon_relative: 1.0e-9
naming_function:
use_catalog_defaults: true
dimension_families:
- name: time
base_level: day
members: [day, month, quarter, year, fiscal_period, fiscal_quarter, fiscal_year, iso_week, iso_year]
hierarchies:
- name: calendar
path: [day, month, quarter, year]
- name: fiscal
path: [day, fiscal_period, fiscal_quarter, fiscal_year]
- name: week
path: [day, iso_week, iso_year]
- name: geography
base_level: store
members: [store, city, region, country]
hierarchies:
- name: administrative
path: [store, city, region, country]
- name: product
base_level: sku
members: [sku, product_line, product_category, department]
hierarchies:
- name: merchandising
path: [sku, product_line, product_category, department]
- name: transaction
base_level: transaction
members: [transaction]
hierarchies: []
metric_families:
- name: revenue
family_root: {schema: transactions, column: revenue}
ip_reducers:
- {operator: SUM, a_block: []}
- name: cost
family_root: {schema: transactions, column: cost}
ip_reducers:
- {operator: SUM, a_block: []}
- name: units_sold
family_root: {schema: transactions, column: units_sold}
ip_reducers:
- {operator: SUM, a_block: []}
- name: eom_inventory
description: "End-of-month inventory; semi-additive"
family_root: {schema: store_monthly_inventory, column: eom_inventory}
ip_reducers:
- {operator: SUM, a_block: [time]}
- {operator: MAX, a_block: []}
- name: peak_inventory
family_root: {schema: store_monthly_inventory, column: peak_inventory}
ip_reducers:
- {operator: MAX, a_block: []}
- name: gross_margin_pct
family_root: {schema: transactions, column: gross_margin_pct}
ip_reducers: []
schemas:
- schema_name: transactions
source:
type: table
connection: warehouse_primary
reference: retail_db.transactions
declared_scope:
time: non_degenerate
geography: non_degenerate
product: non_degenerate
transaction: non_degenerate
column_specs:
- name: transaction
data_type: string
A: [self]
- name: store
data_type: string
A: [transaction]
M: []
op: OBSERVED
lineage: {name: store, A: [transaction], op: OBSERVED}
- name: sku
data_type: string
A: [transaction]
M: []
op: OBSERVED
lineage: {name: sku, A: [transaction], op: OBSERVED}
- name: day
data_type: date
A: [transaction]
M: []
op: OBSERVED
lineage: {name: day, A: [transaction], op: OBSERVED}
- name: revenue
src_name: revenue_amount
data_type: numeric
A: [transaction]
M: []
op: OBSERVED
lineage: {name: revenue, A: [transaction], op: OBSERVED}
- name: units_sold
data_type: integer
A: [transaction]
M: []
op: OBSERVED
lineage: {name: units_sold, A: [transaction], op: OBSERVED}
- name: cost
data_type: numeric
A: [transaction]
M: [transaction]
op: OBSERVED
lineage: {name: cost, A: [transaction], op: OBSERVED}
- name: gross_margin_pct
data_type: numeric
A: [transaction]
M: [transaction]
op: MAP_DIV
lineage:
- {name: revenue, A: [transaction], op: OBSERVED}
- {name: cost, A: [transaction], op: OBSERVED}
- schema_name: stores
source:
type: table
connection: warehouse_primary
reference: retail_db.stores
declared_scope:
geography: non_degenerate
time: degenerate
product: degenerate
transaction: degenerate
column_specs:
- name: store
data_type: string
A: [self]
- name: city
data_type: string
A: [store]
M: []
op: OBSERVED
lineage: {name: city, A: [store], op: OBSERVED}
- name: region
data_type: string
A: [store]
M: []
op: OBSERVED
lineage: {name: region, A: [store], op: OBSERVED}
- name: country
data_type: string
A: [store]
M: []
op: OBSERVED
lineage: {name: country, A: [store], op: OBSERVED}
- name: store_open_date
data_type: date
A: [store]
M: []
op: OBSERVED
lineage: {name: store_open_date, A: [store], op: OBSERVED}
- schema_name: store_monthly_inventory
source:
type: table
connection: warehouse_primary
reference: retail_db.store_monthly_inventory
declared_scope:
geography: non_degenerate
time: non_degenerate
product: degenerate
transaction: degenerate
column_specs:
- name: store
data_type: string
A: [store, month]
M: [self]
op: OBSERVED
lineage: {name: store, A: [store, month], op: OBSERVED}
- name: month
data_type: date
A: [store, month]
M: [self]
op: OBSERVED
lineage: {name: month, A: [store, month], op: OBSERVED}
- name: eom_inventory
data_type: numeric
A: [store, month]
M: [store, month]
op: SUM
lineage: {name: eom_inventory, A: [store, month], op: SUM}
- name: peak_inventory
data_type: numeric
A: [store, month]
M: [store, month]
op: MAX
lineage: {name: peak_inventory, A: [store, month], op: MAX}
This example exercises every major feature of the schema.init format: AC-level metadata and attestation config, dimension family declarations with multiple hierarchies, metric family declarations with ip_reducers and block sets, naming function declaration, schema declarations with declared scope, and ColumnSpec declarations spanning grain-role columns, AC-attributes, AC-metrics, observation-rooted families, reducer-rooted families, and singletons.
Subsequent chapters reference this declaration as the canonical retail AC throughout.
Chapter 6: The Data-API Protocol¶
The interface between Coframe and a backend. The operations the framework requires the backend to support, the request/response shapes, and the protocol for invoking them. Chapter 7's verification workflow is written against this interface; Chapter 9's query resolution executes through it.
6.1 What this chapter does¶
Coframe does not store data. The data lives in a backend — a relational database, a lakehouse query engine, a data-warehouse service, a federation of sources — and Coframe interacts with it through a specified protocol. This chapter specifies that protocol.
The protocol is divided into operations across four areas:
- Connection and introspection (§6.3): how the framework discovers and binds to backend tables.
- Structural verification (§6.4): operations that check declared structural commitments against data.
- Projection and filtering (§6.5): operations that read data with structural awareness.
- Aggregation (§6.6): operations that compute reducer outputs at requested grains.
Each operation is specified with its request parameters, its return shape, and its semantics. The chapter is reference-style; readers may skim and return when implementing or invoking.
The protocol is designed to be implementable over any SQL-like backend. The framework supplies a reference implementation for common backends (Chapter 6 §6.8 enumerates); custom backends implement the protocol to integrate.
6.2 Protocol model¶
6.2.1 Operations and responses¶
Each operation is a named function with typed parameters and a typed return. Operations are stateless: a request carries everything the backend needs to fulfill it. The backend's only persistent state is the data itself.
Operations are organized by area (connection, introspection, verification, projection, aggregation). The framework invokes operations in the order required by its current task; the backend honors them in any order. There are no implicit dependencies between operations.
6.2.2 Error model¶
Every operation may return:
- Success: the operation completed; the return shape carries the requested data.
- Backend error: the backend could not fulfill the operation due to a backend-side condition (table missing, type mismatch, permissions, transient connectivity). The error response carries an error code and message.
- Protocol error: the request was malformed or referenced unknown identifiers. The error response carries the protocol-side reason.
The framework treats backend errors as recoverable (the AC's verification may proceed with reduced confidence; the DQ report flags missing verifications) and protocol errors as bugs (in either the framework or a custom backend implementation).
6.2.3 Identifier conventions¶
Operations reference backend tables, columns, schemas, and connections by string identifiers. The framework consumes the AC's schema.init declarations for these identifiers; the backend interprets them as appropriate for its native naming.
Backend-quoted identifiers (Snowflake's case-sensitive quoting, Postgres's lower-casing) are handled by the backend's protocol implementation; the framework passes identifiers as the AC declares them.
6.2.4 Type model¶
The protocol carries values in a small set of canonical types, mapped from the backend's native types per the backend's protocol implementation:
| Canonical type | Description |
|---|---|
numeric |
floating-point or arbitrary-precision number |
integer |
integer-valued number |
string |
character data |
boolean |
TRUE / FALSE / NULL |
date |
calendar date |
timestamp |
date with time-of-day |
hll_sketch |
HyperLogLog sketch (opaque to the framework; opaquely manipulated via HLL_MERGE) |
theta_sketch |
Theta sketch |
t_digest |
t-digest sketch |
kll_sketch |
KLL sketch |
bitmap |
compressed bitmap |
null |
the null value of any type |
The backend's protocol implementation declares its mapping from native types to canonical types via the introspection protocol (§6.3.3). The framework uses canonical types for all reasoning.
6.2.5 Engine-extension operations¶
When the Metric Engine (Chapter 11) is enabled for an AC, the engine requires two additional operations on the backend beyond the core protocol of §§6.3–6.6. These are de-facto required extensions: they are not part of the typed Protocol surface (omitting them does not break a backend used without the engine), but every backend shipped in Coframe Core implements them, and any backend used with metric_engine.enabled: true must implement them.
| Operation | Purpose |
|---|---|
extract_to_lazyframe(table_name) → LazyFrame |
Stream a backend table out as a Polars LazyFrame so the engine can ingest it into a Parquet entry. The engine's serving substrate is Polars; this operation is the bridge. Backends with native Polars/Arrow producers (DuckDB's .pl(), Polars-native scan_*) implement this trivially. SQLite implements it via pl.read_database_uri. |
operator_registry (attribute, returns OperatorRegistry) |
The backend's L2 operator registry (see Chapter 10 §10.13 and the v2.1 platform-design supplement §W4 for L1/L2/L3 layering). The engine consults the registry when materialising metrics — partition-invariance, native MEDIAN/BOOL_AND/BOOL_OR/APPROX_COUNT_DISTINCT support, missing-value treatment — without re-implementing operator semantics. |
Both extensions are stable across Coframe Core's three reference backends (SQLite, Polars, DuckDB) and are exercised by the cross-backend invariant test suite (packages/coframe-duckdb/tests/test_cross_backend_invariants.py). Custom backend authors who want engine compatibility should implement both at the same time they implement the core protocol; a backend without these two cannot host engine-managed metrics, but can still serve correctness — the resolver falls back to the §13.1 legacy path (no engine, no memoisation, direct backend execution).
A future Manual revision may promote these to formal Protocol methods once the engine ships as the default execution path. Until then, treat them as required-in-practice when shipping an engine-aware backend.
6.3 Connection and introspection¶
6.3.1 connect¶
Establish a session with the backend.
Request:
connection_id: the identifier declared in the AC's schema.init.credentials: opaque to the framework; passed to the backend per the connection's declared auth scheme.
Response: session handle (used by subsequent operations) or backend error.
6.3.2 list_tables¶
List the tables/views available through the connection.
Request: list_tables(schema_filter) — optional namespace filter.
Response:
6.3.3 describe_table¶
Retrieve the column list and types for a backend table.
Request: describe_table(table_name).
Response:
{
table_name: <string>,
columns: [
{
column_name: <string>,
canonical_type: <canonical-type>,
native_type: <string>, # backend's native type name
nullable: <boolean>
},
...
],
row_count: <integer | null> # null if backend doesn't support cheap row count
}
6.3.4 evaluate_query¶
For schemas whose source is type: query, evaluate the query and treat it as a virtual table.
Request: evaluate_query(query_text).
Response: the same shape as describe_table applied to the query's result.
The framework uses evaluate_query to materialize query-defined virtual tables for introspection and later operations.
6.4 Structural verification operations¶
These operations check declared structural commitments against backend data. The DQ process (Chapter 7) invokes them.
6.4.1 verify_grain_uniqueness¶
Verify that a schema's grain-role columns identify rows uniquely.
Request:
Response:
{
unique: <boolean>,
total_rows: <integer>,
distinct_grain_rows: <integer>,
duplicate_examples: [
{ values: { <col>: <value>, ... }, count: <integer> },
...
] # capped at a reasonable sample
}
A schema's grain integrity holds iff unique == true, equivalently total_rows == distinct_grain_rows.
6.4.2 verify_fd_edge¶
Verify a declared FD-edge d1 → d2: each value of d1 maps to a unique value of d2.
Request:
Response:
{
fd_holds: <boolean>,
distinct_source_values: <integer>,
source_values_with_multiple_targets: <integer>,
violation_examples: [
{
source_value: <value>,
target_values: [<value>, ...]
},
...
]
}
The framework iterates over the AC's candidate FD-DAG, invoking verify_fd_edge per edge during DQ Phase 2.
For function-derived FD-edges (per §3.4.6), the framework does not invoke verify_fd_edge — the edge is grounded by the function's deterministic semantics. The framework may invoke a verification of the function's evaluation correctness (§6.4.7) for additional confidence.
6.4.3 verify_scope¶
Verify a schema's declared scope on a dimension family.
Request:
verify_scope(
table_name,
dimension_family: <family-name>,
scope_declaration: "non_degenerate" | "degenerate",
anchor_columns: [<column-name>, ...] # the schema's grain-role cols anchoring in this family
)
Response:
For scope_declaration: non_degenerate, the operation verifies that the schema has data along the dimension family (non-null anchor values). For degenerate, the operation verifies that the schema does not anchor in the dimension family (or that any such anchor columns are uniformly null / absent).
6.4.4 verify_cross_schema_value_mapping¶
Verify that same-named AC-dimensions or AC-attributes across schemas yield consistent values for the same key.
Request:
verify_cross_schema_value_mapping(
column_name: <string>,
schema_a: <schema-name>,
schema_b: <schema-name>,
anchor_key: [<column-name>, ...]
)
Response:
{
consistent: <boolean>,
total_keys_examined: <integer>,
inconsistent_keys: <integer>,
inconsistency_examples: [
{
key: { <col>: <value>, ... },
schema_a_value: <value>,
schema_b_value: <value>
},
...
]
}
Used to verify Principle 2 (Foundations §2.11.2): cross-schema observations of the same entity should agree on AC-dimension and AC-attribute values.
6.4.5 verify_lineage_edge¶
The framework's centerpiece verification: check that a metric family's lineage edge holds against data. The predecessor's data, aggregated via the family's ip_reducer at the successor's anchor (respecting the ip_reducer's block set), should agree with the successor's observed values.
Request:
verify_lineage_edge(
predecessor: {
schema: <schema-name>,
column: <column-name>,
A: [<ac-dimension>, ...]
},
successor: {
schema: <schema-name>,
column: <column-name>,
A: [<ac-dimension>, ...]
},
ip_reducer: {
operator: <operator-name>,
a_block: [<dimension-family>, ...]
},
comparison: {
epsilon_relative: <number>,
epsilon_absolute: <number>,
strict_row_sets: <boolean>
},
sampling: {
fraction: <number | null>,
strategy: "full" | "stratified" | "uniform"
}
)
Response:
{
status: "passed" | "failed" | "blocked_under_R" | "unattestable",
reason: <string>, # populated when status != "passed"
rolled_up_dimensions: [<ac-dimension>, ...],
rows_compared: <integer>,
rows_matched: <integer>,
delta_summary: {
max_relative_delta: <number>,
max_absolute_delta: <number>,
failing_examples: [
{
successor_key: { <col>: <value>, ... },
successor_observed_value: <value>,
predecessor_rolled_up_value: <value>,
delta: <number>
},
...
]
}
}
Status outcomes:
passed: the rollup matched within epsilon for all compared rows.failed: deltas exceeded epsilon for at least one row.delta_summarycarries diagnostics.blocked_under_R: the edge crosses a dimension family in the ip_reducer'sa_block. Attestation is structurally inapplicable for this ip_reducer; the framework may attempt other ip_reducers of the family.unattestable: the edge could not be verified for backend reasons (data unavailable, predecessor missing, type mismatch). Distinct fromfailed; the framework records that verification was attempted but not completable.
The comparison block parameterizes match tolerance per AC attestation configuration. The sampling block parameterizes whether full or sampled verification is performed.
6.4.6 verify_lineage_edge_under_alternative_reducers¶
Convenience operation for families with multiple ip_reducers: attempt verify_lineage_edge under each ip_reducer in turn, returning the result of the first one whose a_block permits the edge.
Request: as verify_lineage_edge, but ip_reducer is a list rather than a single entry.
Response: the result of the first attempted verification, plus the list of attempted-and-blocked ip_reducers.
This is implemented via repeated verify_lineage_edge calls; some backends may optimize.
6.4.7 verify_function_evaluation¶
Verify that a function-derived FD-edge's underlying function evaluates correctly on backend data.
Request:
verify_function_evaluation(
table_name,
source_column,
derived_column,
function: <operator-catalog-function-name>,
sample_size: <integer>
)
Response:
{
evaluation_consistent: <boolean>,
sample_size: <integer>,
inconsistencies: <integer>,
example_failures: [...]
}
Used for additional confidence in function-derived FD-edges. Optional in Coframe Core's default DQ; mandatory only when the AC author requests it via the attestation configuration.
6.5 Projection and filtering operations¶
These operations read backend data with structural awareness, used by Chapter 9's query resolution.
6.5.1 project¶
Read a projection of a schema's data.
Request:
project(
schema: <schema-name>,
columns: [<column-name>, ...],
filters: <filter-expression>, # optional
limit: <integer | null>,
order_by: [<column-name>, ...]
)
Response: a rowset with the requested columns. Streaming or paged delivery is backend-dependent; the framework treats the response as an iterable.
filters is a structured expression (not raw SQL) over AC-dimensions and AC-attributes; the protocol implementation translates to backend-appropriate syntax. The framework constrains filters to conditions over the queried schema's available columns; cross-schema filtering is handled at the framework level, not in the protocol.
6.5.2 count¶
Return a row count for a projection.
Request: count(schema, filters).
Response: { row_count: <integer> }.
6.5.3 get_distinct_values¶
Return distinct values of a column, optionally filtered.
Request: get_distinct_values(schema, column, filters, limit).
Response: { values: [<value>, ...], total_distinct: <integer> }.
Used during introspection (§6.3) and during query resolution for filter expansion.
6.5.4 get_pair_mapping¶
Return the mapping between two columns: for each distinct value of column A, the corresponding value(s) of column B.
Request: get_pair_mapping(schema, source_column, target_column, filters).
Response:
Used during FD-edge verification (§6.4.2) and during cross-schema value-mapping verification (§6.4.4).
6.6 Aggregation operations¶
These operations compute reducer outputs at requested grains, used by query resolution and by lineage-edge verification.
6.6.1 aggregate¶
Compute a reducer over a column at a requested grain.
Request:
aggregate(
schema: <schema-name>,
metric_column: <column-name>,
group_by: [<ac-dimension>, ...],
operator: <operator-catalog-entry>,
filters: <filter-expression>,
missing_value_treatment: <treatment-config>
)
Response:
{
groups: [
{ key: { <col>: <value>, ... }, value: <value> },
...
],
rows_examined: <integer>,
null_rows_handled: { excluded: <integer>, included_as_zero: <integer>, ... }
}
The operator references a catalog entry. The protocol implementation translates to backend-appropriate aggregate-function calls (SUM, MAX, HLL_MERGE, etc.).
The missing_value_treatment block parameterizes the (operator, M) pair's behavior per the operator catalog (Chapter 10).
6.6.2 aggregate_with_join¶
Aggregation across schemas that share an anchor. The framework uses this for cross-schema query resolution.
Request:
aggregate_with_join(
primary_schema: <schema-name>,
joining_schemas: [
{ schema: <schema-name>, join_keys: [<ac-dimension>, ...] },
...
],
metric_columns: [
{ schema: <schema-name>, column: <column-name>, operator: <op> },
...
],
group_by: [<ac-dimension>, ...],
filters: <filter-expression>
)
Response: same shape as aggregate, but with multiple metrics joined per group.
The framework's resolver (Chapter 9) constructs aggregate_with_join requests when a query references metrics in multiple schemas — performing the join at the backend rather than at the framework level.
6.6.3 compute_function¶
Apply a function operator to a column or set of columns.
Request:
compute_function(
schema: <schema-name>,
operator: <function-operator>,
inputs: [<column-name>, ...],
filters: <filter-expression>,
output_column_name: <string>
)
Response: rowset including the computed function output.
Used for function-derived columns and for Frame-QL inline expressions.
6.7 Operator-catalog mapping¶
Each operator in Coframe's catalog (Chapter 10) has a protocol mapping — the SQL or backend-native syntax the protocol implementation generates when the operator is invoked. The catalog declares the canonical mapping; per-backend implementations may override for backend-specific syntax.
Examples of canonical mappings:
| Operator | Canonical SQL |
|---|---|
SUM |
SUM(<col>) |
MAX |
MAX(<col>) |
MIN |
MIN(<col>) |
COUNT |
COUNT(<col>) |
COUNT_DISTINCT |
COUNT(DISTINCT <col>) |
AVG |
AVG(<col>) — or lifted as (SUM(<col>), COUNT(<col>)) for partition-invariant carrying |
HLL_MERGE |
backend-specific: APPROX_COUNT_DISTINCT_COMBINE, HLL_UNION, etc. |
T_DIGEST_MERGE |
backend-specific |
MONTH_OF |
EXTRACT(MONTH FROM <col>) or MONTH(<col>) per dialect |
MAP_DIV |
<col_a> / NULLIF(<col_b>, 0) |
Per-backend implementations provide overrides for backends with non-standard syntax.
6.8 Reference implementations¶
Coframe Core ships protocol implementations for common backends. Each implementation maps the protocol's operations to backend-native operations (SQL, API calls).
Supported in Core:
- PostgreSQL (and Postgres-compatible: Aurora, Redshift, ParadeDB)
- Snowflake
- BigQuery
- Databricks SQL
- DuckDB
- SQLite
Sketch-type operations require backend support for the relevant sketch implementations. Coframe Core's reference implementations enable sketch operations where the backend supports them natively; backends without native sketch support cannot host AC-metrics declared with sketch data_type.
Custom backends are implemented by providing a protocol adapter that maps each operation in §§6.3–6.6 to backend operations. The adapter interface is specified in the Coframe API documentation.
6.9 Protocol versioning¶
The data-API protocol is versioned. The current version is coframe-data-api/1.0. Protocol implementations declare the version they support; the framework checks compatibility on connect.
Pro extends the protocol with additional operations (time-varying schema, fact-table federation, write-back). These are specified separately and use a Pro protocol version.
6.10 Summary¶
The protocol is the boundary between Coframe's structural reasoning and the backend's data. The framework consumes the protocol; backends implement it. The protocol's operations are organized into:
- Connection and introspection (§6.3): bind to the backend, discover tables.
- Structural verification (§6.4): test declared commitments against data — grain integrity, FD-edges, scope, cross-schema mapping, lineage-edge attestation. Each operation returns a structured result the DQ process consumes.
- Projection and filtering (§6.5): read data with structural awareness.
- Aggregation (§6.6): compute reducer outputs at requested grains, including cross-schema joins.
Chapter 7 uses these operations to verify an AC; Chapter 9 uses them to resolve Frame-QL queries.
Chapter 7: Data Quality and Structural Verification¶
The framework's process for verifying that an AC's declared structural commitments hold against backend data. The three-phase DQ workflow, the integrity conditions verified at each phase, the per-lineage-edge attestation regime, and the DQ deliverable that the framework consumes for query resolution.
7.1 What this chapter does¶
An AC declares structural commitments: dimension families with their hierarchies, metric families with their ip_reducers and block sets, schemas with their declared scopes, ColumnSpecs with their anchors and lineage. Foundations (Chapter 2) specified the commitments; Chapter 3 specified how they are declared. This chapter specifies the verification process — the workflow that checks each commitment against the backend's actual data.
The verification process serves two purposes:
-
Catch declaration errors. ACs are written by humans (or generated by AI-assisted authoring). They contain mistakes: an FD-edge that doesn't hold in the data, a metric family whose sibling values diverge, a scope declaration that doesn't match reality. DQ catches these and reports them.
-
Ground the framework's correctness guarantee. The framework's constructive-correctness claim (Chapter 1 §1.6) is conditional on the AC's commitments being valid. When DQ verifies a commitment, the framework can rely on it for query resolution. When DQ cannot verify a commitment (data unavailable, opt-out, asserted-not-verified), the framework reports this in the AC's verification status.
The chapter is organized as a workflow specification. §7.2 introduces the three phases. §§7.3–7.6 specify each phase's operations, integrity conditions verified, and outputs. §7.7 specifies the DQ deliverable. §7.8 specifies the AC verification levels (A, AA, AAA). §7.9 specifies the iteration cycle when DQ surfaces issues.
A reader implementing the DQ process or interpreting its outputs can read this chapter as a procedural specification. A reader who only needs to understand the framework's commitments can skim §7.2 and §7.8 for the high-level structure.
7.2 The three phases¶
Coframe's DQ process operates in three sequential phases, each addressing a distinct class of commitments. Each phase consumes outputs of the prior phases; later phases require earlier ones to have completed.
7.2.1 Phase 1: Declaration integrity¶
What it checks: the AC's declarations are internally consistent and well-formed per Foundations §2.10.1–§2.10.3 and Chapter 3 §3.10.1.
What it consumes: the schema.init file alone. No backend data.
What it produces: a list of declaration violations (typically empty for a well-formed AC) plus a derived properties manifest (the framework's computation of trichotomy classifications, family memberships, FD-DAG, and other derived properties per Chapter 3 §3.9).
Operations invoked: none from the data-API; this phase is entirely static.
7.2.2 Phase 2: Structural data verification¶
What it checks: structural conditions on backend data — grain uniqueness, FD-edges, declared scope, cross-schema value mapping. These conditions are independent of metric values; they concern the structural shape of the data.
What it consumes: the Phase 1 outputs and access to the backend via the data-API.
What it produces: a report of per-condition verification outcomes (pass / fail / unverifiable / opted-out) with diagnostic details for failures.
Operations invoked: verify_grain_uniqueness, verify_fd_edge, verify_scope, verify_cross_schema_value_mapping (per §§6.4.1–6.4.4).
7.2.3 Phase 3: Metric-value attestation¶
What it checks: cross-schema metric coherence — the lemma of Foundations §2.10.5. For each attestable lineage edge in the AC's metric genealogy, the predecessor's data aggregated via the family's ip_reducer at the successor's anchor should agree with the successor's observed values.
What it consumes: Phase 1 and Phase 2 outputs (some Phase 3 attestations require Phase 2-verified FD-edges to be valid), plus backend access.
What it produces: a per-lineage-edge attestation report classifying each edge as passed / failed / blocked_under_R / unattestable / opted-out.
Operations invoked: verify_lineage_edge (per §6.4.5) for each attestable edge in the AC.
7.2.4 Why three phases¶
The phases correspond to three distinct epistemic categories:
- Phase 1 verifies what is checkable from declarations alone — the AC's internal consistency.
- Phase 2 verifies structural conditions on data — the shape of the data matches the AC's structural claims.
- Phase 3 verifies value-level coherence — pre-aggregated and detail-level metric values agree.
Each phase has different cost characteristics. Phase 1 is fast (no data access). Phase 2 is moderate (per-FD-edge queries; per-schema scope verification). Phase 3 is the most expensive (per-lineage-edge aggregation and comparison; potentially the largest cost in DQ).
The phases also have a dependency: failures in Phase 1 may make Phase 2 unmeaningful (a malformed declaration cannot be tested against data); failures in Phase 2 may make Phase 3 unmeaningful (a violated FD-edge breaks the assumptions of cross-grain coherence).
7.2.5 The DQ pipeline¶
The framework runs DQ as a pipeline:
Each phase may produce blocking failures (Phase 1 violations of structural rules; Phase 2 violations of grain integrity) that halt the pipeline, or non-blocking failures (Phase 2 FD-edge violations; Phase 3 attestation deltas) that the pipeline reports but continues past.
The pipeline's behavior on non-blocking failures is configurable per the AC's attestation block (§7.5.8) and per the AC verification level the AC author has chosen (§7.8). At the strictest level, all failures are blocking; at more permissive levels, the pipeline accumulates failures and reports them in the deliverable.
7.3 Phase 1: Declaration integrity¶
Phase 1 runs once at AC validation time, before any backend access. It is the framework's earliest check on the AC's well-formedness.
7.3.1 What Phase 1 verifies¶
Phase 1 verifies the structural rules from Foundations §2.10.1–§2.10.3:
Per-column rules:
- Every ColumnSpec has all required fields declared or derivable (per Chapter 3 §3.8.5).
- Anchor cardinality matches the trichotomy and schema role: an AC-dimension or AC-attribute in non-grain role has
|A| = 1; a grain-role column hasA = [self]or the full composite grain; AC-metrics have|A| ≥ 1(per Chapter 3 §3.8.2). Mis well-formed: it is a subset ofA ∪ {self}(every member is an anchor coordinate or the self-marker). The derived mechanism category (MCAR / MAR / MNAR) follows fromMand is not separately declared (Foundations §2.4.3).- For non-root columns, the operator-type-appropriate anchor relation:
A_pred ⊇ A_selffor reducer ops;A_pred = A_selffor function ops. - For non-root columns, the lineage's
name,A,optriple matches an existing ColumnSpec in the AC. - Naming consistency holds per the declared naming function (when not declined).
Per-schema rules:
- Schema names are unique within the AC.
- Same-named columns within a schema have compatible
data_type. - No-all-dimensions rule: no schema consists entirely of AC-dimensions.
AC-level rules:
- Each AC-dimension belongs to exactly one dimension family.
- Each dimension family has a unique base level reachable to all members through declared paths.
- The candidate FD-DAG within each dimension family is acyclic.
- Each metric family's
family_rootreference resolves; its declared root has self-referential lineage. - Every operator in any
ip_reducerslist is partition-invariant per the operator catalog. - Every dimension family referenced in any
a_blockis a declared dimension family. - Each
a_blockis downward-closed along the FD-DAG. - Family-root uniqueness within
(name, A)holds (per Foundations §2.10.3).
7.3.2 Phase 1 outputs¶
Phase 1 produces:
- A declaration validity verdict: pass / fail.
- A declaration error list: details of any structural violations.
- A derived properties manifest:
- Trichotomy classification per column.
- Family membership map: family-name → list of member ColumnSpec references.
- Family-root manifest: family-name → family-root ColumnSpec reference, with self-referential lineage confirmed.
- Operator-attested vs. operator-asserted classification per metric family.
- Lineage graph: nodes (ColumnSpecs) and edges (lineage predecessors).
- FD-DAG: per dimension family, the union of hierarchy-implied, explicit, and function-derived FD-edges.
- Singleton list: ColumnSpecs with multi-input lineage.
- Schema grain per schema.
The derived properties manifest is consumed by Phase 2, Phase 3, and by Chapter 9's query resolution.
7.3.3 Phase 1 failure modes¶
Phase 1 failures are always blocking. A declaration-error AC cannot be sensibly tested against data; verification proceeds only after declaration errors are corrected.
Common failure modes:
- Missing lineage references: a ColumnSpec's
lineagepoints to(name, A, op)for which no matching ColumnSpec exists in the AC. - Anchor cardinality violations: an AC-dimension or AC-attribute in non-grain role declared with
|A| > 1(a non-grain-role column must anchor at a single coordinate; only grain-role contributors to a composite grain carry a multi-elementA). - Ill-formed
M: a missingness anchor with a member that is neither inAnor theself-marker (i.e.,M ⊄ A ∪ {self}). - Cyclic FD-DAG: declared FD-edges form a cycle within a dimension family.
- Family-root mismatch: declared family root in
metric_familiesdoesn't have self-referential lineage. - Non-partition-invariant ip_reducer: a metric family's
ip_reducersentry references an operator the catalog declares as non-partition-invariant. - Naming function violation: when the naming function is declared (not Option 4 in Chapter 3 §3.6.2), a column's
namedoes not equal the function's output on its lineage.
The Phase 1 error report identifies the exact ColumnSpec or AC-level declaration responsible and the rule violated.
7.4 Phase 2: Structural data verification¶
Phase 2 begins after Phase 1 passes. It opens backend connections per the AC's schemas and verifies structural conditions on data.
7.4.1 What Phase 2 verifies¶
Phase 2 verifies the data-attested conditions from Foundations §2.10.4 that do not involve metric values:
- Grain uniqueness: each schema's grain-role columns uniquely identify rows in the schema's data.
- FD-edge attestation: each declared FD-edge in the candidate FD-DAG (data-attested ones; function-derived ones are skipped) holds in the data: each source value maps to a unique target value.
- Declared scope honoring: schemas declared non-degenerate on a dimension family have data; schemas declared degenerate do not.
- Cross-schema value-mapping consistency: same-named AC-dimensions and AC-attributes across schemas yield consistent values for the same anchor key.
7.4.2 Operations performed¶
Phase 2 invokes the following data-API operations, in this order:
- For each schema:
verify_grain_uniqueness(schema, grain_columns). - For each data-attested FD-edge
d1 → d2declared in any dimension family:verify_fd_edge(table, d1, d2). The framework selects a schema containing bothd1andd2for the verification. - For each (schema, dimension family) pair with an explicit
declared_scopeentry:verify_scope(table, dimension_family, scope, anchor_columns). - For each AC-dimension or AC-attribute appearing in multiple schemas: pairwise
verify_cross_schema_value_mappinginvocations.
Phase 2 may run operations in parallel where the backend supports it; the framework's reference DQ implementation parallelizes across non-dependent operations.
7.4.3 Phase 2 outputs¶
Phase 2 produces a structured report per condition:
{
grain_integrity: {
<schema>: { status, total_rows, distinct_grain_rows, ... },
...
},
fd_edge_attestation: {
<dimension-family>: {
<fd-edge>: {
status: "passed" | "failed" | "skipped_function_derived" | "unattestable",
details: ...
},
...
},
...
},
scope_honoring: {
<(schema, dimension-family)>: { status, details },
...
},
cross_schema_value_mapping: {
<(column, schema_a, schema_b)>: { status, inconsistent_keys, ... },
...
}
}
The framework aggregates Phase 2 outputs into the DQ deliverable (§7.7).
7.4.4 Phase 2 failure modes¶
Phase 2 outcomes per condition:
- Passed: the condition holds in data.
- Failed: the condition violates in data. The framework's diagnostics identify failing rows or values.
- Unattestable: backend errors prevented verification. The condition's status in the DQ deliverable becomes "unverified" rather than "passed" or "failed."
- Opted-out: the AC's
attestationblock declared this condition as not-to-be-verified (rare; reserved for cases where the AC author has external verification).
Some failures are blocking under strict verification modes:
- Grain integrity violation: a schema's declared grain does not uniquely identify rows. The framework cannot proceed to Phase 3 for this schema until grain integrity is restored — Phase 3's lineage-edge attestation depends on the predecessor's grain being well-formed.
- Critical FD-edge violation: a declared FD-edge in a dimension family that is referenced by Phase 3's attestation paths. Some FD-edge failures are tolerable; others would propagate as Phase 3 failures and are blocking.
Other failures are reported but non-blocking under permissive modes — they appear in the DQ deliverable as warnings or as advisory issues per the AC verification level.
7.4.5 Common Phase 2 failure patterns¶
A few patterns surface in real Phase 2 runs:
-
Soft FD-edge violations: an FD-edge like
store → regionmay have a handful of stores that switched regions historically. The framework reports the violation; the AC author may declare a Slowly-Changing-Dimension override (Pro feature) or correct the data. -
Cross-schema value drift:
store.regionin thestoresschema andtransactions.region(denormalized intotransactions) may have a small number of mismatches due to ETL race conditions. The framework reports the violation; the AC author resolves the source-of-truth question. -
Scope honoring failures: a schema declared degenerate on a dimension family may have unexpected non-null values; or one declared non-degenerate may be sparsely populated. The framework reports the divergence.
-
Grain uniqueness failures: a schema's declared grain may have duplicates due to ETL bugs, data corruption, or grain miscount. This is among the most serious Phase 2 failures because subsequent verification depends on grain integrity.
7.5 Phase 3: Metric-value attestation¶
Phase 3 is the framework's most distinctive verification step. It tests cross-schema metric coherence per the lemma of Foundations §2.10.5 — that pre-aggregated sibling tables in fact agree with detail-level data at common coarsenings under their family's ip_reducer.
7.5.1 What Phase 3 verifies¶
For each attestable lineage edge in the AC's lineage graph, Phase 3 verifies that the successor's observed values agree with the predecessor's data aggregated via the family's ip_reducer at the successor's anchor (respecting the ip_reducer's block set and the operator's missing-value treatment).
A lineage edge is attestable iff:
- The edge connects a metric family's columns (not an AC-dimension or AC-attribute lineage).
- The edge's predecessor exists in the AC (it cannot be self-referential).
- At least one of the family's ip_reducers is partition-invariant (per the operator catalog).
- For at least one of the family's ip_reducers, the dimensions coarsened by the edge do not intersect the ip_reducer's
a_block.
If condition 4 fails for all the family's ip_reducers, the edge is blocked under all ip_reducers and Phase 3 records it as such. If condition 3 fails (anchor-locked family with no ip_reducer), the edge is not attestable; the family is structurally anchor-locked and no rollup verification applies.
7.5.2 Operations performed¶
For each attestable lineage edge, Phase 3 invokes verify_lineage_edge (per §6.4.5). The framework selects the appropriate ip_reducer for the edge:
- If the family has exactly one ip_reducer whose
a_blockpermits the edge, that ip_reducer is used. - If multiple ip_reducers permit the edge, the framework invokes
verify_lineage_edge_under_alternative_reducers(per §6.4.6) to attempt each in turn, recording results for each that's not blocked.
7.5.3 Attestation arithmetic¶
The comparison between the successor's observed values and the predecessor's rolled-up values uses the AC's declared epsilon_relative and epsilon_absolute (from the attestation block):
For each compared row (successor key plus value pair):
A row passes if
|observed - rolled_up| ≤ max(epsilon_absolute, epsilon_relative × |observed|).
For non-numeric value spaces (sketch types), the comparison semantics are operator-specific:
- For HLL sketches under HLL_MERGE: the sketches are compared structurally (HLL register arrays), with cardinality estimates also compared to within a relative epsilon.
- For theta sketches: similar structural comparison with cardinality estimates.
- For t-digest / KLL sketches: quantile estimates at sampled percentiles are compared within a relative epsilon.
The AC's attestation block may declare per-operator epsilon overrides for sketch-type metrics where the default numeric epsilons are inappropriate.
7.5.4 Row-set comparison¶
The attestation operation compares:
- The successor's actual rowset at anchor
A_succ. - The predecessor's data aggregated via the ip_reducer at anchor
A_succ.
In a well-formed AC where both sides cover the same underlying universe (per Principle 2), the row sets match (every successor key has a corresponding predecessor-rolled-up key, and vice versa).
In practice, deviations occur:
- Successor has keys the predecessor doesn't: the successor table includes anchor combinations the predecessor data doesn't generate. May indicate missing predecessor data, ETL drift, or scope mismatches.
- Predecessor produces keys the successor doesn't: the successor table omits anchor combinations the predecessor data does cover. May indicate filtering, scope declarations, or successor incompleteness.
The strict_row_sets configuration (from the attestation block) controls behavior:
strict_row_sets: true: row-set mismatches fail attestation.strict_row_sets: false(default): row-set mismatches are reported as deltas but do not fail attestation; only value-level deltas matter.
7.5.5 Sampling and full attestation¶
For large data volumes, full attestation may be prohibitively expensive. Phase 3 supports sampled attestation:
- For tables below
sampling_threshold_rows(default 100M): full attestation. - For tables at or above the threshold: stratified sampling by anchor key. The sampling fraction is configurable; the framework's default is to sample enough to achieve
sampling_confidence_target(default 99%) for the deltas. - The AC's
attestationblock may declareforce_full: trueto disable sampling.
Sampled attestation produces statistical bounds on delta magnitudes. The DQ deliverable reports whether sampling was used and what confidence bounds apply.
7.5.6 Phase 3 outputs¶
Per attestable lineage edge:
{
edge: (predecessor_spec, successor_spec),
ip_reducer_used: (operator, a_block),
status: "passed" | "failed" | "blocked_under_all" | "unattestable",
rolled_up_dimensions: [...],
comparison_summary: { rows_compared, rows_matched, max_delta, ... },
failure_examples: [...]
}
Aggregated across the AC:
- A per-metric-family attestation roll-up.
- A per-schema attestation roll-up.
- An AC-level attestation pass rate.
7.5.7 Operator-asserted families and Phase 3¶
Operator-asserted families (per Chapter 3 §3.5.5) have no in-AC sibling to verify against. Phase 3 cannot attest these families' ip_reducers; the framework reports them as operator-asserted-not-verified in the deliverable.
The AC author has two options for grounding an operator-asserted family:
-
Materialize a sibling. Adding even one finer-grained sibling to the AC converts the family from operator-asserted to operator-attested, enabling Phase 3 verification.
-
Document the assertion. The AC explicitly declares that the ip_reducer is asserted-not-verified, and the consumer of the AC accepts this. The DQ deliverable surfaces the assertion so downstream queries against this family carry appropriate caveats.
7.5.8 Failure modes and remediation¶
When Phase 3 surfaces failures, the AC author has several remediation paths:
- Investigate ETL/source-data quality: the most common cause of attestation failures is upstream data drift — the pre-aggregated tables were computed under different rules or at different times than the detail tables.
- Reconcile definitional differences: the predecessor and successor may be computed under subtly different conventions (one excludes voided transactions; the other includes them).
- Adjust the ip_reducer's block set: a failed attestation may indicate that the rollup direction in question is not actually safe — the family is more semi-additive than declared.
- Declare scope adjustments: a successor may be intentionally scoped (e.g., only includes stores in certain regions); the schema's declared scope should reflect this.
- Accept the divergence: under
failure_mode: tolerated, specific edges may be declared as known-divergent and excluded from blocking attestation.
The DQ deliverable provides diagnostic details to support each remediation path.
7.6 Asserted-not-verified facts¶
The framework's correctness guarantee depends on certain facts being true. Some of these facts are not verifiable per-AC: they are inherited from the operator catalog, from the framework's principles, or from architectural commitments. These are asserted-not-verified facts — explicit, documented, but not separately tested for each AC.
7.6.1 Catalog-asserted facts¶
The operator catalog (Chapter 10) declares for each operator:
- Partition-invariance classification.
- Identity-preservation per (operator, predecessor-family) — mediated by the naming function.
- Type signatures.
- Missing-value treatment per (operator, M).
These declarations are catalog properties — properties of the operator itself, not of any AC. The framework trusts the catalog. ACs do not re-verify partition-invariance of SUM, MAX, HLL_MERGE, or any catalog operator.
The catalog's correctness is established at catalog publication; updates to the catalog are versioned and accompanied by reasoning. ACs may pin to a specific catalog version (Pro feature).
7.6.2 Principle-asserted facts¶
The framework's two principles (Foundations §2.11) are asserted, not verified:
-
Principle 1: every column's value is determined by its declared anchor. Phase 1 verifies the declaration is well-formed; Phase 2's cross-schema value-mapping checks indirectly test consistency. But the framework cannot directly verify "this column's value depends on this anchor and only this anchor" — it can only test for inconsistencies among multiple observations.
-
Principle 2: all schemas observe the same underlying entities. Cross-schema value-mapping (Phase 2) and metric coherence (Phase 3) provide strong indirect evidence, but the principle itself is the framework's epistemic commitment.
When the framework reports an AC as verified, it implicitly assumes the principles hold. The DQ deliverable surfaces this as part of the AC's verification status.
7.6.3 Schema-asserted facts¶
Some commitments the AC author declares are not separately verified:
-
Naming-function declaration (per Chapter 3 §3.6.2): when the AC declines structured naming (Option 4), the framework cannot verify name-vs-operator consistency. The declaration is asserted; consumers of the AC see the assertion.
-
Sketch-type fidelity: when an AC declares a sketch-type column (HLL, t-digest), the framework trusts the backend's sketch implementation. Coframe does not separately verify that the backend's HLL_MERGE faithfully implements the HLL algorithm; this is a backend property.
-
Function-derived FD-edges: the framework trusts the operator catalog's declaration that
MONTH_OFdeterministically mapsdaytomonth. It does not verify this against data (though optional sampling viaverify_function_evaluationis available).
7.6.4 Assertion reporting¶
The DQ deliverable's verification status (§7.7) reports asserted-not-verified facts separately from data-attested ones. A consumer reading the deliverable can distinguish:
- Data-attested with status: pass, fail, opted-out, or unverifiable.
- Asserted-not-verified: a commitment the framework trusts without per-AC verification, with the source of the assertion (catalog, principle, declaration).
This distinction is what allows users to reason about the AC's epistemic posture: which facts have been independently grounded against the data and which rest on the framework's trust model.
7.7 The DQ deliverable¶
The DQ process produces a structured deliverable — the DQ report — that aggregates Phase 1, Phase 2, and Phase 3 outputs into a single artifact. The deliverable is consumed by:
- The AC author, who reviews it to identify and fix issues.
- The framework's runtime, which consults the deliverable during query resolution to determine which commitments hold and which are caveated.
- The AC's Validation Surface — one of the AC Surfaces (per v2.1 supplement §10.2; the umbrella term for the AC's access protocols). The Validation Surface specifically exposes the DQ deliverable, the AC's verification level, and the per-condition warrant status to consumers and AI agents. Other AC Surfaces (Frame-QL, NL Query, MCP, HTTP API, Workbench) consult the Validation Surface's content when their own responses need to reflect verification posture. Concrete implementations — historically the
coframe-mcpserver, now also the HTTP host incoframe-runtime— are deployment-time choices that realise the Validation Surface contract.
7.7.1 Deliverable structure¶
{
ac_name: <string>,
ac_version: <string>,
dq_run_timestamp: <iso-timestamp>,
catalog_version: <string>,
dq_runtime_version: <string>,
overall_status: "pass" | "fail" | "partial",
ac_verification_level: "A" | "AA" | "AAA",
phase_1: {
status: "passed" | "failed",
declaration_errors: [...],
derived_properties: { ... }
},
phase_2: {
status: "passed" | "passed_with_warnings" | "failed",
grain_integrity: { ... },
fd_edge_attestation: { ... },
scope_honoring: { ... },
cross_schema_value_mapping: { ... },
summary: { passed_count, failed_count, unattestable_count }
},
phase_3: {
status: "passed" | "passed_with_warnings" | "failed" | "skipped",
per_family_attestation: { ... },
per_lineage_edge_attestation: { ... },
operator_asserted_families: [ ... ],
summary: { ... }
},
asserted_not_verified: {
catalog_assertions: { ... },
principle_assertions: { ... },
declaration_assertions: { ... }
},
ac_caveats: [
{ type, description, severity, ... },
...
],
resolver_metadata: { ... }
}
The deliverable is persisted alongside the AC and is re-generated each time DQ runs (either on AC publication, on a schedule, or on demand).
7.7.2 Caveats¶
The ac_caveats field surfaces issues a consumer should know about when using the AC:
- Failed conditions that did not block (per the AC's failure-mode settings).
- Unattestable conditions the framework could not verify.
- Opted-out conditions the AC author chose not to verify.
- Operator-asserted families with no in-AC sibling for verification.
- Tolerated edges declared as known-divergent.
Each caveat includes a severity level (info / warning / error) and a structured description supporting both human reading and programmatic consumption.
7.7.3 Resolver metadata¶
The deliverable includes data the query resolver (Chapter 9) needs at runtime:
- The verified FD-DAG (declared edges that Phase 2 attested as data-valid).
- The verified metric family genealogy (the lineage graph with edge attestation status).
- The set of operator-attested vs. operator-asserted families.
- Per-lineage-edge attestation pass/fail status.
- The set of caveats relevant to query construction.
The resolver consults this metadata when constructing query plans. Verified commitments support full resolver-side reasoning; caveated commitments trigger appropriate dubious-query flagging.
7.7.4 Caching and invalidation¶
The DQ deliverable is cached. Re-running DQ regenerates it. Cache invalidation triggers:
- AC declaration changes (schema.init modifications).
- Operator catalog version changes.
- Backend schema changes (column additions, type changes — detected by introspection).
- Configured periodic re-verification (e.g., daily DQ runs against production).
Mid-deliverable updates (without full regeneration) are not supported in Coframe Core; Pro may extend.
7.8 AC verification levels¶
The framework recognizes three AC verification levels — accreditation levels indicating how strictly an AC has been verified. The AC author chooses the target level when configuring DQ; the deliverable reports the achieved level.
7.8.1 Level A (basic structural)¶
Achieved when:
- Phase 1 passes (all declarations are well-formed).
- Phase 2 partially passes: grain integrity holds for every schema; declared FD-edges are either attested by data or designated function-derived; scope declarations are honored.
Level A says: the AC's structure is internally consistent and matches the data's structural shape.
Phase 3 may or may not have been run. Phase 3 results, if present, are advisory at Level A.
7.8.2 Level AA (structural + value coherence)¶
Achieved when Level A is achieved plus:
- Phase 3 has been run for all attestable lineage edges.
- All attestable edges pass or are explicitly tolerated.
- No
failedattestations exist outside the tolerated set.
Level AA says: the AC's structural commitments hold against data, and value-level coherence among pre-aggregated and detail tables is established.
This is the recommended level for production ACs.
7.8.3 Level AAA (full + assertions documented)¶
Achieved when Level AA is achieved plus:
- All asserted-not-verified facts (catalog, principle, declaration) are explicitly documented in the deliverable.
- All operator-asserted metric families are documented with their assertion source.
- All caveats are reviewed and signed off by the AC author (the AC's
ac_metadatamay include averification_sign_offblock). - No unverified opt-outs of attestation exist (or each is explicitly justified in the deliverable).
Level AAA says: the AC has been verified to the maximum extent possible, with the remaining trust commitments fully documented and acknowledged.
7.8.4 Choosing a level¶
- Level A is appropriate for development ACs, exploratory analytical work, or ACs over evolving data where full attestation cost is not yet justified.
- Level AA is the production-quality default.
- Level AAA is appropriate for regulated environments, compliance-sensitive work, or ACs serving as the primary source of truth for the organization.
The level the AC has achieved appears in the DQ deliverable and is exposed through external interfaces (such as the MCP server). Consumers and AI agents can adjust their behavior accordingly (e.g., an LLM agent might add interpretive caveats when querying a Level A AC).
7.8.5 Verification-level inheritance through the Metric Engine¶
When the Metric Engine (Chapter 11) serves a query from a memoised entry rather than from the live backend, the result inherits the verification level the AC carried at the time the entry was materialised. The engine does not lower the level — a Level AA AC served from engine cache remains a Level AA serve — but it also does not raise it: if the entry was materialised against a Level A AC and the AC has since been re-verified to AA, the memoised entry still reflects its materialisation-time level until refreshed.
This is the same epistemic posture as L2 metadata in v2.1's stability filter: a memoised computation is a cached commitment whose grounding is fixed at the moment of computation. The engine's evict() and refresh() operations (Chapter 11 §11.5) provide the explicit mechanisms to re-ground entries after the AC's verification posture changes.
Practical guidance:
- After bumping an AC from Level A → AA → AAA, run
engine.refresh()(or wait for natural eviction) before relying on the new level for cached serves. - The
served_fromindicator (Chapter 11 §11.6) —engine_cachevsengine_backend— is the operational signal for whether a given query result reflects current AC state (engine_backendalways does) or possibly-stale memoised state (engine_cachereflects the AC as of materialisation time). - Operationally, the engine's manifest records the AC's
verification_levelat the time each entry was materialised; consumers needing strict freshness can filter entries on this field.
7.9 The iteration cycle¶
DQ is not a one-time event; it is iterative. An AC under development goes through repeated cycles of declaration, verification, and refinement.
7.9.1 First-run discovery¶
An AC's first DQ run typically surfaces issues:
- Declaration errors (Phase 1).
- Grain uniqueness violations (Phase 2).
- FD-edge mismatches (Phase 2).
- Cross-schema value drift (Phase 2).
- Sibling attestation failures (Phase 3).
The AC author reviews the deliverable, identifies issues, and either corrects the AC declarations, fixes upstream data, or declares tolerated exceptions.
7.9.2 Continuous DQ¶
For ACs in production, DQ runs continuously (typically on a schedule). Each run produces a fresh deliverable; deviations from prior runs surface as drift.
Common drift patterns:
- FD-edge violations appearing where none existed before: typically indicates upstream data quality regression.
- Attestation failures spiking after a deployment: typically indicates an ETL pipeline change that altered pre-aggregation logic.
- Scope honoring failures emerging: typically indicates a schema's declared scope no longer matches the data's actual coverage.
Continuous DQ allows these to be detected and remediated before downstream queries silently produce wrong answers.
7.9.3 Refinement of declarations¶
DQ may also lead to refinement of the AC's declarations:
- An attestation failure may reveal that an ip_reducer's
a_blockis too narrow (or too broad) — the AC author adjusts. - A scope honoring failure may reveal that a schema's declared scope is wrong — the AC author corrects.
- A cross-schema value-mapping failure may reveal that a column should have a more restrictive
M— the AC author re-declares.
The DQ cycle is the AC author's primary feedback loop. The framework's role is to surface discrepancies between declared structure and observed data; the AC author's role is to interpret and resolve them.
7.10 Summary¶
The DQ process is the framework's mechanism for grounding the constructive-correctness guarantee against data:
- Phase 1 verifies the AC's declarations are well-formed (structural rules from Foundations §2.10.1–§2.10.3).
- Phase 2 verifies structural conditions on data (grain integrity, FD-edges, scope, cross-schema mapping).
- Phase 3 verifies metric-value coherence (per-lineage-edge attestation, respecting ip_reducer block sets and missing-value treatment).
The process produces a DQ deliverable consumed by the AC author, the framework's runtime, and any external interface exposing the AC (such as the MCP server). The deliverable distinguishes data-attested commitments from asserted-not-verified ones, supporting transparent reasoning about the AC's epistemic posture.
Three verification levels — A, AA, AAA — indicate how strictly an AC has been verified. The AC author chooses the target; the deliverable reports the achieved level.
The DQ cycle is iterative: the AC author authors, the framework verifies, the deliverable surfaces issues, the AC author refines. Production ACs run DQ continuously.
Subsequent chapters consume the DQ deliverable: Chapter 9's query resolution consults it to determine which commitments support query construction and which trigger dubious-query flagging. External interfaces (such as the MCP server, specified in a separate reference) expose the deliverable to AI agents for verification-aware reasoning.
Chapter 8: Frame-QL¶
The query language for Coframe Core. A Frame-QL query describes the output frame the author wants — a lightweight output schema of columns at a declared grain — and the framework constructs the algorithm to produce it. Frame-QL is consumed by the resolver (Chapter 9).
8.1 What this chapter does¶
Frame-QL is Coframe Core's query language. A Frame-QL query is expressed at the grammar level — referencing columns by their AC family-names and AC-dimensions — rather than at the physical level of tables and joins. The query author needs to know the AC's exposed surface; they do not need to know the backend's tables.
The chapter is organized as follows. §8.2 introduces Frame-QL and what it is (and is not). §8.3 specifies lexical structure. §8.4 specifies the top-level query structure: the Frame and its accessories. §8.5 specifies the clauses. §8.6 specifies expressions. §8.7 specifies WITH-blocks. §8.8 walks through examples by increasing capability. §8.9 specifies disambiguation. §8.10 summarizes semantics. §8.11 catalogs errors. §8.12 compares Frame-QL with SQL.
Query resolution — how the framework routes a Frame-QL query to backend data and constructs the algorithm — is specified in Chapter 9.
8.2 What Frame-QL is¶
Frame-QL is a declarative query language. The guiding principle is the same one that governs every interface to an AC: the author specifies what the output should be, not the procedure for producing it. A Frame-QL query describes a desired result; the framework determines how to construct it from the AC's declared and verified commitments.
The result of a Frame-QL query is a Frame: a rectangular datagrid of columns at a declared grain. Conceptually, a Frame is a lightweight output schema — a set of lightweight ColumnSpecs. This is the framework's central insight applied to query output: just as an AC is built from ColumnSpecs (Chapter 3) and has no genuine table-level operations, a query's output is likewise just a set of columns assembled into a grid. The Frame is the AC's notion of a schema, minus the parts that would make the output re-ingestable into the AC as a permanent member.
This identity — Frame as lightweight output schema — is conceptual. In daily use, Frame-QL provides syntax sugar that makes queries compact and familiar:
- The output grain is declared with an
ATclause rather than by writing each dimension column's full anchor. - A column's anchor is inherited from context — the schema grain, the
ATgrain — rather than restated per column. - Aggregation is implied by the column's metric family ip_reducer and the
ATgrain, rather than spelled out as an explicit procedure.
So a Frame-QL query reads as a compact query while denoting an output schema underneath.
This identity is also the framework's most useful test. Because a Frame is a set of column specs and nothing more — there are no genuine frame-level operations — the question of whether any computation belongs in Frame-QL Core reduces to a single decidable check:
The frame test. Can the column's value be specified from its own row — its own grain, its own components — possibly by aggregating input rows up to that grain? If so, it is expressible: it is a column spec. Does the column's value instead require referencing other rows of the assembled output frame? If so, it is a frame-level operation and is outside Core.
This test, not the general flavor of "declarative," is what decides what Frame-QL expresses. It is worth applying deliberately rather than by feel, because feel misleads in both directions. It will tempt an author to reject a clean construct for looking like SQL (a bare SELECT, an explicit FROM) — when surface resemblance is irrelevant and those constructs pass the test cleanly. And it will tempt an author to reach for a window-style or pivot computation because doing everything in one query feels elegant — when those fail the test (they read across the assembled frame) and belong to Coframe Pro or to processing outside Coframe. The sections that follow specify the surface; the frame test is what tells the reader, in any new case, which side of the line a computation falls on.
8.2.1 The Frame and its accessories¶
A query expresses more than the output Frame. It also expresses scope and arrangement — and these are kept separate from the Frame, because that is how a query lives in an author's mind:
FROM— which schemas of the AC the query may draw from (usually inferred; specified only to disambiguate).WHERE— pre-filtering: scope the input rows before the Frame is computed.- The Frame (
SELECT+AT) — the output schema: the columns and the grain. HAVING— post-filtering: scope the output rows after the Frame is computed.ORDER BY— arrange the output rows.LIMIT— cap the output rows; optionally per-group via thePERsubclause (§8.5.7).
The Frame is the core; FROM, WHERE, HAVING, ORDER BY, LIMIT are accessories around it. The framework does not jam filtering and ordering into the Frame; they are genuinely distinct components of a query, matching the natural decomposition of how people think about querying. The guiding principle ("describe what you want") is honored across all of them: each accessory describes a property of the desired result — its input scope, its output scope, its arrangement — not a data-processing instruction.
8.2.2 Frame-QL as one of the AC Surfaces¶
An AC is equipped with several interfaces — collectively, the AC's Surfaces — all of which honor the same guiding principle of describing-what-you-want rather than instructing-how-to-process:
- Frame-QL Surface (console and API) — for data engineers, data analysts, and business analysts. It is the core that drives the backend construction path (resolution and computation, Chapter 9).
- NL Query Surface — for AI tools and business users, sitting on top of the Frame-QL Surface. NL queries are executed by first translating to Frame-QL.
- Other Surfaces (such as the MCP Surface, HTTP API Surface, and Workbench / Validation Surfaces) — likewise built on the Frame-QL layer.
Frame-QL is the foundational query form; the other Surfaces reduce to it. AC Surfaces is the umbrella term for these access protocols — each a separately-documented conformance contract. (See coframe_platform_design_v2_1_supplement.md §10.2 for the full enumeration and rationale.)
A second use of Frame-QL is worth noting: because a Frame is a lightweight output schema, a data engineer can use a Frame-QL output to augment the AC directly — the output datagrid becomes new AC content. In Coframe Core, this re-ingestion is not provided (Frame-QL outputs are session-local; §8.7.2); since the broader range of use cases does not need it, the re-ingestion requirement is relaxed. Persistent re-ingestion is a Coframe Pro capability.
8.2.3 What Frame-QL is not¶
Frame-QL is not SQL, though it shares some surface vocabulary. The differences that matter:
- No JOIN clause. Cross-schema reach is automatic via the four-rule filter (Chapter 9). The author never writes a join.
- No GROUP BY clause. The
ATclause specifies the output grain; aggregation follows from the grain and the metric family ip_reducers. - No subqueries except WITH-chained frames (§8.7).
- No window functions in Coframe Core. Window analytics — operations that reference other rows of the assembled frame, such as period-over-period comparison (
PRIOR/LEAD), ranking, and running totals — are not supported in Coframe Core. They operate over the assembled frame rather than describing a column's own spec, so they fall outside Core's frame model. They are Coframe Pro territory, or are performed outside Coframe.
Frame-QL is also not a programming language. It is a declarative description of a desired result.
8.2.4 What Coframe Core supports¶
Coframe Core supports a defined subset of Frame-QL's full capability surface, described by rungs of increasing query complexity (§8.8): reading columns (Rung 0), identity-preserving reduction (Rung 1), broadcast (Rung 2), multi-input expressions (Rung 6), cross-schema reach (Rung 7), and WITH-chained frames (Rung 9).
Within this scope, Frame-QL expresses most analytical needs. Outside it — window analytics, pivot, holistic-within-self reductions, query-time epoch transitions, custom operators, multi-backend queries, persistent re-ingestion — is Coframe Pro territory, or patterns better performed outside Coframe entirely. The framework does not attempt to support every analytical pattern; it supports the frame-shaped ones correctly and refuses cleanly outside that scope.
8.3 Lexical structure¶
8.3.1 Tokens¶
Frame-QL source consists of tokens separated by whitespace. The token classes:
- Identifiers: alphanumeric sequences starting with a letter or underscore, optionally containing dots for qualified references. Examples:
revenue,peak_revenue,transactions.revenue,revenue_per_customer. - Keywords: reserved words including
SELECT,FROM,WHERE,AT,BY,USING,HAVING,ORDER,LIMIT,PER,WITH,AS,AND,OR,NOT,IF,THEN,ELSE,CASE,WHEN,END,IS,NULL,MISSING,TRUE,FALSE,DISTINCT,IN,BETWEEN,LIKE,ASC,DESC. - Operators: arithmetic (
+,-,*,/,%,^), comparison (<,<=,=,<>or!=,>=,>). Logical operators are the keywordsAND,OR,NOT. - Literals: numeric (
42,3.14), string (single-quoted:'west'), boolean (TRUE,FALSE), missing (NULLorMISSING— equivalent in Frame-QL). - Punctuation: parentheses
(), brackets[], comma,, semicolon;.
8.3.2 Comments¶
Single-line comments start with -- and run to end of line. Multi-line comments are delimited by /* and */.
8.3.3 Case sensitivity¶
Keywords and built-in operator names are case-insensitive (SELECT and select are equivalent). Identifiers are case-sensitive by default (matching the AC's registered names exactly); an implementation may offer case-insensitive identifier matching as configuration.
8.3.4 String literals¶
Single-quoted strings use SQL-style escaping: a doubled single-quote within a string is a literal single-quote ('it''s'). String literals support no other escape sequences in Coframe Core; backends handle Unicode per their native conventions.
8.3.5 Date and timestamp literals¶
Date and timestamp literals use explicit syntax: DATE '2026-01-01', TIMESTAMP '2026-01-01 09:30:00', or via cast CAST('2026-01-01' AS DATE). String literals are not implicitly converted to dates; the explicit form is required.
8.3.6 Identifiers as AC family-names¶
AC family-names appear as identifiers in queries. Per Foundations §2.11 (names as opaque labels), the framework treats names as labels for equality comparison against AC declarations — the parser consumes them as identifier strings; it does not infer meaning from them.
Identifiers with characters outside the alphanumeric-underscore-dot set must be backtick-quoted (`unusual name`) where the backend supports it.
8.4 Top-level structure¶
8.4.1 Query forms¶
A Frame-QL query is one of:
- A Frame — a
SELECT-form (or sugar-form) query with its accessories. Outer Frames require anATclause. - A WITH-block — one or more inner Frames followed by an outer Frame (§8.7).
A Frame has two equivalent surface forms:
- Explicit form: begins with the
SELECTkeyword:SELECT select_item_list [FROM ...] [WHERE ...] BY ... [HAVING ...] [ORDER BY ...] [LIMIT ...]. - Sugar form: the
SELECTkeyword is omitted; the query begins directly with the select-item list.
Outer Frames must declare AT explicitly. Inner Frames in a WITH-block may omit AT, inheriting from the outer Frame (§8.7).
8.4.2 Top-level grammar¶
query := frame | with_block
frame := select_clause [from_clause] [where_clause] by_clause
[having_clause] [order_by_clause] [limit_clause]
select_clause := [SELECT] select_item_list
with_block := WITH inner_frame_list outer_frame
This grammar is abbreviated; the full BNF is in Appendix A.
The clause ordering above is conventional and matches the natural decomposition (§8.2.1): the Frame core (SELECT + AT) sits among its accessories (FROM, WHERE, HAVING, ORDER BY, LIMIT).
8.5 Frame clauses¶
8.5.1 SELECT clause — the Frame's columns¶
The SELECT clause specifies the columns of the output Frame. Each select-item is one of:
- A bare column reference:
revenue,region,customer_name. References an AC family-name. Its anchor is inherited from the query'sATgrain (§8.2). - A qualified reference:
transactions.revenue. Qualifies which schema's appearance of the family-name to use — for disambiguation (§8.9). - A reducer expression:
SUM(revenue),MAX(peak_revenue),COUNT(*),COUNT_DISTINCT(customer). Aggregates per the output grain. - A mapper expression:
revenue / units_sold,UPPER(customer_name). Operates row-wise at the output grain. - A composed expression: combinations of mappers and reducers, parenthesized for grouping.
- A literal:
42,'west',TRUE. - An aliased item:
expression AS name, e.g.SUM(revenue) AS total_revenue.
Items are comma-separated. Each select-item is, conceptually, one lightweight ColumnSpec in the output schema.
8.5.2 FROM clause — schema scope (accessory)¶
The FROM clause is optional. When present, it lists schemas that may contribute to the query:
Uses:
- Disambiguation when automatic schema selection is ambiguous (multiple schemas could serve a column).
- Restricting the framework to specific schemas (e.g., for performance).
- Cousin disambiguation: restricting to the schemas containing the intended sibling group (§8.9).
When FROM is omitted, the framework selects schemas via the four-rule filter (Chapter 9).
8.5.3 WHERE clause — pre-filtering (accessory)¶
The WHERE clause filters input rows before the Frame is computed. Its expression evaluates to a boolean at the input grain:
WHERE expressions reference AC-dimensions and AC-attributes (not aggregated metrics — those are filtered in HAVING). They follow three-valued logic (TRUE / FALSE / NULL); rows evaluating to FALSE or NULL are excluded.
8.5.4 AT clause — the Frame's grain¶
The AT clause specifies the output grain — the AC-dimensions each output row represents:
The AT clause is mandatory on outer Frames. It declares the grain so the framework knows what level the Frame is anchored at. It may reference:
- A single AC-dimension (
AT region). - A tuple of AC-dimensions (
AT (region, year)). - The grain of a specific schema (
AT transactionmeans "at transaction grain").
The framework navigates from each metric's source anchor to the output grain via the FD-DAG, applying the metric family's ip_reducer with its block set (Chapter 9). Together, SELECT and AT constitute the Frame — the output schema.
A note on the keyword. Earlier drafts used BY (echoing SQL's GROUP BY). The framework prefers AT, because it is locative — it names the grain the Frame's values sit at — rather than operative (an instruction to "group"). This matches the framework's stance: a query describes the output frame it wants, it does not instruct a grouping procedure. The same word underlies the per-operand grain annotation @ (§8.6.6): AT sets the Frame's grain; @ sets a sub-term's grain. BY is still accepted as a synonym for AT for readers coming from SQL, but it is deprecated and may be removed in a future version; new ACs and examples should use AT.
When a dimension family has multiple hierarchies and a referenced AC-dimension is reachable through more than one of them, the AT ... USING <hierarchy-name> form selects which hierarchy to traverse:
In most ACs this is unnecessary because distinct hierarchies use distinct AC-dimension names (the retail AC's quarter and fiscal_quarter are different AC-dimensions, so naming the AC-dimension already selects the hierarchy). USING is the explicit disambiguator for the rarer case where an AC-dimension is shared across hierarchies (Chapter 9 §9.7.2.4).
8.5.5 HAVING clause — post-filtering (accessory)¶
The HAVING clause filters output rows after the Frame is computed. Its expression evaluates at the output grain and may reference aggregated values:
WHERE filters input rows; HAVING filters output rows. The two are distinct accessories on opposite sides of the Frame computation.
8.5.6 ORDER BY clause — arrangement (accessory)¶
ASC (default) or DESC, applied lexicographically across multiple keys. Sort keys may reference SELECT columns or AT AC-dimensions.
LIMIT applies after ORDER BY; with the PER subclause (§8.5.7) the cap applies within each group rather than across the entire frame, while sort order continues to be determined by ORDER BY.
8.5.7 LIMIT clause — capping (accessory)¶
The LIMIT clause caps the number of output rows. In its bare form it caps the entire frame; with the optional PER subclause it caps per group:
Bare LIMIT n. The first n rows of the frame, after ORDER BY. Without an ORDER BY, which rows survive is implementation-defined.
LIMIT n PER cols. The first n rows of the frame within each group defined by cols, after ORDER BY. cols is one or more AC-dimensions; each must appear in the AT clause. The form covers the top-N-per-group pattern — top three stores per region by revenue, last five orders per customer, three highest test scores per district — that would otherwise require a window function.
The framework treats PER as pure output filtering: it discards rows of the frame, it does not introduce per-row values that read across rows. It therefore remains on the "describe the column's spec" side of the frame test (§8.2) and is not a window function (§8.2.3).
Example. Top three stores per region by revenue:
SELECT region, store, SUM(revenue) AS total_revenue
AT (region, store)
ORDER BY total_revenue DESC
LIMIT 3 PER region
The frame is computed at (region, store) grain (one row per store with its total revenue), sorted by revenue descending, then truncated to three rows per region. The output has at most 3 × |regions| rows.
Validity. The grammar requires n ≥ 0 and at least one dimension after PER. The resolver additionally requires:
- (L1) Each entry in
colsmust be an AC-dimension that appears in theATclause. References to measures, expressions over columns, or dimensions outside the grain are rejected. - (L2)
PERrequiresLIMIT.PERstandalone is a grammar error. - (L3)
LIMIT 0 PER ...is grammatical and yields an empty frame; the resolver emits a warning.
Determinism within group. When ORDER BY contains a sort key that is not in cols — typically a measure expression — per-group selection is deterministic. When ORDER BY contains only the cols themselves, or no ORDER BY is present, per-group selection is implementation-defined: the operator is responsible for supplying a within-group tiebreaker. The resolver emits a lint warning in this case.
Difference from SQL. SQL has no per-group form of LIMIT; the same pattern requires a window function (ROW_NUMBER() OVER (PARTITION BY …) plus an outer filter). Frame-QL's PER is not a window function — it discards rows of the frame rather than computing per-row values — so it does not violate the no-window rule (§8.2.3). Authors migrating SQL queries should expect bare LIMIT n to behave identically to SQL's LIMIT; per-group capping requires the additional PER keyword rather than reinterpreting existing syntax.
8.6 Expressions¶
A Frame-QL expression that computes a column at the output grain from terms available at that grain is a frame expression. A term is available at the frame grain if it is (a) a bare family column, which resolves to that family at the frame grain via the family's ip_reducer; (b) an explicit reducer applied at the grain; (c) an @-anchored sub-term whose value is staged at another grain and brought to the frame grain (§8.6.6); or (d) any arithmetic combination of these. The defining property of a frame expression is that every term resolves to a single value per output row — it describes a column's value in terms of its own grain, never by referencing other rows of the assembled frame. This is the line that keeps Frame-QL within the "describe a column's spec" model and excludes window functions (§8.2.3).
Frame-QL's expression sublanguage has three kinds — reducers, mappers, and their compositions — plus a small set of registered convenience operators (§8.6.4) and the cross-grain @ annotation (§8.6.6).
8.6.1 Reducer expressions¶
Reducers aggregate over input rows to the output grain. Per the operator catalog (Chapter 10):
SUM(c),MAX(c),MIN(c)— natively monoidal reducers.COUNT(*),COUNT(c),COUNT_DISTINCT(c)— count operations.AVG(c),STDEV(c),VARIANCE(c)— liftably monoidal reducers (computed via their lifts; Chapter 10 §10.5).APPROX_DISTINCT(c),APPROX_PERCENTILE(c, p)— sketch-backed approximate reducers.MEDIAN(c),MODE(c),EXACT_DISTINCT(c)— holistic reducers. Usable when the output grain matches the data grain (no rollup); a query that would require rolling these up across grain is refused, since the family is anchor-locked (Chapter 9).BOOL_AND(c),BOOL_OR(c),BIT_AND(c),BIT_OR(c),BIT_XOR(c)— boolean and bitwise reducers.HLL_MERGE(c),THETA_UNION(c),T_DIGEST_MERGE(c)— sketch-merge reducers (for sketch-typed metric families).STRING_AGG(c, sep),ARRAY_AGG(c)— collection reducers.
Missing-value treatment per reducer is in Chapter 10.
8.6.2 Mapper expressions¶
Mappers operate row-wise at the output grain. They reference only values within the same row — never other rows — which is what distinguishes them from window functions.
- Arithmetic:
+,-,*,/,%,^, with standard precedence. - Comparison:
=,<>,<,<=,>,>=, three-valued. - Logical:
AND,OR,NOT, three-valued. - String functions:
UPPER,LOWER,TRIM,SUBSTRING,LENGTH,CONCAT,REPLACE. - Date/time functions:
DATE_ADD,DATE_DIFF,EXTRACT(field FROM d). - Type conversion:
CAST(expr AS type). - Missing-value handling:
COALESCE(c1, c2, ...),IFNULL(c, repl),NULLIF(c, v). - Conditional:
CASE WHEN ... THEN ... [ELSE ...] END,IF(cond, t, f).
Missing-value treatment per mapper is in Chapter 10.
8.6.3 Composite expressions¶
Mappers and reducers compose:
revenue / units_sold— mapper over two metric references at the output grain.SUM(revenue) / SUM(units_sold)— composition of two reducers.100 * SUM(revenue) / SUM(market_revenue)— a ratio of two reducers at the same (output) grain, scaled to a percentage. (This is a same-grain ratio; the share-of-a-coarser-total measure is the two-anchorPCToperator, §8.6.6.)
The framework handles precedence and missing-value propagation per Chapter 10.
A note on where a ratio is taken: SUM(revenue) / SUM(units_sold) computes the two sums at the output grain, then divides — a per-output-row computation. This differs from a per-input-row ratio, which would be a singleton metric family in the AC anchored at the input grain (Foundations §2.7.7). Both are legitimate; they differ in the grain at which the division happens.
8.6.4 Registered ratio and count operators¶
For common ratios and conditional counts, Coframe Core provides convenience operators that compute at the output grain:
RATIO_OF(numerator, denominator)— the ratio of two reducers (by default sums) at the output grain, with catalog-specified missing-value treatment. Equivalent toSUM(numerator) / SUM(denominator)taken at theATgrain.COUNT_OF(filter_expression)— counts input rows where the filter is TRUE; useful for conditional counts. Equivalent to a conditional count, rolled up like any additive measure.
Both are query-time expressions, not stored metrics. This is a deliberate choice for the ratio case: an intensive ratio (margin, revenue-per-unit, revenue-per-customer) has no partition-invariant reducer, so a stored ratio would be anchor-locked — queryable only at the grain where it was materialized and unreachable from any other grain. Computing the ratio at the output grain from its extensive components is instead correct at every grain and costs nothing extra. COUNT_OF, by contrast, is extensive and rolls up cleanly; it is provided as an expression purely for convenience. The full reasoning, and the narrow cases where exposing a ratio as a stored metric is warranted, are in Chapter 10 §10.8.6.
For the two-anchor measures (§8.6.6) — those that combine a value at the frame grain with an aggregate of a metric at another grain — Coframe Core provides three named operators as sugar over the underlying @-anchored frame expressions:
PCT(m @ a)—m's share of itsa-grain total. Sugar form / (m @ a), whereais a coarsening of the frame grain. Withathe top of a hierarchy (PCT(m @ All)), this is share-of-grand-total.WEIGHTED_AVG(m, w @ a)— the average ofmweighted byw, withathe weighting grain (the grain at whichmandware paired). Sugar forSUM(m * w @ a) / SUM(w @ a)carried to the frame grain.INDEX(m @ coord)—mrelative to its value at a designated base coordinate. Sugar form / (m @ coord).
These operators are the surface at which the dubiousness law (§8.6.6) is enforced: when the relevant grain (the weighting grain for WEIGHTED_AVG, the companion grain for PCT/INDEX) is not uniquely determined and not explicitly given via @, the operator is refused as dubious. A native, accelerated implementation of these patterns is a possible future-engine optimization (a Pro concern); in Coframe Core they are defined by their desugaring, and the desugared form is correct at the cost of intermediate staging.
8.6.5 Qualified references¶
When a family-name appears in multiple schemas with different family-roots — cousins (Foundations §2.7.5) — the author can qualify references to select one cousin's family:
transactions.revenue— the revenue column whose family-root is in the transactions schema.regional_summary.revenue— the revenue column whose family-root is in the regional_summary schema.
Qualified references constrain which schema serves a specific column, overriding automatic selection for that column. The four-rule filter still applies at the query level. See §8.9 for the cousin-disambiguation use case.
8.6.6 Cross-grain sub-terms (@) and two-anchor measures¶
Most frame expressions combine terms that all live at the frame grain. Some analytical measures, however, combine a value at the frame grain with an aggregate of a metric taken at a different grain. The percentage of a regional total, a weighted average, and an index to a base period are the canonical examples. These are two-anchor measures, and Frame-QL expresses them with the per-operand grain annotation @.
The @ annotation¶
<term> @ <anchor> evaluates <term> at <anchor> rather than at the frame grain, and brings the result to the frame grain so it can combine with same-row terms. Two relationships between the sub-term's anchor and the frame grain are permitted, and only these two:
- Coarsening by composite-subset — the sub-term anchor drops one or more coordinates from a composite frame grain (frame
AT (store, month), sub-term@ store). - Coarsening by nested dimension — the sub-term anchor climbs an FD edge within a dimension family to a coarser AC-dimension (frame
AT day, sub-term@ week).
In both cases the sub-term anchor is a coarsening of the frame grain reachable in the coordinate structure. The coarse value is then filled up (broadcast) back to the frame grain along the FD-DAG (Chapter 9). Sideways moves, finer grains, and grains in unrelated dimension families are not permitted and are refused at validation.
A bare-term @ reference (revenue @ region) resolves to that family at the coarser grain via its ip_reducer — it is the same conceptual quantity, navigated up. An explicit reducer (MAX(revenue @ region)) applies that reducer at the coarser grain and may denote a different quantity (Foundations §2.6.5).
Reduce-of-expression¶
A reducer may be applied to an expression rather than a bare column — SUM(revenue - cost), SUM(rating * enrollment). Such a reduce-of-expression requires an explicit formation anchor:
The rule: a reducer applied to anything other than a single bare family column requires an explicit formation anchor. The bare family column is the sole exception — it carries its own grain via its anchor and reduces by its ip_reducer, so it needs no annotation.
The requirement is conservative by design. For linear expressions under distributive reducers the formation grain does not in fact change the result — SUM(revenue - cost) equals SUM(revenue) - SUM(cost) at any grain. But for nonlinear expressions it does: SUM(rating * enrollment) formed at school grain differs from the same expression formed at district grain, because multiplication does not commute with regrouping. Core does not analyze each expression for linearity and distributivity; it requires the formation anchor uniformly whenever the operand is not a bare column, so that the grain at which the expression is formed is always explicit and the result is always well-defined. Declaring the anchor on a case that happens to be grain-invariant is harmless; omitting it on a case that is not would be silently wrong.
A reduce-of-expression desugars to a staged frame (§8.7): form the expression at its declared anchor, then reduce to the output grain. Nesting is permitted — a reduce-of-expression whose operand contains another reduce-of-expression at a coarser grain — and desugars to a longer staged chain, provided each level's anchor is a coherent coarsening of the level below it. (An implementation may support a limited nesting depth and report deeper forms as not-yet-supported; this is an implementation limit, not a language restriction.)
The dubiousness law¶
A reduce-of-expression over a nonlinear combination is dubious when its formation anchor is neither explicitly given nor uniquely determined. The weighted average is the canonical case: WEIGHTED_AVG(m, w) with no stated weighting grain is genuinely ambiguous whenever m and w are co-native at more than one grain, because the weighted average of fine-grain values differs from the weighted average of coarser-grain values — they answer different questions. The framework cannot choose between them without guessing, so it does not:
A two-anchor measure whose formation (weighting or companion) grain is ambiguous and undeclared is refused as dubious, and the author must declare the grain via
@. When the grain is uniquely determined (the metrics are co-native at exactly one grain), it is inferred and no annotation is required.
This is the same posture as cousin disambiguation (§8.9): where the answer is not unique under the AC's commitments, the framework refuses and requires the author to disambiguate, rather than silently returning one of several possible answers. It generalizes the dubious-query mechanism from family-root ambiguity to grain ambiguity of nonlinear combinations. The practical effect is that a weighted average — a measure most tools compute by silently fixing whatever grain is convenient — must, in Coframe, state the grain it weights at, or be refused. The number is then reproducible and defensible rather than quietly dependent on an unstated choice.¶
8.7 WITH-blocks¶
8.7.1 Structure¶
WITH-blocks let authors define intermediate Frames (Rung 9). Each inner Frame produces a result that subsequent Frames can reference:
WITH
inner_frame_1 AS (
SELECT ... AT ...
),
inner_frame_2 AS (
SELECT ... FROM inner_frame_1 ... AT ...
)
outer_frame_query
The outer Frame is the query's actual output. Inner Frames are intermediate; their content is available within the WITH-block.
8.7.2 Session-local intermediates¶
In Coframe Core, WITH-Frame outputs are session-local:
- They exist for the duration of the query session.
- They are not registered as persistent AC schemas.
- They are not visible across sessions.
- They are not re-queryable through a future session's four-rule filter.
This is a Coframe Core simplification. Persistent Frame-QL outputs — the re-ingestion workflow in which a Frame's lightweight output schema is promoted to a full AC schema — are a Coframe Pro capability.
8.7.3 Inner Frame semantics¶
Inner Frames behave like outer Frames: they may have SELECT, FROM, WHERE, BY, HAVING, ORDER BY, LIMIT. An inner Frame may omit BY when it is conceptually at the same grain as the outer Frame (the framework infers). Inner Frames may reference earlier inner Frames in the same WITH-block, building a query step by step.
8.7.4 Example¶
WITH
region_revenue AS (
SELECT region, SUM(revenue) AS total
AT region
),
region_customer_count AS (
SELECT region, COUNT_DISTINCT(customer) AS customers
AT region
)
SELECT region, total / customers AS revenue_per_customer
FROM region_revenue, region_customer_count
AT region
ORDER BY revenue_per_customer DESC
The inner Frames compute regional totals and customer counts; the outer Frame combines them into a per-region ratio at the shared region grain. (This example assumes a customer AC-dimension extending the retail AC, as the examples in §8.8 do.)
8.8 Examples by rung¶
This section illustrates each Coframe Core-supported rung against the retail AC. The rungs are a pedagogy of increasing capability, not a feature taxonomy the author must learn; most queries combine several.
A note on the examples: several below reference a customer AC-dimension and a customer_name AC-attribute. These are not part of the canonical retail AC of the front matter; they are a minimal extension assumed here to illustrate reads, broadcasts, and distinct counts. Picture the transactions schema additionally carrying a customer column (anchored at {transaction}, a non-grain-role AC-dimension) and a customers reference schema providing customer_name. The structural reasoning shown is unchanged by the extension.
8.8.1 Rung 0: Read¶
Read column values directly, no aggregation.
Each customer with their name.
Each 2026 transaction with its revenue and store.
8.8.2 Rung 2: Broadcast¶
Broadcast a coarser-grain attribute or dimension across finer-grain rows. The resolver applies broadcast automatically when an attribute from a coarser-grain schema is requested at a finer-grain anchor.
Each transaction with its revenue and the region of the transaction's store. The framework resolves region via FD-DAG navigation (transaction → store → region) and broadcasts from the stores schema.
8.8.3 Rung 1: Identity-preserving reduction¶
Aggregate via the metric family's ip_reducer.
Total revenue per region. The framework navigates from revenue's source anchor ({transaction} in the transactions schema) to {region} via the FD-DAG, summing along the way under the revenue family's ip_reducer (SUM, A_block = ∅).
Total revenue per (region, year) — the same identity-preserving reduction at a composite grain.
Semi-additive case¶
Recall that eom_inventory declares two ip_reducers: (SUM, A_block = {time}) and (MAX, A_block = ∅).
-- Total inventory by region within a month: SUM across stores (time not crossed)
SELECT region, month, SUM(eom_inventory) AS total_inventory
AT (region, month)
The resolver selects (SUM, A_block = {time}): the rollup coarsens store → region (geography, not blocked) and stays at month (time not crossed). Valid.
-- Peak inventory by region across a year: MAX in any direction
SELECT region, year, MAX(eom_inventory) AS peak_inventory
AT (region, year)
The resolver selects (MAX, A_block = ∅): MAX rolls up in any direction.
-- Refused: total inventory by region across a year would SUM across time
SELECT region, year, SUM(eom_inventory) AS total_inventory
AT (region, year)
The resolver attempts (SUM, A_block = {time}): the rollup coarsens month → year (time, blocked). Refused with a block-set diagnostic (§8.11). The author either chooses MAX explicitly or stays within a non-time-crossing grain.
8.8.4 Rung 6: Multi-input expressions¶
Compose mappers across multiple columns within a Frame.
Per region, the ratio of total revenue to total units sold (the division taken at the region output grain).
SELECT region, year,
SUM(revenue) AS revenue,
COUNT_DISTINCT(customer) AS customers,
SUM(revenue) / COUNT_DISTINCT(customer) AS revenue_per_customer
AT (region, year)
Per (region, year): revenue, customer count, and revenue per customer.
8.8.5 Rung 7: Cross-schema reach¶
Queries drawing on multiple schemas, with the four-rule filter determining reachability.
SELECT region, month,
SUM(revenue) AS total_revenue,
SUM(eom_inventory) AS total_inventory
AT (region, month)
revenue comes from transactions (navigated to {region, month}); eom_inventory comes from store_monthly_inventory (navigated store → region, staying at month under the SUM ip_reducer's block set). The framework aligns the two at {region, month} and collects them. Per Multi-Table Invariance (Chapter 9), the result is well-defined regardless of which valid plan the resolver chooses.
8.8.6 Rung 9: WITH-chained frames¶
Session-local intermediate Frames that subsequent Frames reference. See §8.7.4 for a worked example.
8.8.7 What Coframe Core does not support¶
Several rungs and patterns are simplified or out of scope in Coframe Core:
- Window analytics: period-over-period comparison (such as year-over-year), ranking, running totals, percent-of-total across the assembled frame. These reference other rows of the output frame, which is outside Core's frame model. A year-over-year comparison, for example, must be performed in Coframe Pro (when available) or outside Coframe. The motivating query of Chapter 1 §1.1 includes a year-over-year facet precisely to mark this boundary: Core constructs the per-(region, quarter) revenue frame; the cross-year comparison itself is out of Core scope.
- Pivot: reshaping rows into columns. Out of Core scope.
- Holistic-within-self reductions and query-time epoch transitions (constructing new metric families at query time): Coframe Pro territory. In Core, derived columns are declared via lineage in schema.init (Chapter 3), not constructed through query-time transitions.
- Type-changing reductions beyond the catalog's
COUNT/COUNT_DISTINCT: Coframe Pro. - Custom operators, multi-backend queries, and persistent re-ingestion: Coframe Pro.
The framework's posture is to support frame-shaped output correctly and to be explicit about where Core stops, rather than to approximate patterns it cannot construct with guaranteed correctness.
8.9 Disambiguation¶
8.9.1 When disambiguation is needed¶
A Frame-QL query is ambiguous when:
- A bare family-name resolves to multiple cousins (same family-name, different family-roots; Foundations §2.7.5).
- The four-rule filter produces multiple incompatible survivors the framework cannot reconcile.
- A column appears at multiple anchorings across schemas and the
ATgrain is reachable by multiple non-equivalent paths.
When ambiguous, the framework refuses with a dubious-query diagnostic (§8.11.4 and Chapter 9).
8.9.2 Disambiguation mechanisms¶
The author disambiguates via:
- Qualified references:
transactions.revenueconstrains the family-name to the column whose family-root is in the transactions schema, isolating one cousin. - Explicit
FROM:FROM transactions, storesrestricts the query to specific schemas; cousins outside theFROMlist are excluded. AT-clause grain anchors: a more specific grain (AT transactionrather thanAT region) can make the resolution path unambiguous.
After disambiguation the framework re-resolves; if the disambiguation suffices, the query proceeds.
These mechanisms operate at the query level and pick among structurally distinct candidates (different cousins, different schemas). They do not let the author override the framework's correctness reasoning — a query that is dubious because two cousins genuinely differ is resolved by naming which cousin, not by instructing the framework to merge them.
8.9.3 Cousin disambiguation example¶
Consider an AC where peak_concurrent_users appears as two cousins:
- In a
system_metrics_hourlyschema, anchored at{server, hour}. - In a
product_analytics_dailyschema, anchored at{region, day}, computed from session logs.
Both share the family-name but trace to different family-roots — they are cousins.
The framework refuses this as dubious:
DUBIOUS: 'peak_concurrent_users' resolves to two family-roots in this AC:
- family-root in system_metrics_hourly (A = {server, hour})
- family-root in product_analytics_daily (A = {region, day})
These are cousins and produce different results. Disambiguate with a
qualified reference (e.g., system_metrics_hourly.peak_concurrent_users)
or an explicit FROM clause.
The author disambiguates:
The query now references one cousin's family; the framework resolves it.
8.10 Semantics summary¶
Operationally, a Frame-QL query proceeds through:
- Parse: the text is parsed into an AST per the grammar (§8.4).
- Resolve names: every column reference is bound to an AC family-name or AC-dimension (Foundations §2.7.3).
- Type-check: every operator and expression is validated per the operator catalog (Chapter 10).
- Schema selection: the four-rule filter (Chapter 9) selects schemas to serve each column.
- Plan construction: the framework constructs an execution plan over the selected schemas.
- Execute: the data-API performs the plan; the Frame is returned.
Steps 4 and 5 are specified in Chapter 9.
8.11 Errors¶
8.11.1 Parse errors¶
Lexical or grammatical errors. The framework reports the position, the expected tokens, and a suggested correction:
Parse error at character 47:
SELECT region, SUM(revenue AT region
^
Expected ')' to close SUM(, but found 'AT'.
8.11.2 Binding errors¶
Failures resolving a family-name against the AC:
8.11.3 Resolution errors¶
Failures during the four-rule filter or schema selection (Chapter 9):
Resolution error: no schema can serve 'revenue' at grain (country) —
schemas containing revenue do not reach the country anchor via the FD-DAG.
8.11.4 Dubious-query errors¶
A query that resolves to multiple non-equivalent plans (§8.9.1):
DUBIOUS: 'revenue' resolves to multiple family-roots:
- revenue with family-root in transactions
- revenue with family-root in regional_summary
These are cousins (different family-roots) and produce different results.
Disambiguate via qualified reference or explicit FROM.
A two-anchor measure whose formation grain is ambiguous and undeclared is also dubious (§8.6.6):
DUBIOUS: WEIGHTED_AVG(rating, enrollment) has no unique weighting grain.
'rating' and 'enrollment' are co-native at both {school} and {district},
and the weighted average differs between them.
Declare the weighting grain, e.g. WEIGHTED_AVG(rating, enrollment @ school).
8.11.5 Block-set errors¶
A query whose requested rollup is blocked by a metric family's ip_reducer block set:
BLOCKED: 'eom_inventory' under SUM is blocked across the 'time' dimension family.
The requested grain (region, year) would SUM inventory across time.
Use MAX(eom_inventory) for a peak, or query at a non-time-crossing grain.
8.11.6 Anchor-locked errors¶
A query requesting a metric family at a grain its ip_reducers cannot reach (an anchor-locked family with no partition-invariant ip_reducer):
ANCHOR-LOCKED: 'avg_basket_size' (family-root operator AVG) has no ip_reducer
and cannot be rolled up to grain (region). Materialize a sibling at the desired
grain, or compute the average inline from its components.
8.11.7 Integrity-condition errors¶
A query whose resolution would depend on an integrity condition that failed in DQ (Chapter 7):
INTEGRITY: this query depends on the FD-edge 'store -> region', which the data
does not attest (DQ Phase 2 failed). Re-run DQ or remove the dependency.
8.11.8 Operator-rule errors¶
Type-checking failures:
8.11.9 Backend errors¶
Errors reported by the backend during execution are passed through with the backend's native message.
8.11.10 Diagnostic conventions¶
Every diagnostic includes the error category, the query position or column where it was detected, a descriptive message, and a suggested remediation when possible.
8.12 Comparison with SQL¶
For engineers familiar with SQL, Frame-QL is similar in spirit but different in mechanics:
| Concept | SQL | Frame-QL |
|---|---|---|
| Unit of querying | tables | an AC (a curated structural object) |
| Joins | explicit (JOIN ... ON ...) |
none — the four-rule filter handles cross-schema reach |
| Aggregation | explicit (SUM(c) + GROUP BY) |
implicit per metric family ip_reducer and AT grain |
| Cross-grain reasoning | manual | automatic via FD-DAG navigation |
| Column references | physical column names | AC family-names |
| Pre/post filtering | WHERE / HAVING |
WHERE / HAVING (same roles) |
| Sub-queries | inline subqueries, CTEs | WITH-blocks |
| Projection | SELECT |
SELECT (the Frame's columns) |
| Grain specification | GROUP BY |
AT |
| Window functions | yes | no (Coframe Pro / external) |
| Ambiguous joins | silently produces a result | refused as dubious until disambiguated |
| Integrity | author's responsibility | checked before execution; queries on broken integrity fail fast |
| Correctness model | operational (the query as written computes a result) | constructive (the algorithm is built from declared and verified commitments) |
Frame-QL queries are typically shorter than equivalent SQL because the framework handles structural complexity automatically. SQL makes structural decisions explicit; Frame-QL makes analytical intent explicit and lets the framework handle structure. The trade-off is that Frame-QL refuses queries SQL would accept — those whose answer the AC's commitments do not uniquely determine.
8.13 Summary¶
Frame-QL is the declarative query language for Coframe Core. A query describes a desired output Frame — a lightweight output schema of columns at a BY grain — surrounded by accessories that scope and arrange the result (FROM, WHERE, HAVING, ORDER BY, LIMIT). The Frame is the core; the accessories are kept separate, matching how queries live in an author's mind.
The guiding principle throughout is to describe what the output should be, not how to process the data. The framework resolves a Frame-QL query by binding names to AC declarations, selecting schemas via the four-rule filter, navigating the FD-DAG, applying metric family ip_reducers with their block sets, and constructing an execution plan — refusing cleanly (as dubious, blocked, anchor-locked, or integrity-failing) where the AC's commitments do not uniquely determine an answer.
Coframe Core supports a defined subset of Frame-QL's capability surface (Rungs 0, 1, 2, 6, 7, 9). Window analytics, pivot, and other cross-frame patterns are Coframe Pro territory or are performed outside Coframe. The framework supports frame-shaped output correctly rather than approximating patterns it cannot construct with guaranteed correctness.
For how queries are resolved — the four-rule filter, Multi-Table Invariance, schema selection — see Chapter 9.
Chapter 9: Query Resolution¶
How the framework constructs the algorithm that produces a Frame-QL query's output frame from the AC's declared and verified commitments. The resolution pipeline, the four-rule filter, single-schema and cross-schema resolution, Multi-Table Invariance and plan construction, the dubious-query mechanism.
9.1 What this chapter does¶
A Frame-QL query (Chapter 8) describes what output frame is desired. This chapter specifies how the framework constructs the algorithm that produces it.
Resolution is the framework's central operational deliverable. It is what distinguishes Coframe from a semantic-layer translator: the resolver does not lower the query into a fixed pre-defined SQL template; it constructs a query plan from first principles using the AC's structural commitments. Two ACs with the same surface but different commitments may resolve the same Frame-QL query into different plans; an AC with stronger commitments may answer queries another AC refuses.
The chapter is organized as a pipeline specification. §9.2 introduces the resolution problem and the pipeline. §9.3 specifies the four-rule filter — the resolver's core mechanism for evaluating candidate plans. §9.4 specifies single-schema resolution (queries whose data comes from one schema). §9.5 specifies cross-schema resolution (queries spanning multiple schemas). §9.6 specifies Multi-Table Invariance — the property that licenses the resolver to choose among equivalent plans and optimize. §9.7 specifies the dubious-query mechanism. §9.8 specifies error reporting.
The chapter is for readers implementing the resolver, debugging unexpected query behavior, or reasoning about what queries an AC can answer. AC authors who just want to know "will my query work" can skim §9.2 and read the dubious-query section (§9.7).
9.2 The resolution problem¶
9.2.1 What resolution produces¶
Given a Frame-QL query Q and an AC AC with its DQ deliverable, the resolver produces one of:
- A query plan: a sequence of data-API operations (Chapter 6) that, when executed, produces the output frame
Qdescribes. Each operation in the plan is grounded in a verified commitment from the AC. - A refusal: a structured error indicating that
Qcannot be uniquely constructed fromAC's commitments. Refusals include the structural reason (cousin ambiguity, anchor-locked family, block-set conflict, etc.) and where possible suggest remediation (disambiguating syntax, AC-level changes, alternative queries).
The resolver is deterministic: given the same (AC, DQ deliverable, Q), it produces the same plan or the same refusal. This is part of the constructive-correctness guarantee — queries that succeed succeed reliably; queries that fail fail informatively.
9.2.2 Resolution pipeline¶
The resolver operates in seven stages:
- Parse: tokenize and parse the Frame-QL query. Output: an abstract syntax tree (AST).
- Symbol resolution: bind identifiers to AC objects — AC-dimensions to their dimension families, metric family names to family-root declarations, operators to catalog entries. Output: a resolved AST with bindings.
- Grain analysis: determine the output frame's grain from the
BYclause; verify that each referenced AC-dimension belongs to a declared dimension family. Output: a grain specification. - Family resolution: for each metric family reference in the
SELECTclause, determine which family-root applies, which ip_reducer is selected, which sibling materializations are available. Output: per-metric resolution plan including the chosen ip_reducer and source ColumnSpec. - Plan construction: build the candidate query plans — combinations of source ColumnSpecs, navigation paths through the FD-DAG, ip_reducer applications, and joins (cross-schema). Output: one or more candidate plans, each scored.
- Plan selection: apply the four-rule filter (§9.3) to candidate plans. If exactly one plan satisfies all four rules, that's the resolved plan. If zero plans satisfy them, refuse. If multiple plans satisfy them and produce non-equivalent results, refuse as dubious.
- Plan execution: hand the resolved plan to the data-API for execution; consume results into the output frame.
The first six stages happen entirely within the framework, without backend access. Only stage 7 invokes the backend.
9.2.3 What "resolution" guarantees¶
When the resolver returns a query plan, the framework commits to: the plan, when executed, produces an output frame consistent with the AC's verified commitments. This is the operational form of the constructive-correctness guarantee from Chapter 1.
Specifically:
- Every navigation in the plan traverses a verified FD-edge (Phase 2 attestation) or a function-derived FD-edge (catalog-attested).
- Every ip_reducer application uses a partition-invariant operator (catalog-attested) with
A_blockrespected. - Every cross-schema join uses a verified cross-schema value mapping (Phase 2) or shares grain-role columns whose anchor cardinality is verified.
- Every lineage edge implicitly used (when rolling up a sibling) is either operator-attested (Phase 3 passed) or operator-asserted (and the AC's policy permits unverified assertions).
If any of these conditions fail, the resolver refuses rather than constructing a plan.
9.3 The four-rule filter¶
The resolver's central decision mechanism is the four-rule filter. Each candidate plan is evaluated against four structural rules; plans that fail any rule are rejected. The four rules are independent and collectively necessary.
9.3.1 Rule 1: Family resolution¶
Every metric family reference in the query must resolve to exactly one family-root in the AC.
For each metric family name m in the query's SELECT clause:
- Find all ColumnSpecs in
ACwithname = mwhose lineage is self-referential (family-roots). - If zero exist: refused (resolution error) — the family is undeclared.
- If exactly one exists: family resolution succeeds.
- If multiple exist (cousins): refused (dubious-query: cousin ambiguity) unless the query explicitly disambiguates.
This rule operationalizes the cousin distinction from Foundations §2.7.5. Multiple cousins with the same name make the family reference ambiguous; the framework refuses rather than picking one.
9.3.2 Rule 2: Anchor-set capability¶
For each metric family m referenced in the query, the family must be capable of producing values at the query's grain via some ip_reducer.
Given:
- The family's family-root anchor A_root.
- The query's grain anchor A_target (the set of AC-dimensions in BY).
- The family's set of ip_reducers {(R_1, A_block_1), ..., (R_n, A_block_n)}.
Compute the coarsened dimension set: Δ = the set of AC-dimensions in A_root not present in A_target, plus AC-dimensions in A_target reachable from A_root via the FD-DAG. (For anchors at completely different levels, this is the navigation from A_root to A_target.)
An ip_reducer (R_i, A_block_i) is applicable if:
- The coarsened AC-dimensions are reachable through the FD-DAG (the dimension family hierarchies permit the navigation).
- The dimension families of the coarsened AC-dimensions do not intersect
A_block_i. R_iis partition-invariant per the operator catalog.
If no ip_reducer is applicable, the family cannot reach the query's grain. Refused (anchor-locked error or block-set conflict).
If exactly one ip_reducer is applicable, that's the chosen ip_reducer.
If multiple are applicable and would produce equivalent results (rare but possible — e.g., when Δ = ∅ and the chosen reducer is irrelevant), select deterministically.
If multiple are applicable and produce non-equivalent results: refused (dubious-query: multi-ip_reducer ambiguity) unless the query explicitly disambiguates.
This rule operationalizes the ip_reducer-with-block-set machinery from Foundations §2.8. It is the rule that handles semi-additive measures: a query asking for inventory across time activates the SUM ip_reducer's A_block = {time}, and the rule refuses or routes the query to MAX.
9.3.3 Rule 3: Schema selection¶
The plan must select schema(s) that can supply the requested data:
- The schema must contain a sibling of the metric family (a ColumnSpec in the family) at the family-root's anchor or at an intermediate anchor reachable to the query's grain through the chosen ip_reducer.
- The schema's declared scope on each dimension family in the query's grain must be
non_degenerate(or compatible with the query's filter). - The schema's grain integrity (Phase 2 verified) must hold.
If multiple schemas could supply the data, the resolver prefers the schema with the finest sibling anchor (to minimize the rollup distance) and with verified DQ status (operator-attested over operator-asserted).
If no schema can supply the data: refused.
9.3.4 Rule 4: Cross-schema coherence¶
When the plan draws from multiple schemas (for different metric families or for joining metric data with AC-attribute data), the schemas must agree on shared AC-dimension and AC-attribute values.
This is the cross-schema value-mapping condition from Phase 2 (§7.4.1). If Phase 2 verified the relevant cross-schema mappings, this rule is satisfied. If Phase 2 surfaced inconsistencies, the rule's behavior depends on the AC's configuration:
- Under strict mode: refused.
- Under permissive mode: the plan proceeds but the response carries a caveat.
9.3.5 The filter in summary¶
| Rule | What it checks | Failure category |
|---|---|---|
| 1. Family resolution | Each metric family name resolves to exactly one family-root | Cousin ambiguity (refused) |
| 2. Anchor-set capability | The family's ip_reducers can reach the query's grain | Anchor-locked / block-set conflict (refused) |
| 3. Schema selection | A schema can supply the data with verified DQ status | Schema unavailable (refused) |
| 4. Cross-schema coherence | Multi-schema plans honor cross-schema value mappings | Mapping inconsistency (refused or caveated) |
A plan passes the filter iff all four rules hold. The resolver constructs plans that pass; refuses queries with no passing plan; refuses queries with multiple non-equivalent passing plans (dubious).
9.4 Single-schema resolution¶
The simplest resolution case: the query's metrics all come from a single schema, and grain navigation stays within the FD-DAG reachable from that schema's columns.
9.4.1 Construction¶
For a single-schema query:
- Determine the chosen ip_reducer per metric family (Rule 2).
- Identify the schema's sibling for each metric family (Rule 3).
- Determine the navigation path: from each sibling's anchor to the query's grain.
- Build a single
aggregaterequest (per §6.6.1) per metric: schema: the chosen schema.metric_column: the sibling's column name.group_by: the query's grain AC-dimensions.operator: the chosen ip_reducer.filters: translated from the query'sWHEREclause.missing_value_treatment: per the operator catalog's (op, M) entry.- Apply the query's
WHEREclause as a filter on the input rows. - Apply post-grain computations: composed expressions in
SELECT,HAVING,ORDER BY,LIMIT. WhenLIMIT n PER colsis present (§8.5.7), the cap can be pushed into per-group construction so the engine produces at mostnrows per group rather than materializing the full frame and discarding.
9.4.2 Example: single-schema query¶
The resolver:
- Resolves
revenueto family-root(transactions, revenue)(Rule 1). - Selects ip_reducer
(SUM, A_block = ∅)(Rule 2; only one ip_reducer; no blocks). - Selects schema
transactions(Rule 3; the family-root is there). - Navigates
transaction → region: transactions referencestore(an AC-attribute intransactions), and stores haveregion(via the FD-edgestore → regionin thegeographydimension family). The navigation joinstransactionstostoresvia the data-API. - Constructs an
aggregate_with_joinrequest: primary schematransactions, joining schemastoreson the shared AC-dimensionstore, aggregatingtransactions.revenuevia SUM grouped bystores.region.
(Note: this example is "single-schema" in terms of the metric family — only revenue is requested — but the resolver still joins with stores to access region. The single-schema-vs-cross-schema distinction in this chapter refers to metric sources, not to AC-dimension/AC-attribute lookups.)
9.4.3 Single-schema queries with multiple metrics¶
When multiple metrics come from the same schema:
SELECT region, quarter, SUM(revenue) AS revenue, SUM(cost) AS cost, SUM(units_sold) AS units_sold
AT (region, quarter)
The resolver constructs a single aggregate request with multiple metric columns:
aggregate(
schema: transactions,
metrics: [
{ column: revenue, operator: SUM },
{ column: cost, operator: SUM },
{ column: units_sold, operator: SUM }
],
group_by: [region, quarter],
filters: ...
)
The backend evaluates all metrics in a single pass.
9.4.4 Single-schema with grain-coarsening through hierarchy¶
Queries that coarsen along a dimension family's hierarchy:
The resolver:
- The
timedimension family has hierarchycalendar: day → month → quarter → year. - Transactions'
dayAC-attribute provides the base level. - The navigation
day → yeartraversesday → month → quarter → year(or directly if the backend supportsYEAR_OF(day)as a function-derived hop). - The resolver constructs an aggregate grouped by
YEAR_OF(transactions.day)(or by the corresponding navigation through thetimedimension family's declared structure).
9.5 Cross-schema resolution¶
When metrics come from multiple schemas, the resolver must construct a plan that joins data appropriately.
9.5.1 The cross-schema challenge¶
Cross-schema resolution arises in cases like:
SELECT region, month, SUM(revenue) AS revenue, SUM(eom_inventory) AS eom_inventory
AT (region, month)
revenue is in transactions schema; eom_inventory is in store_monthly_inventory schema. The resolver must:
- Resolve
revenuevia SUM rollup fromtransactionsto{region, month}(navigatingtransaction → store → regionandtransaction → day → month). - Resolve
eom_inventoryvia SUM rollup (under(SUM, A_block = {time})) fromstore_monthly_inventoryto{region, month}(navigatingstore → region). - Combine the two at
{region, month}grain.
9.5.2 Resolution strategy¶
The resolver:
- For each metric, construct the sub-plan that rolls it up to the query's grain (Rule 2 + Rule 3 per metric).
- Verify cross-schema coherence (Rule 4): the schemas' shared AC-dimensions (here,
regionandmonth) must have consistent values per Phase 2. - Combine the sub-plans: a join at the query's grain produces the multi-metric output frame.
The combined plan typically uses aggregate_with_join (per §6.6.2):
aggregate_with_join(
primary_schema: transactions,
joining_schemas: [
{
schema: store_monthly_inventory,
join_keys: [region, month]
}
],
metric_columns: [
{ schema: transactions, column: revenue, operator: SUM },
{ schema: store_monthly_inventory, column: eom_inventory, operator: SUM }
],
group_by: [region, month],
filters: ...
)
The backend executes the join and aggregation in a single optimized pass.
9.5.3 The MTI guarantee for cross-schema¶
Cross-schema resolution depends on a key claim: rolling each metric to the query's grain independently and joining at the grain produces the same result as a (hypothetical) join at the family-root grains followed by joint rollup.
This is Multi-Table Invariance (MTI) (§9.6). Without MTI, cross-schema resolution could be path-dependent — different plans yielding different results. With MTI (and its underlying assumptions of partition-invariance, declared scope, and verified cross-schema coherence), cross-schema resolution is well-defined.
9.5.4 Multi-step navigation through hierarchies¶
When the query's grain is far from any sibling's anchor, the resolver may traverse multiple FD-edges:
The resolver navigates from transactions (at {transaction}) through:
- transaction → store → region → country (geography hierarchy).
- transaction → day → month → quarter → year (time hierarchy).
The navigation is composed; the resolver constructs a single aggregate grouping by the deepest navigation hops.
9.5.5 When no single schema can supply data at the family-root grain¶
Some queries reference metrics whose family-root data isn't available at the query's filter granularity:
SELECT product_category, SUM(eom_inventory) AS eom_inventory
AT product_category
HAVING SUM(revenue) > 1000000
The filter revenue > 1000000 is on the post-grain metric revenue. The inventory rollup needs to honor this filter.
The resolver constructs a multi-stage plan:
- Compute
revenueperproduct_categoryfromtransactions. - Filter to
product_categoryvalues whererevenue > 1000000. - Compute
eom_inventoryfor those product categories fromstore_monthly_inventory— but wait, the inventory schema is degenerate onproduct. The resolver detects this conflict (Rule 3) and refuses the query: the inventory schema cannot supply data filtered by product_category.
The error message identifies the structural reason and suggests remediation (the AC author would need to extend the inventory schema to be non-degenerate on product, which is a data-modeling change).
9.6 Multi-Table Invariance and plan construction¶
Multi-Table Invariance (MTI) is the property that licenses the resolver to choose among equivalent construction paths — and therefore to optimize. It is not, primarily, a proof that a correct solution exists: a correct solution always exists trivially, by navigating from the family-root (the finest-grained materialization) and reducing up to the target anchor. That path is always available and always correct by partition-invariance. What MTI provides is the guarantee that this is not the only correct path — that any verified pre-aggregated sibling, or any cross-schema combination the four-rule filter accepts, yields the same result — and so the resolver may serve a query from whichever path is cheapest.
This is the soundness condition for serving a query from a materialized rollup instead of from detail: the established problem of answering queries using materialized views. MTI is what makes pre-aggregated siblings usable as a trustworthy acceleration layer (a cache or a dynamically-managed cube): the resolver may substitute a coarser sibling for the detail precisely because MTI guarantees the substitution does not change the answer, and attestation (Chapter 7 Phase 3) verifies the premise that the sibling equals the rolled-up detail. MTI licenses the substitution; attestation certifies it is safe.
9.6.1 The invariance¶
Given an AC whose declared structural commitments are verified (Phases 1, 2, 3 of DQ have passed for the relevant subset), and given a Frame-QL query Q:
MTI. Any two query plans
P_1andP_2that the four-rule filter accepts as resolvingQproduce the same output frame, modulo:
- Floating-point arithmetic differences within declared epsilons.
- Sketch-type approximation differences within their declared confidence bounds.
- Ordering of rows within the output frame (when no
ORDER BYis specified).
The invariance follows directly from its preconditions — partition-invariance of the reducers, verified FD-navigation, and cross-schema value consistency — rather than being a deep result in its own right; the preconditions do the work. Its value is operational: because accepted plans are interchangeable, the resolver's plan selection (§9.9.3) is a cost-based choice over an equivalence class, free to pick the cheapest plan without affecting correctness.
9.6.2 What MTI rests on¶
MTI relies on a chain of structural commitments. Each link is independently necessary:
-
Partition-invariance of ip_reducers (catalog-attested per Foundations §2.6.2): the operator's algebra guarantees commutativity with arbitrary partitioning. Without this, different plans aggregating along different paths could produce different results.
-
Block-set well-formedness (Foundations §2.8.3, §2.10.3): the ip_reducer's
A_blockis downward-closed along the FD-DAG within each blocked dimension family. This ensures that block-set-aware plans cannot legitimately disagree on which dimensions are safe to roll up. -
FD-DAG attestation (Phase 2): declared FD-edges hold in data. Without this, navigation through the FD-DAG could produce different results depending on the navigation path.
-
Cross-schema value mapping consistency (Phase 2): same-named AC-dimensions and AC-attributes have consistent values across schemas. Without this, cross-schema joins could disagree on the join keys.
-
Per-lineage-edge attestation (Phase 3, when applicable): pre-aggregated siblings agree with rolled-up detail data. Without this, plans using different siblings as sources could produce different results.
-
Principle 1 (asserted): every column's value depends on its declared anchor and nothing else. Without this, queries could produce values dependent on unmodeled state.
-
Principle 2 (asserted): all schemas observe the same underlying entities. Without this, cross-schema reasoning would be unsound.
When any link fails, MTI's guarantee weakens. The DQ deliverable surfaces which links are verified, which are asserted-not-verified, and which have failed. The framework's reasoning over MTI is conditional on the DQ status of each link.
9.6.3 What MTI does not guarantee¶
MTI is a structural statement; it does not guarantee:
-
Performance equivalence: different plans may have different execution costs. The resolver's plan selection optimizes for cost when multiple plans pass the filter (§9.4–9.5 describe selection heuristics).
-
Equivalence under failed attestation: when Phase 3 attestation fails, MTI's underlying chain is broken. The framework either refuses such queries under strict mode or proceeds with caveats under permissive mode; in the latter case, the answer's correctness depends on the failed attestation's significance, which the framework cannot determine automatically.
-
Equivalence for sketch-type metrics across epsilon thresholds: different plans using sketch types may produce slightly different approximate values within declared confidence bounds. The differences are bounded but non-zero.
-
Equivalence with non-determinism in the backend: if the backend executes operations non-deterministically (rare), plans may produce different orderings or marginal arithmetic differences. The framework's epsilon thresholds accommodate this.
9.6.4 MTI and the dubious-query mechanism¶
MTI's guarantee is what makes the dubious-query mechanism meaningful. When the resolver refuses a query as dubious, it is saying: multiple plans pass the filter, but MTI does not guarantee they produce equivalent results, because the AC's commitments do not constrain which result is "correct."
This is most visible in:
- Cousin ambiguity: two family-roots share a name but have no lineage relationship; MTI does not apply across cousins.
- Multi-ip_reducer ambiguity: two ip_reducers on the same family produce non-equivalent results (peak vs. total); the query must commit to one.
- Hierarchy ambiguity: rare; resolved by the AC-dimension's hierarchy membership in most cases.
The framework's commitment is: refuse rather than choose. The query author either disambiguates explicitly or accepts the structural reason the AC cannot answer the query.
9.7 The dubious-query mechanism¶
Dubious-query handling is one of the framework's most distinctive features. This section specifies its full machinery.
9.7.1 Definition¶
A Frame-QL query is dubious relative to an AC iff:
- The query passes the four-rule filter for some candidate plan.
- More than one structurally-valid plan exists.
- The plans do not produce equivalent results under MTI (or the framework cannot establish equivalence).
The framework refuses dubious queries rather than picking among the candidate plans.
9.7.2 Sources of dubiosity¶
Five major sources:
9.7.2.1 Cousin ambiguity¶
The query references a metric family name that resolves to multiple family-roots (per Rule 1's multiple-family-roots case).
Example: two revenue family-roots exist in the AC — one rooted at transactions.revenue (the proper observation root), another rooted at regional_summary.revenue (declared as its own family-root). A query selecting a bare revenue cannot determine which.
Remediation: the author disambiguates by naming which cousin they mean — via a qualified reference (transactions.revenue) or an explicit FROM clause restricting the query to the schemas containing the intended sibling group (Chapter 8 §8.9). This is consistent with the framework's principle of describing what you want: the author is specifying which observation they mean, not instructing the framework to merge or relate the cousins. The framework does not reason about the relationship between cousins (they are independent observations); it simply lets the author select one. Alternatively, if the two were meant to be the same metric, the AC author corrects the declaration to make one a sibling of the other (declaring its lineage to point at the proper family-root) — a structural fix at the AC level.
9.7.2.2 Multi-ip_reducer ambiguity¶
The query references a metric family with multiple ip_reducers, more than one of which is applicable to the requested rollup but produces non-equivalent results.
Example: SELECT region, eom_inventory AT (region, year) — the SUM ip_reducer with A_block = {time} is blocked (year is in time); the MAX ip_reducer with no block is applicable. But the query did not request MAX explicitly. Note: this particular case is not dubious — only MAX is applicable, so the resolver could in principle select it. Whether the framework requires explicit disambiguation when only one ip_reducer applies is a configuration choice. In strict mode the framework requires explicit selection when multiple ip_reducers are declared, even if only one applies; in permissive mode the only-applicable case is auto-selected.
The genuinely dubious case is when multiple ip_reducers are applicable and produce different results. Example (hypothetical): a metric family declares both (SUM, A_block = ∅) and (MAX, A_block = ∅) — both applicable everywhere, but they yield different values. The query must select.
Disambiguation: the query explicitly applies the reducer (SUM(eom_inventory) or MAX(eom_inventory) in the SELECT clause).
9.7.2.3 Schema selection ambiguity¶
The query could be answered from multiple schemas whose data does not agree (Phase 2 cross-schema value-mapping failed or has not been verified).
Example: revenue is materialized at {region, month} in two different schemas; both could supply the data for AT (region, month). If Phase 2 cross-schema value mapping is verified, the resolver can select either (they agree by MTI). If Phase 2 surfaced inconsistencies, the resolver refuses as dubious.
Disambiguation: the AC author fixes the underlying data or declares one as authoritative.
9.7.2.4 Hierarchy ambiguity (rare)¶
The query's BY clause references an AC-dimension whose hierarchy is ambiguous within its dimension family.
This is rare because most AC-dimensions belong to exactly one hierarchy. It arises when an AC-dimension is shared across hierarchies and the query's other AC-dimensions don't disambiguate.
Disambiguation: the BY ... USING <hierarchy-name> syntax (Chapter 8 §8.5.4).
9.7.2.5 Operator-asserted family with failed verification¶
A family declared as operator-asserted (no in-AC sibling for Phase 3 verification) is used in a query, and the AC's policy on operator-asserted families is strict (requires verified).
Example: eom_inventory is operator-asserted under SUM (no daily-level sibling). A query under strict policy would refuse; under permissive policy, it proceeds with a caveat in the response.
Disambiguation: relax the policy, materialize a finer-grained sibling for verification, or accept the asserted commitment with caveats.
9.7.3 Refusal reporting¶
When the resolver refuses a query as dubious, the error response includes:
- The source of dubiosity (one of §9.7.2's categories).
- Identification of the conflict: the multiple plans considered, the structural reason they differ, the AC commitments that should but do not uniquely determine the answer.
- Suggested remediations:
- Query-level: syntactic disambiguation the query author can apply.
- AC-level: structural changes the AC author could make.
- Configuration-level: AC settings (strict vs. permissive policies) that affect dubiosity handling.
The error is structured (machine-readable) and human-readable. AI agents accessing the framework through any AC Surface — the MCP Surface historically, also the Frame-QL and HTTP API Surfaces in the v2.1 architecture (see Glossary for AC Surfaces) — consume the structured error and propose corrected queries. The error format is part of the Surface conformance contract.
9.7.4 Dubiosity vs. ambiguity¶
Frame-QL distinguishes dubiosity (structural refusal due to non-equivalent plans) from ambiguity (resolved by deterministic selection when MTI guarantees equivalence).
When MTI guarantees plan equivalence, the resolver selects deterministically (typically the cheapest plan, or the plan with the most-verified sources). The choice does not change the result; it only changes performance characteristics. This is ambiguity and it is auto-resolved.
When MTI does not guarantee equivalence (the structural commitments don't establish that all candidate plans produce the same result), the resolver refuses. This is dubiosity.
The distinction is the framework's epistemic line: where the commitments determine the answer, the resolver acts; where they don't, the resolver refuses.
9.8 Error reporting and remediation¶
9.8.1 Error categories¶
The resolver produces errors in these categories (matching Chapter 8's Frame-QL error categories at the language level):
| Category | Source | Recovery |
|---|---|---|
| Parse error | Malformed Frame-QL | Fix the query syntax |
| Resolution error | Unknown identifier | Verify AC declarations; check spelling |
| Type error | Operator-operand type mismatch | Adjust expression or operator choice |
| Family-resolution error | Cousin ambiguity | Qualified reference or FROM; or fix AC declaration |
| Anchor-set-capability error | Family cannot reach grain | Use different grain; declare new family member |
| Schema-selection error | No schema can supply | Add schema or extend declared scope |
| Cross-schema-coherence error | Phase 2 inconsistency | Fix data or AC; verify Phase 2 |
| Multi-ip_reducer dubious | Multiple ip_reducers applicable | Explicit reducer selection in query |
| Verification-status error | Operator-asserted or unverified | Relax policy or extend verification |
| Backend execution error | Backend operational failure | Investigate backend |
9.8.2 Error structure¶
Each resolver error includes:
- An error code (machine-readable).
- A human-readable message.
- Identification of the structural element involved (which family, which schema, which FD-edge).
- A suggested remediation per the table above.
- Where applicable, references to the AC's DQ deliverable for context.
9.8.3 Resolver as a teaching tool¶
The resolver's refusals are not just errors; they are informative about the AC's structure. A query that refuses with "cousin ambiguity" tells the AC author that their family declarations have a structural issue. A query that refuses with "anchor-set capability" tells the query author that the family is structurally limited.
This is the framework's pedagogical aspect: through refusals, the resolver communicates what the AC can and cannot do. Over time, AC authors and query authors develop accurate mental models of the AC's analytical surface from the resolver's responses.
9.9 Resolver implementation considerations¶
This section is brief; it identifies considerations for implementers of the resolver. Full implementation guidance is in the framework's source documentation.
9.9.1 Statelessness¶
The resolver is stateless across queries. Each query is resolved independently. The AC's compiled metadata (from the DQ deliverable) is loaded once and reused; the resolver does not maintain query-specific state between invocations.
9.9.2 Plan-space exploration¶
The resolver enumerates candidate plans by:
- For each metric family: candidate sibling sources (Rule 3).
- For each navigation: candidate hierarchies (Rule 2 when hierarchies are alternative).
For typical queries the candidate space is small (often a single plan). For complex queries (multi-metric, multi-schema, multiple hierarchies), the space may have tens of candidates. The resolver evaluates the four-rule filter on each.
9.9.3 Plan selection heuristics¶
When multiple plans pass the filter and MTI guarantees equivalence, the resolver selects by:
- Most-verified sources (operator-attested over operator-asserted).
- Finest sibling anchor (minimizes rollup distance).
- Single-schema over cross-schema (reduces join cost).
- Plan with lowest estimated row count (cost-based).
These are heuristics; their relative weighting is configurable per the framework's deployment.
9.9.4 Caching¶
The resolver may cache compiled plans for frequently-issued queries. Cache invalidation triggers:
- AC declaration changes.
- DQ deliverable updates.
- Operator catalog version changes.
Plan caching is an optimization; the framework's correctness is independent of caching behavior.
9.9.5 Tracing¶
The resolver supports tracing: every decision in stages 1–6 of the pipeline (§9.2.2) can be logged. Traces include:
- Identifiers resolved.
- Plans considered.
- The four-rule filter's verdict per plan.
- The selected plan.
- The data-API operations generated.
Tracing is essential for debugging unexpected refusals and for understanding the resolver's behavior in development. Tracing is opt-in (it adds overhead).
9.10 Summary¶
The resolver is the framework's central operational deliverable: the component that takes a Frame-QL query against an AC and constructs the algorithm to produce the output frame.
The resolver operates in seven stages: parse, symbol resolution, grain analysis, family resolution, plan construction, plan selection, plan execution. Stage 6 (plan selection) applies the four-rule filter: family resolution, anchor-set capability, schema selection, cross-schema coherence. Plans passing the filter are candidates; plans failing any rule are rejected.
Multi-Table Invariance guarantees that candidate plans produce equivalent results when the AC's commitments are verified, which is what lets the resolver choose the cheapest plan. When MTI's underlying chain is broken (verification failed, opt-out, asserted-not-verified), the framework reports caveats; when multiple plans pass the filter but their results are not guaranteed equivalent, the framework refuses as dubious.
The dubious-query mechanism is the framework's epistemic line: where the AC's commitments uniquely determine the answer, the resolver acts; where they don't, it refuses. Sources of dubiosity include cousin ambiguity, multi-ip_reducer ambiguity, schema selection ambiguity, hierarchy ambiguity, and operator-asserted families under strict policy.
The resolver's refusals are informative: each refusal identifies the structural reason and suggests remediation, supporting iterative refinement of both queries and AC declarations.
Resolution is the framework's deliverable. Foundations specifies the primitives; Chapter 3 specifies how to declare them; Chapter 7 specifies how to verify them. Chapter 9 specifies what they enable: a constructive algorithm for answering rich analytical queries against curated collections of data, with guaranteed correctness derived from the AC's declared and verified commitments.
This completes the framework's core operational specification. Chapter 10 specifies the operator catalog that the resolver and the data-API protocol both consult; Chapter 11 specifies the optional Metric Engine that sits between the resolver and the backend, memoising per-(family, anchor) aggregates and serving them transparently when enabled.
Chapter 10: The Operator Catalog¶
The full specification of operators in Coframe Core. Each operator's type, partition-invariance classification, identity-preservation behavior, type signature, missing-value treatment, and naming-function entry. The catalog is the source of truth that ColumnSpec op references and that the resolver consults during plan construction.
10.1 What this chapter does¶
Every operator a Coframe AC uses — SUM, MAX, HLL_MERGE, MAP_DIV, MONTH_OF, OBSERVED — is specified in the operator catalog. This chapter is the catalog: a reference enumerating each operator with its structural properties.
The framework's catalog is versioned. The version specified in this chapter is coframe-core-catalog/1.0. ACs may declare the catalog version they target via ac_metadata; the framework resolves operator references against the declared version. Catalog updates produce new versions; existing ACs continue to bind to the version they were authored against.
This chapter is reference-style. Readers consult it when:
- Authoring an AC and choosing the
opfor a ColumnSpec. - Implementing a custom backend protocol adapter (per Chapter 6 §6.7).
- Debugging an unexpected resolver decision involving operator behavior.
- Verifying that a query's operators are catalog-defined.
The chapter is organized by operator class. §§10.2–10.3 specify the catalog entry format and conventions. §§10.4–10.6 enumerate reducers by partition-invariance classification. §10.7 enumerates functions. §10.8 enumerates multi-input operators and the registered ratio/count operators (and explains why intensive ratios are computed rather than stored). §10.9 explains why windowed operators are out of Core scope. §10.10 specifies OBSERVED. §10.11 specifies naming-function entries. §10.12 specifies missing-value treatment. §10.13 collects cross-cutting notes.
10.2 Catalog entry format¶
Each operator catalog entry is a structured record with the following fields:
operator:
name: <string> # canonical operator name
display_name: <string> # human-friendly name
description: <string> # what the operator does
type: reducer | function | multi_input | observed
partition_invariance:
classification: natively_monoidal | liftably_monoidal | holistic | not_applicable
monoid: # present iff natively_monoidal
value_space: <type>
identity: <value>
associative: true
commutative: true
lift: # present iff liftably_monoidal
enriched_value_space: <type>
lift_operator: <name> # operator producing enriched values
final_projection: <name> # operator extracting final result
explanation: <string> # why classified this way
identity_preservation:
default: preserved | not_preserved
overrides: # per-family-name overrides if any
- family: <name>
behavior: preserved | not_preserved
type_signature:
inputs: [<type>, ...]
output: <type>
anchor_relation:
type: subset_or_equal | equal | unrelated # A_pred relation to A_self
explanation: <string>
missing_value_treatment:
# the selected behavior is gated by the M-vs-Δ relationship at query time (§10.12);
# this field declares the operator's per-category defaults and its skip-unbiased predicate
- input_M_category: MCAR | MAR | MNAR # derived label for the input's missingness anchor M
skip_unbiased_when: <predicate over M and Δ, accounting for extensivity>
behavior: exclude | propagate | include_as_zero | propagate_null | impute # impute is Pro
output_M: <derived missingness anchor>
naming_function_default: <template-string>
canonical_sql: <string> # backend-agnostic SQL form
backend_overrides: # per-backend syntax differences
snowflake: <string>
bigquery: <string>
postgres: <string>
...
notes: <string> # commentary, references
Each operator has exactly one catalog entry. The entry is the framework's complete specification for that operator.
10.2.1 Field semantics¶
name: the canonical identifier used in ColumnSpec op fields, in Frame-QL expressions, and in metric family ip_reducers declarations. All-caps by convention (SUM, HLL_MERGE).
type: the operator's classification:
reducer: aggregates values across the predecessor's anchor, producing a successor at a coarser anchor (A_pred ⊇ A_self).function: transforms values without aggregation (A_pred = A_self).multi_input: takes multiple inputs at a common anchor and produces a singleton output.observed: marks a column as directly-observed data with no derivation.
partition_invariance.classification: per Foundations §2.6.4:
natively_monoidal: the operator's (value space, operator, identity) triple forms a commutative monoid. SUM, MAX, HLL_MERGE.liftably_monoidal: not partition-invariant in its natural value space but becomes partition-invariant in an enriched space with a final projection. AVG (via SUM/COUNT), STDDEV (via SUM/SUM_OF_SQUARES/COUNT).holistic: not partition-invariant under any finite enrichment. Exact MEDIAN, exact COUNT_DISTINCT, MODE, exact percentile.not_applicable: for non-reducer operators where partition-invariance is not the relevant axis (functions, multi-inputs, observed).
identity_preservation.default: whether applying this operator to a family-named column produces output that the AC's naming function maps back to the same family-name (preserved) or to a different family-name (not_preserved). SUM is preserved for most numeric families; MAX is not_preserved (output is conventionally "peak_X").
anchor_relation: per Foundations §2.6.1:
subset_or_equal(A_pred ⊇ A_self): reducer behavior.equal(A_pred = A_self): function behavior.unrelated: for the observed operator, where the predecessor relation is special (self-referential).
missing_value_treatment: per Foundations §2.4.3, specifies how the operator handles values whose M declaration indicates missingness. Critical for ensuring rollups are consistent with missingness commitments.
naming_function_default: a template string the default naming function uses to compute the output column's name given the predecessor's name. {name_pred} is the substitution token.
canonical_sql: the backend-agnostic SQL form the data-API generates when this operator is invoked. Backend implementations may override.
backend_overrides: the catalog's per-backend defaults for substrate-specific syntax. These seed the installation's L2 operator_registry (v2.1 supplement §3.6) at backend-bind time. At runtime, the effective registry is L2 ⊕ L3 (per-AC operator_overrides); the catalog's backend_overrides is the L1 → L2 seed, not the runtime source of truth. See §10.13.4.
10.3 Conventions and notation¶
10.3.1 Value type abbreviations¶
In type signatures, the following abbreviations are used:
numericincludes both integer and floating-point.ordindicates any totally-ordered type (numeric, date, timestamp, string).comparableindicates any type supporting equality comparison.anyindicates any canonical type.
10.3.2 Template strings¶
Template strings for naming-function entries use {name_pred} as a token for the predecessor's family-name. Other tokens supported in some entries:
{op}: the operator name in lowercase.{A_pred}: the predecessor's anchor (rare).
10.3.3 SQL placeholders¶
In canonical_sql and backend_overrides entries, angle-bracket tokens are placeholders the data-API substitutes at SQL-generation time:
<col>— the backend column reference for the operator's input.<col_a>,<col_b>,<col_c>— for multi-input operators, the inputs in order.<width>,<start>,<length>,<replacement>, etc. — operator-specific parameters.
These are not Coframe variables; they mark where the data-API inserts the concrete column references and parameters when constructing a backend query.
10.3.4 Identity-preservation by family-name convention¶
For some operators, identity-preservation depends on family-name conventions in the AC. The catalog declares the default behavior; AC authors may override via the naming function declaration (Chapter 3 §3.6.2).
For example, MAX applied to a temperature family conventionally produces peak_temperature (not preserved). MAX applied to a peak_inventory family that's already rooted at MAX is preserved (peak-of-peak is still peak).
10.3.5 Missing-value treatment abbreviations¶
exclude: missing values are excluded from the operation (most common for reducers).include_as_zero: missing values are treated as zero (alternative for some additive reducers).propagate_null: missing values propagate through (typical for functions).per_operator_rule: the operator has a custom rule specified in its entry.
10.4 Natively monoidal reducers¶
These operators are partition-invariant in their declared value space. The (value space, operator, identity) triple is a commutative monoid.
10.4.1 SUM¶
name: SUM
display_name: "Sum"
description: "Sum of numeric values."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: numeric
identity: 0
associative: true
commutative: true
explanation: "(ℝ, +, 0) is the canonical commutative monoid."
identity_preservation:
default: preserved
overrides: []
type_signature:
inputs: [numeric]
output: numeric
anchor_relation:
type: subset_or_equal
missing_value_treatment:
- input_M_category: MCAR
behavior: exclude
output_M: MCAR-derived
- input_M_category: MAR
behavior: exclude
output_M: MAR-derived (surviving M-coordinates propagate)
- input_M_category: MNAR
behavior: per_operator_rule
output_M: MNAR-derived; the result inherits self-dependence
naming_function_default: "{name_pred}"
canonical_sql: "SUM(<col>)"
notes: |
The most common reducer. Partition-invariant under all anchor coordinates;
identity-preserving for most family-name conventions.
10.4.2 MAX¶
name: MAX
display_name: "Maximum"
description: "Largest value in the group."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: ord
identity: -infinity
associative: true
commutative: true
explanation: "(ord ∪ {-∞}, max, -∞) is a commutative monoid for any totally-ordered type."
identity_preservation:
default: not_preserved
overrides:
- family: peak_*
behavior: preserved
- family: max_*
behavior: preserved
type_signature:
inputs: [ord]
output: ord
anchor_relation:
type: subset_or_equal
missing_value_treatment:
- input_M_category: MCAR
behavior: exclude
- input_M_category: MAR
behavior: exclude
- input_M_category: MNAR
behavior: per_operator_rule
naming_function_default: "peak_{name_pred}"
canonical_sql: "MAX(<col>)"
notes: |
Partition-invariant on totally-ordered types. Not identity-preserving by default
(MAX of revenue is conventionally peak_revenue, a different family). When applied to
a family already rooted at MAX (e.g., peak_inventory), it is identity-preserving via
the override mechanism.
10.4.3 MIN¶
name: MIN
description: "Smallest value in the group."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: ord
identity: +infinity
identity_preservation:
default: not_preserved
overrides:
- family: min_*
behavior: preserved
- family: lowest_*
behavior: preserved
type_signature:
inputs: [ord]
output: ord
naming_function_default: "min_{name_pred}"
canonical_sql: "MIN(<col>)"
10.4.4 COUNT¶
name: COUNT
description: "Count of non-null values."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: integer
identity: 0
explanation: "Counts compose by addition; (ℕ, +, 0) is a commutative monoid."
identity_preservation:
default: not_preserved
type_signature:
inputs: [any]
output: integer
missing_value_treatment:
- input_M_category: MCAR
behavior: exclude
- input_M_category: MAR
behavior: exclude
- input_M_category: MNAR
behavior: exclude
naming_function_default: "{name_pred}_count"
canonical_sql: "COUNT(<col>)"
notes: |
Always not-identity-preserving — COUNT produces a new family (the count of), not
the original family. The naming-function entry produces the canonical "X_count"
convention.
10.4.5 COUNT_STAR¶
name: COUNT_STAR
display_name: "Count rows"
description: "Count of all rows including nulls."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: integer
identity: 0
identity_preservation:
default: not_preserved
type_signature:
inputs: []
output: integer
naming_function_default: "row_count"
canonical_sql: "COUNT(*)"
10.4.6 AND_AGG / OR_AGG¶
name: AND_AGG
description: "Logical AND across boolean values."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: boolean
identity: true
type_signature:
inputs: [boolean]
output: boolean
naming_function_default: "all_{name_pred}"
canonical_sql: "BOOL_AND(<col>)"
name: OR_AGG
description: "Logical OR across boolean values."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: boolean
identity: false
type_signature:
inputs: [boolean]
output: boolean
naming_function_default: "any_{name_pred}"
canonical_sql: "BOOL_OR(<col>)"
10.4.7 BIT_AND / BIT_OR / BIT_XOR¶
name: BIT_AND
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: bitmap | integer
identity: all-ones-bitmap | -1
type_signature:
inputs: [bitmap | integer]
output: bitmap | integer
canonical_sql: "BIT_AND(<col>)"
Similar entries for BIT_OR (identity: all-zeros) and BIT_XOR (identity: all-zeros, with the additional property that double-XOR cancels).
10.4.8 HLL_MERGE¶
name: HLL_MERGE
display_name: "HyperLogLog merge"
description: "Combine HLL sketches preserving approximate distinct-count semantics."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: hll_sketch
identity: empty_hll_sketch
explanation: |
HLL sketches form a commutative monoid under merge. Merging sketches of
partition-pieces gives the sketch one would obtain by merging the unpartitioned
data, exactly (no approximation in the merge — approximation enters at the
cardinality-extraction step via APPROX_DISTINCT).
identity_preservation:
default: preserved
type_signature:
inputs: [hll_sketch]
output: hll_sketch
missing_value_treatment:
- input_M_category: MCAR | MAR
behavior: exclude
- input_M_category: MNAR
behavior: per_operator_rule
naming_function_default: "{name_pred}"
canonical_sql: "HLL_UNION(<col>)"
backend_overrides:
snowflake: "HLL_COMBINE(<col>)"
bigquery: "HLL_COUNT.MERGE_PARTIAL(<col>)"
notes: |
Coframe treats sketch-merge operators as first-class monoidal reducers. The fact
that the value space is non-numeric does not affect the framework's reasoning —
partition-invariance is an algebraic property of the (value space, operator)
pair, not a property of arithmetic over numbers.
10.4.9 THETA_UNION / THETA_INTERSECTION¶
name: THETA_UNION
description: "Combine theta sketches preserving approximate set-union semantics."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: theta_sketch
identity: empty_theta_sketch
type_signature:
inputs: [theta_sketch]
output: theta_sketch
identity_preservation:
default: preserved
canonical_sql: "THETA_SKETCH_UNION(<col>)"
Similar entry for THETA_INTERSECTION (whose identity element is the universal sketch).
10.4.10 T_DIGEST_MERGE / KLL_MERGE¶
name: T_DIGEST_MERGE
description: "Combine t-digest sketches for approximate quantile estimation."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: t_digest
identity: empty_t_digest
type_signature:
inputs: [t_digest]
output: t_digest
identity_preservation:
default: preserved
canonical_sql: "TDIGEST_UNION(<col>)"
Similar entry for KLL_MERGE.
10.4.11 Summary of natively monoidal reducers¶
| Operator | Value space | Identity | Identity-preserving by default |
|---|---|---|---|
SUM |
numeric | 0 | yes |
MAX |
totally-ordered | -∞ | no (see overrides) |
MIN |
totally-ordered | +∞ | no |
COUNT |
integer | 0 | no |
COUNT_STAR |
integer | 0 | no |
AND_AGG |
boolean | true | no |
OR_AGG |
boolean | false | no |
BIT_AND |
bitmap/integer | all-ones | yes (for bitmap unions of same family) |
BIT_OR |
bitmap/integer | all-zeros | yes |
BIT_XOR |
bitmap/integer | all-zeros | no |
HLL_MERGE |
hll_sketch | empty | yes |
THETA_UNION |
theta_sketch | empty | yes |
THETA_INTERSECTION |
theta_sketch | universal | yes |
T_DIGEST_MERGE |
t_digest | empty | yes |
KLL_MERGE |
kll_sketch | empty | yes |
Every operator in this table is eligible as an ip_reducer for metric families. The classification matches Foundations §2.6.3.
10.5 Liftably monoidal reducers¶
These operators are not partition-invariant in their natural output value space but become partition-invariant when an enriched value space is carried through aggregation, with a final projection extracting the desired output.
The catalog declares such operators with their lift block: the enriched value space, the operator that produces enriched values, and the final projection operator.
10.5.1 AVG¶
name: AVG
display_name: "Arithmetic mean"
type: reducer
partition_invariance:
classification: liftably_monoidal
lift:
enriched_value_space: (numeric, integer) # (sum, count) pair
lift_operator: AVG_LIFT # produces (sum, count) pair
final_projection: AVG_PROJECT # computes sum/count
explanation: |
AVG is not partition-invariant on numerics: AVG of partition averages ≠
AVG of the unpartitioned data when partitions have unequal sizes. But carrying
(SUM, COUNT) pairs is partition-invariant (both pairs roll up monoidally), and
the final AVG is recovered as SUM/COUNT at the final grain.
identity_preservation:
default: not_preserved
type_signature:
inputs: [numeric]
output: numeric
missing_value_treatment:
- input_M_category: MCAR
behavior: exclude
- input_M_category: MAR
behavior: exclude
- input_M_category: MNAR
behavior: per_operator_rule
naming_function_default: "avg_{name_pred}"
canonical_sql: "AVG(<col>)"
notes: |
Coframe Core can use AVG as an ip_reducer only if the lift is carried through
aggregation (a Pro feature involving (SUM, COUNT) pair tracking). In Core,
AVG-rooted families are typically treated as anchor-locked, with cross-grain
AVG queries refused unless siblings are independently materialized.
10.5.2 STDDEV and VARIANCE¶
name: VARIANCE
description: "Population variance."
type: reducer
partition_invariance:
classification: liftably_monoidal
lift:
enriched_value_space: (numeric, numeric, integer) # (sum, sum_of_squares, count)
lift_operator: VARIANCE_LIFT
final_projection: VARIANCE_PROJECT
explanation: |
Variance = E[X²] - E[X]². The triple (SUM, SUM_OF_SQUARES, COUNT) is
partition-invariant; variance is recovered by the standard formula at the final grain.
type_signature:
inputs: [numeric]
output: numeric
naming_function_default: "var_{name_pred}"
Similar entry for STDDEV (square root of variance; same lift).
10.5.3 APPROX_DISTINCT¶
name: APPROX_DISTINCT
display_name: "Approximate distinct count"
description: "Estimated number of distinct values, via HLL or similar sketch."
type: reducer
partition_invariance:
classification: liftably_monoidal
lift:
enriched_value_space: hll_sketch
lift_operator: HLL_MERGE
final_projection: HLL_CARDINALITY
explanation: |
Exact distinct count is holistic (no finite enrichment). HLL provides a
bounded-error approximation: HLL sketches are partition-invariant under
HLL_MERGE (a natively monoidal reducer on HLL sketches), and the cardinality
estimate is recovered as the final projection.
type_signature:
inputs: [any]
output: integer
identity_preservation:
default: not_preserved
naming_function_default: "approx_distinct_{name_pred}"
canonical_sql: "APPROX_COUNT_DISTINCT(<col>)"
backend_overrides:
snowflake: "APPROX_COUNT_DISTINCT(<col>)"
bigquery: "APPROX_COUNT_DISTINCT(<col>)"
postgres: "<custom HLL extension call>"
notes: |
In Coframe Core, an AC's HLL-typed metric family carries HLL sketches as
first-class values (per §10.4.8). Queries can request APPROX_DISTINCT as a
final projection, computed at the query's grain via the standard liftable
pattern: HLL_MERGE the sketches, then APPROX_DISTINCT the merged sketch.
10.5.4 APPROX_PERCENTILE¶
name: APPROX_PERCENTILE
description: "Approximate percentile via t-digest or KLL sketch."
type: reducer
partition_invariance:
classification: liftably_monoidal
lift:
enriched_value_space: t_digest
lift_operator: T_DIGEST_MERGE
final_projection: T_DIGEST_PERCENTILE
type_signature:
inputs: [numeric, numeric] # (value, percentile)
output: numeric
naming_function_default: "p{percentile}_{name_pred}"
10.5.5 Why liftable reducers are powerful¶
The liftable classification matters for the framework's expressiveness. Without it, the framework would refuse cross-grain AVG queries; with it (and the AC's commitment to carry the enriched state), the framework can produce correct AVG rollups.
The implementation cost: the framework must carry the enriched state through aggregation. This requires:
- Catalog support for the
lift_operatorandfinal_projection. - Backend support for the enriched value space (e.g., HLL sketches for APPROX_DISTINCT; (sum, count) tuples for AVG).
- DQ verification that the lift's components are themselves partition-invariant.
Coframe Core supports HLL-backed APPROX_DISTINCT as a first-class reducer; supports liftable AVG/VARIANCE in advanced cases. Pro extends the lift mechanism to user-defined enrichments.
10.6 Holistic reducers¶
These operators are not partition-invariant under any finite-state enrichment. They cannot roll up from finer to coarser anchors via the catalog's standard mechanism; metric families rooted at them are anchor-locked.
10.6.1 EXACT_DISTINCT¶
name: EXACT_DISTINCT
display_name: "Exact distinct count"
description: "Exact count of distinct values; holistic — not partition-invariant."
type: reducer
partition_invariance:
classification: holistic
explanation: |
Computing exact distinct counts requires either the full multiset of values
(to deduplicate) or a sketch (which is approximation, not exactness).
No fixed-size enrichment of the partition state suffices.
type_signature:
inputs: [any]
output: integer
identity_preservation:
default: not_preserved
naming_function_default: "distinct_{name_pred}"
canonical_sql: "COUNT(DISTINCT <col>)"
notes: |
EXACT_DISTINCT cannot serve as an ip_reducer. A metric family rooted at
EXACT_DISTINCT is anchor-locked. For approximate distinct counts that can
roll up, use APPROX_DISTINCT (§10.5.3) with HLL backing.
10.6.2 EXACT_PERCENTILE / EXACT_MEDIAN¶
name: EXACT_PERCENTILE
description: "Exact percentile; holistic."
type: reducer
partition_invariance:
classification: holistic
explanation: |
Exact percentiles require sorted access to the full data set. No fixed-size
enrichment captures the necessary information.
type_signature:
inputs: [numeric, numeric]
output: numeric
notes: |
Anchor-locked. Use APPROX_PERCENTILE for cross-grain percentile queries.
EXACT_MEDIAN is EXACT_PERCENTILE(<col>, 0.5).
10.6.3 MODE¶
name: MODE
description: "Most frequent value; holistic."
type: reducer
partition_invariance:
classification: holistic
explanation: |
The mode requires frequency counts across the full multiset. Partition-level
modes do not compose: the mode of the union is not derivable from the modes
of the partitions.
type_signature:
inputs: [any]
output: any
10.6.4 Why holistic reducers exist in the catalog¶
The framework includes holistic reducers for two reasons:
-
Query-time use: Frame-QL queries may apply holistic reducers directly (
SELECT EXACT_DISTINCT(customer_id)) when the query's grain is the same as the data's grain — no rollup is needed. -
Family-root declaration: an AC author may declare a metric family rooted at a holistic reducer (e.g.,
unique_customersrooted at EXACT_DISTINCT). Such families are anchor-locked; cross-grain queries refuse. This is appropriate when the AC author has independently materialized the result at the desired grain.
What holistic reducers cannot do is serve as ip_reducers. Their entries in the catalog make this explicit; the framework refuses to use them as ip_reducers.
10.7 Functions¶
Functions are operators with anchor_relation: equal — they transform values without changing the anchor. Function-derived FD-edges (Foundations §2.5.6) use these.
10.7.1 Date/time extraction functions¶
name: MONTH_OF
description: "Extract month from a date or timestamp."
type: function
partition_invariance:
classification: not_applicable
identity_preservation:
default: not_preserved
type_signature:
inputs: [date]
output: date # or integer for month-number form
anchor_relation:
type: equal
naming_function_default: "month_of_{name_pred}" # or "month" by convention
canonical_sql: "DATE_TRUNC('month', <col>)"
backend_overrides:
postgres: "DATE_TRUNC('month', <col>)"
snowflake: "DATE_TRUNC('MONTH', <col>)"
bigquery: "DATE_TRUNC(<col>, MONTH)"
notes: |
Common entry in dimension family hierarchies. The function is deterministic, so
function-derived FD-edges (day → month) are grounded by this entry.
Similar entries for QUARTER_OF, YEAR_OF, WEEK_OF, DAY_OF, HOUR_OF, DATE_OF (extracts date from timestamp), FISCAL_PERIOD_OF, FISCAL_QUARTER_OF, FISCAL_YEAR_OF, ISO_WEEK_OF, ISO_YEAR_OF.
Each fiscal-related function takes the fiscal-calendar definition as catalog metadata (the AC's ac_metadata can override fiscal-year-start defaults).
10.7.2 Arithmetic functions¶
name: ABS
description: "Absolute value."
type: function
type_signature:
inputs: [numeric]
output: numeric
partition_invariance:
classification: not_applicable
identity_preservation:
default: not_preserved
anchor_relation:
type: equal
naming_function_default: "abs_{name_pred}"
canonical_sql: "ABS(<col>)"
Similar entries for: ROUND, FLOOR, CEIL, TRUNC, MOD, LOG, LN, EXP, SQRT, POW, SIGN.
10.7.3 String functions¶
name: SUBSTR
type: function
type_signature:
inputs: [string, integer, integer]
output: string
naming_function_default: "{name_pred}_substr"
canonical_sql: "SUBSTRING(<col>, <start>, <length>)"
Similar for: UPPER, LOWER, TRIM, LENGTH, CONCAT, REPLACE, STARTS_WITH, CONTAINS.
10.7.4 Bucketing / discretization¶
name: BUCKET
description: "Discretize a numeric value into bucket boundaries."
type: function
type_signature:
inputs: [numeric, numeric] # (value, bucket_width)
output: integer # bucket index
partition_invariance:
classification: not_applicable
anchor_relation:
type: equal
naming_function_default: "{name_pred}_bucket"
canonical_sql: "FLOOR(<col> / <width>) * <width>"
notes: |
Often used to produce function-derived AC-dimensions (price_bucket from price)
enabling cross-grain navigation along discretized continuous variables.
Similar entry for BUCKET_BY_BOUNDARIES (custom boundary list).
10.7.5 Type-coercion functions¶
name: CAST_NUMERIC
type: function
type_signature:
inputs: [string | integer | date]
output: numeric
naming_function_default: "{name_pred}" # the cast doesn't conventionally rename
canonical_sql: "CAST(<col> AS NUMERIC)"
Similar for CAST_INTEGER, CAST_STRING, CAST_DATE, CAST_TIMESTAMP, CAST_BOOLEAN.
10.7.6 Null-handling functions¶
name: COALESCE_TO
type: function
type_signature:
inputs: [<type>, <type>] # (column, replacement value)
output: <type>
naming_function_default: "{name_pred}"
canonical_sql: "COALESCE(<col>, <replacement>)"
notes: |
Identity-preserving by default — the name is preserved across null replacement.
name: IS_NULL
type: function
type_signature:
inputs: [any]
output: boolean
naming_function_default: "{name_pred}_is_null"
canonical_sql: "(<col> IS NULL)"
10.7.7 Sketch-projection functions¶
Functions that extract scalar values from sketch-type columns:
name: HLL_CARDINALITY
description: "Extract cardinality estimate from HLL sketch."
type: function
type_signature:
inputs: [hll_sketch]
output: integer
anchor_relation:
type: equal
naming_function_default: "approx_distinct_{name_pred}"
canonical_sql: "HLL_ESTIMATE(<col>)"
notes: |
Combined with HLL_MERGE (a reducer), this function provides the final projection
for APPROX_DISTINCT (§10.5.3).
Similar entries for T_DIGEST_PERCENTILE, T_DIGEST_QUANTILE, KLL_PERCENTILE, THETA_CARDINALITY, THETA_INTERSECT_CARDINALITY.
10.8 Multi-input operators¶
Multi-input operators take two or more inputs at a common anchor and produce a singleton output (per Foundations §2.7.7).
10.8.1 Arithmetic combinators¶
name: MAP_DIV
display_name: "Pointwise division"
description: "Divide two metrics elementwise at a shared anchor."
type: multi_input
partition_invariance:
classification: not_applicable
identity_preservation:
default: not_preserved
type_signature:
inputs: [numeric, numeric]
output: numeric
anchor_relation:
type: equal # both inputs share the singleton's anchor
naming_function_default: "{name_pred[0]}_per_{name_pred[1]}"
canonical_sql: "<col_a> / NULLIF(<col_b>, 0)"
notes: |
Singleton-producing. Used in the retail AC for gross_margin_pct (revenue / cost).
Note: MAP_DIV is not the same as AVG. AVG = SUM / COUNT at a final anchor;
MAP_DIV is pointwise division at the per-row level. Use MAP_DIV when you want
a ratio at the input row's grain.
Similar for: MAP_MUL, MAP_SUB, MAP_ADD, MAP_POW.
10.8.2 Logical combinators¶
name: MAP_IF
type: multi_input
type_signature:
inputs: [boolean, <type>, <type>]
output: <type>
naming_function_default: "{name_pred[0]}_if_{name_pred[1]}_else_{name_pred[2]}"
canonical_sql: "CASE WHEN <col_a> THEN <col_b> ELSE <col_c> END"
Similar for: MAP_AND, MAP_OR, MAP_NOT.
10.8.3 Mixed combinators¶
name: MAP_COMPARE_GT
type: multi_input
type_signature:
inputs: [<ord>, <ord>]
output: boolean
naming_function_default: "{name_pred[0]}_gt_{name_pred[1]}"
canonical_sql: "<col_a> > <col_b>"
Similar for: MAP_COMPARE_LT, MAP_COMPARE_EQ, etc.
10.8.4 Multi-input vs. functions¶
Multi-input operators and single-input functions are similar (both transform without aggregating). The distinction:
- A function has exactly one input — its
opreferences the predecessor, and the predecessor's family is the function's natural domain. - A multi-input operator has multiple inputs — its
lineageis a tuple of predecessors; it produces a singleton family (no further family lineage upward).
The framework reasons about them differently in cross-grain navigation. Functions can participate in function-derived FD-edges (Foundations §2.5.6); multi-input operators cannot (they produce singletons that are leaves in the metric genealogy).
10.8.5 Registered ratio and count operators¶
Two commonly-needed patterns are provided as registered convenience operators. Both are query-time constructs — they appear in Frame-QL expressions (Chapter 8 §8.6.4) and compute at the output grain. Neither requires a stored metric.
COUNT_OF¶
name: COUNT_OF
display_name: "Conditional count"
description: "Count of input rows where a predicate holds, at the output grain."
type: reducer
partition_invariance:
classification: natively_monoidal
monoid:
value_space: integer
identity: 0
explanation: |
COUNT_OF is an additive count: it composes by SUM across partitions exactly
as COUNT does. It is fully partition-invariant and navigable across grain.
identity_preservation:
default: not_preserved
type_signature:
inputs: [boolean] # the predicate expression
output: integer
anchor_relation:
type: subset_or_equal
naming_function_default: "count_of_{predicate}"
canonical_sql: "COUNT(CASE WHEN <predicate> THEN 1 END)"
notes: |
COUNT_OF(p) is sugar for a conditional count. Because counts are extensive and
roll up by SUM, COUNT_OF behaves as an ordinary Rung-1 reducer: it navigates
across grain via the FD-DAG like any additive measure. It is therefore
structurally unproblematic — the expression form and any stored form are
equivalent, so Coframe Core provides it as an expression operator and does not
treat it as a special stored metric.
RATIO_OF¶
name: RATIO_OF
display_name: "Output-grain ratio"
description: "Ratio of two reducers, computed at the output grain."
type: multi_input # two reducer inputs, divided at the output grain
partition_invariance:
classification: not_applicable # the ratio itself is intensive (see notes)
identity_preservation:
default: not_preserved
type_signature:
inputs: [numeric, numeric] # (numerator family, denominator family)
output: numeric
anchor_relation:
type: equal # evaluated at the output grain
naming_function_default: "{name_pred[0]}_per_{name_pred[1]}"
canonical_sql: "SUM(<numerator>) / NULLIF(SUM(<denominator>), 0)"
notes: |
RATIO_OF(num, den) is sugar for `SUM(num) / SUM(den)` taken at the query's
output grain — i.e., the numerator and denominator are each rolled up to the
AT grain via their families' ip_reducers, then divided. The default reducer is
SUM; a registered variant may specify alternative numerator/denominator
reducers where the catalog defines them.
The ratio itself is intensive: it has no ip_reducer and does not roll up. This
is exactly why RATIO_OF is a query-time operator rather than a stored metric.
Computing the ratio at the output grain from extensive components is correct at
every grain; storing a ratio at one grain would be navigationally inert (see
§10.8.6). RATIO_OF therefore composes two navigable reducers and divides at the
end, rather than rolling up a stored intensive value.
Two-anchor operators: PCT, WEIGHTED_AVG, INDEX¶
Three further registered operators express two-anchor measures — measures that combine a frame-grain value with an aggregate of a metric at a coarser coordinate (Chapter 8 §8.6.6). Like RATIO_OF, they are query-time sugar, not stored metrics; unlike RATIO_OF, their companion term lives at a grain coarser than the frame and is broadcast back.
PCT(m @ a)—m's share of itsa-grain total. Desugars tom / (m @ a).WEIGHTED_AVG(m, w @ a)—maveraged weighted byw, withathe weighting grain. Desugars toSUM(m * w @ a) / SUM(w @ a).INDEX(m @ coord)—mrelative to its value at a base coordinate. Desugars tom / (m @ coord).
These desugar to @-anchored frame expressions and, in turn, to staged frames (Chapter 8 §8.7). They are the surface at which the grain-ambiguity dubiousness law is enforced: the relevant grain must be uniquely determined or explicitly given via @, else the operator is refused as dubious. The WEIGHTED_AVG weighting grain is the canonical case — a weighted average with an ambiguous, undeclared weighting grain is refused rather than silently computed at a convenient grain.
A note on implementation: in Coframe Core these operators are defined by their desugaring — correct at the cost of intermediate staging. Recognizing and natively accelerating the two-anchor pattern in the backend engine (a single fused pass rather than staged frames) is a possible future optimization and a Coframe Pro concern; it does not change the Core semantics.
10.8.6 Why intensive ratios are computed, not stored¶
A note on a design choice the above operators embody, since it recurs whenever an AC author considers exposing a ratio.
A ratio of two extensive metrics — revenue / cost, revenue / units_sold, revenue / customer_count — is intensive: it has no partition-invariant reducer (Foundations §2.8.4). It does not sum, does not max, does not meaningfully average across partitions. Structurally, a stored intensive ratio is an anchor-locked metric: queryable only at the exact grain where it was materialized, and unreachable from any other grain.
This has a sharp consequence. A gross_margin_pct stored at transaction grain cannot be navigated to region, quarter, or category grain — the anchor-lock forbids it. Yet those coarser grains are where margin is actually queried. So a query for margin-by-region must reconstruct the ratio from revenue and cost rolled up to region grain regardless of whether a transaction-grain margin was stored. The stored value provides no navigational capability the components do not already provide; it is inert for every query except one at its own grain.
Therefore the framework's default is: express intensive ratios in Frame-QL (via RATIO_OF or an explicit reducer composition), do not store them as AC-metrics. The expression form is correct at every grain and costs nothing extra; the stored form is correct at one grain and unreachable elsewhere.
There are three cases where exposing a ratio as a stored metric does earn its place — precisely the cases where the Frame-QL expression alternative is unavailable:
-
A component is not in the AC. If the denominator (a market-size figure, a budget target) is not an AC-exposed metric, the user cannot write the expression. Expose the ratio (or the missing component) directly.
-
The ratio is a primary observation, not a derivation. Some ratio-shaped quantities are recorded directly — an externally-supplied index, a sensor-reported rate, a survey-derived percentage — and are not reconstructable from in-AC components. These are observation-rooted metrics, typically anchor-locked (a reported rate does not roll up), and belong in the AC as data.
-
The denominator is non-navigable. If the denominator is itself holistic and anchor-locked (e.g., an exact
COUNT_DISTINCTthe AC cannot roll up), the user cannot recompute it at an arbitrary grain, so the ratio is only available where it was materialized. This is a special case of (1).
In all three, the structural treatment is the singleton (a one-member family) or anchor-locked family: the only construction path the resolver needs is lookup at the metric's own grain — there is no cross-grain resolution to perform. This is the correct and principled treatment for the cases where exposure is warranted; the refinement relative to "expose all ratios" is simply that these cases are narrow.
This is one instance of a recurring discipline in authoring an AC: prefer the construct the framework can guarantee at every grain (here, the navigable components plus a query-time ratio) over the one that merely feels more complete (a stored value at every grain). The discriminating question — can this be reconstructed from navigable components at the query grain? — is a decidable test, not a matter of taste, and it generalizes well beyond ratios. When the answer is yes, computing is the principled default; when it is no, storing as a singleton or anchor-locked family is exactly right. Authoring decisions made by applying this test tend to produce lean, navigable ACs; those made by the instinct to "expose everything for completeness" tend to accumulate inert, anchor-locked storage that no query can reach.
Coframe Pro forward note. A more sophisticated treatment, outside Coframe Core, is a computed metric family: a metric family whose family-root is a declared rule (e.g., "
gross_margin_pct≡RATIO_OF(revenue, cost)") rather than data or a reducer over data. Such a family exposes a named, discoverable metric in the AC vocabulary without storing any values; the resolver reconstructs it at any grain reachable by both components. This introduces a third kind of family-root (rule-rooted, alongside observation-rooted and reducer-rooted), which is a genuine extension of the Foundations primitives — and so is deliberately reserved for Coframe Pro rather than folded into Core.
10.9 Windowed operators are not in Coframe Core¶
Window analytics — operators that compute a value for an output row by referencing other rows of the assembled output frame — are not part of the Coframe Core catalog. This category includes:
- Period-over-period comparison (
PRIOR,LEAD) — referencing a row offset along a dimension. - Ranking (
RANK,DENSE_RANK,ROW_NUMBER,NTILE,PERCENT_RANK). - Running aggregations (
RUNNING_SUM,CUMSUM,RUNNING_AVG, moving averages, and similar). - Within-frame combinators that partition the assembled result by arbitrary columns (
DIFF_FROM_AVG,Z_SCORE).
These operators do not fit Coframe Core's frame model. Every Core operator either describes a column's value at its own grain (functions, mappers, multi-input operators) or aggregates input rows up to the output grain (reducers). Window operators do neither: they read across the rows of the assembled frame, which is a frame-level operation. Coframe's central insight is that there are no genuine frame-level operations in Core — a frame is just a set of column specs — so window analytics has no place in the Core catalog.
10.9.1 The boundary: two-anchor measures vs. window functions¶
Some measures look like they belong here but do not, and the distinction is worth making precise because it is exactly where Core's boundary sits. A two-anchor measure (PCT, WEIGHTED_AVG, INDEX; Chapter 8 §8.6.6) combines a frame-grain value with an aggregate of a metric at a coarser coordinate — reached by FD-navigation and broadcast back. A window function computes over other rows of the assembled result. The single discriminator:
Is the companion value the same for every fine row in a coordinate cell (→ two-anchor measure, Core), or does it vary row-by-row according to position or ordering among other rows (→ window function, not Core)?
PCT(revenue @ region) is a two-anchor measure: every store in a region divides by the same regional total, which is a coordinate-cell aggregate broadcast down the FD-DAG — Core. A cumulative sum is a window function: each row's running total is different, defined by that row's position in an ordering, reaching into all earlier rows — not Core. This is why PCT (share of a coarser coordinate total) is in Core while a window-style "percent of an arbitrarily-partitioned group" or a CUMSUM is not: the former navigates the coordinate space, the latter reaches across assembled rows.
Window analytics is Coframe Pro territory, or is performed outside Coframe (for example, in a notebook or BI tool consuming a Frame-QL result). Frame-QL Core refuses queries that would require window operators rather than approximating them.
This is a deliberate scope boundary, consistent with the framework's posture: support frame-shaped output with guaranteed correctness, and decline patterns that cannot be expressed as column specs.
10.10 The OBSERVED operator¶
OBSERVED is a special operator that marks a column as directly-observed data, not derived from any predecessor.
name: OBSERVED
display_name: "Directly observed"
description: "The column's values are observations, not derivations."
type: observed
partition_invariance:
classification: not_applicable
explanation: |
OBSERVED is not a transformation operator; it carries no aggregation algebra.
A column with op: OBSERVED is a family-root with self-referential lineage.
identity_preservation:
default: preserved
explanation: |
OBSERVED inherently preserves the family-name — it doesn't transform anything.
type_signature:
inputs: []
output: any
anchor_relation:
type: unrelated
naming_function_default: "{name_pred}" # identity; OBSERVED preserves name
canonical_sql: (none; OBSERVED columns are read directly via project operations)
notes: |
Every observation-rooted column in an AC has op: OBSERVED with self-referential
lineage. The framework treats OBSERVED as the marker indicating "data is the root."
No transformation, no algebra; the column is the data.
For grain-role columns (A = [self]), the framework auto-derives op: OBSERVED.
10.11 Naming-function entries¶
Each operator's naming_function_default provides the default rule the AC's naming function applies. This subsection collects them for reference and discusses overrides.
10.11.1 Default templates¶
| Operator class | Template | Example output |
|---|---|---|
| Identity-preserving reducers (SUM, BIT_OR, HLL_MERGE) | {name_pred} |
revenue |
| MAX | peak_{name_pred} |
peak_revenue |
| MIN | min_{name_pred} |
min_revenue |
| COUNT | {name_pred}_count |
revenue_count |
| AVG | avg_{name_pred} |
avg_revenue |
| EXACT_DISTINCT | distinct_{name_pred} |
distinct_customer_id |
| APPROX_DISTINCT | approx_distinct_{name_pred} |
approx_distinct_customer_id |
| VARIANCE / STDDEV | var_{name_pred} / std_{name_pred} |
var_revenue |
| MAP_DIV (multi-input) | {name_pred[0]}_per_{name_pred[1]} |
revenue_per_cost (overridable to gross_margin_pct) |
| Function operators | varies (e.g., month_of_day) |
per operator |
| OBSERVED | {name_pred} |
name preserved |
10.11.2 Overrides¶
AC authors override defaults via the naming_function declaration in schema.init (per Chapter 3 §3.6.2). Common overrides:
MAX→{name_pred}_max(replacingpeak_{name_pred}).MIN→{name_pred}_min.MAP_DIVfor specific family pairs → custom names (gross_margin_pctinstead ofrevenue_per_cost).
The catalog provides the defaults; the AC author specializes per their naming conventions.
10.11.3 Naming-function verification¶
When the AC declares a naming function (Options 1, 2, or 3 in Chapter 3 §3.6.2), the framework verifies every non-root ColumnSpec's name matches the function applied to its lineage. Mismatches are Phase 1 errors.
The catalog's default templates are the framework's source-of-truth when Option 1 (use catalog defaults) is selected.
10.12 Missing-value treatment specification¶
Each operator's missing_value_treatment field specifies how the operator handles missing values. The treatment is not a fixed property of the operator alone, nor of the column's missingness anchor M alone — it is a function of both, together with how M relates to the rollup direction. This joint sensitivity is what makes missingness handling principled rather than ad hoc, and it is possible only because M is a first-class per-column coordinate (Foundations §2.4.3) the catalog can dispatch on.
10.12.1 The governing condition: does the rollup neutralize or activate the missingness?¶
Recall (§9.3.2) that a rollup from anchor A_pred to A_succ coarsens away the dimension set Δ = A_pred \ A_succ. The key question for missing-value handling is whether the coordinates that drive missingness survive into the output grain or are coarsened away:
- If
M's coordinates are retained in the output grain (none of them lie inΔ) andself ∉ M, then within each output cell the missingness-determining coordinates are held fixed — missingness does not vary within the cell, so excluding (skipping) missing values is unbiased. The grouping has neutralized the mechanism. - If any of
M's coordinates are coarsened away (lie inΔ), then within an output cell the missingness mechanism varies — skipping produces a biased result. The mechanism is active. - If
self ∈ M(MNAR), no grouping can neutralize the mechanism, because one cannot group on the unobserved value itself. Skipping is never guaranteed unbiased.
This condition — an intersection test between M and Δ, structurally parallel to (but distinct from) the block-set test (§2.8) — is what determines the appropriate treatment. (M and A_block are independent declarations governing different things: A_block governs whether the aggregation is meaningful; M governs how missing values are handled. A measure may be freely additive yet still bias a skip — see Foundations §2.8.3.)
10.12.2 Extensive vs. intensive operators¶
One refinement: the test above concerns bias from the missingness mechanism. A second, orthogonal issue is incompleteness of an extensive aggregate. A SUM over present-only values undercounts the true total whenever any value is missing, regardless of the mechanism — a sum of some values is not an estimate of the sum of all. An AVG, being intensive, does not have this problem: the mean of the present values is an unbiased estimate of the whole when the mechanism is neutralized.
So each operator declares its own predicate for "skip is unbiased here," and the predicate accounts for the operator's extensivity:
- For intensive operators (AVG, ratios): skip is unbiased iff the mechanism is neutralized (
M ∩ Δ = ∅andself ∉ M). - For extensive operators (SUM, COUNT): skip undercounts whenever data is actually missing, so skip is essentially never the unbiased choice under missingness.
10.12.3 The Coframe Core treatment rule¶
Coframe Core applies a single, conservative rule, per (operator, missingness):
If an unbiased result is achievable by skipping missing values, skip. Otherwise, propagate.
Skip (exclude) drops missing values from the input multiset; it is selected only when the operator's skip-unbiased predicate (§10.12.2) holds. Propagate marks the affected output cells as carrying missingness rather than returning a silently-biased number — the missingness analogue of the framework's "refuse rather than guess" posture (Chapter 1 §1.6). Core does not impute and does not silently launder a biased aggregate into a clean-looking value.
This rule is decidable from structure alone — the operator, M, and Δ — and requires no statistical model. It is deliberately conservative: it tells the truth about missingness rather than attempting to correct for it.
10.12.4 Coframe Pro: imputation and tolerance¶
Correcting for missingness — rather than skipping-when-safe or propagating — requires information or judgment the framework does not have: an imputation value or model, or a tolerance for residual bias. Coframe Pro exposes these to the AC author and consumer:
- Imputation: replacing missing values with a declared value or a model-derived estimate (e.g., a group mean, or a value from an external source). This is a judgment call encoding external knowledge, so it is user-controlled, not automatic.
- Tolerance levels: declaring how much mechanism-induced bias is acceptable before a result is flagged or refused, rather than always propagating.
These are Pro because they require a human decision; Core confines itself to what is decidable from the declared structure.
10.12.5 Output missingness anchor derivation¶
When a rollup propagates missingness, the framework derives the output column's missingness anchor from the input's M, the operator, and Δ:
- Skip selected (mechanism neutralized): the output's missingness is reduced — the coordinates that drove it are now fixed within each cell.
- Propagate selected: the output inherits the relevant coordinates of
M(those that survive into the output grain), and inheritsself-dependence if the input was MNAR.
Chapter 3's M declaration on a derived column is verified against this derivation: the framework checks that a declared output M matches what the lineage predecessor's M and the operator imply.
10.12.6 Behavior primitives¶
The treatments above are realized by these primitives, recorded per operator:
exclude(skip): missing values are dropped from the input multiset.propagate: affected output cells are marked as carrying missingness.propagate_null: for functions, a null input yields a null output (ABS(NULL)→ NULL).include_as_zero: missing treated as 0 — only when the AC author explicitly commits to this semantics (not a default).impute(Pro): replace with a declared or model-derived value.
10.13 Cross-cutting notes¶
10.13.1 The complete monoid summary¶
For quick reference, here are the natively monoidal operators with their commutative monoid structures:
| Operator | Value space | Operator | Identity |
|---|---|---|---|
| SUM | ℝ | + | 0 |
| MAX | ord | max | -∞ |
| MIN | ord | min | +∞ |
| COUNT | ℕ | + | 0 |
| AND_AGG | 𝔹 | ∧ | true |
| OR_AGG | 𝔹 | ∨ | false |
| BIT_AND | bitmap | ∧ | all-ones |
| BIT_OR | bitmap | ∨ | all-zeros |
| BIT_XOR | bitmap | ⊕ | all-zeros |
| HLL_MERGE | hll_sketch | merge | empty sketch |
| THETA_UNION | theta_sketch | union | empty sketch |
| THETA_INTERSECTION | theta_sketch | intersection | universal sketch |
| T_DIGEST_MERGE | t_digest | merge | empty digest |
| KLL_MERGE | kll_sketch | merge | empty sketch |
This table is the catalog's central structural fact: each row is a partition-invariant reducer eligible as an ip_reducer in a metric family.
10.13.2 Sketch types as first-class¶
The catalog includes operators on sketch types (HLL, theta, t-digest, KLL) as first-class natively monoidal reducers. This is a distinguishing feature of Coframe's framing relative to OLAP literature that treats sketches as approximation tools rather than as algebraic objects.
Consequences:
- An AC may declare metric families backed by sketches (e.g.,
customer_visitor_sketchtypedhll_sketch). - The framework reasons about cross-grain navigation of sketch-typed families identically to numeric-typed families.
- Per-lineage-edge attestation (Chapter 7 §7.5) compares sketches structurally, with sketch-specific epsilon semantics.
- Frame-QL queries can request
APPROX_DISTINCT(sketch_metric)as a final projection over the sketch family.
The framework's monoidal framing is what enables this uniformity.
10.13.3 Catalog versioning¶
Catalog versions are semver-style: coframe-core-catalog/<major>.<minor>.<patch>. Patch versions add operator entries or refine specifications without changing existing behavior. Minor versions may add operator classes. Major versions may change behavior (rare).
ACs target a specific catalog version. The framework includes shimming for compatibility across minor and patch versions.
10.13.4 Backend compatibility¶
Some catalog operators require specific backend capabilities. The catalog declares per-operator backend compatibility; an AC that references an operator the backend doesn't support is flagged at validation time.
Common dependencies:
- Sketch operators (HLL_MERGE, etc.) require backend sketch implementations.
- Approximate-distinct functions vary in support across backends.
- Specific date functions vary in syntax and semantics.
The data-API protocol (Chapter 6) abstracts most of these via backend-specific overrides; an AC bound to a backend whose protocol implementation lacks an operator's support encounters a backend-incompatibility error.
Where the runtime mapping lives. The catalog's backend_overrides field (§10.2) is the catalog-time defaults per known backend. At runtime, the mapping from catalog operator names to physical operator names is layered:
- L1 (catalog): defines what operators exist and their structural properties (this chapter).
- L2 (installation
operator_registry): maps L1 operator names → physical operator names for the installation's bound backend. Seeded frombackend_overrides; per-installation customizations land on top. - L3 (per-AC
operator_overrides): an optional override patch (usually empty or a small handful of entries) that the AC author uses to bind a custom UDF or shadow a default.
Effective at runtime = L2 ⊕ L3-overrides, with L3 winning on overlap. A query referencing an operator the effective registry has no binding for is refused with a structured error (UnknownOperatorError if the operator isn't in L1; OperatorUnsupportedError if it's in L1 but unbound at L2 and L3). See coframe_platform_design_v2_1_supplement.md §3.6 for the full layering specification.
10.13.5 Extending the catalog (Pro)¶
Coframe Pro supports user-defined operators added to the catalog. The Pro mechanism requires the user to specify:
- The operator's structural properties (the full catalog entry format).
- A proof or attestation of partition-invariance (when claimed).
- A backend protocol implementation (the SQL or native form per backend).
- DQ-test scaffolding for the operator's properties.
Coframe Core's catalog is fixed; Core ACs use only the operators specified in this chapter.
10.14 Summary¶
The operator catalog is the framework's source of truth for operator behavior. Each entry specifies:
- The operator's classification (reducer / function / multi-input / observed).
- Its partition-invariance (natively monoidal / liftably monoidal / holistic / not applicable).
- Its identity-preservation default and overrides.
- Its type signature and anchor relation.
- Its missing-value treatment per input M signature.
- Its default naming-function template.
- Its canonical SQL form and backend overrides.
The catalog's monoidal framing is what enables Coframe Core to treat sketch types (HLL, theta, t-digest) as first-class partition-invariant reducers — accommodating non-numeric value spaces uniformly. This is the framework's algebraic core: partition-invariance is a property of (value space, operator) pairs forming commutative monoids, regardless of whether the value space is numeric.
Liftably monoidal reducers (AVG, VARIANCE, APPROX_DISTINCT) extend the framework's expressiveness by carrying enriched state through aggregation and projecting the result at the final grain. Holistic reducers (EXACT_DISTINCT, EXACT_PERCENTILE, MODE) are anchor-locked — useful at fixed grains but not for cross-grain navigation.
The catalog is versioned. ACs target specific versions; the framework supports backward-compatible shimming across minor and patch versions.
Subsequent chapters reference this catalog for ColumnSpec op field values (Chapter 3), for the framework's reasoning over partition-invariance and identity-preservation (Foundations §2.6), for the data-API protocol's SQL generation (Chapter 6), for missing-value treatment in DQ verification (Chapter 7), and for Frame-QL expressions (Chapter 8). The catalog ties all of these together.
Chapter 11: The Metric Engine¶
An optional per-AC query engine that sits between the resolver and the backend. Memoises per-(family, anchor) aggregates on a Polars + Parquet substrate; serves them via three-branch dispatch (exact match → FD-DAG rollup → backend fallback); composes multi-metric Frames in one operation; hosts metric and quasi-metadata domains on a shared store. The chapter specifies the engine's language-level semantics — its domains, its serve / compose lifecycle, its memoisation policy, its interaction with verification levels and the stability window, and the cache_hint surface the AC author drives it through.
11.1 What this chapter does¶
The framework specified in Chapters 1–10 makes no commitment about where computed values live between the moment a Frame-QL query is resolved and the moment its result is returned. Chapter 6 establishes a Protocol with the backend; Chapter 9 specifies how a query is planned into Protocol calls; the backend supplies the rest. The engine introduced in this chapter is an optional acceleration substrate layered above that boundary: when enabled, it caches per-(metric family, anchor) aggregates and serves them in place of backend calls, transparently to the resolver and to the AC author.
The engine is language-level — meaning, it is part of what Coframe Core specifies, not a Pro extension or an implementation detail of a specific backend. An AC's behavior with the engine enabled is fully determined by its declarations (the cache_hint blocks of §3.5.7, the engine block of §11.2's installation.yaml) plus the engine's serve / compose / eviction lifecycle specified here. The engine adds no new query language, no new query semantics, no new attestation regime. What it adds is a deterministic specification of when a Frame is computed from materialised state versus from the live backend, and how those two paths agree.
This chapter is reference-style. Readers implementing the engine, integrating it into a custom resolver, or interpreting its outputs can read it as a procedural specification. Readers who only need to understand the AC's behavior can skim §11.1, §11.2, and §11.6 — the rest is implementation-facing detail.
The chapter is organised as: §11.2 the opt-in model + on-disk layout; §11.3 the multi-domain substrate (METRIC + QM); §11.4 the serve() lifecycle (the three-branch dispatch); §11.5 compose(), eviction, refresh; §11.6 the served_from indicator and what it means for the caller; §11.7 cache_hint pre-materialisation and promotion recommendations; §11.8 verification-level inheritance; §11.9 stability-window invalidation; §11.10 the integration boundary with Chapter 9's resolver; §11.11 forward-compatible domains; §11.12 summary.
11.2 The opt-in model¶
The engine is per-AC, opt-in, declared in the installation's installation.yaml. The relevant block:
backend:
type: duckdb
source: ./retail_demo.duckdb
metric_engine: # optional; omit to disable
enabled: true
max_bytes_per_ac: 1073741824 # 1 GiB byte budget per AC
# Future: per-domain budgets, eviction-policy tuning, etc.
Two axes of opt-in:
- Per-installation: omit the
metric_engineblock (or setenabled: false) and no engine instance is constructed. The resolver falls back to direct backend execution per Chapter 9. ACs work identically, just without memoisation. - Per-AC budget:
max_bytes_per_acbounds the engine's on-disk footprint for each AC. Materialised entries beyond the budget trigger eviction (see §11.5). A common starting value is 1 GiB; production deployments may tune this per AC's working-set size.
On-disk layout. Each engine instance owns a directory rooted at <dev_data>/metric_engine/<installation_id>/<ac_name>/:
<engine_root>/
├── manifest.db # SQLite — the EngineEntry catalog
└── entries/
├── <hash1>.parquet # one Parquet file per materialised entry
├── <hash2>.parquet
└── ...
The manifest is a single-process SQLite database holding the entries table (one row per EngineEntry — see §11.3.2). The Parquet files hold the entries' columnar data. The layout is intentionally simple: no per-domain subdirectories, no per-family namespacing on disk. The manifest's (domain, dataset_id, anchor_signature) composite key is the only index the engine consults.
The engine is process-local: a single Python process owns the manifest's write lock; cross-process coordination (multi-worker uvicorn, multiple machines sharing a deployment) is a Pro concern and out of scope for Coframe Core. The runtime / workbench HTTP apps each construct their own EngineRegistry (one engine per (installation_id, ac_name)) and serve their own subset of queries; sharing across processes requires either a single-worker deployment or Pro-tier coordination.
11.3 The multi-domain substrate¶
11.3.1 Two domains, one store¶
The engine hosts two domains on a shared substrate:
| Domain | Stores | Used by |
|---|---|---|
METRIC |
Per-(metric family, anchor) aggregates — e.g., revenue@(region,) materialised as a 3-row Parquet |
Frame-QL resolution (Chapter 9); the canonical engine workload |
QM |
Quasi-metadata — per-column profiles (cardinality, nulls, type, sample values) — keyed at (schema, column) |
DQ verification (Chapter 7); the Workbench's Table Explorer and Lineage pages |
Both domains share the manifest, the Parquet substrate, the access-tracking machinery, and the eviction policy. They differ in two ways:
- METRIC entries support FD-DAG rollup (§11.4 branch 2). The engine can serve a query at a coarser grain by aggregating a finer cached entry.
- QM entries are exact-match-or-compute (§11.4 branch 3, no rollup). A request for
(schema, column)either hits the manifest exactly or triggers re-profiling via the backend'sextract_to_lazyframe(Chapter 6 §6.2.5).
The shared substrate is what makes the engine more than a metric cache. The same machinery the AC's authors interact with at COMMIT time to declare hot metric grains also powers the Workbench's introspection views; the same eviction policy that bounds metric memoisation bounds QM memoisation. Future domains (LINEAGE, INTEGRITY_RESULTS, SKETCH — see §11.11) extend the same substrate.
11.3.2 The EngineEntry record¶
Each materialised entry is captured by an EngineEntry:
| Field | Type | Notes |
|---|---|---|
domain |
METRIC / QM |
Domain discriminator |
dataset_id |
string | METRIC: the metric family name (e.g., "revenue"). QM: the profile kind (e.g., "column_profile") |
anchor |
tuple of dimension names | Canonicalised (sorted) so set-equal anchors hash identically |
parquet_path |
string | Relative path under <engine_root>/entries/ |
operator |
string | null | METRIC-only — the reducer used (SUM, MAX, ...) |
partition_invariant |
bool | null | METRIC-only — true iff the entry is eligible for further FD-DAG rollup |
mvt |
string | null | METRIC-only — skip / propagate / impute |
imputation_value |
any | null | METRIC-only — literal used when mvt=impute |
source_schemas |
list of strings | Schemas this entry was materialised from |
source_filter |
dict | null | The AC's effective filter at materialisation time |
row_count |
int | |
byte_size |
int | The Parquet file's size on disk; the budget enforcement basis |
materialized_at |
datetime | Wall-clock at first materialisation |
stability_cutoff |
datetime | null | The AC's stability hold-off cutoff at materialisation time (§11.9) |
last_access_at |
datetime | Bumped on every cache hit |
access_count |
int | Counter for the recommendation surface (§11.7) |
verification_level |
A/AA/AAA | null |
The AC's verification level at materialisation time (§11.8) |
The full schema lives in coframe.metric_engine.types.EngineEntry. The fields with null defaults distinguish METRIC-domain entries (which carry operator metadata) from QM-domain entries (which don't).
11.4 The serve() lifecycle¶
11.4.1 Signature¶
The engine's central API call:
engine.serve(
domain: Domain,
dataset_id: str,
anchor: tuple[str, ...],
*,
operator: str | None = None,
partition_invariant: bool = False,
mvt: str | None = None,
imputation_value: Any = None,
source_table: str | None = None,
source_column: str | None = None,
source_schemas: list[str] | None = None,
backend: DataAPIBackend | None = None,
) -> pl.LazyFrame
Returns a single-column Polars LazyFrame shaped [<anchor dims>, <dataset_id>]. The METRIC kwargs (operator + partition-invariance + mvt + source metadata + backend) are needed only for branches that may fall back to the backend; branches that resolve from cache ignore them.
11.4.2 The three branches (METRIC domain)¶
serve(METRIC, dataset_id, anchor)
│
├── Branch 1: Exact match
│ manifest.lookup(METRIC, dataset_id, anchor) is not None
│ → scan that entry's Parquet, bump access stats, return
│
├── Branch 2: FD-DAG rollup
│ find a partition-invariant cached entry whose anchor is a
│ strict superset of `anchor` (i.e., finer than what we need)
│ → scan it, group_by(anchor), agg via operator
│ → memoise the rollup at (dataset_id, anchor)
│ → return
│
└── Branch 3: Backend fallback
no eligible cached entry exists
→ backend.aggregate(...) for the single metric at the requested anchor
→ memoise as a new entry
→ return
Each branch's exit point is a pl.LazyFrame; the caller (typically compose() — see §11.5) treats all three uniformly.
Branch 2's reachability rule (v0.1). The engine considers a cached entry $E$ a rollup source for target anchor $A$ iff:
- $E$ is partition-invariant (
AVG,MEDIAN,COUNT_DISTINCTare excluded — they cannot be further rolled up). - $E$'s anchor is a strict superset of $A$ (subset reachability).
Cost heuristic: the cheapest candidate by row_count wins. The framework specification leaves the cost model as "minimum rows to scan + group"; implementations may refine. A richer reachability rule that follows FD-edges (e.g., serving region@(region,) from a store_id@(store_id,) entry by traversing the store_id → region FD) is forward-compatible for a v0.2 refinement.
Memoisation in branches 2 and 3. The result is written back to the manifest under the requested (dataset_id, anchor) so the next call at the same grain hits Branch 1. This is the engine's lazy memoisation discipline (§11.7.1): observed workload populates the cache.
11.4.3 The QM-domain branches¶
QM has the same three branches, simpler:
- Branch 1 (exact match): hit on
(QM, "column_profile", (schema, column))→ return. - Branch 2 is not used — QM entries do not support rollup. The dimension grain is
(schema, column)and there is no coarser grain to roll up to. - Branch 3 (fallback): re-profile via
engine.profile_table(schema, lazyframe)which calls the backend'sextract_to_lazyframe(Chapter 6 §6.2.5), computes the profile, and memoises every column's profile in one pass.
The QM domain's smaller API (profile_table, column_profile, table_profile) wraps these branches.
11.5 compose(), evict(), refresh()¶
11.5.1 compose() — multi-metric Frame assembly¶
After per-metric serve() calls return single-column LazyFrames, compose() joins them at the target grain and applies post-grain operations:
engine.compose(
entries: list[pl.LazyFrame],
target_grain: tuple[str, ...],
*,
frame_expression: Callable[[pl.LazyFrame], pl.LazyFrame] | None = None,
having: pl.Expr | None = None,
order_by: list[tuple[str, bool]] | None = None,
limit: int | None = None,
limit_per: tuple[list[str], int] | None = None,
) -> pl.DataFrame
Order of operations: join → frame_expression → having → order_by → limit_per → limit → collect.
The frame_expression parameter is the engine's affordance for derived metrics computed post-aggregation — well-defined for any expression over already-aggregated SUMs / MAXes / etc. The canonical example is profit = revenue - cost:
def derive(lf: pl.LazyFrame) -> pl.LazyFrame:
return lf.with_columns(
(pl.col("revenue") - pl.col("cost")).alias("profit"),
)
revenue_lf = engine.serve(METRIC, "revenue", ("region",), operator="SUM", partition_invariant=True)
cost_lf = engine.serve(METRIC, "cost", ("region",), operator="SUM", partition_invariant=True)
frame = engine.compose([revenue_lf, cost_lf], ("region",), frame_expression=derive)
(Naming note: per the convention profit = revenue − cost; gross profit = revenue − COGS; margin = profit / revenue; gross margin = gross profit / revenue. The retail dataset's cost is the all-in per-transaction cost, so the subtraction is profit, not gross profit.)
The frame_expression is arbitrary Polars; weighted ratios, percent-of-total, cohort math, anything Polars can express on a LazyFrame works. AC-level declaration of derived families (so a Frame-QL author can write SELECT region, profit AT region directly) is forward-compatible resolver work — the design is captured in drafts/coframe_derived_metrics_design_v0_1.md; today the engine exposes the affordance programmatically.
11.5.2 evict(entry) — explicit removal¶
evict(domain, dataset_id, anchor) removes one entry: deletes the Parquet, removes the manifest row. The engine also evicts opportunistically when the byte budget is exceeded:
- After each
ingest, the engine sumsbyte_sizeacross all entries. - If the sum exceeds
max_bytes_per_ac, the engine evicts entries by ascending eviction score until the budget is honored. - The eviction score is a function of
(access_count, age, byte_size)— entries that are cheap to recompute (small) and rarely accessed are evicted first.
The exact score formula is implementation-defined within Coframe Core; the specification commits only that the score is monotonic in the listed factors and that the budget is honored.
11.5.3 refresh(entry) — re-materialise in place¶
refresh(domain, dataset_id, anchor) evicts and re-materialises the entry against the current backend state, preserving the manifest's dataset_id / anchor / operator but resetting materialized_at, last_access_at, and access_count. The new entry inherits the AC's current verification_level and stability cutoff (§11.8, §11.9).
Refresh is the explicit response to verification-level changes, schema reloads, and stability-cutoff advances. The framework does not auto-refresh: cached entries reflect their materialisation-time state until the AC author (or a scheduled job) calls refresh().
11.6 The served_from indicator¶
Every Frame the engine serves carries a served_from field on the response:
| Value | Meaning |
|---|---|
engine_cache |
All requested metrics were served from cached entries (Branch 1 or memoised Branch 2 / Branch 3). The backend was not consulted on this call. |
engine_backend |
All requested metrics fell through to backend fallback (Branch 3) and were materialised on the way. The next call at the same grain will be engine_cache. |
engine_mixed |
Multi-metric query where some metrics hit cache and others fell back. |
backend |
The engine was disabled for this AC (no metric_engine block, or enabled: false); the resolver executed directly against the backend per Chapter 9. |
The indicator is observational, not contractual: a Frame's correctness is not a function of which branch served it. The framework guarantees equality of Frames across branches (Chapter 6 §6.7 — backend-protocol mappings honor the catalog's operator semantics; the engine's rollup obeys partition-invariance per the catalog).
served_from exposes the engine's serving path so that:
- Workbench's Query UI can display a visible badge (green / amber / grey) so the AC author sees engine activity.
- Operators can monitor cache effectiveness without instrumenting the backend.
- Tests can assert deterministic cache behavior — e.g., "the warm-up walker pre-materialised this grain, so the canonical demo query is
engine_cacheon first call."
Callers that need strict freshness (e.g., a query that must see the latest backend state) can construct the resolver with an engine override or explicitly refresh() the relevant entries; served_from: engine_cache is the operational signal that this hasn't been done.
11.7 cache_hint and the promotion lifecycle¶
11.7.1 Hybrid push + pull¶
The engine combines two memoisation triggers:
- Push (declared):
cache_hint.materialize_atblocks inac.yaml(specified in §3.5.7) declare which(family, anchor)pairs the engine should pre-materialise at COMMIT time. The COMMIT-time walker (pre_materialise) callsengine.serve(...)for each declared anchor withpartition_invariant=True; on first run this populates Branch 3 (backend fallback + memoise); on subsequent commits the walker is idempotent. - Pull (lazy): queries that miss the cache and fall through to Branch 3 (or rollup to Branch 2) write their result back to the manifest. Workload patterns the AC author didn't anticipate are still cached after their first call.
The two are not redundant. Push catches the declared-hot grains that demos and dashboards depend on; pull catches the unanticipated grains that emerge from real workload.
11.7.2 The promotion surface¶
After workload accumulates, the engine's access_count field identifies entries the AC author should consider promoting from ambient cache to declared cache_hint:
from coframe.metric_engine.recommendations import promotion_recommendations
for rec in promotion_recommendations(
engine,
min_hits=100,
min_age_days=7,
min_hits_per_day=1.0,
):
print(rec.dataset_id, rec.anchor, rec.hits_per_day)
print(rec.suggested_yaml)
Each PromotionRecommendation carries the entry's hit stats and a paste-able YAML stanza:
# Add to metric_family 'revenue' in ac.yaml:
cache_hint:
materialize_at:
- [day] # Engine-recommended: 142 hits / 9 days = 15.8 hits/day
The intended workflow is declare → observe → promote:
- The AC author declares a starting
cache_hintset (or none). - Workload runs; lazy memoisation fills hot grains the declaration missed.
- The Workbench's Engine page (or
promotion_recommendations()directly) surfaces the entries that have accumulated sustained access. - The author pastes the recommended YAML into
ac.yaml, re-COMMITs. - The next warmup pass pre-materialises the promoted grain deterministically; it now starts warm on every fresh installation.
The engine does not auto-promote. Promotion is an AC-author action because it edits the AC's declared surface, and AC declarations are the framework's authoritative artifact (see Chapter 3).
11.7.3 Defaults¶
The recommendation surface's defaults — min_hits=100, min_age_days=7, min_hits_per_day=1.0 — are conservative: an entry must have sustained hot access over a week's window to surface. The Workbench Engine page lets the operator tune these interactively to surface fresher or weaker candidates (useful in demos and during initial AC tuning). Programmatic callers can override the defaults per call.
11.8 Verification-level inheritance¶
Per the amendment introduced in §7.8.5: a memoised entry inherits the AC's verification level at materialisation time. The engine's manifest records this on EngineEntry.verification_level; serving a query from such an entry produces a Frame whose served_from reflects the cache hit, and whose underlying epistemic posture reflects the entry's recorded level.
The engine does not lower the level (a Level AA AC served from cache remains AA) and does not raise it (an entry materialised at Level A stays at A even after the AC is re-verified to AA; refresh required). The evict() and refresh() operations (§11.5) are the explicit re-grounding mechanisms.
Consumers needing strict freshness can filter manifest entries on verification_level, or treat served_from: engine_backend as the only branch guaranteeing currency at the AC's present level.
11.9 Stability-window invalidation¶
ACs declared under v2.1's stability filter (platform design v2.1 §5–§7) carry a stability_cutoff — a wall-clock point past which the AC's effective view of data is frozen. The engine respects this by storing stability_cutoff on each entry at materialisation time and treating an entry as stale when the AC's effective cutoff has advanced past it.
Stale entries are not automatically evicted. The framework's posture is:
- Read:
serve()will still return data from a stale entry; the resolver's stability-aware path (Chapter 9) is responsible for filtering data to the effective cutoff at execution time, which the engine respects by storing pre-filtered values. - Refresh: stable entries become stale when the AC's cutoff advances. The AC author (or a scheduled job) re-materialises via
refresh()to roll the cutoff forward. - Recommendation: the engine's
promotion_recommendations()surface flags entries whosestability_cutoffis more than a configurable threshold behind the AC's current effective cutoff, signaling that refresh is overdue.
This is the same discipline as L2 metadata in v2.1: cached commitments are valid as of the moment they were committed; explicit re-grounding is required to advance them.
11.10 Integration with Chapter 9's resolver¶
The engine sits below the resolver in the request flow:
Frame-QL query
│
▼
Lex + parse + four-rule resolution (Chapter 9)
│ → ResolvedPlan = per-metric ServingRequests + target grain + post-grain ops
▼
For each metric in the plan: ┐
lf = engine.serve(METRIC, family, anchor, ...) │
│ replaces the
After all metric serves return: │ legacy "build
frame = engine.compose( │ JOIN'd SQL,
per_metric_lfs, │ execute via
target_grain, │ backend.aggregate"
frame_expression=..., │ path of v2.0
having=..., │
order_by=..., │
limit=..., │
) │
│ │
▼ ┘
Frame (served_from = "engine_cache" | "engine_mixed" | "engine_backend")
The resolver's surface to the engine is the EngineRegistry — keyed by (installation_id, ac_name), returning the engine instance for that AC, or None if the engine is disabled for the installation. When None, the resolver falls through to the legacy v2.0 path (build SQL, call backend.aggregate, build Frame); the result is tagged served_from: backend.
The engine introduces no new query language and no new resolver rules. What it changes is the execution-time decomposition: where v2.0 issued one cross-schema JOIN'd aggregate call per query, v2.1+ with the engine issues one serve() per metric (each of which may or may not consult the backend) and one compose() at the end. The engine and the v2.0 path produce equal Frames for any query they both serve (validated by the cross-backend invariant suite — platform design v2.2 §8).
11.11 Forward-compatible domains¶
The engine's two-domain substrate (METRIC + QM) is extensible. Forward-compatible domains the v0.1 design specification anticipates:
| Future domain | Stores | Used by |
|---|---|---|
LINEAGE |
Per-column lineage edges | Chapter 7 attestation; Workbench Lineage page |
INTEGRITY_RESULTS |
Cached structural / sibling-coherence attestation outcomes | DQ continuous runs; the Validation Surface |
SKETCH |
Per-anchor HLL / t-digest / Theta sketches | Approximate-distinct queries; sketch-typed metric families |
Each future domain adds its own dataset_id conventions and its own serve / compose semantics on top of the shared manifest + Parquet substrate. The substrate itself does not change. Adding a new domain is an additive Coframe Core minor-version bump.
This is the same design discipline as the operator catalog (Chapter 10 §10.13.3): the substrate's structure is stable; expressive additions are versioned in.
11.12 Summary¶
The Metric Engine is an optional, per-AC, language-level acceleration substrate. It memoises per-(family, anchor) aggregates on a Polars + Parquet store keyed by a SQLite manifest, serves them via three-branch dispatch (exact match → FD-DAG rollup → backend fallback), composes multi-metric Frames in one operation, and hosts both METRIC and QM domains on a shared substrate.
The engine's behavior is fully determined by the AC's declarations (§3.5.7's cache_hint blocks + the metric_engine block of installation.yaml) and the lifecycle specified in this chapter. The engine adds no new query language and no new attestation regime. Frames it serves are equal to Frames the backend would serve for the same query; the served_from indicator (§11.6) exposes the serving path for observability.
The engine's relationship to the AC's declared surface is one of ambient working memory to long-term declared memory — pull memoisation catches unanticipated workload, push pre-materialisation honors declared hot grains, and the promotion recommendation surface (§11.7) is the bridge between them. The intended workflow is declare → observe → promote, with the AC's cache_hint evolving as workload patterns surface.
Subsequent reading: the v2.2 platform-design supplement (coframe_platform_design_v2_2_supplement.md) for the platform-side concerns (packaging, deployment topology, EngineRegistry); the engine's design doc (coframe_metric_engine_design_v0_1.md) for the implementation-facing detail (manifest schema, Parquet layout, eviction-score formula).
Chapter 12: The AC Analyst¶
The AC Analyst is an AI-powered analytical workspace that sits on top of a registered AC. It accepts natural-language questions from a user, composes structurally-correct Frame-QL queries against the AC, runs them through the resolver and (optionally) the Metric Engine, narrates the results, and proposes follow-up directions. This chapter specifies the surface: its tier-1 framing, the four-tier context model, the tool catalog, the bounded turn loop, the two-panel workspace UX, ANALYST.md authoring, and the deterministic eval framework that gates regressions.
The Analyst is a surface built on Coframe Core, not part of the Core specification itself. Like the Workbench (authoring) and the Query UI (manual Frame-QL), it is an application of the framework's structural surface — Frame-QL, the FD-DAG, verification levels, the Metric Engine — to a particular workflow. The framework gives the Analyst structurally-correct queries, refusals as first-class results, and provenance per Frame. The Analyst gives the framework a conversational interface that cannot fabricate answers around it.
The implementation lives in the coframe.runtime.analyst package. The full design rationale (operating posture, harness-engineering principles, simulation v1 scope, model routing) is in the companion design document coframe_ac_analyst_design_v0_1.md; this chapter is the user-facing specification.
12.1 What this chapter does¶
This chapter specifies the AC Analyst surface in the same posture as the other framework surfaces: what the user sees, what the framework promises, what the implementation contract looks like. It does not re-derive design choices — those live in the design document. It does specify everything an installation operator, an AC author, or an integration partner needs to know to use the Analyst against a registered AC.
The reader should be familiar with Chapter 1 (what the AC is), Chapter 8 (Frame-QL), Chapter 9 (resolution + refusals), and ideally Chapter 11 (the Metric Engine, which the Analyst routes through when enabled). Chapter 7 (verification levels) is the source of the AAA / AA / A annotations the Analyst surfaces in narrations.
12.2 What the Analyst is (and isn't)¶
It is:
- A workspace — two panels (dialogue + artifacts), a session that persists across turns, an artifact stack the user can pin / branch / export.
- A bounded agent — every tool call is auditable; refusals are first-class; narrations are grounded in the actual Frame data; replay is bit-equivalent.
- A structurally-honest surface — every numerical claim traces to a Frame the user can see, with the Frame's Frame-QL, served_from indicator, and verification level visible.
It is NOT:
- A SQL workbench — the analyst doesn't write SQL, and the user can't type SQL into it. The contract surface is Frame-QL (which the analyst composes) and natural language (which the user supplies).
- A free LLM chat — there's no system prompt the user can override, no "you are a helpful assistant" framing. The analyst's role is fixed: compose queries against the AC and narrate results.
- A forecasting tool, a what-if modeller, or a customer-analytics tool unless the AC declares those capabilities. The analyst's scope is bounded by the AC.
12.3 The three operating modes¶
The same Analyst handles three different user postures:
| Mode | User input | Analyst behaviour |
|---|---|---|
| Translation | A user types something close to a query ("revenue by region for Q4") | Translate one-shot, execute, narrate, suggest follow-ups |
| Guided | A user types a vague intent ("how's my business doing?") | Discover the relevant families, ask clarifying questions, then execute and narrate |
| Analysis | A user states a business problem ("Q4 revenue looks weak, help me figure out why") | Explore proactively, run multiple Frames, compare, narrate patterns |
The Analyst does not declare which mode it's in; it picks the right one based on the user's message. The bounded turn loop (§12.6) keeps Analysis-mode exploration from running away.
12.4 The four-tier context model¶
LLM context is finite + expensive. The Analyst uses a four-tier model that keeps the small / hot data in the prompt and pulls the large / cold data only on demand.
L1 Static system prompt + tool definitions + AC catalog summary
(prompt-cached, TTL hours-to-days; invalidates on AC COMMIT or
prompt-version bump).
L2 Composed ANALYST.md (installation-level memo + AC-level memo;
prompt-cached separately so memo edits don't invalidate L1).
L3 Conversation history (recent turns verbatim; older turns
summarised; tool results inlined).
L4 Artifact stack (NOT in context by default; LLM holds artifact
IDs only; full payload pulled via recall_artifact when needed).
The L4 discipline is what makes a 20-turn conversation that produces a dozen 5000-row Frames stay within the model's context window. The LLM sees the Frame's (rows, cols, frame_ql, served_from) summary; the user sees the full Frame in the right panel; the data goes back into LLM context only when the user explicitly refers back ("compare this to the chart from earlier" → recall_artifact).
Anthropic prompt caching is wired with cache breakpoints at the L1 and L2 boundaries. For a 20-turn conversation with L1 ≈ 5k tokens and L2 ≈ 1.5k tokens, this is the difference between paying the ingestion cost per turn (~$2 / session) and paying it once (~$0.20 / session).
12.5 The tool catalog¶
The Analyst has thirteen tools, organised by function. Each has a strict JSON-Schema input, a compact output, and a structured error format {error_kind, suggestion, ...context} that the LLM can react to programmatically.
Introspection (5 tools):
| Tool | Purpose |
|---|---|
list_families |
Enumerate the AC's metric + dimension families with one-line summaries |
describe_family(name) |
Detail one family — root, ip_reducers, hierarchies, derived dims, verification status |
column_profile(schema, column) |
QM profile for a physical column or derived dim (via Unit A synthesis) |
discover_fd_dag(schema, columns?) |
Data-attested FD-DAG over a schema's columns |
verification_level(ac_name?) |
Current AC verification level + outstanding conditions |
Execution (2 tools):
| Tool | Purpose |
|---|---|
propose_frame_ql(intent_summary) |
Construct a Frame-QL proposal WITHOUT executing — useful for clarification dialogue before commitment |
execute_frame_ql(frame_ql) |
Parse + resolve + execute via the shared execute_query entry. Returns a Frame artifact or a structured refusal |
Synthesis (3 tools):
| Tool | Purpose |
|---|---|
narrate_frame(frame_id, user_intent) |
Summarise a Frame in prose — grounded LLM subcall with the Frame's actual data in the prompt (§12.7) |
propose_followups(frame_id, conversation_summary) |
Suggest 2-4 clickable follow-up questions |
recall_artifact(artifact_id) |
Inject a prior artifact's full content into context (L4 discipline) |
Simulation (3 tools, v1 scope):
| Tool | Mechanism |
|---|---|
simulate_filter(frame_id, filter_clause) |
Re-run the source Frame-QL with an added WHERE clause |
simulate_perturbation(frame_id, column, perturbation) |
Post-output transform (multiply / add); doesn't re-execute |
simulate_goal_seek(frame_id, column, target_value, vary) |
Linear-aggregate solver — what value of vary makes column reach target_value |
Every simulation produces a simulation artifact with provenance.is_simulation: true so the UI renders it distinctly (different border, "SIM" badge) — the auditability discipline made visible.
12.6 The bounded turn loop¶
One user message triggers an LLM turn loop bounded by:
| Bound | Default | What it prevents |
|---|---|---|
max_turns_per_intent |
8 | Runaway agents |
stuck_threshold |
5 | Tool-call loops without progress |
dedup_window |
3 | Same (tool, args) repeated within last N calls → rejected |
max_parallel_tools |
4 | Resource exhaustion |
tool_timeout_seconds |
30 | Hung backends |
Per-session overrides may relax these for power users; the defaults are tuned for typical interactive work. When max_turns_per_intent is exhausted, the orchestrator emits a synthetic summary message rather than asking the LLM (which is the thing that just ran out of budget). When the stuck threshold fires, a synthetic system message is injected suggesting the LLM ask for clarification, produce a partial result, or refuse with a structured error.
12.7 Narration grounding¶
Narration is itself a tool: narrate_frame(frame_id, user_intent). The handler issues a separate LLM subcall — distinct from the planning loop — with the Frame's actual rows in the prompt, instructed to summarise in 2-3 sentences without speculation:
User asked: <user_intent>
Frame (<n> rows, columns: <cols>):
<compact pipe-table rendering of the Frame's data>
Verification level: <level>
Frame-QL: <source query>
Summarise this Frame in 2-3 sentences.
The LLM cannot hallucinate "revenue went up" when the rows show a dip — the rows are in the prompt. The narration tool returns a narration artifact tied by provenance to the source Frame.
The same grounding pattern applies to propose_followups: the follow-up generator sees the Frame's shape so its suggestions are specific ("break November down by region") rather than generic ("ask another question").
12.8 The two-panel workspace¶
┌─────────────────────────────────────┬──────────────────────────────────────┐
│ DIALOGUE │ ARTIFACTS │
│ │ │
│ user: "show me revenue trends" │ ┌────────────────────────────────┐ │
│ │ │ Frame · monthly revenue │ │
│ analyst: │ │ [datagrid] │ │
│ I'll look at monthly totals… │ │ ↳ provenance (frame_ql, │ │
│ [tool: execute_frame_ql] │ │ served_from, verification) │ │
│ ↳ Frame ↑ │ └────────────────────────────────┘ │
│ │ ┌────────────────────────────────┐ │
│ Trending up through Q3 then │ │ Follow-ups (clickable chips) │ │
│ dipping in November. │ │ • Break Nov by region │ │
│ │ │ • Compare to Nov last year │ │
│ [input box] │ │ • Look at units instead │ │
│ [Send] │ └────────────────────────────────┘ │
└─────────────────────────────────────┴──────────────────────────────────────┘
The dialogue panel is the conversation log + input. The artifact panel is the stack — eleven kinds of typed artifacts that accumulate as conversation progresses:
| Kind | Renderer | Carries |
|---|---|---|
narration |
Markdown card | Prose summary, grounded in a referenced Frame |
frame |
Sortable datagrid | Columns + rows + frame_ql + served_from + verification level |
chart |
Vega-Lite renderer | Auto-suggested viz; chart type from Frame shape |
qm_card |
Column-profile card | Distinct values, cardinality, nunique (works for derived dims via Unit A synthesis) |
structure |
Interactive graph | FD-DAG or dimension-family hierarchy view |
comparison |
Two-Frame side-by-side | "This period vs last period" with structured diff |
simulation |
Parameterised Frame | Distinct border + "SIM" badge per §11.3 |
followups |
Clickable chips | Next-question suggestions; clicking sends as user message |
refusal |
Diagnostic card | Plain-language explanation + suggested reformulation |
provenance |
Footer on any Frame | Frame-QL, served_from, verification — always accessible |
clarification |
Question + multiple-choice | When intent is ambiguous, ask before running |
Each artifact has a stable ID (art_{session_id}_t{turn}_i{index}) the LLM can reference via recall_artifact(id) and the user can pin / branch / export.
Session-level operations: pin (de-prioritises pruning), branch (fork from a specific artifact; new session inherits up to that point), export (download session as JSON; replayable on another machine), share (read-only URL), clear (archive conversation; artifacts remain).
12.9 ANALYST.md — the author memo¶
The AC's structural surface tells the LLM what's possible; the memo tells it what's meant. Two-level hierarchy (mirrors Claude Code's CLAUDE.md):
installation.coframe/ANALYST.md ← org-wide conventions + voice
retail.coframe/ANALYST.md ← AC-specific family semantics
The installation-level memo carries organisational conventions (currency, fiscal-year alignment, date format), cross-cutting caveats (known data quirks), and analyst voice. The AC-level memo carries family semantics ("when users say 'margin', interpret as profit / revenue at the output grain"), recommended starting points ("executive views = region × month grain"), and refusal scope ("don't speculate about future periods").
Composition: both loaded at session start, installation memo first, AC memo second, with an explicit "AC overrides installation on conflict" note so the LLM resolves conflicts in the right direction. Both memos prompt-cached as L2, separately from L1, so memo edits don't invalidate the AC catalog cache.
A real example lives at drafts/data/retail_demo/installation.coframe/ANALYST.md and drafts/data/retail_demo/retail.coframe/ANALYST.md. The format is freeform markdown — no schema to satisfy, no fields to fill in. The framework reads the file and passes it as L2 context verbatim.
12.10 Confidence + verification signaling¶
The Analyst should be explicit about what it knows vs. what it's inferring. Concrete patterns:
| Situation | Analyst behaviour |
|---|---|
| Ambiguous intent | "I'm going to interpret this as X — does that match?" (uses clarification artifact, not plain text) |
| AC at Level A on this surface | "This Frame relies on declared FD-edges that aren't fully data-attested yet. Take regional rollups as approximate." |
| Anomaly in the data | "I notice November is unusually low. I don't know if this is a data issue or a real event." (states the anomaly; doesn't speculate beyond what data shows) |
| Refusal received | Passes the resolver's structured error through in plain language; suggests reformulation |
| Multiple valid interpretations | Surfaces as clarification artifact with multiple-choice chips |
Every Frame artifact carries the AC's verification level at execution time. The provenance card shows it. The narration mentions it when below AAA. This is the "AAA attestation" claim made operationally visible — the Analyst is structurally incapable of presenting an A-attested number as if it were AAA, because the verification level is in the context.
Register discipline (A9). The Analyst is bilingual: it thinks in Coframe vocabulary internally and speaks the user's analytical vocabulary externally. The verification-level mention above ("we have high confidence", "with a small caveat") is the example pattern — not "Level AA" or "sibling-coherence opted out for revenue@region_daily_summary." Three pieces enforce the discipline:
- A framework primer ships as part of L1 (versioned as
FRAMEWORK_KNOWLEDGE_VERSION, separately from the system-prompt version). It teaches the Analyst what each Coframe concept means and pairs each with the plain-language frame to use when speaking to users — anchor-locked becomes "computed per transaction; weight to roll up"; FD-DAG becomes "verified column relationships"; verification-level becomes "high confidence" / "small caveat" / "structurally valid but not data-verified." - A register-translations table ships alongside the primer as a compact Coframe-term ↔ user-vocab reference. Both primer and table are pure content (
coframe.runtime.analyst.framework_knowledgemodule) — they will be republished as MCP resources so third-party harnesses get the same framework fluency without consuming our system prompt. - A register rule in the system prompt instructs the Analyst to match the user's register: default plain, mirror power users who open with framework vocabulary. The installation-level ANALYST.md may override the default via a
## User registersection (e.g., an installation deployed to data engineers can opt into free use of framework terminology).
The eval framework gates this via expected_text_must_not_contain (§12.14) — scenarios assert that user-facing prose does not leak framework terms when the user is in the plain-analytical register. Case-insensitive substring match; failures emit register discipline violated so the regression is named.
The substantive claim: Coframe's value proposition rests on "structurally-correct answers without the user having to know the structure." If the Analyst surfaces the structure to the user, that proposition partly leaks — the user is doing the disambiguation work the framework was supposed to do. Register discipline is not polish; it is the user-side completion of the framework's correctness claim.
12.11 The HTTP surface¶
The Analyst surface mounts on the runtime HTTP app (the same process that serves /query) under /analyst:
| Method + path | Purpose |
|---|---|
POST /analyst/sessions |
Open a session against an installation + AC |
POST /analyst/sessions/{id}/messages |
Send a user message; receive (text, artifacts, turns_used, …) |
GET /analyst/sessions/{id} |
Session metadata + recent artifacts |
GET /analyst/sessions/{id}/artifacts |
Full artifact list |
GET /analyst/sessions/{id}/artifacts/{aid} |
One artifact |
GET /analyst/sessions/{id}/trace |
Structured trace log (the unit of replay; §12.13) |
GET /analyst/sessions/{id}/export |
Portable session bundle — messages + artifacts + trace |
The runtime app mounts the analyst router only when an analyst_adapter_factory is supplied, so installations that don't want the surface don't pay the dependency. Deployments choose their LLM provider via the factory: the default ClaudeAdapter requires the [analyst] extra; OpenAIAdapter is pluggable but its SDK isn't pulled by default.
12.12 Pluggable LLM adapter¶
The Analyst is model-neutral by architecture. A single LLMAdapter ABC abstracts the provider:
class LLMAdapter(ABC):
def send(self, messages, *, tools, model, max_tokens,
cache_breakpoints) -> LLMResponse: ...
def select_model(self, task_class: TaskClass) -> str: ...
@property
def provider_name(self) -> str: ...
Two reference implementations ship: ClaudeAdapter (default — Anthropic SDK, prompt-cache markers translated into cache_control blocks) and OpenAIAdapter (peer reference — OpenAI's function-calling shape, no cache primitives). A third reference adapter for an open-weights model surface is planned for v2.
The select_model(task_class) hook is the architectural seam for per-task model routing — narration to a fast small model, planning to a heavier reasoning model, simulation to medium. V1 returns the same model for every task class; v2 wires the split.
12.13 Trace logging + replay¶
Every session emits a structured trace:
{
"format_version": "v0.1",
"session_id": "sess_abc123",
"ac_name": "retail_full",
"prompt_version": "v0.1.0",
"model": "claude-sonnet-4-5",
"started_at": 1779730000.0,
"events": [
{"t": 0.0, "kind": "session_start", "payload": {...}},
{"t": 0.0, "kind": "user_message", "payload": {"text": "show me revenue trends"}},
{"t": 0.4, "kind": "llm_response", "payload": {...}},
{"t": 0.6, "kind": "tool_call", "payload": {"tool": "list_families", ...}},
{"t": 0.6, "kind": "tool_result", "payload": {"tool": "list_families", "ok": true}},
{"t": 1.4, "kind": "artifact", "payload": {"artifact_id": "...", "artifact_kind": "frame"}},
{"t": 2.7, "kind": "assistant_message", "payload": {"text": "..."}}
]
}
The trace is the unit of replay. Given the same trace + same prompt version + same model + same AC version, replay reproduces the session bit-equivalently. The ReplayAdapter.from_trace(trace) factory builds an adapter whose send() returns the recorded llm_response events in order; the orchestrator re-executes against it.
What replay verifies: the orchestrator + handlers still dispatch the same tools in the same order; artifact IDs still match; dedup / stuck warnings still fire at the same points. What replay can't verify: whether the recorded LLM was right (eval-corpus job), or whether handlers produce the same data (backends move; replay against a frozen backend snapshot for that).
12.14 Eval framework + quality metrics¶
The eval framework is deterministic by construction. Each scenario carries both the user input AND the LLM's intended responses (llm_script field), so CI runs the same way every time without burning live-LLM tokens. Live-LLM eval is a separate, slower path for one-off canaries.
A scenario:
- name: revenue_by_region
user_message: "show me revenue by region"
llm_script:
- tool_calls:
- name: execute_frame_ql
arguments:
frame_ql: "SELECT region, SUM(revenue) AS r AT region"
- text: "East leads at $3.87M, then Central, then West."
expected_tool_calls:
- tool: execute_frame_ql
frame_ql_pattern: "SELECT region, SUM\\(revenue\\).*AT region"
expected_outcome: success
expected_artifact_kinds: [frame]
expected_max_turns: 2
expected_text_contains: ["East"]
ScenarioRunner drives the orchestrator with a scripted adapter built from llm_script, then asserts on the resulting trace + artifacts. Failures emit human-readable diagnostics (expected 'execute_frame_ql' got 'list_families', frame_ql_pattern '...' did not match '...', etc.).
Quality metrics aggregate over batches of traces (§12.3 of the design doc):
| Metric | What it measures |
|---|---|
turns_per_intent_p50/p99/max |
How quickly the analyst resolves user messages |
refusal_rate |
Fraction of intents ending in a refusal |
stuck_warning_rate |
Fraction of intents in which the stuck warning fired |
budget_exceeded_rate |
Fraction of intents that hit MAX_TURNS_PER_INTENT |
cache_hit_rate |
Anthropic prompt-cache effectiveness (from cached_input_tokens) |
eval_pass_rate |
Fraction of corpus scenarios passing — tracked per prompt version |
The Workbench Engine page renders these alongside Metric Engine cache stats; CI emits them as part of the eval summary.
12.15 What the Analyst gives the framework¶
A column-native AC with verified structural integrity is the substrate the Analyst rests on. The reverse claim is also true: the Analyst is what makes the AC's investment pay off for non-author users. The AC author works in Frame-QL and the operator catalog; the AC consumer never needs to. The Analyst's discipline — refuse rather than guess, ground every claim in a Frame, surface verification levels — is the framework's structural guarantees made operationally visible to the user who never edits ac.yaml.
In the language of the position article: Coframe is the analytical layer; the AC Analyst is the dominant access pattern to it for human users.
Subsequent reading: the design document (coframe_ac_analyst_design_v0_1.md) for the rationale, the harness-engineering principles, the model-routing architecture, and the deferred-to-v2 simulation patterns. The retail demo's analyst_eval/ directory has six worked scenarios. The coframe.runtime.analyst package and its eval/ subpackage carry the implementation.
Appendix A: Frame-QL Grammar¶
The complete BNF grammar for Coframe Core's Frame-QL, formalizing the language specified in Chapter 8. This grammar covers the Core subset; Coframe Pro extensions (window functions, persistent re-ingestion, multi-backend queries) are out of scope and specified separately.
A.1 Notation¶
This grammar uses a standard EBNF-style notation:
::=defines a production.|separates alternatives.[ x ]denotes an optional element (zero or one).{ x }denotes repetition (zero or more).( x )groups elements.- Terminals are written in
UPPERCASE(keywords) or as quoted literals (',','('). - Nonterminals are written in
lower_snake_case. - Keywords are case-insensitive (§8.3.3); they are shown uppercase for clarity.
The grammar describes syntax only. Semantic constraints — that a metric family reference resolves to exactly one family-root, that the AT grain is reachable, that an ip_reducer's block set permits the requested rollup — are specified in Chapters 8 and 9 and verified during resolution, not by the grammar.
A.2 Top-level structure¶
query ::= frame | with_block
with_block ::= WITH inner_frame_list outer_frame
inner_frame_list ::= inner_frame { ',' inner_frame }
inner_frame ::= identifier AS '(' frame ')'
outer_frame ::= frame
frame ::= select_clause
[ from_clause ]
[ where_clause ]
at_clause
[ having_clause ]
[ order_by_clause ]
[ limit_clause ]
[ ';' ]
A query is either a single Frame or a WITH-block. An outer Frame must include an at_clause; an inner Frame within a WITH-block may omit it (inheriting the outer Frame's grain, per §8.7.3). The grammar below makes at_clause mandatory in frame; the inner-Frame exception is a resolution-time relaxation, not a syntactic one — an inner Frame that omits the grain clause is parsed with an absent at_clause and the resolver supplies the inherited grain.
To make that explicit, the inner-Frame production uses a relaxed Frame:
inner_frame ::= identifier AS '(' relaxed_frame ')'
relaxed_frame ::= select_clause
[ from_clause ]
[ where_clause ]
[ at_clause ]
[ having_clause ]
[ order_by_clause ]
[ limit_clause ]
A.3 The Frame core: SELECT and AT¶
A.3.1 SELECT clause¶
select_clause ::= [ SELECT ] select_item_list
select_item_list ::= select_item { ',' select_item }
select_item ::= expression [ AS identifier ]
The SELECT keyword is optional (the sugar form, §8.4.1): a Frame may begin directly with the select_item_list. Each select_item is an expression, optionally aliased with AS.
A.3.2 AT clause (grain)¶
at_clause ::= ( AT | BY ) at_target [ using_clause ]
at_target ::= dimension_ref
| '(' dimension_ref { ',' dimension_ref } ')'
using_clause ::= USING identifier
dimension_ref ::= identifier
| qualified_name
The at_clause declares the output grain as a single AC-dimension or a parenthesized tuple. The keyword is AT (canonical); BY is accepted as a deprecated synonym (§8.5.4). The optional using_clause selects a hierarchy by name when an AC-dimension is reachable through more than one hierarchy of its dimension family (§8.5.4).
A.4 Accessory clauses¶
A.4.1 FROM clause¶
The from_clause lists schemas (or WITH-block inner-frame names) the query may draw from. Optional; when absent, the resolver selects schemas via the four-rule filter (§9.3).
A.4.2 WHERE clause¶
The where_clause filters input rows before the Frame is computed. Its expression is boolean-valued at the input grain (§8.5.3).
A.4.3 HAVING clause¶
The having_clause filters output rows after the Frame is computed. Its expression is boolean-valued at the output grain and may reference reducer expressions (§8.5.5).
A.4.4 ORDER BY clause¶
order_by_clause ::= ORDER BY sort_key_list
sort_key_list ::= sort_key { ',' sort_key }
sort_key ::= expression [ ASC | DESC ]
ASC is the default when neither direction is given. Multiple keys are applied lexicographically (§8.5.6).
A.4.5 LIMIT clause¶
limit_clause ::= LIMIT integer_literal [ per_clause ]
per_clause ::= PER dimension_ref { ',' dimension_ref }
Bare LIMIT n caps the entire frame after ORDER BY. LIMIT n PER cols caps per group, where cols must be a subset of the AT dimensions and each entry must be an AC-dimension reference (not a measure or expression). See §8.5.7 for the full semantics, the L1–L3 validity rules, and the within-group determinism guidance.
A.5 Expressions¶
The expression sublanguage covers reducers, mappers, composites, and registered ratio/count operators (§8.6). It contains no window functions (those are out of Core scope, §8.2.3).
A.5.1 Expression grammar¶
expression ::= bool_expression
bool_expression ::= bool_term { OR bool_term }
bool_term ::= bool_factor { AND bool_factor }
bool_factor ::= [ NOT ] bool_primary
bool_primary ::= comparison
| '(' bool_expression ')'
comparison ::= additive_expr [ comparison_op additive_expr ]
| additive_expr IS [ NOT ] NULL
| additive_expr [ NOT ] IN '(' value_list ')'
| additive_expr [ NOT ] BETWEEN additive_expr AND additive_expr
| additive_expr [ NOT ] LIKE string_literal
comparison_op ::= '=' | '<>' | '!=' | '<' | '<=' | '>' | '>='
additive_expr ::= multiplicative_expr { ( '+' | '-' ) multiplicative_expr }
multiplicative_expr ::= unary_expr { ( '*' | '/' | '%' | '^' ) unary_expr }
unary_expr ::= [ '-' ] primary_expr
primary_expr ::= literal
| column_ref [ at_annotation ]
| reducer_expr
| function_expr
| multi_input_expr
| registered_ratio_expr
| case_expr
| '(' expression ')'
The boolean layer wraps the arithmetic layer so that WHERE and HAVING expressions (which must be boolean) and SELECT expressions (which may be any type) share one expression nonterminal; type-checking (Chapter 8 §8.10, Chapter 10) — not the grammar — enforces that a where_clause's expression is boolean.
A.5.2 Column references¶
column_ref ::= identifier
| qualified_name
qualified_name ::= identifier '.' identifier
value_list ::= literal { ',' literal }
A bare identifier references an AC family-name or AC-dimension by name. A qualified_name (schema.column) selects a specific schema's appearance of a family-name, used for cousin disambiguation (§8.6.5, §8.9).
A.5.3 Reducer, function, and multi-input expressions¶
reducer_expr ::= reducer_name '(' reducer_arg ')'
| COUNT '(' '*' ')'
reducer_arg ::= [ DISTINCT ] expression [ at_annotation ]
| expression ',' expression (* e.g. APPROX_PERCENTILE(c, p) *)
at_annotation ::= '@' at_target (* formation/companion grain; §8.6.6 *)
reducer_name ::= SUM | MAX | MIN | COUNT | COUNT_DISTINCT
| AVG | STDDEV | VARIANCE
| APPROX_DISTINCT | APPROX_PERCENTILE
| MEDIAN | MODE | EXACT_DISTINCT | EXACT_PERCENTILE
| BOOL_AND | BOOL_OR | BIT_AND | BIT_OR | BIT_XOR
| HLL_MERGE | THETA_UNION | T_DIGEST_MERGE
| STRING_AGG | ARRAY_AGG
| identifier (* catalog-defined reducer *)
function_expr ::= function_name '(' expression { ',' expression } ')'
function_name ::= UPPER | LOWER | TRIM | SUBSTRING | LENGTH | CONCAT | REPLACE
| DATE_ADD | DATE_DIFF | EXTRACT
| CAST
| COALESCE | IFNULL | NULLIF
| ABS | ROUND | FLOOR | CEIL
| identifier (* catalog-defined function *)
multi_input_expr ::= multi_input_name '(' expression { ',' expression } ')'
multi_input_name ::= MAP_DIV | MAP_MUL | MAP_SUB | MAP_ADD
| identifier (* catalog-defined multi-input *)
The reducer_name, function_name, and multi_input_name alternatives ending in identifier allow catalog-defined operators not enumerated in the grammar; the resolver validates the name against the operator catalog (Chapter 10). EXTRACT uses the special form EXTRACT(field FROM expression):
function_expr ::= ...
| EXTRACT '(' identifier FROM expression ')'
| CAST '(' expression AS type_name ')'
A.5.4 Registered convenience and two-anchor operators¶
registered_ratio_expr ::= RATIO_OF '(' expression ',' expression ')'
| COUNT_OF '(' bool_expression ')'
| two_anchor_expr
two_anchor_expr ::= PCT '(' expression at_annotation ')'
| WEIGHTED_AVG '(' expression ',' expression at_annotation ')'
| INDEX '(' expression at_annotation ')'
RATIO_OF and COUNT_OF are the registered convenience operators (§8.6.4, Chapter 10 §10.8.5). RATIO_OF takes a numerator and denominator; COUNT_OF takes a boolean predicate.
PCT, WEIGHTED_AVG, and INDEX are the two-anchor-measure operators (§8.6.6). The @-annotation supplies the companion or weighting grain; it is required when that grain is not uniquely determined, and the operator is refused as dubious if an ambiguous grain is left undeclared. A bare column_ref may also carry an at_annotation (revenue @ region), evaluating that family at the named coarser grain and broadcasting it to the frame grain. The two permitted grain relationships (composite-subset, nested-dimension) and the reduce-of-expression anchor rule are specified in §8.6.6; the grammar admits the annotation syntactically and the resolver enforces the relationships.
A.5.5 CASE expressions¶
case_expr ::= CASE when_clause { when_clause } [ ELSE expression ] END
| IF '(' bool_expression ',' expression ',' expression ')'
when_clause ::= WHEN bool_expression THEN expression
A.6 Literals and identifiers¶
literal ::= numeric_literal
| string_literal
| boolean_literal
| date_literal
| timestamp_literal
| null_literal
numeric_literal ::= integer_literal | decimal_literal
integer_literal ::= digit { digit }
decimal_literal ::= digit { digit } '.' digit { digit }
string_literal ::= "'" { string_char } "'" (* doubled '' for a literal quote *)
boolean_literal ::= TRUE | FALSE
date_literal ::= DATE string_literal
timestamp_literal ::= TIMESTAMP string_literal
null_literal ::= NULL | MISSING
identifier ::= letter { letter | digit | '_' }
| '`' { any_char_except_backtick } '`'
type_name ::= NUMERIC | INTEGER | STRING | BOOLEAN | DATE | TIMESTAMP
letter ::= 'a'..'z' | 'A'..'Z'
digit ::= '0'..'9'
NULL and MISSING are equivalent (§8.3.1). Backtick-quoted identifiers preserve case and may contain characters outside the bare-identifier set (§8.3.6).
A.7 Reserved keywords¶
The following are reserved and may not be used as bare identifiers (backtick-quote them if a column name collides):
SELECT FROM WHERE AT BY USING HAVING ORDER LIMIT WITH AS
AND OR NOT IN BETWEEN LIKE IS NULL MISSING TRUE FALSE
DISTINCT ASC DESC CASE WHEN THEN ELSE END IF
Operator names (SUM, MAX, MAP_DIV, RATIO_OF, PCT, WEIGHTED_AVG, INDEX, …) are not reserved keywords in the strict sense — they are resolved against the operator catalog — but using them as column names is discouraged and requires backtick-quoting to disambiguate from an operator invocation.
Operator names follow a single convention: a widely-recognized short abbreviation where practitioners read it as such (SUM, MAX, MIN, AVG, PCT), the full word where the full word is clearer (INDEX, COUNT, MEDIAN), and underscore-separated parts for compounds (HLL_MERGE, THETA_UNION, WEIGHTED_AVG). The rule is legibility, not uniform length — which is why AVG is abbreviated but INDEX is not.
A.8 Comments and whitespace¶
Whitespace separates tokens but is otherwise insignificant (§8.3.3). Comments may appear anywhere whitespace may.
A.9 Worked parse: the §1.5 query¶
For reference, the motivating query from Chapter 1 §1.5 parses as follows:
SELECT
region,
quarter,
SUM(revenue) AS revenue,
SUM(revenue) - SUM(cost) AS profit,
(SUM(revenue) - SUM(cost)) / SUM(revenue) AS gross_margin_pct
FROM transactions
WHERE
product_category = 'consumer_electronics'
AND year IN (2022, 2023, 2024)
AT (region, quarter)
select_clause: fiveselect_items — two barecolumn_refs (region,quarter) and three aliased expressions (areducer_expr, anadditive_exprover tworeducer_exprs, and amultiplicative_exprover composites).from_clause: oneschema_ref(transactions).where_clause: abool_expression— a comparisonANDanINpredicate.by_clause: a parenthesized two-dimension_refby_target, nousing_clause.
No having_clause, order_by_clause, or limit_clause is present. The grammar accepts this query; resolution (Chapter 9) determines whether the AC's commitments construct it.
A.10 What the grammar excludes¶
The grammar deliberately omits constructs that are not part of Coframe Core Frame-QL:
JOIN— there is no join syntax; cross-schema reach is automatic (§8.2.3).GROUP BY— grain is declared byBY, not by a separate grouping clause.- Window functions (
OVER,PARTITION BY,PRIOR,LEAD,RANK, running aggregates) — out of Core scope (§8.2.3, §8.8.7, Chapter 10 §10.9). - Subqueries other than WITH-block inner Frames (§8.2.3).
- DDL/DML — Frame-QL is a query language only; it does not create, alter, or write data. (Persistent re-ingestion of a Frame's output is a Pro capability, §8.7.2.)
A parser encountering these constructs reports a parse error (§8.11.1) identifying the unsupported construct.
Appendix B: Glossary¶
Definitions of the terms used throughout this manual. Each entry gives a concise definition and a pointer to the section where the term is specified in full. Where a term has a precise structural meaning, the definition states it; informal or motivational uses are noted as such.
A (anchor). See anchor.
AC. See Analytic Collection.
AC Surfaces. The umbrella term for an AC's access protocols — Frame-QL, NL Query, MCP, HTTP API, Workbench, Validation. Each Surface is a separately-documented conformance contract (what operations it supports, what its semantics guarantee, what authentication it requires); a given AC deployment may offer some Surfaces and not others. Distinct from the data-API protocol (Chapter 6), which is the framework-to-backend interface, not a consumer-facing AC Surface. (§8.2.2; see coframe_platform_design_v2_1_supplement.md §10.2 for the full enumeration.)
AC-attribute. A column that is neither an AC-dimension nor an AC-metric: it has a fixed single anchor (|A| = 1) across all schemas where it appears, belongs to no dimension family, and describes its anchor without serving as an analytical coordinate. Example: store_open_date. One of the three categories of the trichotomy. (§2.4.5)
AC-dimension. A column that is either (a) in grain role in at least one schema, or (b) reachable from a grain-role column through the FD-DAG — equivalently, a declared member of some dimension family. AC-dimensions function as analytical coordinates. One of the three categories of the trichotomy. (§2.4.5)
AC-metric. A column whose anchor may vary across schemas — observed (or observable) at multiple anchors and representing a measure. AC-metrics are organized into metric families. One of the three categories of the trichotomy. (§2.4.5)
Analytic Collection (AC). The framework's unit of analytical reasoning: a curated, named, structurally-committed selection of columns from one or more backend tables, organized into schemas, with structural commitments declared at the column, schema, and AC levels. Queries are written against an AC, not against backend tables. The AC is the primitive of the Analytic Layer. (§2.3)
Analytic Layer. The architectural category Coframe instantiates: the layer in a data stack that guarantees derivations of metrics are structurally correct — through declared structural commitments, verified integrity conditions, and a graded verification regime (A/AA/AAA). Peer to the Semantic Layer category (Looker, dbt MetricFlow, Cube, AtScale, Snowflake Semantic Views, Databricks Metric Views), which centralizes named metric meanings. The two layers are complementary; they answer different questions. (Preface; see coframe_platform_design_v2_1_supplement.md §1 for the category framing in full.)
Anchor (A). The set of AC-dimensions on whose values a column's value depends. Declared per ColumnSpec via the A field. A commitment, not a description: the AC author commits that the column's value depends on exactly the AC-dimensions named in A. (§2.4.1)
Anchor-locked family. A metric family whose family-root operator is not partition-invariant and which therefore has no ip_reducer. Its columns exist at specific anchors and cannot be derived to other anchors via name-preserving aggregation; cross-grain queries against it are refused. Example: a family rooted at AVG or MEDIAN. (§2.8.4)
AT clause. The Frame-QL clause declaring the output grain — the AC-dimensions each output row represents. Canonical keyword; BY is an accepted but deprecated synonym. AT is locative (it names the grain values sit at) rather than operative (an instruction to group). (§8.5.4)
@ (grain annotation). A per-operand annotation, term @ anchor, evaluating a sub-term at a grain other than the frame grain and bringing it to the frame grain. Used in two-anchor measures and reduce-of-expression. Permitted only for coarsenings of the frame grain (composite-subset or nested-dimension). Where AT sets the Frame's grain, @ sets a sub-term's grain. (§8.6.6)
Asserted-not-verified fact. A fact the framework's correctness depends on but does not verify per-AC — inherited from the operator catalog, the framework's principles, or an architectural commitment. The DQ deliverable reports these distinctly from data-attested conditions. (§2.10.6, §7.6)
Attestation. See per-lineage-edge attestation.
Base level. The finest AC-dimension in a dimension family — the unique member from which all other members are reachable through declared FD-edges. The base level need not be in grain role in any schema; it need only be grounded. Example: day is the base level of the time dimension family. (§2.5.2, §3.8.2)
Block set (A_block). A set of dimension families along which an ip_reducer must not be applied for a given metric family. Declared per ip_reducer at the metric-family level; downward-closed along the FD-DAG. Written {time} in prose, [time] in YAML. Example: eom_inventory's SUM ip_reducer has A_block = {time}. (§2.8.1, §2.8.3)
cache_hint. An optional block on a metric family declaring which (family, anchor) pairs the Metric Engine should pre-materialise at COMMIT time. Advisory only — omitting it does not change the family's semantics. The engine's promotion-recommendation surface emits paste-able cache_hint stanzas for hot entries (§11.7). (§3.5.7, §11.2, §11.7)
Candidate FD-DAG. The set of FD-edges the AC author declares. Verified against the data-driven FD-DAG during DQ Phase 2; the integrity condition is candidate ⊆ data-driven. (§2.5.5)
ColumnSpec. The AC's per-column structural commitment, declared within a schema. Has four parts: backend-facing (src_name, data_type), anchor-facing (A, M), operator-lineage (op, lineage), and cross-schema linkage (name). (§2.9.6, §3.8)
Constructive correctness. The framework's central guarantee: when it accepts a Frame-QL query, the answer is a consequence of the AC's declared and verified commitments — constructed by an algorithm, not produced by semantic interpretation or guessing. (§1.6)
Cousin. One of two columns with the same name but different family-roots — independent observations sharing a family-name. A query referencing a name that resolves to multiple cousins is refused as dubious unless disambiguated by qualified reference or FROM. (§2.7.5)
Data-API protocol. The specified interface between the framework and a backend: the operations the framework requires (connection, introspection, structural verification, projection, aggregation), their request/response shapes, and the protocol for invoking them. (Chapter 6)
Data-driven FD-DAG. The set of FD-edges actually attested by the backend data, as opposed to the candidate FD-DAG the AC declares. (§2.5.5)
Declared scope. Per schema, per dimension family, whether the schema is non-degenerate (has analytical anchoring in that family) or degenerate (has none). (§2.9.4)
Degenerate. A schema is degenerate on a dimension family when it has no analytical anchoring there — a deliberate declaration, the controlled exception to Principle 2. (§2.9.4, §2.11.2)
Derived family. A metric family whose family-root's lineage points to a column in a different family; it inherits structure from the predecessor family via the family-DAG. Contrast primitive family. (§2.7.8)
Dimension family. A named grouping of related AC-dimensions, organized into one or more hierarchies, with a designated base level. The coordinate layer of an AC. Example: the geography dimension family contains store, city, region, country. (§2.5.1)
Dubious query. A query whose answer is not unique under the AC's commitments, so the framework refuses it rather than choosing. Two sources: (a) a family-name resolving to multiple non-equivalent family-roots (cousins); (b) a two-anchor measure whose formation grain is ambiguous and undeclared — e.g. a weighted average whose weighting grain is not uniquely determined (§8.6.6). In both cases the author must disambiguate. (§1.6, §8.6.6, §9.7)
EngineEntry. One materialised row in the Metric Engine's manifest: domain (METRIC / QM), dataset_id, anchor, parquet path, operator + partition-invariance + missing-value treatment (METRIC-only), source schemas + filter, byte size, materialised-at + last-access timestamps, access count, verification level. The on-disk identity of one cached aggregate or QM profile. (§11.3.2)
FD-DAG. The collection of all functional-dependency edges in the AC, partitioned by dimension family, required to be acyclic. (§2.5.4)
FD-edge (functional-dependency edge). A declaration d1 → d2 that every value of d1 determines a unique value of d2. Grounded either as data-attested (observed in the data) or function-derived (computed by a catalog function). Example: store → region. (§2.5.4, §2.5.6)
FDStep. One edge in a Metric Engine ServingPath: a transition from a finer source anchor to a coarser target anchor via a partition-invariant operator. The atom of FD-DAG rollup in the engine's serve() branch 2. (§11.4.2)
Family. See metric family and dimension family. Unqualified, "family" usually refers to a metric family.
Family-DAG. The acyclic graph of derivation edges between metric families, where a derived family's root depends on a predecessor family. (§2.7.8)
Family-name. A name value appearing in one or more ColumnSpecs. Two columns with the same name belong to the same metric family. Names are opaque labels: the framework compares them for equality only, never parsing meaning. (§2.7.3, §2.11)
Family-root. The earliest column reached by walking a column's lineage backward while the predecessor name equals the current name. The structural origin of a metric family. A family-root has self-referential lineage. (§2.7.4)
Four-rule filter. The resolver's central plan-evaluation mechanism: a candidate plan must pass all four rules — family resolution, anchor-set capability, schema selection, cross-schema coherence. (§9.3)
Frame. The output of a Frame-QL query: a rectangular datagrid of columns at a declared grain. Conceptually, a Frame is a lightweight output schema — a set of lightweight ColumnSpecs, minus what would make it re-ingestable as a permanent AC member. The core of a Frame-QL query, surrounded by accessory clauses. (§8.2)
Frame expression. A Frame-QL expression computing a column at the output grain from terms each available at that grain (a bare family column, an explicit reducer, an @-anchored sub-term, or arithmetic over these). Every term resolves to a single value per output row; no term references other rows of the assembled frame. (§8.6)
Frame-QL. Coframe Core's declarative query language. A query describes the desired output Frame (via SELECT + AT) surrounded by accessories (FROM, WHERE, HAVING, ORDER BY, LIMIT); the framework constructs the algorithm that produces it. (Chapter 8)
Frame test. The decidable check for whether a computation is expressible in Frame-QL Core: can a column's value be specified from its own row's grain (expressible — it is a column spec), or does it require referencing other rows of the assembled result (not expressible in Core — a frame-level operation)? (§8.2)
Function (operator). An operator that transforms values without aggregation, producing a successor at the same anchor as its predecessor (A_pred = A_self). Example: MONTH_OF, ABS. (§2.6.1, §10.7)
Function-derived FD-edge. An FD-edge grounded by a deterministic catalog function rather than observed in data. Example: month = MONTH_OF(day). Contributes to the FD-DAG identically to a data-attested edge. (§2.5.6)
Grain. Of a schema: the anchor at which its rows are uniquely identified, derived from its grain-role columns. Of a query/Frame: the output anchor declared by BY. (§2.9.3, §8.5.4)
Grain role. A column is in grain role in a schema when its anchor is the schema's grain — A = {self} for a single-column grain, or the full composite (e.g., {store, month}) for a composite-grain contributor. A column may be in grain role in one schema and non-grain role in another. (§2.4.2)
Grounded. An AC-dimension is grounded when its values can be obtained from observed data: it either appears in some ColumnSpec in any role, or is FD-reachable from such an AC-dimension. The requirement that lets a base level (e.g., day) be valid without itself being in grain role. (§3.8.2)
Hierarchy. A named path within a dimension family from the base level upward, traversing FD-edges. A dimension family may have multiple hierarchies sharing the base level. Example: the time family's calendar, fiscal, and week hierarchies. (§2.5.3)
Holistic operator. A reducer that is not partition-invariant under any finite-state enrichment — it cannot roll up across grain. Example: exact MEDIAN, exact COUNT_DISTINCT, MODE. Families rooted at holistic operators are anchor-locked. (§2.6.4, §10.6)
Identical. Two columns with the same name, same anchor, and same family-root — structurally interchangeable. (§2.7.5)
Identity-preservation. A property of an operator relative to a predecessor family: whether applying the operator yields a column in the same family (same name). Orthogonal to partition-invariance. SUM is identity-preserving for revenue (→ revenue); COUNT is not (→ revenue_count). Mediated by the naming function. (§2.6.5)
ip_reducer (identity-preserving reducer). The operator under which a metric family's columns roll up consistently — both partition-invariant and identity-preserving for the family. Declared as a pair (R, A_block). A family may declare zero (anchor-locked), one, or several. (§2.8.1)
Lineage. A ColumnSpec's record of its immediate predecessor: a triple (name_pred, A_pred, op_pred). Self-referential for a root column. Multi-input operators carry a list of predecessor records. (§2.7.1)
Lineage edge. The relationship between a column and its predecessor. The collection of all lineage edges is the AC's lineage graph. (§2.7.2)
Liftably monoidal operator. A reducer not partition-invariant in its natural value space but partition-invariant in an enriched space, with a final projection extracting the result. Example: AVG via (SUM, COUNT). (§2.6.4, §10.5)
Missingness anchor (M). The per-ColumnSpec set of coordinates (a subset of A ∪ {self}) on whose values the column's missingness depends. The MCAR / MAR / MNAR mechanism category is a derived, lossy summary of M, not a separately-declared field. Paired with A to form the twin anchor on every ColumnSpec. (§2.4.3)
Metric Engine. Coframe Core's optional per-AC acceleration substrate. A query engine layered on Polars + Parquet + a SQLite manifest; memoises per-(metric family, anchor) aggregates; serves them via three-branch dispatch (exact match → FD-DAG rollup → backend fallback); composes multi-metric Frames in one operation. Hosts METRIC + QM domains on a shared substrate. Opt-in via the installation.yaml metric_engine block. (Chapter 11)
Metric family. A named grouping of AC-metrics related through lineage to a common family-root. The observation layer of an AC. Example: the revenue family. (§2.7)
Monoid (commutative). The algebraic structure (V, R, e) — value space, operator, identity — that an operator forms iff it is partition-invariant on that value space. The framework's general criterion for rollup safety, accommodating numeric and non-numeric (sketch) value spaces alike. (§2.6.2)
Multi-input operator. An operator taking two or more inputs at a common anchor and producing a singleton output. Example: MAP_DIV. (§2.6.1, §10.8)
Multi-Table Invariance (MTI). The theorem that any two query plans the four-rule filter accepts for the same query produce the same output frame (modulo declared epsilons and row ordering), given verified commitments. The guarantee underlying cross-schema resolution. (§9.6)
Naming function. The AC-level declaration mapping (name_pred, A_pred, op) to the successor's name, allowing the framework to verify that a column's name is consistent with its operational lineage. (§3.6)
Non-degenerate. A schema is non-degenerate on a dimension family when its rows can be meaningfully classified along that family's hierarchies. Contrast degenerate. (§2.9.4)
OBSERVED. The operator marking a column as directly-observed data with no derivation; a family-root with self-referential lineage. (§10.10)
Operator. A catalog-defined relationship between a predecessor column and a successor column, with declared structural properties (type, partition-invariance, identity-preservation, type signature, missing-value treatment). Reducers, functions, multi-input operators, and OBSERVED. (§2.6, Chapter 10)
Operator-asserted family. A metric family whose ip_reducer cannot be tested against data because no in-AC sibling exists. The framework trusts the declaration; the DQ deliverable records the commitment as asserted-not-verified. Contrast operator-attested. (§2.7.6)
Operator-attested family. A metric family with at least one in-AC sibling against which Phase 3 can test the ip_reducer. (§2.7.6)
Partition-invariance. The property of an operator R that R(X) = R(R(X_1), …, R(X_n)) for any partitioning of X — equivalently, that (V, R, e) forms a commutative monoid. What licenses rollup. (§2.6.2)
Per-lineage-edge attestation. The DQ Phase 3 verification that, for each attestable lineage edge, the predecessor's data aggregated via the family's ip_reducer at the successor's anchor agrees with the successor's observed values (respecting block sets and missing-value treatment). (§2.10.5, §7.5)
PER subclause. Optional subclause of LIMIT that makes the row cap apply within groups defined by a subset of the AT dimensions, rather than across the entire frame. LIMIT 3 PER region returns up to three rows per region. Pure output filtering — not a window function (§8.5.7).
Primitive family. A metric family rooted at OBSERVED with no further lineage upward — the data is the family's root. Contrast derived family. (§2.7.8)
Principle 1 (column-borne information). Every column's value is determined by the entities it is anchored to, as declared by A. (§2.11.1)
Principle 2 (same universe of observation). All schemas in an AC observe the same underlying entities, which is what makes cross-schema commitments meaningful. (§2.11.2)
Promotion recommendation. A surface emitted by the Metric Engine identifying cached entries whose sustained access pattern (hit count over an age window) qualifies them for promotion from ambient cache to the AC's declared cache_hint. Each recommendation carries a paste-able YAML stanza the AC author appends to the relevant family. The bridge between lazy memoisation and push pre-materialisation in the declare → observe → promote flywheel. (§11.7)
Reduce-of-expression. A reducer applied to an expression rather than a bare family column (e.g. SUM(rating * enrollment @ school)). Requires an explicit formation anchor (@) — the bare family column is the sole exception. The requirement is conservative: it is enforced uniformly rather than only for the nonlinear cases where the grain provably matters. Desugars to a staged frame (§8.7). (§8.6.6)
Reducer. An operator that aggregates values across a predecessor's anchor, producing a successor at a coarser anchor (A_pred ⊇ A_self). Example: SUM, MAX, HLL_MERGE. (§2.6.1)
Registered ratio / count operator. A query-time convenience operator: RATIO_OF(num, den) (ratio of two reducers at the output grain) and COUNT_OF(predicate) (conditional count). Not stored metrics. (§8.6.4, §10.8.5)
Resolver. The framework component that takes a Frame-QL query against an AC and constructs the query plan (or a refusal). Operates in seven stages, applying the four-rule filter at plan selection. (Chapter 9)
Rung. A level in the pedagogy of increasing query capability used to organize Frame-QL examples. Coframe Core supports rungs 0 (read), 1 (identity-preserving reduction), 2 (broadcast), 6 (multi-input expressions), 7 (cross-schema reach), and 9 (WITH-chained frames). (§8.8)
Schema. A structural object within an AC binding to a single backend source and declaring ColumnSpecs for its analytically-relevant columns. The binding layer of an AC. (§2.9.1)
Schema.init. The YAML document holding an AC's complete declaration: AC-level metadata, dimension families, metric families, naming function, attestation config, and schemas with their ColumnSpecs. (§3.2)
Semi-additive measure. A metric that rolls up additively along some dimension families but not others — captured by an ip_reducer with a non-empty block set. Example: eom_inventory (SUM across geography, blocked across time). (§2.8.1)
served_from. The observational indicator the Metric Engine attaches to every Frame: engine_cache (all metrics from cached entries), engine_backend (all metrics fell through to backend fallback + were memoised on the way), engine_mixed (some hit cache + some fell back), or backend (the engine is disabled for the AC). Observational, not contractual — Frames are equal across all four branches; the indicator exposes the serving path for observability and tests. (§11.6)
ServingPath. The Metric Engine's plan for resolving a METRIC-domain serve() from materialised state: a source EngineEntry to scan, a sequence of FDStep rollups to traverse, the operator at each step, and the final target anchor. Empty rollup_steps = exact match (Branch 1); non-empty = FD-DAG rollup (Branch 2). Branch 3 (backend fallback) does not produce a ServingPath. (§11.4)
Sibling. One of two columns with the same name, different anchors, and the same family-root — the same metric observed at different anchors, navigable via the family's ip_reducers. (§2.7.5)
Singleton (family). A one-member metric family produced by a multi-input operator (or otherwise leaf in the genealogy). Anchor-locked; no further family lineage upward. Example: a stored gross_margin_pct. (§2.7.7)
Trichotomy. The exhaustive, mutually-exclusive, metadata-derivable classification of every AC column as AC-dimension, AC-attribute, or AC-metric. Distinct from a column's per-schema grain-vs-non-grain role. (§2.4.5)
Two-anchor measure. An analytical measure combining a value at the frame grain with an aggregate of a metric taken at a coarser grain (reached by FD-navigation, then broadcast back). Examples: percentage of a coarser total (PCT), weighted average (WEIGHTED_AVG), index to a base coordinate (INDEX). Expressed via @-anchored frame expressions; the named operators are sugar and the surface at which the grain-ambiguity dubiousness law is enforced. Distinct from window functions, which reference other rows of the assembled result and are out of Core. (§8.6.6)
Twin anchor ((A, M)). The pair of coordinate-sets attached to every column: the anchor A (what the column's value depends on) and the missingness anchor M (what its being-missing depends on, ranging over A ∪ {self}). Both are sets over the same coordinate space; the framework requires both on every ColumnSpec. (§2.4.4)
Top-N per group. The pattern of returning at most n rows within each group defined by a subset of the output grain — e.g. top three stores per region by revenue. Expressed in Frame-QL with LIMIT n PER cols; pure output filtering rather than a window function, since it discards rows of the frame rather than computing per-row values that read across rows. (§8.5.7)
Verification level (A / AA / AAA). The accreditation level indicating how strictly an AC has been verified: A (basic structural), AA (structural + value coherence), AAA (full + assertions documented). (§7.8)
Virtual table. The conceptual table a schema binds to — possibly a physical table, view, query result, or federation. The data-API abstracts the binding. (§2.9.2)
WITH-block. A Frame-QL construct defining one or more session-local inner Frames followed by an outer Frame that may reference them. The mechanism for rung-9 chained frames. (§8.7)
Appendix C: Related Work and Provenance of Ideas¶
This appendix situates Coframe Core against the established literature and the contemporary tool landscape. It states plainly what the framework borrows, what it generalizes, and what is genuinely new in its construction. The intent is to make Coframe's contribution legible by placing it accurately: a framework is more credible, not less, for citing the work it builds on and naming the precise delta it adds. Several of Coframe's structural conditions are necessary truths about correct aggregation that the framework uncovers and operationalizes rather than invents; this appendix says so.
C.1 Summarizability and the conditions of correct aggregation¶
The question of when an aggregate can be correctly computed at a coarser level from values at a finer level is studied in the data-warehousing literature under the name summarizability. The notion was introduced by Lenz and Shoshani ("Summarizability in OLAP and Statistical Data Bases," SSDBM 1997) and developed by many successors, who established conditions — disjointness, completeness, and type compatibility of measures with aggregation functions — under which summarization is valid.
The taxonomy of measures as additive, semi-additive, and non-additive (holistic) is established practitioner vocabulary, popularized in dimensional-modeling practice (Kimball and Ross, The Data Warehouse Toolkit), and the algebraic study of which aggregation functions distribute over partitioning appears in the OLAP-algebra literature (e.g., the distributive / algebraic / holistic classification of aggregate functions due to Gray et al., "Data Cube," 1997).
What Coframe takes from this work: the conditions of summarizability, the additive/semi-additive/holistic taxonomy, and the distributive/algebraic/holistic distinction (which appears here as natively-monoidal / liftably-monoidal / holistic, §2.6.4). Coframe does not claim to have discovered these. The structural conditions its Data Quality process verifies (Chapter 7) — grain uniqueness, functional-dependency validity, cross-schema value consistency, pre-aggregate/detail coherence — are necessary conditions for any correct analytical aggregation, independent of Coframe; they are properties of the data and the question.
What Coframe adds: it operationalizes these conditions as a closed checklist derived from the AC's column declarations, unifies their checking under one structural model rather than leaving them to scattered, ad-hoc tests, and runs them as a standing, phased verification regime that publishes a verification status (the verification levels of §7.8) which the query constructor consults. Because the checklist is derived from the structure, it is scoped to exactly the conditions the structure makes load-bearing for answerable queries — which a generic data-quality tool cannot determine, lacking the structural model that says which dependencies are navigated.
C.2 Attribute-centric control of aggregation correctness¶
The idea that the correctness of an aggregate query can be governed by per-attribute properties — describing, for each attribute, which aggregation functions correctly aggregate it along which sets of dimensions — is developed by Simon, Amann, Liu, and Gancarski ("Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic Queries," arXiv:2111.13927, 2021). That work introduces aggregable properties and uses them to automatically detect, and refuse, semantically incorrect aggregate queries composed from filter, project, join, aggregate, union, difference, and pivot over analytic tables and views.
What Coframe shares: the attribute-centric stance that aggregation correctness is governed by per-column properties relating operators to dimensions, and the disposition to refuse incorrect aggregations rather than compute them. Coframe's per-column anchor together with its family's ip_reducers and block sets (Chapter 2) is recognizably in this family of ideas.
What Coframe adds: the setting and the construction differ substantially. Coframe's query is posed against an Analytic Collection that has no tables and no join operation in its query model (§1.3.1); the per-column constraint generalizes to arbitrary partition-invariant reducers, not only additive measures (§2.8.3); and the constraint is embedded in a constructive resolution algorithm that builds the answer column by column (Chapter 9), rather than in a checker that validates a table-algebra query after the fact. The correctness idea has a close neighbor; the column-native, table-free architecture it is embedded in does not.
C.3 Aggregate awareness and materialized-view rewriting¶
Commercial analytics tools have long routed queries to pre-aggregated tables for performance. Aggregate awareness — automatically substituting a smaller rollup table for the detail when a query's grain permits — appears in Looker (aggregate awareness), Cube (additive rollup pre-aggregations), and Holistics, among others. The general problem of answering a query from precomputed summaries is answering queries using views / materialized-view query rewriting, a well-studied area of database research.
What Coframe shares: the goal of serving queries from pre-aggregated summaries when it is sound to do so. Coframe's Multi-Table Invariance (§9.6) is the soundness condition that licenses this substitution — the same role materialized-view rewriting plays in a relational optimizer.
What Coframe adds: verification of the premise these systems assume. Aggregate-aware tools generally assume the pre-aggregate equals the detail and route to it for speed; they do not verify the equality. Coframe pairs the substitution license (MTI) with per-lineage-edge attestation (Chapter 7 Phase 3), which tests that the rollup actually equals the rolled-up detail under the declared operator, within declared tolerance, respecting block sets. The result is acceleration with a verified correctness premise, rather than acceleration on trust. To the framework's knowledge this routine, contractual attestation of pre-aggregate/detail coherence — and its surfacing as a published verification level — is not standard in mainstream aggregate-aware tools.
C.4 Semantic layers and the metrics-definition ecosystem¶
The contemporary semantic layer category — LookML (Looker), the dbt Semantic Layer powered by MetricFlow, Cube, AtScale, Snowflake Semantic Views, and Databricks Metric Views — centralizes metric definitions so that consumers query named measures without writing the underlying SQL, and increasingly carries grain-awareness and additivity metadata. The Open Semantic Interchange (OSI) initiative (with participation from dbt Labs, Snowflake, Salesforce, and others) pursues vendor-neutral, define-once metric definitions consumable across tools. A shared motivation across this ecosystem, stated explicitly by its proponents, is that AI agents must not be left to guess joins, grains, filters, and windows — that governed semantics are required for trustworthy automated analytics.
What Coframe shares: the goal of querying a governed model rather than raw tables, and the conviction that systems (and agents) should rely on declared structure rather than guesswork. Coframe does not claim originality for either; both are common ground in this category.
What Coframe does differently — a different decomposition, not a different category. The prevailing semantic layer fuses two layers into one artifact: the operational rules of a measure (how it may be aggregated, along which dimensions) and the named business measure itself. Coframe separates them: aggregation rules live at the structural/grammar layer and apply generically to any column that satisfies them (Chapter 2), while naming is a thin, customizable surface. Two consequences follow that distinguish Coframe within the category:
- Open, query-time composition. Because the safety rules attach to a column's structure rather than to a pre-blessed named metric, any column may participate in any grammatically-valid expression at the output grain (Chapter 8). The composable surface is not limited to compositions declared in advance.
- Contextual multiplicity. Because an AC is derived entirely from column declarations and detached from physical tables, ACs are cheap and scoped-to-purpose: many ACs over one backend, in which a term such as
revenuemay legitimately resolve to a different, individually-rigorous definition per department, project, or audience (internal-facing versus agent-facing). This is a deliberate alternative to the "single canonical definition for the whole organization" posture that the semantic-layer category generally adopts. It trades the guarantee of one negotiated meaning for the flexibility of many honest, context-specific meanings — and places on the organization the corresponding responsibility to reconcile them where reconciliation is required.
Coframe is therefore best understood not as a replacement for the semantic layer but as a structurally-verified, column-native layer that can complement the metrics ecosystem — providing provably-safe acceleration and contextual scoping — rather than as a competitor for the canonical-metric-definition role.
C.5 The algebra of aggregation and sketch monoids¶
Coframe's treatment of partition-invariance as the commutative-monoid property of a (value space, operator, identity) triple (§2.6.2) is the algebraic view of aggregation. That additive aggregates, extremal aggregates, and probabilistic sketches (HyperLogLog for distinct counts; theta sketches for set operations; t-digest and KLL for quantiles) all form monoids under their merge operations is well known in the streaming-algorithms and approximate-query literature.
What Coframe takes: the monoid framing and the fact that sketches are mergeable monoidal objects.
What Coframe adds: uniform first-class treatment of sketch reducers within the same machinery as numeric reducers — including in cross-grain navigation, in block-set constraints (§2.8.3), and in per-lineage-edge attestation (where sketches are compared structurally, §7.5). Many tools special-case approximate-distinct as a separate feature; Coframe treats HLL_MERGE as the same kind of object as SUM, differing only in its value space.
C.5a Missing-data mechanisms¶
The classification of missingness as MCAR (missing completely at random), MAR (missing at random), and MNAR (missing not at random) is due to Rubin ("Inference and Missing Data," Biometrika, 1976) and developed in Little and Rubin, Statistical Analysis with Missing Data. The taxonomy turns on what the probability of missingness depends on: nothing (MCAR), observed data only (MAR), or the unobserved value itself even after conditioning on the observed (MNAR).
What Coframe takes: the taxonomy itself, unmodified. Coframe claims no originality for the mechanism classification.
What Coframe adds: the placement of missingness as a first-class per-column coordinate — the missingness anchor M (Foundations §2.4.3), a set of coordinates (over A ∪ {self}) on whose values the column's missingness depends. The three-way category is treated as a derived, lossy summary of M: the set distinguishes cases the category conflates ({self} and {self, region} are both MNAR but are different anchors), which aligns with the literature's observation that the MAR/MNAR boundary shifts with the conditioning set. Because M is a coordinate that travels through aggregation, Coframe selects missing-value treatment as a joint function of the operator and of whether M's coordinates survive into the output grain (§10.12) — skip when the rollup neutralizes the mechanism and an unbiased result is achievable, otherwise propagate. This operator-and-measure-sensitive treatment is awkward to express when missingness is handled only at load time, only on the measure, or only on the operator. Coframe's contribution is the placement and the structural treatment-selection discipline; it does not perform statistical inference under missingness (imputation and bias-tolerance, which require external knowledge or judgment, are deferred to Coframe Pro).
C.6 Summary of provenance¶
| Idea in Coframe | Established prior art | Coframe's delta |
|---|---|---|
| Summarizability conditions; DQ checks | Lenz & Shoshani (1997) and successors | Operationalized as a structure-derived, unified, standing verification regime with published levels |
| Additive / semi-additive / holistic measures | Kimball; Gray et al. "Data Cube" (1997) | Encoded as block sets; generalized off time and to arbitrary monoidal reducers |
| Per-attribute aggregation-correctness control | Simon, Amann, Liu, Gancarski (2021) | Embedded in a table-free, column-native constructive model rather than a table-algebra checker |
| Aggregate awareness | Looker, Cube, Holistics | Premise verified by attestation rather than assumed |
| Answering queries using views | Materialized-view rewriting literature | MTI as the in-framework soundness license, paired with attestation |
| Governed semantics for querying / agents | dbt Semantic Layer / MetricFlow, Cube, AtScale, Snowflake/Databricks, OSI | Different decomposition (grammar + thin naming); contextual multiplicity rather than canonical singularity |
| Sketches as mergeable summaries | Streaming-algorithms / approximate-query literature | First-class monoidal reducers within one uniform mechanism |
| MCAR / MAR / MNAR missingness mechanisms | Rubin (1976); Little & Rubin | Missingness as a first-class per-column coordinate; structural treatment-selection by operator and rollup direction (not statistical inference) |
The honest summary: Coframe's contribution is architectural and integrative — a column-native, table-free decomposition of the analytical stack that makes classes of error inexpressible and classes of acceleration provably safe — rather than the discovery of new conditions for correct aggregation. The conditions are largely known; the architecture that makes them constructive, unified, and verifiable is the new part.
C.7 A note on what remains open¶
Two areas are acknowledged as not yet addressed by Coframe Core, and are noted here so the framework's scope is not overstated:
-
Time-varying dimensions. Slowly-changing dimensions, reorganizations, late-arriving data, and bitemporality are the most demanding case for any framework whose guarantees rest on functional dependencies, since a declared FD such as
store → regionmay not hold across time. Coframe Core treats FDs as stable and defers temporal validity to future work; this is the framework's most significant open problem at the structural layer. -
Generalization beyond the relational frame. Coframe's column-as-keyed-values foundation is, in principle, more general than tables — but its correctness and acceleration guarantees rest on a navigable shared coordinate space and monoidal reducers, structures that relational and dimensional data afford and that arbitrary key-value, graph, or object data may not. Any extension beyond the relational frame is a research direction, not a current capability, and is deliberately kept out of this specification.
End of the Coframe Core manual.