Coframe Metric Engine — Design Document, v0.2¶
Per-AC multi-domain query engine with persistent memoization, built on Polars. Hosts METRIC and QM (quasi-metadata) domains in v0.1; additional domains (LINEAGE, SKETCH, ...) are forward-compatible. Subsumes what an earlier iteration called the "cache layer" into a first-class execution component. Phase 8 deliverable per the v2.1 supplement roadmap.
0. Status¶
- Draft v0.2, 2026-05.
- v0.2 change: broadened scope from metric-only to multi-domain (METRIC + QM). The QM domain hosts quasi-metadata as engine-managed LazyFrames; Pydantic types become projections. Eliminates the duplicated per-backend kind-hint heuristic. See §2.7, §5.5, §13.4.
- All 7 open questions from v0.1 resolved (see §15).
- Not yet implemented. Slices 2-8 follow this doc.
- The naming throughout uses "Metric Engine" rather than "cache" — see §1.5 for why. (Despite hosting QM as a co-resident domain, the public-facing name stays "Metric Engine" since metrics are the dominant use case.)
1. Purpose, scope, and the reframe from "cache"¶
1.1 What changed in the design¶
This document started as a cache-layer design. Through brainstorm, it became clear the natural shape is not a cache wrapping a backend — it is a single-column query engine with persistent memoization as a byproduct of its operation. The collapse happened in three steps:
- The atomic logical unit of cached state is
(metric_family, anchor) → values— a single column with its anchor. Wide-table caching introduces grain mismatches, duplications, and forced co-locations that single-column form avoids. - Once the cache is asked "can you serve this metric at this anchor (possibly via FD-DAG rollup from a finer cached entry)?", the cache is no longer passive storage — it is making semantic decisions. If it can answer yes, it might as well serve.
- With (1) and (2), the distinction between "cache hit," "cache hit
with rollup," and "fresh backend call" collapses into one
per-metric
serve()operation; what differs is only whether the answer is computed from materialized state or from the underlying backend. The materialization is a side effect of execution, not its precondition.
What this implies architecturally: the Metric Engine is the execution
substrate for coframe queries. The DataAPIBackend becomes a thin
single-metric-aggregate provider invoked only for misses. Frame
composition, FD-DAG rollups, post-grain operations, and cross-source
unification all move into the Metric Engine.
1.2 Why now¶
Three converging forces:
- Phase 7 (coframe-polars) landed. We now have a production-grade Polars-based DataAPIBackend with cross-backend invariants verified against SQLite. Polars's columnar engine + lazy evaluation + Parquet IO are exactly the substrate a single-column query engine needs; we're not starting from scratch.
- The W4 abstraction (L1/L2/L3 operator registry) is de-risked.
The cross-backend invariant tests in Phase 7 establish that the
planner's IR (
AggregateRequest) is genuinely backend-agnostic. Inserting a Metric Engine layer between planner and backend doesn't fight any existing abstraction. - The FD-DAG is a first-class object. Most warehouse
materialized-view advisors decide what to cache from workload
statistics + cost models alone. Coframe's Metric Engine reasons
from workload + the AC's declared FD-DAG — knowing
revenue@(region, month)is FD-reachable fromrevenue@(region, day)isn't a statistical hypothesis but a declared invariant. This is a real differentiator and it deserves a first-class component to express it.
1.3 Scope (what the engine does)¶
The engine manages tabular AC-substrate data in two domains in v0.1:
- Metric values: per-metric, per-anchor aggregates the planner routes through the engine when serving Frame-QL queries.
- Quasi-metadata (QM): per-column profile rows + auxiliary distribution data (top-N values, etc.) the AC needs for integrity reasoning, kind-hint classification, and cost estimation.
Both domains share storage, manifest, and serve machinery; they differ in their semantics layer (FD-DAG traversal applies to metrics; QM is exact-match-or-compute). See §2.7.
For each AC, the engine:
- Answers per-dataset serve requests:
serve(domain, dataset_id, anchor) → LazyFramereturning a single-column-with-anchor result. - Persistently memoizes serve results in a per-AC store (Polars/Parquet on disk).
- For the metric domain, walks the FD-DAG to find the cheapest serving path: exact match, rollup from a finer cached node, or backend fallback.
- For the QM domain, returns exact-match or computes via the backend's table extraction + the engine's shared profiling logic.
- Composes multi-metric query results by merging per-metric LazyFrames at a target grain (metric domain).
- Evaluates frame-expressions (derived metrics, post-grain ops: HAVING, ORDER BY, LIMIT [PER]).
- Evicts memoized entries under workload pressure (LRU + FD-DAG-aware for metrics; pure LRU for QM).
- Honors stability-window-driven invalidation (v2.1 §6).
- Surfaces materialization provenance so verification levels propagate correctly.
The Pydantic types (TableProfile, ColumnProfile, etc.) remain
as the API contract surface — they're constructed as projections
over engine-backed LazyFrames when callers want typed objects.
Storage is Polars-native; the typed view is computed on read.
1.4 Non-scope (what the engine does not do)¶
- No cross-AC sharing in v0.1. Each AC has its own Metric Engine instance with its own materialized store. Storage duplication for shared physical metrics is accepted; cross-AC sharing is a future optimization (see §15).
- No backend-side computation of cross-schema joins. The Metric Engine assembles cross-schema results in its Polars substrate; the backend's job is single-metric scan + aggregate.
- No automatic schema promotion. When a memoized entry has been consistently hot, the Metric Engine can surface a recommendation to promote it to a declared schema (and possibly a physical pre-aggregate), but the promotion itself is an AC-author action via the Workbench, not an engine action.
- No real-time invalidation channels. The engine relies on the stability filter's hold-off window for validity guarantees. Sources of data freshness signals beyond that (e.g., warehouse CDC streams) are out of scope for v0.1.
1.5 Naming¶
The choice of "Metric Engine" over "Cache" is deliberate. "Cache"
frames it as an optimization on top of something else; "Metric Engine"
names it as a first-class execution component whose persistent
materialization is a byproduct, not its purpose. The component sits
at the same architectural level as DataAPIBackend — peer, not
adapter.
- Package name:
coframe-metric-engine. - Module path:
coframe.metric_engine. - Core type:
MetricEngine(per-AC instance).
2. Design principles¶
2.1 Per-AC identity¶
Each AC owns its own Metric Engine instance. The engine knows the AC's
FD-DAG, declared filter, and name_map; every memoized entry is
pre-scoped to the AC's filter. Cross-AC sharing of physical-name
caches is a future concern.
Rationale: cross-AC sharing introduces messy interactions around filter scopes, name remapping, and invalidation correlation. We pay a storage duplication cost to keep the design simple. The duplication is small relative to total compute saved; we revisit if profiling shows it bites.
2.2 Single-column-with-anchor as the atomic unit¶
The fundamental cache entry shape is (metric_family, anchor_tuple) →
column_of_values. Each entry maps exactly to one node in the FD-DAG.
Storage may physically group hot co-occurring entries into shared
Parquet files for IO efficiency, but the logical contract stays
atomic.
Implications:
- Adding/removing metrics from materialization doesn't require schema-wide updates.
- Incremental population (one new metric at a time) is trivial.
- Cache reachability is FD-DAG graph reachability — clean algebra.
- Sketch reducers (HLL, t-digest) fit naturally as single-column entries with sketch-state values.
2.3 Materialization is a byproduct, not a precondition¶
serve() always returns a LazyFrame. Whether it came from a
materialized entry, a rollup over a finer materialized entry, or a
fresh backend call is opaque to the caller. The decision of whether
to materialize a serve result for future reuse is internal policy
(§9), not part of the contract.
2.4 The Metric Engine is the execution substrate¶
Frame composition, FD-DAG rollups, post-grain operations, and
cross-source unification all live in the engine's Polars substrate.
Backends become thin single-metric providers; today's
coframe-resolution execution layer (merge_blocks, post_grain_ops)
collapses into the engine.
This is the largest architectural shift in the design. It is justified by:
- Polars's LazyFrame composition gives compounding query planning for free (column pruning + predicate pushdown across the whole chain of cache reads + fresh-result merges + post-grain ops).
- The cross-backend invariant becomes stronger: composition runs in one engine (Polars), not in Python lists with engine-dependent semantics.
- Backends get simpler — including hypothetical join-less ones (the Metric Engine handles all multi-source assembly).
2.5 FD-DAG is an input to the engine, not just to the planner¶
Most warehouse cache advisors reason from workload statistics + cost models alone. The Metric Engine additionally uses the AC's declared FD-DAG to know:
- Which rollup paths are sound (exact, not heuristic).
- Which operators are partition-invariant (cached value can be further rolled up).
- Which materialized entries serve which broader request sets.
This gives the engine principled, not heuristic, decisions about serving paths and memoization priorities.
2.6 Honor the stability filter for invariants¶
The v2.1 §6 stability filter defines a clean cache-validity window:
data older than hold_off_days is provably stable and won't be
modified. Cached aggregates over stable data don't expire from data
change; they expire only when the stability window rolls forward.
This is a stronger guarantee than typical OLTP caches enjoy — we don't need invalidation channels or version vectors. A daily cron at the stability boundary is sufficient.
2.7 Multi-domain substrate¶
The engine's storage layer is domain-agnostic. Every entry is
keyed by (domain, dataset_id, anchor_signature); the same
Polars+Parquet storage + SQLite manifest serves all domains. The
serve API takes a domain parameter; semantics layered above the
storage layer specialize per domain.
v0.1 domains:
| Domain | dataset_id |
Anchor shape | Semantics |
|---|---|---|---|
METRIC |
metric family name (e.g., revenue) |
grain tuple (e.g., (region, day)) |
FD-DAG rollup, partition-invariance, compose-to-Frame |
QM |
profile-kind id (e.g., column_profile, top_values) |
(schema, column) or finer |
exact-match-or-compute; no FD-DAG; Pydantic projections on read |
Why this matters: quasi-metadata is naturally tabular and benefits from the same columnar/lazy/Parquet substrate the metric domain needs. Today QM compute is duplicated across backends (SQLite via SQL aggregates + Python lists, Polars via Series ops); both implementations re-derive the same kind-hint heuristic. Hosting QM in the engine eliminates the duplication: backends extract tables to Polars LazyFrames; the engine's shared profiling logic runs once.
The Pydantic types stay (TableProfile, ColumnProfile,
NumericStats, …) — but they shift from being the storage medium
to being projections materialized from engine-backed LazyFrames
when callers want typed objects. Best of both: typed contracts at
the API surface, columnar storage underneath.
Future domains worth considering (not in v0.1):
LINEAGE: per-column lineage edges (today in Python dicts).INTEGRITY_RESULTS: cached attestation results (when refreshed against stable data, valid until stability rolls forward).SKETCH: per-anchor HLL / t-digest sketches as first-class cached objects.
These are explicitly future scope; v0.1 ships METRIC + QM only.
3. Architectural shape¶
3.1 Where the Metric Engine sits¶
surface ──→ coframe-resolution (parse, resolve, plan)
│
▼
list[(metric_family, anchor, mvt, filter)]
│
▼
┌─────────────────────────────────────────────┐
│ Metric Engine (per-AC) │
│ │
│ serve(metric_family, anchor) │
│ ├─ exact materialized? → return │
│ ├─ finer cached → roll up → return │
│ └─ neither → backend.aggregate (single) │
│ → memoize (per §9) → return│
│ │
│ compose(entries, target_grain) → Frame │
│ ├─ merge per-metric LazyFrames on anchor│
│ ├─ apply frame-expression (derived) │
│ └─ post-grain ops (HAVING/ORDER/LIMIT) │
└────────────────────┬────────────────────────┘
│
▼ (misses only)
┌──────────────────────┐
│ DataAPIBackend │
│ (single-metric calls)│
└──────────────────────┘
3.2 Lifecycle¶
The Metric Engine instance is created at AC COMMIT time (per v2.1
§5.2). The AC's structure is frozen at COMMIT, so the FD-DAG, filter,
and name_map the engine relies on are stable for the engine's
lifetime. Forking an AC creates a new engine instance with the
forked AC's structure; the old engine continues to serve the
unforked AC.
- Lazy-load mode: engine is constructed in-process on first query for that AC. Materialized store is read from disk if it exists.
- Pre-load mode: a runtime/workbench bootstrap walks the installation's ACs and instantiates engines eagerly. Useful for cold-start latency reduction in long-running runtimes.
3.3 Process model (v0.1)¶
v0.1 is shared-process within a single Python process that owns
the AC. All serve() and compose() calls — whether issued from
FastAPI request handlers, NLQ-generated queries, workbench ops, or
direct programmatic use — share the same in-process engine instance.
In-process thread safety is provided by a simple threading.Lock
around manifest mutations; readers and writers do not need to
coordinate via the filesystem.
Single-process assumption. v0.1 assumes exactly one Python
process owns the engine for any given AC. Multi-worker deployments
(e.g., uvicorn --workers N over the same installation) should
route each AC to a single worker, or run with a single worker per
installation. The engine does not attempt cross-process coordination.
Coframe Pro will address multi-process and distributed deployments — cross-worker engine coordination, optional dedicated engine processes (Redis-style), and possibly distributed cache fronts. These are out of scope for v0.1.
The simplification is real: no lock file, no manifest revalidation on read, no version vectors. SQLite's own concurrency handles incidental cross-process reads safely; we just don't depend on it.
3.4 On-disk layout¶
<installation>/.coframe/metric_engine/<ac_name>/
├── manifest.sqlite # entry catalog + metadata
├── lock # writer lock file
└── data/
├── <metric_family_a>/
│ ├── anchor=region/
│ │ └── part-000.parquet
│ ├── anchor=region,day/
│ │ └── part-000.parquet
│ └── anchor=region,month/
│ └── part-000.parquet
└── <metric_family_b>/
└── ...
- One Parquet directory per
(metric_family, anchor_signature), partitioned by anchor (so Polars can scan-by-grain selectively). - Manifest is SQLite (small, transactional, well-understood). One row per cache entry; see §6 for the schema.
- The
.coframe/metric_engine/directory sits alongside the workbench's.coframe/session.json, scoped per installation.
4. The serve() API¶
4.1 Signature¶
class MetricEngine:
def serve(
self,
domain: Domain, # METRIC | QM
dataset_id: str, # metric_family | qm-kind id
anchor: tuple[str, ...],
*,
mvt: MissingValueTreatment | None = None, # metric domain only
backend: DataAPIBackend,
) -> pl.LazyFrame:
"""Return a single-column LazyFrame: anchor cols + the data column.
Behavior is domain-specific:
METRIC domain:
1. Materialized entry at exact (dataset_id, anchor): scan + return.
2. Materialized entry at a finer FD-DAG node serving this
family + reachable to this anchor via rollup: scan + roll
up via the operator's partition-invariant rule + return.
3. No usable materialized entry: invoke backend.aggregate for
a single-metric request; memoize per §9; return.
QM domain:
1. Materialized entry at exact (dataset_id, anchor): scan + return.
2. No usable materialized entry: backend extracts the relevant
table to a LazyFrame; engine's shared profiling logic
computes the QM entry; memoize per §9; return.
(No FD-DAG rollup for QM — anchors are (schema, column) or
finer, with no algebra above them.)
"""
4.2 Inputs¶
domain:METRICorQM. Selects the per-domain semantics layer.dataset_id: domain-specific identifier.- METRIC: the AC-logical metric family name (e.g.,
revenue). - QM: the profile-kind id (e.g.,
column_profile,top_values).
- METRIC: the AC-logical metric family name (e.g.,
anchor: tuple of AC-dimension names (METRIC) or(schema_name, column_name[, value])(QM) defining the grain.mvt: optional missing-value treatment override (METRIC only).backend: the AC's boundDataAPIBackend(needed only for misses).
4.3 Outputs¶
A Polars LazyFrame with columns [anchor_col_1, ..., anchor_col_n,
<dataset_id_column>]. The frame is lazy; the caller (typically
compose() for METRIC, or a Pydantic-projection helper for QM)
determines when to collect.
4.4 Why backend is a per-call argument, not engine-state¶
The Metric Engine instance is per-AC. The AC's backend binding is known at engine-creation time and could be stored as engine state. But passing it per-call keeps the engine pure with respect to data source: it can be reused if the AC's backend binding changes (e.g., testing with a stub backend) without re-instantiating the engine. The materialized store is independent of which backend produced the contents; provenance metadata records the source.
5. FD-DAG traversal and ServingPath¶
5.1 ServingPath¶
@dataclass(frozen=True)
class ServingPath:
"""The plan for serving a metric request from materialized state.
Captures the chain: which materialized entry to start from + how
to roll up + which operator + which intermediate anchors.
"""
source_entry: MetricEntry # The materialized entry to scan
rollup_steps: tuple[FDStep, ...] # FD-DAG edges to traverse
operator: str # Reducer used at each step
target_anchor: tuple[str, ...] # Final grain
5.2 Path-selection algorithm¶
Given (metric_family, target_anchor):
- Look up exact match. If
(metric_family, target_anchor)is in the manifest, returnServingPath(source=that_entry, rollup_steps=()). - Look up finer cached nodes. Enumerate all materialized entries
for the same
metric_familyat anchors finer thantarget_anchorper the FD-DAG. Filter to those whose operator ispartition_invariant(only those can be rolled up further). - Pick the cheapest path. For each candidate finer node, compute
the FD-DAG path to
target_anchor. Cost heuristic for v0.1: row count of the candidate node (cheaper to scan). Return the minimum-cost path. - No materialized node serves. Return
None; caller falls back to backend.
In v0.1 we use the row-count heuristic. A future revision can add column-cardinality estimates, IO cost models, etc.
5.3 Soundness¶
The FD-DAG declares which rollups are semantically valid. Combined
with the operator's partition_invariant flag, the engine knows
which rollups are also arithmetically sound:
- SUM, COUNT, MIN, MAX, BOOL_AND, BOOL_OR: partition-invariant; the cached value can be rolled up further via the same operator.
- AVG, MEDIAN, COUNT_DISTINCT: not partition-invariant; the cached entry is usable at its own anchor but not further-rollable.
The partition_invariant flag is part of the Operator declaration
in coframe.operators (already exists, used by W4).
5.4 Missing-value treatment under rollup¶
Rolling up a propagate-treated metric: the cached entry already
carries NULL for any-NULL groups; further SUM rollup naturally
preserves the NULL (NULL + anything = NULL with native skip-NULL
absent). The engine respects this; the rollup expression doesn't need
to re-apply the propagate guard.
Rolling up a skip-treated metric: native Polars SUM continues to
skip NULL; the rollup behaves as expected.
Rolling up an impute-treated metric: the imputation has already
substituted at the source-row level; the rollup is over already-imputed
values. No re-imputation needed.
5.5 QM-domain serve semantics¶
QM serve is intentionally simpler than METRIC serve — no FD-DAG traversal, no rollup algebra:
- Exact-match-or-compute. The engine looks up
(QM, dataset_id, anchor)in the manifest. If present, scan + return. If not, compute via the backend's extract-to-LazyFrame path, then memoize + return. - No rollup. QM data doesn't compose along an FD-DAG. A column
profile at
(stores, region)doesn't roll up into anything; it's a leaf observation about the data. - Anchor shapes per QM kind:
column_profile: anchor =(schema_name, column_name). The single row of per-column stats lives here.top_values: anchor =(schema_name, column_name, value). Each anchor cell carries one (value, count) pair; query top-N by sorting + limiting at read time.histogram: anchor =(schema_name, column_name, bin_index). Each cell carries one bin's range + count.- Future kinds add their own anchor shape.
- Shared compute logic. The profiling algorithm (cardinality classification, kind-hint heuristic, per-kind stats blocks) lives in the engine as Polars expressions. Backends provide table-extraction; the engine runs the heuristic.
- Pydantic projections.
ColumnProfile,TableProfile, etc. become helper functions that read the relevant engine entries and assemble the typed view:engine.column_profile(schema, column) → ColumnProfile. The Pydantic object is constructed from a fresh LazyFrame collect; the storage is the LazyFrame.
6. Manifest schema¶
The SQLite manifest stores one row per materialized entry plus metadata for the engine itself.
CREATE TABLE entries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
domain TEXT NOT NULL, -- 'METRIC' | 'QM'
dataset_id TEXT NOT NULL, -- metric_family | qm-kind id
anchor_signature TEXT NOT NULL, -- canonical sorted-tuple repr
parquet_path TEXT NOT NULL,
-- Metric-domain fields (NULL for QM):
operator TEXT, -- the reducer used
partition_invariant BOOLEAN,
mvt TEXT, -- skip|propagate|impute
imputation_value TEXT, -- JSON literal (when mvt=impute)
-- Shared:
source_schemas TEXT NOT NULL, -- JSON list of source schema names
source_filter TEXT, -- JSON: the filter applied
row_count INTEGER NOT NULL,
byte_size INTEGER NOT NULL,
materialized_at TIMESTAMP NOT NULL,
stability_cutoff TIMESTAMP, -- the hold-off cutoff at materialize time
last_access_at TIMESTAMP NOT NULL,
access_count INTEGER NOT NULL DEFAULT 0,
verification_level TEXT, -- A|AA|AAA, inherited from source
UNIQUE(domain, dataset_id, anchor_signature)
);
CREATE INDEX entries_by_domain_dataset ON entries(domain, dataset_id);
CREATE INDEX entries_by_access ON entries(last_access_at);
CREATE TABLE engine_meta (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
-- engine_meta rows: ac_name, ac_fingerprint, schema_version, etc.
The manifest is small (one row per entry) and SQLite gives us transactional updates, multi-reader concurrency, and a familiar operational story.
7. Backend interaction¶
7.1 Single-metric calls¶
When serve() falls through to the backend, it constructs an
AggregateRequest with exactly one AggregateMetric. The existing
backend protocol handles this without modification.
Cross-schema queries previously fanned out into multiple
AggregateRequests in the planner's merge_blocks path. With the
Metric Engine, this fan-out shifts to per-metric serve() calls; the
backend never sees a multi-metric, cross-schema request — each call is
narrow.
7.2 Backend batching (deferred)¶
A future optimization: when multiple cache misses in a single
composition target the same source schema, the Metric Engine could
batch them into one multi-metric AggregateRequest for IO efficiency.
This preserves the per-metric serve() API; batching is an engine
internal optimization.
In v0.1 we issue one aggregate() call per miss. The simplicity is
worth more than the IO savings for the initial release.
7.3 Join elimination at the backend¶
Because the Metric Engine assembles cross-schema results in Polars,
the planner no longer needs to emit AggregateJoin entries for
dim-table navigation. Each backend call asks for metric@anchor
where the metric and the anchor's dimensions live on the same physical
table — or, when they don't, the engine first calls the backend for
the metric at its native anchor, then joins to a dim-table entry the
engine has separately memoized.
This realizes the join-less-backend hypothetical we discussed during
Phase 7 — without changing the DataAPIBackend protocol, backends
can now ignore joins entirely (the engine never sends them).
8. Composition¶
8.1 compose() API¶
def compose(
self,
entries: list[pl.LazyFrame],
target_grain: tuple[str, ...],
frame_expression: FrameExpression | None = None,
post_grain_ops: PostGrainOps | None = None,
) -> Frame:
"""Combine multiple single-column LazyFrames into a Frame.
Each input LazyFrame has the shape [anchor_cols..., metric].
The output Frame has [grain_cols..., metrics...] after merge,
frame_expression evaluation, and post-grain ops.
"""
8.2 Steps¶
- Grain reconciliation: each input LazyFrame is at some anchor;
if any are finer than
target_grain, roll them up via the engine's FD-DAG traversal (re-using the sameServingPathmachinery). - Join on target_grain: outer-join all LazyFrames on the grain columns. Polars's lazy join + the engine's columnar layout make this cheap.
- Evaluate frame_expression: derived metrics (e.g.,
profit = revenue - cost) are Polars expressions over the joined frame. - Post-grain ops: HAVING (filter), ORDER BY (sort), LIMIT [PER] (top-N per group). Applied in declared order.
- Materialize the final result into a
Frame(the coframe-resolution output type) and return.
8.3 What this replaces¶
- Today's
merge_blocksincoframe-resolution/execution.py➝ becomes a thin shim that callscompose(). - Today's
post_grain_opsincoframe-resolution/execution.py➝ moves intocompose()step 4. - The post-aggregation Frame transformations stay conceptually the same; the execution shifts from Python list-of-lists to Polars LazyFrame.
8.4 Backwards compatibility¶
execute_query(ac, backend, src) continues to return the same Frame
type. The engine is wired between the resolver and the surface; the
public API surface is unchanged.
9. Memoization policy (the hybrid lazy + push-opt-in)¶
9.1 Lazy by default¶
Every backend call's result is memoized by default. Cold start has no materialized entries; the engine populates as queries arrive.
9.2 Push opt-in via cache_hint¶
The AC can declare hot grains to pre-materialize:
metric_families:
- name: revenue
family_root: {schema: transactions, column: revenue}
ip_reducers:
- operator: SUM
a_block: [time]
cache_hint:
materialize_at:
- [region]
- [region, day]
- [region, month]
A new lifecycle step at AC COMMIT walks every cache_hint and
schedules pre-materialization (either eager at COMMIT, or deferred to
a "warmup" cron). The pre-materialization invokes the same
serve() machinery; the only difference is when it runs (not at
first query but at COMMIT).
9.3 Promotion recommendations¶
The Metric Engine tracks per-entry access frequency. When an entry has
sustained hot access over a configurable window (e.g., 7 days, 100+
hits), it surfaces a recommendation: "consider promoting this entry to
a declared schema." The promotion itself is an AC-author action
through the Workbench (not an engine action) — the engine emits the
SchemaSpec YAML stanza the author can paste.
This gives the progression:
- Lazy memoization catches unknown workload patterns.
- Push opt-in catches known-hot patterns the AC author can declare in advance.
- Promotion catches steady-state patterns the engine identifies from observed workload, lifting them out of the engine into the AC's declared schema set (and optionally to a physical pre-aggregate table).
9.4 What never gets memoized¶
- Results of singleton lookups (single anchor cell, single value): not worth the memoization overhead.
- Results explicitly tagged
no_cacheby the query (a future Frame-QL hint; v0.1 just memoizes everything). - Results whose source schema is in
SELECTINGphase (not yet COMMITTED): the AC isn't stable, so memoizing would risk staleness.
10. Eviction¶
10.1 LRU baseline¶
Manifest tracks last_access_at per entry. When the engine's
materialized store exceeds a configurable byte budget, evict
least-recently-accessed entries.
10.2 FD-DAG-aware enhancement¶
Pure LRU can evict a node that's the only reachability path to many ancestor grains, defeating future serves at those grains. The engine's eviction scorer biases against evicting such nodes:
where downstream_fanout is the number of FD-DAG ancestor nodes that
would lose their cheapest serving path if this entry were evicted.
10.3 Bytes budget¶
- Default: 1 GB per AC.
- Configurable via
installation.yaml:
11. Stability-window invalidation¶
11.1 Per-entry stability cutoff¶
Each entry records the stability_cutoff it was materialized over —
i.e., the date cutoff applied to the source data at materialization
time. As long as the engine's current effective cutoff is ≤ this
value, the entry is valid.
11.2 Rolling forward¶
When the engine's effective cutoff rolls forward (typically daily, at the stability boundary), entries materialized over older cutoffs become stale: they're missing the newly-stable rows that were previously in the hold-off window.
Two strategies:
- Invalidate: drop stale entries from the manifest; next serve repopulates with the fresh cutoff.
- Append-and-merge: if the operator is partition-invariant, compute the incremental aggregate over the newly-stable rows only and merge with the stale entry. Faster than full recompute when the increment is small.
v0.1 implements invalidate-only. Append-and-merge is a future optimization that requires source-table partition awareness on the backend side.
11.3 Cron / on-demand¶
The roll-forward check runs at engine serve() time (cheap: compare
two timestamps) and is also exposed as an explicit engine.refresh()
call for batch operations.
12. Verification level interaction¶
12.1 Inheritance¶
Each materialized entry records the verification level (A / AA / AAA) of its source schema(s). The entry inherits the minimum level of its sources — a cached entry can be no better-verified than its weakest input.
12.2 Soundness of rollup¶
If the source schema is AAA-verified and the operator is partition-invariant, the cached rollup is trivially AAA-attestable: the materialized entry is the rollup of the source, so the sibling-coherence attestation between them is satisfied by construction (the cached node is itself the answer to the sibling-coherence check). No re-attestation needed.
Non-partition-invariant operators (AVG, MEDIAN, COUNT_DISTINCT)
produce entries that are usable at their own anchor; their AAA status
is inherited from the source but the entry is flagged
not_further_rollable so the engine doesn't attempt to roll them up.
12.3 Surfaced provenance¶
When the engine serves a request from a materialized entry, the
returned Frame carries provenance metadata identifying the entry
(and its verification level) for each cell. This lets the
verification surface report "this row came from <metric_engine_entry>
materialized at <timestamp> from <source_schema>@<level>."
13. Integration with coframe-resolution¶
13.1 What changes in coframe-resolution¶
execute_query(ac, backend, src)accepts an optionalmetric_engine: MetricEngine | Noneargument. When present, the engine handles execution; when absent, the legacy backend.aggregate path runs.build_plan()emits a list of(metric_family, anchor, mvt, filter)requests instead ofAggregateRequests when the engine is in use. (The legacyAggregateRequestpath remains for engine-less execution.)merge_blocksandpost_grain_opsare removed in favor ofengine.compose().
13.2 Default off in v0.1¶
The Metric Engine is opt-in via per-installation configuration. It
must soak through real workloads before becoming default. The default
path remains: direct backend invocation through today's
execute_query.
13.3 The cross-backend invariant retains¶
The Phase 7 cross-backend invariant tests (same query → same Frame through SQLite vs Polars) become a forcing function for the engine: when the engine is enabled, the same query through the engine-on-SQLite path must produce identical Frames to the engine-on-Polars path. This is the engine's primary correctness gate.
13.4 QM backend migration¶
The QM domain landing as a co-resident substrate (per §2.7, §5.5) collapses the duplicated quasi-metadata compute paths across backends into one. After the migration:
SQLiteBackend.compute_table_profile→extract_to_lazyframe(table) → engine.profile_table(lazyframe) → engine.serve(QM, "column_profile", (schema, col))for each column; the typedTableProfileis a projection assembled from the per-column entries.PolarsBackend.compute_table_profile→ same path: extract to LazyFrame (trivial — it already has one) → shared profiling.- The per-backend kind-hint heuristic in
coframe.sqlite.backendcoframe.polars.backendis removed; the canonical implementation lives in the engine.
- Backend protocol grows one method:
extract_to_lazyframe(table) → pl.LazyFrame(the QM ingestion door). SQLite reads viasqlite3 → arrow → polars; Polars returns its native frame. - Existing callers of
backend.compute_table_profile()continue to work: the call now routes through the engine when enabled, or falls back to the legacy per-backend impl when disabled.
The migration removes ~200 lines of duplicated logic and eliminates the cross-backend invariant tests for QM (they become structurally trivial — same code path runs for both backends).
14. Slice plan (after this design doc)¶
| Slice | Deliverable |
|---|---|
| 2 | Package skeleton (coframe-metric-engine). Domain-aware types: Domain enum (METRIC/QM), EngineEntry, Manifest, ServingPath, FDStep. Skeleton tests. |
| 3 | Storage + manifest. Polars+Parquet writes, SQLite manifest CRUD with (domain, dataset_id, anchor_signature) unique key. Domain-agnostic. |
| 4 | QM as engine-managed substrate. Port both backends' compute_table_profile to: extract → LazyFrame → shared engine.profile_table() → store as QM entries → return Pydantic projection. Eliminates the duplicated kind-hint heuristic. Proves the substrate before metric serving is load-bearing. |
| 5 | METRIC-domain serve() with FD-DAG traversal (exact, rollable, fallback). Memoization on backend miss. |
| 6 | METRIC-domain compose() — multi-metric merge + frame-expression + post-grain ops. Replaces merge_blocks + post_grain_ops. |
| 7 | Eviction (LRU + FD-DAG-aware for METRIC; pure LRU for QM) + stability-window invalidation. |
| 8 | Integration with coframe-resolution: rewire execute_query behind the metric_engine.enabled flag. End-to-end retail test through both paths (engine on + off) returning identical Frames. |
Why slice 4 (QM) before slice 5 (metric serve): QM's serve
semantics are simpler (no FD-DAG rollup, no partition-invariance
reasoning), so it validates the storage substrate end-to-end before
the metric domain's complexity lands on top. QM unification also
removes a duplicated heuristic immediately — visible win in the
first substantive slice. The metric serve() machinery in slice 5
lands on a proven substrate rather than co-developing with one.
15. Open questions¶
15.1 Per-process vs shared-process Metric Engine — RESOLVED¶
Resolved per author guidance. v0.1 ships shared-process within a single Python process per AC (see §3.3). Cross-process coordination, dedicated engine processes, and distributed-engine futures are explicit Coframe Pro tier concerns; v0.1 keeps the lightweight shape.
15.2 Cache hint placement in the AC — RESOLVED¶
Resolved: cache_hint lives on MetricFamily (per §9.2). In
v0.1, almost all metric families ship a single ip_reducer; the
granularity benefit of per-reducer hints is theoretical for most
ACs, and MetricFamily-level matches how authors think about hot
grains ("revenue is hot at these grains," not "revenue's SUM
reducer specifically is hot at these grains").
Forward-compat: extending to a per-reducer override later is
additive — an optional cache_hint on IpReducer would shadow the
family-level hint for that reducer only. Lands if/when real
workloads demand it.
15.3 Which call paths route through the engine — RESOLVED¶
Resolved: the engine intercepts only the analytical-query
path — serve() and its descendants under compose(). Every
other DataAPIBackend method goes direct to the backend, untouched
by the engine.
| Call path | Routes through engine? |
|---|---|
execute_query(ac, backend, src) (Frame-QL → Frame) |
Yes |
Per-metric serve() from compose() |
Yes (by definition) |
backend.attest_fd_edge(...) |
No — verification op |
backend.attest_sibling_coherence(...) |
No — verification op |
backend.compute_table_profile(...) |
No — L2 QM has its own cache |
backend.list_tables() / describe_table() |
No — cheap introspection |
backend.check_update_timestamp_column() |
No — stability-filter probe |
backend.apply_stability_filter() |
No — stability-filter probe |
Rationale: attestation results are operational diagnostics, not analytical answers; memoizing them muddles provenance ("did I just verify this, or am I reading a verification from 3 days ago?"). Introspection + profiling are cheap and have their own L2 quasi-metadata caching story. Stability probes are O(1) lookups.
Contributor guidance: when adding a new method to
DataAPIBackend, ask whether it's an analytical-query path. If yes,
route through serve(). If no, call the backend directly.
15.4 NLQ and engine — RESOLVED¶
Resolved per author guidance. Not an engine concern. NLQ
produces Frame-QL — highly structured input — that goes through the
same execute_query path as any other source. The engine sees no
distinction. Speculation about engine-state-driven NLQ suggestions
("you've cached X; want to ask about it?") would be NLQ-side work,
not engine-side, and is out of scope here.
15.5 Cross-AC sharing — RESOLVED¶
Resolved: Coframe Pro tier territory. Same shape as §15.1: v0.1 keeps per-AC stores for simplicity (filter-scope isolation, naming isolation, no shared-write coordination). Physical-name-keyed cross-AC sharing with L3 name remapping is a Pro-tier optimization that lands if real workloads show meaningful storage duplication or hit-rate loss from non-sharing. Storage cost is the least concern in v0.1.
15.6 Backend batching for misses — RESOLVED¶
Resolved: Coframe Pro tier territory. Batching multiple
single-metric misses into multi-metric backend calls is an
IO-overhead optimization that matters for remote warehouses
(Snowflake, BigQuery — round-trip latency dominates) but not for
the local engines Core targets (SQLite, Polars — per-call overhead
is negligible). Coframe Core keeps serve() clean and atomic;
batching machinery lives in the Pro tier alongside the remote-
warehouse backends that benefit from it.
Forward-compat: batching is purely an engine-internal optimization;
adding it later doesn't change serve(), compose(), or
DataAPIBackend signatures.
15.7 Results too large to memoize — RESOLVED¶
Resolved: serve normally; skip memoization when the result exceeds a per-entry size cap; surface a coarser-grain recommendation.
- Per-entry size cap: 10% of the engine's
max_bytes_per_acbudget (default 100MB on the default 1GB budget). Configurable viainstallation.yamlasmetric_engine.max_entry_bytes. - Behavior at the cap: serve the result from the backend normally (no error, no degraded answer); skip writing to the materialized store; log a recommendation that "a coarser-grain version of this query would be cacheable" — feeds the promotion-recommendation machinery (§9.3).
- No errors thrown: query correctness is preserved unconditionally. The engine only opts out of memoization, never out of serving.
The 10% default is a starting point. Real workloads will inform whether to raise or lower it.
16. Out of scope for this doc¶
- Concrete Polars expression generation for rollup (slice 4 detail).
- Concrete Parquet partition scheme tuning (slice 3 detail).
- Concurrency primitive choices in Python (slice 3 detail).
- Telemetry / observability hooks (worth its own design pass).
- Distributed-engine futures (clearly out of v1.0).
Next: review + iterate this doc; lock the design (or flag open questions for further brainstorm); then proceed to slice 2 (package skeleton + types) per §14.