AC-Level Derived Dimensions — Design Document, v0.1¶
Status: v0.1 design (2026-05-25). Companion to coframe_derived_metrics_design_v0_1.md. Captures the Shape A — declarative mapping/alias dimensions approach agreed on 2026-05-25 before implementation.
Author: reeeneeee Audience: implementers of Coframe Core v2.3+; reviewers checking the architectural commitments before code lands.
Scope. This document specifies how Coframe Core represents and executes AC-level derived dimensions — dimensions that are not physically present in any backend table but are derived from existing dimensions via an AC-declared mapping (e.g., coarse_region = MAP({East: Atlantic, West: Pacific, Central: Plains})). It complements the existing function-derived dimension machinery (month = MONTH_OF(day)) by adding user-declared mappings as a first-class AC declaration.
Out of scope. Expression-derived dimensions over metric values (revenue_tier = CASE WHEN revenue > 1000 THEN 'high' ELSE 'low' END) — the Shape C case from the architecture discussion. Those break the dimension/metric trichotomy and are Pro-tier. Arbitrary user catalog functions (Shape B beyond what's already in Chapter 10) are a separate registry concern, also deferred.
0. Naming + framing¶
A derived dimension is an AC-dimension whose values are computed at query time from an existing AC-dimension via an AC-declared mapping. The mapping is a pure function source → derived with no data dependency beyond the source value.
This is structurally similar to function-derived dimensions (the existing MONTH_OF(day) machinery, Manual §3.4.6) — both produce a per-row value from another dimension via a function. The difference:
| Function-derived (existing) | Mapping-derived (this design) | |
|---|---|---|
| Declared via | derived_by: MONTH_OF in a hierarchy path |
mapping: {East: Atlantic, ...} in a derived_dimensions block |
| Function source | Chapter 10 operator catalog | The AC's own declaration |
| Authoring overhead | Catalog operator must exist | None — pure data in YAML |
| Example | day → month |
region → coarse_region |
Mapping-derived dimensions are the natural extension of function-derived ones: same structural slot in the FD-DAG, same per-row computation model, just with an AC-supplied lookup instead of a catalog operator.
1. Motivation¶
Three common real-world needs Coframe Core doesn't directly support today:
- Coarsening —
coarse_region = {East: Atlantic, West: Pacific, Central: Plains}for executive dashboards. - Aliasing —
country_code = {USA: us, Canada: ca, Mexico: mx}to bridge a friendly column name to a system code. - Bucketing fixed sets —
priority_tier = {p1: critical, p2: critical, p3: standard, p4: standard, p5: low}for ops dashboards that need fewer categories than the source.
In all three cases the author can today only get this by: - Materialising the derived column in the warehouse (heavy, has to be ETL'd), - Asking the user to write the mapping inline in every query (error-prone, not centralised), - Or registering a custom catalog operator (Pro-tier, overkill for a literal mapping).
The design ships a fourth option: declare the mapping on the AC, query against the derived dimension as if it were physical.
2. The Reading B principle (load-bearing)¶
The architectural commitment, consistent with derived metrics:
The user does not know
coarse_regionis derived. The backend doesn't know either (no SQL pushdown in v0.1; v0.2+ could push down a CASE). The metric engine knows and executes the mapping. The planner is partially aware — it has to substitute the derived dim's source when building the per-metric request, because grouping by the derived dim requires reading rows at the source-dim grain.
Slightly different asymmetry from derived metrics: dimensions affect the grain of the query, so the planner can't be completely blind. But the awareness is contained — Rule 1/2 see the derived dim as a first-class dim (via the auto-added FD-DAG edge), and only the per-metric request construction in execution.py needs to know about the substitution.
The principle is mapping, not data: a derived dimension is a declaration of how to map, not a new column to materialise. The substrate stays primitive — each metric is cached at the source dim's grain (e.g., revenue@(region,)); the derived view is computed on demand by mapping + re-aggregating.
3. AC schema for derived dimensions¶
The derived_dimensions block lives on DimensionFamily — because a derived dimension naturally belongs to one family (it coarsens or aliases within that family's coordinate space).
dimension_families:
- name: geography
description: "Store geography."
base_level: store
members: [store, city, region] # primitive members; derived ones auto-appended
hierarchies:
- name: administrative
path: [store, city, region]
derived_dimensions:
- name: coarse_region
description: "Coarsened region grouping for executive views."
derived_from: region
mapping:
East: Atlantic
West: Pacific
Central: Plains
default: null # optional; null = strict (reject unknowns)
After loading:
- coarse_region is appended to members (so it shows in members)
- The FD-DAG gets a new edge: region → coarse_region, with source="mapping_derived" and a reference to the mapping
- The mapping is stored on the AC for runtime access via ac.derived_dimension(name) → DerivedDimensionSpec
Pydantic shape¶
class DerivedDimensionSpec(BaseModel):
name: str # the derived dim's AC-dimension name
description: str | None = None
derived_from: str # source dim — must be in this family's members
mapping: dict[str, str] # source-value → derived-value
default: str | None = None # value for unmapped source values; None = strict refuse
model_config = ConfigDict(frozen=True, extra="forbid")
The mapping's key/value types are str for v0.1 — Coframe dimensions are almost always strings (region names, categories, codes). Numeric / date sources can be supported in a later iteration by widening the value type.
4. Validation rules¶
At AC load + cross-ref time:
| Rule | Detail |
|---|---|
| DD-100 Source dim exists | derived_from resolves to a declared member of the SAME family |
| DD-101 Derived name unique | name doesn't collide with any other AC-dimension (across all families) |
| DD-102 Source not itself derived | derived_from points to a primitive or function-derived dimension, not another mapping-derived one (v0.1 conservative; v0.2 may relax) |
| DD-103 Mapping non-empty | At least one entry in mapping; empty is forbidden as it'd produce empty Frames |
| DD-104 Default optional | If default is None, strict mode — unknown source values produce an error at query time; if a string, that string is used for unmapped values |
The "DD-" prefix parallels "D-" for derived metrics; both join the integrity catalog (Manual §2.10) under a new naming scheme.
5. FD-DAG integration¶
Mapping-derived dimensions extend the existing FD-DAG edge taxonomy:
FDEdgeSource = Literal[
"hierarchy", # existing — implied by path: [d1, d2, d3]
"function_derived", # existing — implied by {ac_dimension: m, derived_by: F}
"data_attested", # existing — verified at Phase 2
"extra", # existing — extra_fd_edges list
"mapping_derived", # NEW
]
A mapping-derived edge carries no derived_by (no catalog operator) but a reference to the mapping. Storage choice for v0.1: keep the FDEdge model lean (no payload), and look up the mapping on demand via ac.derived_dimension(tail_name).
Reachability behaviour is unchanged — derived dims become FD-reachable from their source, so all existing schema-selection / cross-schema-coherence logic flows through transparently.
6. Execution: engine-side mapping + re-aggregate¶
When the engine sees a request for metric@(<derived_dim>, ...):
serve(METRIC, "revenue", ("coarse_region",))
│
├── Branch 0: not derived (skip)
├── Branch 1: exact match on (coarse_region,)? Usually no.
├── Branch 2a (existing): subset rollup from a finer cached entry.
│ E.g., revenue@(region, day) → revenue@(coarse_region,) would need
│ substitution; not handled by current subset-only Branch 2.
│
├── Branch 2b (NEW for derived dims): FD-edge rollup via mapping.
│ - Detect: the requested anchor contains a derived dim
│ - Substitute: serve at the source-dim-substituted anchor instead
│ (recursive: serve(METRIC, "revenue", ("region",)))
│ - Apply mapping in Polars: with_columns(col(region).replace(mapping))
│ - Re-aggregate: group_by("coarse_region").agg(sum("revenue"))
│ - Return as LazyFrame with [coarse_region, revenue] columns
│
└── Branch 3: backend fallback (only if Branch 2b's recursion couldn't
serve the source anchor either).
The re-aggregation step uses the metric family's first ip_reducer (typically SUM) — same partition-invariance rule as the existing Branch 2.
Crucially: the engine never materialises an entry at the derived-dim grain. The substrate stays primitive (entries at source-dim grains only); the derived view is computed on demand by mapping + re-aggregation. Same principle as derived metrics (formula not data).
Multi-derived-dim anchors¶
If the requested anchor has multiple derived dims (e.g., (coarse_region, coarse_quarter)), the engine substitutes each independently → serves at (region, quarter) → applies each mapping → re-aggregates on the derived-dim columns. Conceptually clean; implementation iterates over the anchor.
7. The planner's narrow involvement¶
Unlike derived metrics (where the planner stays fully unaware), derived dims require a small planner touch:
- Rule 1 (family resolution): unchanged.
- Rule 2 (anchor-set capability): the derived dim is just a member of its family; block-set checks treat it like any other dim.
- Rule 3 (schema selection): the derived dim isn't a physical column, so the planner's "find sibling at this grain" search must skip it. Instead, the planner substitutes derived dims with their source dims when building the candidate schema search.
- Rule 4 (cross-schema coherence): unchanged.
In _resolve_metric / apply_rule_3: before searching for a schema that hosts a sibling at target_grain, substitute every derived dim in target_grain with its source dim. Pass the substituted grain to the schema search. The schema selection finds a schema that has the source dim physically. The substitution metadata is stashed on the ResolvedMetric for the executor to apply.
The execution path is then: 1. Backend computes aggregates at the source-dim grain 2. Engine maps source → derived values + re-aggregates 3. Returned Frame's columns carry the derived dim's name (not the source)
8. Trade-offs accepted¶
8.1 No SQL pushdown of mappings in v0.1¶
The backend always groups by the source dim; the engine handles the mapping post-read. Costs: one extra Python/Polars pass after each metric serve. For typical AC sizes (a few dozen distinct values in a mapping) this is sub-millisecond. Pro could add SQL pushdown by extending AggregateRequest.group_by to allow CASE expressions — forward-compat, not needed for v0.1.
8.2 v0.1 doesn't support nested derivation¶
A mapping-derived dim cannot itself be the source of another mapping-derived dim (coarse_region → continent rejected). This keeps the FD-DAG cycle check simple. v0.2 could relax this.
8.3 Strict vs default mode for unmapped values¶
When the backend returns a source value not in the mapping:
- Strict mode (default: null): the engine raises UnmappedDimensionValueError with the offending value(s).
- Default mode (default: "Other"): unmapped values get the default; the result is grouped correctly.
Strict is the v0.1 default. Forces the AC author to be explicit about coverage.
8.4 Re-aggregation requires a partition-invariant ip_reducer¶
If a metric is anchor-locked (no ip_reducer) at the source-dim grain, querying it at a derived-dim grain is refused (the mapping might collapse multiple source values into one derived value; we'd need to re-aggregate, but there's no reducer). Same rule as the existing Branch 2 rollup. Failure is a clear AnchorLockedError.
9. Non-goals¶
| Non-goal | Note |
|---|---|
Expression-derived dims over metrics (revenue_tier) |
Shape C; Pro-tier; breaks the trichotomy |
Catalog-function dims beyond mapping (QUARTER_OF) |
Already supported via Manual §3.4.6 hierarchy syntax + Chapter 10 catalog |
| Nested mappings (mapping-derived from mapping-derived) | v0.2 affordance |
| Mappings with non-string source/target types | v0.2 affordance |
| User-supplied mapping override at query time | Would defeat the AC-as-source-of-truth posture |
10. Implementation plan¶
Three slices, in dependency order:
Slice 1 — AC schema + validation + FD-DAG integration¶
- Extend
derived.pywithDerivedDimensionSpecPydantic model + a sentinelMAPPING_DERIVED_EDGEsource. - Extend
DimensionFamilywith optionalderived_dimensions: tuple[DerivedDimensionSpec, ...]. - After-construction validator: appends derived names to
membersif not already there. - Extend FDEdgeSource literal with
"mapping_derived". - Extend
_edges_from_dimension_familyto emit the new edges. - AC cross-ref: DD-100..DD-104 validation rules.
- AC accessors:
is_derived_dimension,derived_dimension(name). - Tests: declaration, FD-DAG reachability, validation rules.
Slice 2 — Engine execution¶
- New
_serve_via_derived_dim_rolluphelper inengine.py(Branch 2b). - Detect derived dims in serve()'s requested anchor; substitute → recursive serve → mapping + re-aggregate.
- Polars implementation:
with_columns(col(src).replace_strict(mapping))+group_by + agg. - Strict / default mode behaviour.
- Tests: warm components, cold components, multi-derived-dim anchors, anchor-locked refusal.
Slice 3 — Planner + retail demo + end-to-end¶
- Planner: substitute derived dims with source dims in schema search (
apply_rule_3+_resolve_metric). - Stash substitution metadata on
ResolvedMetric. - Executor: pass substituted grain to backend; apply derived-dim transform after engine returns.
- Retail demo: declare
coarse_region(East: Atlantic, West: Pacific, Central: Plains). - Walkthrough Step 7c: SELECT coarse_region, SUM(revenue) AT coarse_region.
- End-to-end test through Frame-QL.
Estimated effort: ~250–400 LOC + tests across the three slices.
11. Summary¶
AC-level derived dimensions are added as mappings declared on the AC, executed by the engine, with narrow planner support. The key commitments:
- Mapping, not data — derived dims never get a backend column or a cached entry at the derived grain. The mapping is applied on demand post-read.
- FD-DAG integration is structural — derived dims appear as ordinary nodes reachable from their source via a new
mapping_derivededge type. Reachability + schema selection flow through transparently. - Planner is mostly unaware — Rules 1, 2, 4 unchanged. Rule 3 gets one targeted substitution (derived → source) for schema search.
- Engine owns the execution — new Branch 2b: substitute, recursive-serve at source grain, apply mapping + re-aggregate.
- Backend never sees them — only primitive dim columns go over the protocol. The mapping is engine-internal.
The design follows the Reading B principle adapted for the dimension/metric asymmetry (dims must be applied pre-aggregation, so the planner can't be quite as blind). The cleanest principle-preserving extension: declared mappings, engine-side execution, no SQL pushdown in v0.1.
When implementation lands (per §10), the canonical retail AC will declare coarse_region directly, and Frame-QL authors will write SELECT coarse_region, SUM(revenue) AT coarse_region and have it just work — with served_from: engine_cache whenever revenue@(region,) is warmed.