AC-Level Derived Dimensions — Design Document, v0.1¶

Status: v0.1 design (2026-05-25). Companion to coframe_derived_metrics_design_v0_1.md. Captures the Shape A — declarative mapping/alias dimensions approach agreed on 2026-05-25 before implementation.

Author: reeeneeee Audience: implementers of Coframe Core v2.3+; reviewers checking the architectural commitments before code lands.

Scope. This document specifies how Coframe Core represents and executes AC-level derived dimensions — dimensions that are not physically present in any backend table but are derived from existing dimensions via an AC-declared mapping (e.g., coarse_region = MAP({East: Atlantic, West: Pacific, Central: Plains})). It complements the existing function-derived dimension machinery (month = MONTH_OF(day)) by adding user-declared mappings as a first-class AC declaration.

Out of scope. Expression-derived dimensions over metric values (revenue_tier = CASE WHEN revenue > 1000 THEN 'high' ELSE 'low' END) — the Shape C case from the architecture discussion. Those break the dimension/metric trichotomy and are Pro-tier. Arbitrary user catalog functions (Shape B beyond what's already in Chapter 10) are a separate registry concern, also deferred.

0. Naming + framing¶

A derived dimension is an AC-dimension whose values are computed at query time from an existing AC-dimension via an AC-declared mapping. The mapping is a pure function source → derived with no data dependency beyond the source value.

This is structurally similar to function-derived dimensions (the existing MONTH_OF(day) machinery, Manual §3.4.6) — both produce a per-row value from another dimension via a function. The difference:

	Function-derived (existing)	Mapping-derived (this design)
Declared via	`derived_by: MONTH_OF` in a hierarchy path	`mapping: {East: Atlantic, ...}` in a `derived_dimensions` block
Function source	Chapter 10 operator catalog	The AC's own declaration
Authoring overhead	Catalog operator must exist	None — pure data in YAML
Example	`day → month`	`region → coarse_region`

Mapping-derived dimensions are the natural extension of function-derived ones: same structural slot in the FD-DAG, same per-row computation model, just with an AC-supplied lookup instead of a catalog operator.

1. Motivation¶

Three common real-world needs Coframe Core doesn't directly support today:

Coarsening — coarse_region = {East: Atlantic, West: Pacific, Central: Plains} for executive dashboards.
Aliasing — country_code = {USA: us, Canada: ca, Mexico: mx} to bridge a friendly column name to a system code.
Bucketing fixed sets — priority_tier = {p1: critical, p2: critical, p3: standard, p4: standard, p5: low} for ops dashboards that need fewer categories than the source.

In all three cases the author can today only get this by: - Materialising the derived column in the warehouse (heavy, has to be ETL'd), - Asking the user to write the mapping inline in every query (error-prone, not centralised), - Or registering a custom catalog operator (Pro-tier, overkill for a literal mapping).

The design ships a fourth option: declare the mapping on the AC, query against the derived dimension as if it were physical.

2. The Reading B principle (load-bearing)¶

The architectural commitment, consistent with derived metrics:

The user does not know coarse_region is derived. The backend doesn't know either (no SQL pushdown in v0.1; v0.2+ could push down a CASE). The metric engine knows and executes the mapping. The planner is partially aware — it has to substitute the derived dim's source when building the per-metric request, because grouping by the derived dim requires reading rows at the source-dim grain.

Slightly different asymmetry from derived metrics: dimensions affect the grain of the query, so the planner can't be completely blind. But the awareness is contained — Rule 1/2 see the derived dim as a first-class dim (via the auto-added FD-DAG edge), and only the per-metric request construction in execution.py needs to know about the substitution.

The principle is mapping, not data: a derived dimension is a declaration of how to map, not a new column to materialise. The substrate stays primitive — each metric is cached at the source dim's grain (e.g., revenue@(region,)); the derived view is computed on demand by mapping + re-aggregating.

3. AC schema for derived dimensions¶

The derived_dimensions block lives on DimensionFamily — because a derived dimension naturally belongs to one family (it coarsens or aliases within that family's coordinate space).

dimension_families:
  - name: geography
    description: "Store geography."
    base_level: store
    members: [store, city, region]      # primitive members; derived ones auto-appended
    hierarchies:
      - name: administrative
        path: [store, city, region]
    derived_dimensions:
      - name: coarse_region
        description: "Coarsened region grouping for executive views."
        derived_from: region
        mapping:
          East:    Atlantic
          West:    Pacific
          Central: Plains
        default: null                   # optional; null = strict (reject unknowns)

After loading: - coarse_region is appended to members (so it shows in members) - The FD-DAG gets a new edge: region → coarse_region, with source="mapping_derived" and a reference to the mapping - The mapping is stored on the AC for runtime access via ac.derived_dimension(name) → DerivedDimensionSpec

Pydantic shape¶

class DerivedDimensionSpec(BaseModel):
    name: str                    # the derived dim's AC-dimension name
    description: str | None = None
    derived_from: str            # source dim — must be in this family's members
    mapping: dict[str, str]      # source-value → derived-value
    default: str | None = None   # value for unmapped source values; None = strict refuse

    model_config = ConfigDict(frozen=True, extra="forbid")

The mapping's key/value types are str for v0.1 — Coframe dimensions are almost always strings (region names, categories, codes). Numeric / date sources can be supported in a later iteration by widening the value type.

4. Validation rules¶

At AC load + cross-ref time:

Rule	Detail
DD-100 Source dim exists	`derived_from` resolves to a declared member of the SAME family
DD-101 Derived name unique	`name` doesn't collide with any other AC-dimension (across all families)
DD-102 Source not itself derived	`derived_from` points to a primitive or function-derived dimension, not another mapping-derived one (v0.1 conservative; v0.2 may relax)
DD-103 Mapping non-empty	At least one entry in `mapping`; empty is forbidden as it'd produce empty Frames
DD-104 Default optional	If `default` is None, strict mode — unknown source values produce an error at query time; if a string, that string is used for unmapped values

The "DD-" prefix parallels "D-" for derived metrics; both join the integrity catalog (Manual §2.10) under a new naming scheme.

5. FD-DAG integration¶

Mapping-derived dimensions extend the existing FD-DAG edge taxonomy:

FDEdgeSource = Literal[
    "hierarchy",         # existing — implied by path: [d1, d2, d3]
    "function_derived",  # existing — implied by {ac_dimension: m, derived_by: F}
    "data_attested",     # existing — verified at Phase 2
    "extra",             # existing — extra_fd_edges list
    "mapping_derived",   # NEW
]

A mapping-derived edge carries no derived_by (no catalog operator) but a reference to the mapping. Storage choice for v0.1: keep the FDEdge model lean (no payload), and look up the mapping on demand via ac.derived_dimension(tail_name).

Reachability behaviour is unchanged — derived dims become FD-reachable from their source, so all existing schema-selection / cross-schema-coherence logic flows through transparently.

6. Execution: engine-side mapping + re-aggregate¶

When the engine sees a request for metric@(<derived_dim>, ...):

serve(METRIC, "revenue", ("coarse_region",))
│
├── Branch 0: not derived (skip)
├── Branch 1: exact match on (coarse_region,)? Usually no.
├── Branch 2a (existing): subset rollup from a finer cached entry.
│   E.g., revenue@(region, day) → revenue@(coarse_region,) would need
│   substitution; not handled by current subset-only Branch 2.
│
├── Branch 2b (NEW for derived dims): FD-edge rollup via mapping.
│   - Detect: the requested anchor contains a derived dim
│   - Substitute: serve at the source-dim-substituted anchor instead
│     (recursive: serve(METRIC, "revenue", ("region",)))
│   - Apply mapping in Polars: with_columns(col(region).replace(mapping))
│   - Re-aggregate: group_by("coarse_region").agg(sum("revenue"))
│   - Return as LazyFrame with [coarse_region, revenue] columns
│
└── Branch 3: backend fallback (only if Branch 2b's recursion couldn't
              serve the source anchor either).

The re-aggregation step uses the metric family's first ip_reducer (typically SUM) — same partition-invariance rule as the existing Branch 2.

Crucially: the engine never materialises an entry at the derived-dim grain. The substrate stays primitive (entries at source-dim grains only); the derived view is computed on demand by mapping + re-aggregation. Same principle as derived metrics (formula not data).

Multi-derived-dim anchors¶

If the requested anchor has multiple derived dims (e.g., (coarse_region, coarse_quarter)), the engine substitutes each independently → serves at (region, quarter) → applies each mapping → re-aggregates on the derived-dim columns. Conceptually clean; implementation iterates over the anchor.

7. The planner's narrow involvement¶

Unlike derived metrics (where the planner stays fully unaware), derived dims require a small planner touch:

Rule 1 (family resolution): unchanged.
Rule 2 (anchor-set capability): the derived dim is just a member of its family; block-set checks treat it like any other dim.
Rule 3 (schema selection): the derived dim isn't a physical column, so the planner's "find sibling at this grain" search must skip it. Instead, the planner substitutes derived dims with their source dims when building the candidate schema search.
Rule 4 (cross-schema coherence): unchanged.

In _resolve_metric / apply_rule_3: before searching for a schema that hosts a sibling at target_grain, substitute every derived dim in target_grain with its source dim. Pass the substituted grain to the schema search. The schema selection finds a schema that has the source dim physically. The substitution metadata is stashed on the ResolvedMetric for the executor to apply.

The execution path is then: 1. Backend computes aggregates at the source-dim grain 2. Engine maps source → derived values + re-aggregates 3. Returned Frame's columns carry the derived dim's name (not the source)

8. Trade-offs accepted¶

8.1 No SQL pushdown of mappings in v0.1¶

The backend always groups by the source dim; the engine handles the mapping post-read. Costs: one extra Python/Polars pass after each metric serve. For typical AC sizes (a few dozen distinct values in a mapping) this is sub-millisecond. Pro could add SQL pushdown by extending AggregateRequest.group_by to allow CASE expressions — forward-compat, not needed for v0.1.

8.2 v0.1 doesn't support nested derivation¶

A mapping-derived dim cannot itself be the source of another mapping-derived dim (coarse_region → continent rejected). This keeps the FD-DAG cycle check simple. v0.2 could relax this.

8.3 Strict vs default mode for unmapped values¶

When the backend returns a source value not in the mapping: - Strict mode (default: null): the engine raises UnmappedDimensionValueError with the offending value(s). - Default mode (default: "Other"): unmapped values get the default; the result is grouped correctly.

Strict is the v0.1 default. Forces the AC author to be explicit about coverage.

8.4 Re-aggregation requires a partition-invariant ip_reducer¶

If a metric is anchor-locked (no ip_reducer) at the source-dim grain, querying it at a derived-dim grain is refused (the mapping might collapse multiple source values into one derived value; we'd need to re-aggregate, but there's no reducer). Same rule as the existing Branch 2 rollup. Failure is a clear AnchorLockedError.

9. Non-goals¶

Non-goal	Note
Expression-derived dims over metrics (`revenue_tier`)	Shape C; Pro-tier; breaks the trichotomy
Catalog-function dims beyond mapping (`QUARTER_OF`)	Already supported via Manual §3.4.6 hierarchy syntax + Chapter 10 catalog
Nested mappings (mapping-derived from mapping-derived)	v0.2 affordance
Mappings with non-string source/target types	v0.2 affordance
User-supplied mapping override at query time	Would defeat the AC-as-source-of-truth posture

10. Implementation plan¶

Three slices, in dependency order:

Slice 1 — AC schema + validation + FD-DAG integration¶

Extend derived.py with DerivedDimensionSpec Pydantic model + a sentinel MAPPING_DERIVED_EDGE source.
Extend DimensionFamily with optional derived_dimensions: tuple[DerivedDimensionSpec, ...].
After-construction validator: appends derived names to members if not already there.
Extend FDEdgeSource literal with "mapping_derived".
Extend _edges_from_dimension_family to emit the new edges.
AC cross-ref: DD-100..DD-104 validation rules.
AC accessors: is_derived_dimension, derived_dimension(name).
Tests: declaration, FD-DAG reachability, validation rules.

Slice 2 — Engine execution¶

New _serve_via_derived_dim_rollup helper in engine.py (Branch 2b).
Detect derived dims in serve()'s requested anchor; substitute → recursive serve → mapping + re-aggregate.
Polars implementation: with_columns(col(src).replace_strict(mapping)) + group_by + agg.
Strict / default mode behaviour.
Tests: warm components, cold components, multi-derived-dim anchors, anchor-locked refusal.

Slice 3 — Planner + retail demo + end-to-end¶

Planner: substitute derived dims with source dims in schema search (apply_rule_3 + _resolve_metric).
Stash substitution metadata on ResolvedMetric.
Executor: pass substituted grain to backend; apply derived-dim transform after engine returns.
Retail demo: declare coarse_region (East: Atlantic, West: Pacific, Central: Plains).
Walkthrough Step 7c: SELECT coarse_region, SUM(revenue) AT coarse_region.
End-to-end test through Frame-QL.

Estimated effort: ~250–400 LOC + tests across the three slices.

11. Summary¶

AC-level derived dimensions are added as mappings declared on the AC, executed by the engine, with narrow planner support. The key commitments:

Mapping, not data — derived dims never get a backend column or a cached entry at the derived grain. The mapping is applied on demand post-read.
FD-DAG integration is structural — derived dims appear as ordinary nodes reachable from their source via a new mapping_derived edge type. Reachability + schema selection flow through transparently.
Planner is mostly unaware — Rules 1, 2, 4 unchanged. Rule 3 gets one targeted substitution (derived → source) for schema search.
Engine owns the execution — new Branch 2b: substitute, recursive-serve at source grain, apply mapping + re-aggregate.
Backend never sees them — only primitive dim columns go over the protocol. The mapping is engine-internal.

The design follows the Reading B principle adapted for the dimension/metric asymmetry (dims must be applied pre-aggregation, so the planner can't be quite as blind). The cleanest principle-preserving extension: declared mappings, engine-side execution, no SQL pushdown in v0.1.

When implementation lands (per §10), the canonical retail AC will declare coarse_region directly, and Frame-QL authors will write SELECT coarse_region, SUM(revenue) AT coarse_region and have it just work — with served_from: engine_cache whenever revenue@(region,) is warmed.