← The Framework

Component 6: The Lineage Category

Provenance as a free category on a directed acyclic graph.

Every resource in the platform : datasets, transforms, entity types, link types, actions, applications : depends on other resources. These dependencies form a directed acyclic graph. The lineage category is the free category on this graph; its morphisms are dependency paths, and its structure gives provenance, impact analysis, and build ordering. The key word is free: the universal property of the free construction gives all of this from the graph alone, with no additional choices.

1. Free Categories on Graphs

Definition 1.1 : Quiver

A quiver Q = (V, E, s, t): a set of vertices, a set of edges, and source/target functions s, t : EV.

Definition 1.2 : Free Category

The free category Path(Q) has vertices as objects and directed paths as morphisms. A morphism from v to w is a sequence (e₁, …, ek) of edges with matching endpoints: s(e₁) = v, t(ek) = w, and t(ei) = s(ei+1). Composition is concatenation. Identity is the empty path.

2. Universal Property

Theorem 2.1 : Free ⊣ U

The free category construction is left adjoint to the forgetful functor U : CatGraph. For any category C and graph morphism F : QU(C), there exists a unique functor : Path(Q) → C extending F.

Proof sketch. Define on a path (e₁, …, ek) as F(ek) ∘ … ∘ F(e₁). Well-defined by associativity; unique because paths are generated by edges. ∎

The practical content: to define a functor out of the free category, you only need to define it on vertices and edges : everything else is forced. Assign meaning to each resource and each direct dependency, and all multi-hop dependency chains get their meaning for free.

3. The Lineage Graph

Definition 3.1 : Lineage Graph

G = (V, E, s, t) where V = all platform resources (each with a unique RID) and E = dependency relations (r₁ → r₂ means "r₂ depends on r₁").

Example 3.2 : Edges
Proposition 3.3

G is a DAG. Circular dependencies are structurally prevented at definition time : an invariant of the platform, not a runtime check.

4. The Lineage Category and Functor

Definition 4.1 : Lin

Lin = Path(G). Objects are resources. Morphisms are dependency chains. Composition is concatenation.

Definition 4.2 : Lineage Functor

Lineage : SysLin is forgetful: it strips data content and retains only the dependency wiring. Pipeline chaining maps to path concatenation; identities map to empty paths.

5. Acyclicity

Theorem 5.1 : Trivial Endomorphisms

If G is a DAG, then HomLin(r, r) = {idr} for all r. The only endomorphism of any resource is the identity.

Proof. A non-identity path from r to r would be a directed cycle in G, contradicting acyclicity. ∎

This is a structural property, not a runtime check. In most systems you verify acyclicity by running cycle detection (O(|V| + |E|)) every time an edge is added, but the free category on a DAG simply does not contain non-identity endomorphisms : you cannot even write down a circular dependency in the language of Lin.

6. Impact Analysis and Root Cause

Definition 6.1 : Downstream / Upstream

Down(r) = {r′ ∈ V | Hom(r, r′) ≠ ∅} : everything reachable from r. A change to r potentially affects all of Down(r).

Up(r) = {r′ ∈ V | Hom(r′, r) ≠ ∅} : everything r depends on. A failure anywhere in Up(r) is a potential root cause of issues in r.

Proposition 6.2 : Computability

Both are computable in O(|V| + |E|) by BFS or DFS on G. The DAG structure guarantees termination.

7. Topological Sort and Build Order

Definition 7.1 : Topological Ordering

A total order ≤ on V such that r₁ → r₂ in E implies r₁ ≤ r₂ : dependencies before dependents.

Proposition 7.2

Every DAG admits at least one topological ordering. The sort is functorial: it extends to a functor Path(G) → (V, ≤). Processing resources in this order guarantees every resource is built after all its dependencies. Pipeline scheduling is a topological sort of G.

8. Column-Level Lineage

The resource-level graph G is often too coarse. When you need to know which output columns depend on which input columns, you need a refinement.

Definition 8.1 : Column-Level Lineage Graph

Gcol has vertices (D, c) : resource-column pairs : and edges (D₁, c₁) → (D₂, c₂) whenever column c₂ of D₂ is computed from column c₁ of D₁. It is also a DAG, and Path(Gcol) gives column-level provenance.

Remark 8.2

A forgetful functor Path(Gcol) → Path(G) projects column-level lineage down to resource-level. You get fine-grained analysis when you need it, and the simpler resource-level view as the default.


Summary

Lin = Path(G) is the free category on the dependency DAG: acyclicity is structural, impact and root cause reduce to graph reachability, and topological sort gives build ordering. Column-level lineage refines to Path(Gcol) and forgets back to resource-level : everything follows from the universal property of the free construction.

Return to The Framework.