Provenance as a free category on a directed acyclic graph.
Every resource in the platform : datasets, transforms, entity types, link types, actions, applications : depends on other resources. These dependencies form a directed acyclic graph. The lineage category is the free category on this graph; its morphisms are dependency paths, and its structure gives provenance, impact analysis, and build ordering. The key word is free: the universal property of the free construction gives all of this from the graph alone, with no additional choices.
A quiver Q = (V, E, s, t): a set of vertices, a set of edges, and source/target functions s, t : E → V.
The free category Path(Q) has vertices as objects and directed paths as morphisms. A morphism from v to w is a sequence (e₁, …, ek) of edges with matching endpoints: s(e₁) = v, t(ek) = w, and t(ei) = s(ei+1). Composition is concatenation. Identity is the empty path.
The free category construction is left adjoint to the forgetful functor U : Cat → Graph. For any category C and graph morphism F : Q → U(C), there exists a unique functor F̃ : Path(Q) → C extending F.
Proof sketch. Define F̃ on a path (e₁, …, ek) as F(ek) ∘ … ∘ F(e₁). Well-defined by associativity; unique because paths are generated by edges. ∎
The practical content: to define a functor out of the free category, you only need to define it on vertices and edges : everything else is forced. Assign meaning to each resource and each direct dependency, and all multi-hop dependency chains get their meaning for free.
G = (V, E, s, t) where V = all platform resources (each with a unique RID) and E = dependency relations (r₁ → r₂ means "r₂ depends on r₁").
G is a DAG. Circular dependencies are structurally prevented at definition time : an invariant of the platform, not a runtime check.
Lin = Path(G). Objects are resources. Morphisms are dependency chains. Composition is concatenation.
Lineage : Sys → Lin is forgetful: it strips data content and retains only the dependency wiring. Pipeline chaining maps to path concatenation; identities map to empty paths.
If G is a DAG, then HomLin(r, r) = {idr} for all r. The only endomorphism of any resource is the identity.
Proof. A non-identity path from r to r would be a directed cycle in G, contradicting acyclicity. ∎
This is a structural property, not a runtime check. In most systems you verify acyclicity by running cycle detection (O(|V| + |E|)) every time an edge is added, but the free category on a DAG simply does not contain non-identity endomorphisms : you cannot even write down a circular dependency in the language of Lin.
Down(r) = {r′ ∈ V | Hom(r, r′) ≠ ∅} : everything reachable from r. A change to r potentially affects all of Down(r).
Up(r) = {r′ ∈ V | Hom(r′, r) ≠ ∅} : everything r depends on. A failure anywhere in Up(r) is a potential root cause of issues in r.
Both are computable in O(|V| + |E|) by BFS or DFS on G. The DAG structure guarantees termination.
A total order ≤ on V such that r₁ → r₂ in E implies r₁ ≤ r₂ : dependencies before dependents.
Every DAG admits at least one topological ordering. The sort is functorial: it extends to a functor Path(G) → (V, ≤). Processing resources in this order guarantees every resource is built after all its dependencies. Pipeline scheduling is a topological sort of G.
The resource-level graph G is often too coarse. When you need to know which output columns depend on which input columns, you need a refinement.
Gcol has vertices (D, c) : resource-column pairs : and edges (D₁, c₁) → (D₂, c₂) whenever column c₂ of D₂ is computed from column c₁ of D₁. It is also a DAG, and Path(Gcol) gives column-level provenance.
A forgetful functor Path(Gcol) → Path(G) projects column-level lineage down to resource-level. You get fine-grained analysis when you need it, and the simpler resource-level view as the default.
Lin = Path(G) is the free category on the dependency DAG: acyclicity is structural, impact and root cause reduce to graph reachability, and topological sort gives build ordering. Column-level lineage refines to Path(Gcol) and forgets back to resource-level : everything follows from the universal property of the free construction.
Return to The Framework.