Data Lineage in RAG with PROV‑O
A Retrieval‑Augmented Generation (RAG) system is only as trustworthy as the evidence it can show about where its content came from.
The PROV‑O (Provenance Ontology) lets you describe who produced data, when, where, and why, and ties those facts together across every step of the pipeline.
Below we focus on building and using a PROV‑O‑based lineage graph that follows a document from ingestion to answer generation.
1. What is Data Lineage?
Data lineage is the audit trail that tracks a piece of data through its life cycle:
- Ingestion – the raw source (e.g., a PDF, an API endpoint, a database dump).
- Pre‑processing – tokenisation, chunking, embedding.
- Storage – index insertion, metadata enrichment.
- Retrieval – query matching, passage selection.
- Generation – model inference and response assembly.
PROV‑O provides a lightweight RDF schema to record each step as an Activity and connect them with Agents (the software or people that performed them) and Entitys (the data items).
---
config:
theme: 'base'
themeVariables:
primaryColor: '#dab785'
primaryTextColor: '#031d44'
primaryBorderColor: '#04395e'
lineColor: '#d5896f'
secondaryColor: '#70a288'
tertiaryColor: '#fff'
background: '#fff'
class:
hideEmptyMembersBox: true
---
classDiagram
class ProvEntity {
+URI id
+string label
+dateTime createdAt
+dateTime updatedAt
}
class ProvActivity {
+URI id
+dateTime startedAt
+dateTime endedAt
+Agent performedBy
}
class ProvAgent {
+URI id
+string name
}
class DownloadActivity {
<<extends>> ProvActivity
+string sourceURL
}
class ChunkingActivity {
<<extends>> ProvActivity
+int chunkSize
}
class EmbeddingActivity {
<<extends>> ProvActivity
+int dimension
}
class RetrievalActivity {
<<extends>> ProvActivity
+int topK
}
class GenerationActivity {
<<extends>> ProvActivity
+string model
}
ProvEntity <|-- Document
ProvEntity <|-- Passage
ProvEntity <|-- EmbeddingVector
ProvEntity <|-- IndexRecord
ProvEntity <|-- GeneratedAnswer
ProvActivity <|-- DownloadActivity
ProvActivity <|-- ChunkingActivity
ProvActivity <|-- EmbeddingActivity
ProvActivity <|-- RetrievalActivity
ProvActivity <|-- GenerationActivity
ProvActivity o-- ProvEntity : uses
ProvActivity o-- ProvEntity : generates
ProvActivity o-- ProvAgent : performedBy
ProvEntity --> ProvAgent : wasGeneratedBy
2. PROV‑O Elements in a RAG Pipeline
| Element | PROV‑O Class | Example |
|---|---|---|
| Source document | prov:Entity |
ex:doc123 |
| Chunked passage | prov:Entity |
ex:passage_456 |
| Embedding vector | prov:Entity |
ex:embedding_456 |
| Index record | prov:Entity |
ex:indexRecord_456 |
| Query | prov:Activity |
ex:retrieval_789 |
| Generated answer | prov:Entity |
ex:answer_1011 |
| RAG engine | prov:Agent |
ex:ragEngine |
| User request | prov:Agent |
ex:researcher |
Core PROV‑O Triples
ex:doc123 a prov:Entity ;
prov:label "Annual Report 2023" ;
prov:wasGeneratedBy ex:download_001 .
ex:download_001 a prov:Activity ;
prov:startedAtTime "2025-08-17T08:00:00Z"^^xsd:dateTime ;
prov:endedAtTime "2025-08-17T08:00:05Z"^^xsd:dateTime ;
prov:wasAssociatedWith ex:downloaderScript .
ex:passage_456 a prov:Entity ;
prov:wasGeneratedBy ex:chunking_002 ;
prov:hadPrimarySource ex:doc123 .
ex:chunking_002 a prov:Activity ;
prov:wasAssociatedWith ex:chunkingScript ;
prov:used ex:doc123 ;
prov:generated ex:passage_456 .
3. Building the Lineage Graph
- Annotate every data item as it is created.
- Persist the triples in a triplestore (e.g., Blazegraph, Stardog) or embed them in your vector index as side‑car metadata.
- Query the graph to trace back any answer or passage to its root source.
Example: Tracing an Answer Back to the Original PDF
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ex: <https://example.org/>
SELECT ?sourceLabel ?sourceURL
WHERE {
ex:answer_1011 prov:wasGeneratedBy ?generation .
?generation prov:used ?indexRecord .
?indexRecord prov:wasDerivedFrom ?passage .
?passage prov:wasDerivedFrom ?source .
?source a prov:Entity ;
prov:label ?sourceLabel ;
prov:wasGeneratedBy ?download .
?download prov:used ?sourceURL .
}
The result will show the original PDF and its download URL, proving the chain of custody.
4. Illustrative Flow Diagram (Mermaid)
Below is a simple diagram that visualizes how PROV‑O links the steps in a RAG pipeline.
Copy the diagram into a Markdown preview that supports Mermaid to view it.
---
config:
theme: 'base'
themeVariables:
primaryColor: '#dab785'
primaryTextColor: '#031d44'
primaryBorderColor: '#04395e'
lineColor: '#d5896f'
secondaryColor: '#70a288'
tertiaryColor: '#fff'
background: '#fff'
---
flowchart LR
A[User Query] --> B{RAG Engine}
B --> C[Retrieve Passages]
C --> D[Generate Answer]
D --> E[Return to User]
subgraph Ingestion
F[Source Doc] --> G[Chunking] --> H[Embedding] --> I[Index]
end
subgraph Provenance
F -- "prov:wasGeneratedBy" --> J[Download Activity]
G -- "prov:wasGeneratedBy" --> K[Chunking Activity]
H -- "prov:wasGeneratedBy" --> L[Embedding Activity]
I -- "prov:wasGeneratedBy" --> M[Index Activity]
D -- "prov:wasGeneratedBy" --> N[Generation Activity]
N -- "prov:used" --> I
end
5. Why Focus on Lineage?
| Benefit | PROV‑O in RAG |
|---|---|
| Auditability | Every answer can be traced to its origin. |
| Compliance | License and attribution can be verified automatically. |
| Debugging | Faulty passages are easier to isolate and fix. |
| Explainability | Users see the evidence chain behind a response. |
6. Quick Implementation Checklist
| Step | What to Do | Tooling |
|---|---|---|
| 1 | Capture provenance on ingestion | Custom script, pyprov |
| 2 | Store triples in a triplestore | Blazegraph, Apache Jena |
| 3 | Index triples with embeddings | Elastic, Pinecone (side‑car) |
| 4 | Query lineage in response | SPARQL endpoint, GraphQL wrapper |
| 5 | Visualise lineage | Mermaid, Graphviz |
By treating provenance as first‑class citizens in your RAG pipeline, you build a system that not only delivers accurate answers but also provides the transparency needed for rigorous scientific or enterprise applications.