Commit
·
0510038
1
Parent(s):
66a4b03
Phase 2: Enhanced lineage extraction with export to data catalogs
Browse filesFeatures added:
- Upgraded to Gradio 6.0.0 for hackathon compliance
- Export to 4 data catalog formats: OpenLineage, Collibra, Purview, Alation
- 6 new comprehensive sample data files (dbt, Airflow, SQL DDL, warehouse, ETL, complex demo)
- Complete USER_GUIDE.md with tutorials and examples
- BUILD_PLAN.md with competition roadmap
- Real lineage parsing (not stubs) with Mermaid visualization
- MCP server integration UI
- Demo Gallery tab for quick exploration
- Enhanced test suite (12 tests)
Competition: Gradio Agents & MCP Hackathon Winter 2025
Track: MCP in Action (Productivity)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- .gitignore +2 -2
- BUILD_PLAN.md +302 -0
- README.md +286 -69
- USER_GUIDE.md +550 -0
- app.py +598 -239
- exporters/__init__.py +27 -0
- exporters/alation.py +242 -0
- exporters/base.py +199 -0
- exporters/collibra.py +243 -0
- exporters/openlineage.py +177 -0
- exporters/purview.py +206 -0
- memories/graph_visualizer/tools.json +1 -0
- memories/subagents/tools.json +1 -0
- memories/tools.json +1 -0
- requirements.txt +1 -2
- samples/airflow_dag_sample.json +150 -0
- samples/complex_lineage_demo.json +425 -0
- samples/dbt_manifest_sample.json +196 -0
- samples/etl_pipeline_sample.json +252 -0
- samples/sample_api_metadata.json +8 -0
- samples/sample_metadata.json +12 -0
- samples/sql_ddl_sample.sql +269 -0
- samples/warehouse_lineage_sample.json +216 -0
- tests/test_app.py +86 -4
.gitignore
CHANGED
|
@@ -39,10 +39,10 @@ ENV/
|
|
| 39 |
.DS_Store
|
| 40 |
Thumbs.db
|
| 41 |
|
| 42 |
-
# Credentials
|
| 43 |
-
*.json
|
| 44 |
service-account-*.json
|
| 45 |
credentials.json
|
|
|
|
| 46 |
|
| 47 |
# Logs
|
| 48 |
*.log
|
|
|
|
| 39 |
.DS_Store
|
| 40 |
Thumbs.db
|
| 41 |
|
| 42 |
+
# Credentials (but allow sample json files)
|
|
|
|
| 43 |
service-account-*.json
|
| 44 |
credentials.json
|
| 45 |
+
!samples/*.json
|
| 46 |
|
| 47 |
# Logs
|
| 48 |
*.log
|
BUILD_PLAN.md
ADDED
|
@@ -0,0 +1,302 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# BUILD PLAN - Lineage Graph Accelerator
|
| 2 |
+
|
| 3 |
+
## Competition: Gradio Agents & MCP Hackathon - Winter 2025
|
| 4 |
+
**Deadline:** November 30, 2025
|
| 5 |
+
**Track:** Track 2 - MCP in Action (Productivity)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Judging Criteria Alignment
|
| 10 |
+
|
| 11 |
+
| Criteria | Weight | Current Status | Target |
|
| 12 |
+
|----------|--------|----------------|--------|
|
| 13 |
+
| Design/Polished UI-UX | High | Basic Gradio UI | Professional, intuitive interface with themes |
|
| 14 |
+
| Functionality | High | Stub extractors | Full MCP integration + agentic chatbot |
|
| 15 |
+
| Creativity | High | Standard lineage tool | Multi-format export, catalog integration |
|
| 16 |
+
| Documentation | High | Basic README | Comprehensive guide + demo video |
|
| 17 |
+
| Real-world Impact | High | Concept | Production-ready for enterprises |
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Submission Requirements Checklist
|
| 22 |
+
|
| 23 |
+
- [ ] HuggingFace Space deployed
|
| 24 |
+
- [ ] Social media post (LinkedIn/X) published
|
| 25 |
+
- [ ] README with complete documentation
|
| 26 |
+
- [ ] Demo video (1-5 minutes)
|
| 27 |
+
- [ ] All team member HF usernames in Space README
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## Phase 2 Implementation Plan
|
| 32 |
+
|
| 33 |
+
### 2.1 HuggingFace MCP Server Integration
|
| 34 |
+
**Priority:** Critical
|
| 35 |
+
**Status:** Not Started
|
| 36 |
+
|
| 37 |
+
#### Tasks:
|
| 38 |
+
- [ ] Research available MCP servers on HuggingFace
|
| 39 |
+
- [ ] Implement connection to HF-hosted MCP servers
|
| 40 |
+
- [ ] Add MCP server discovery/selection UI
|
| 41 |
+
- [ ] Create fallback chain: HF MCP -> Local MCP -> Stub
|
| 42 |
+
- [ ] Add health check and status indicators
|
| 43 |
+
- [ ] Support for multiple MCP server endpoints
|
| 44 |
+
|
| 45 |
+
#### Files to Modify:
|
| 46 |
+
- `app.py` - Add HF MCP integration
|
| 47 |
+
- `mcp_example/server.py` - Enhance for HF deployment
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
### 2.2 Comprehensive Sample Test Data
|
| 52 |
+
**Priority:** Critical
|
| 53 |
+
**Status:** Not Started
|
| 54 |
+
|
| 55 |
+
#### Tasks:
|
| 56 |
+
- [ ] Create realistic dbt manifest sample
|
| 57 |
+
- [ ] Create Airflow DAG metadata sample
|
| 58 |
+
- [ ] Create SQL DDL with complex lineage sample
|
| 59 |
+
- [ ] Create data warehouse lineage sample (Snowflake/BigQuery style)
|
| 60 |
+
- [ ] Create API-based data pipeline sample
|
| 61 |
+
- [ ] Create ETL workflow sample
|
| 62 |
+
- [ ] Add "Try Demo" one-click examples in UI
|
| 63 |
+
|
| 64 |
+
#### New Files:
|
| 65 |
+
- `samples/dbt_manifest_sample.json`
|
| 66 |
+
- `samples/airflow_dag_sample.json`
|
| 67 |
+
- `samples/sql_ddl_sample.sql`
|
| 68 |
+
- `samples/warehouse_lineage_sample.json`
|
| 69 |
+
- `samples/etl_pipeline_sample.json`
|
| 70 |
+
- `samples/complex_lineage_demo.json`
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
### 2.3 Export to Data Catalogs (Collibra, Purview, Alation)
|
| 75 |
+
**Priority:** High
|
| 76 |
+
**Status:** Not Started
|
| 77 |
+
|
| 78 |
+
#### Tasks:
|
| 79 |
+
- [ ] Design universal lineage export format (JSON-LD/OpenLineage)
|
| 80 |
+
- [ ] Implement Collibra export format
|
| 81 |
+
- [ ] Implement Microsoft Purview export format
|
| 82 |
+
- [ ] Implement Alation export format
|
| 83 |
+
- [ ] Implement Apache Atlas export format
|
| 84 |
+
- [ ] Add export UI with format selection
|
| 85 |
+
- [ ] Add download buttons for each format
|
| 86 |
+
- [ ] Create export documentation
|
| 87 |
+
|
| 88 |
+
#### Export Formats:
|
| 89 |
+
```
|
| 90 |
+
exports/
|
| 91 |
+
├── openlineage/ # OpenLineage standard format
|
| 92 |
+
├── collibra/ # Collibra Data Intelligence
|
| 93 |
+
├── purview/ # Microsoft Purview
|
| 94 |
+
├── alation/ # Alation Data Catalog
|
| 95 |
+
└── atlas/ # Apache Atlas
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
#### Files to Create:
|
| 99 |
+
- `exporters/__init__.py`
|
| 100 |
+
- `exporters/base.py`
|
| 101 |
+
- `exporters/openlineage.py`
|
| 102 |
+
- `exporters/collibra.py`
|
| 103 |
+
- `exporters/purview.py`
|
| 104 |
+
- `exporters/alation.py`
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
### 2.4 User Guide with Sample Lineage Examples
|
| 109 |
+
**Priority:** High
|
| 110 |
+
**Status:** Not Started
|
| 111 |
+
|
| 112 |
+
#### Tasks:
|
| 113 |
+
- [ ] Create comprehensive USER_GUIDE.md
|
| 114 |
+
- [ ] Add getting started section
|
| 115 |
+
- [ ] Document all input formats supported
|
| 116 |
+
- [ ] Create step-by-step tutorials
|
| 117 |
+
- [ ] Add troubleshooting section
|
| 118 |
+
- [ ] Include sample lineage scenarios with expected outputs
|
| 119 |
+
- [ ] Add integration guides for each data catalog
|
| 120 |
+
|
| 121 |
+
#### Sample Scenarios to Document:
|
| 122 |
+
1. Simple table-to-table lineage
|
| 123 |
+
2. Multi-hop data pipeline lineage
|
| 124 |
+
3. dbt model dependency graph
|
| 125 |
+
4. Airflow DAG task dependencies
|
| 126 |
+
5. Cross-database lineage
|
| 127 |
+
6. API-to-database data flow
|
| 128 |
+
7. ETL job lineage
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
### 2.5 Gradio 6 Upgrade & UI/UX Enhancement
|
| 133 |
+
**Priority:** Critical (Competition Requirement)
|
| 134 |
+
**Status:** Not Started
|
| 135 |
+
|
| 136 |
+
#### Tasks:
|
| 137 |
+
- [ ] Upgrade to Gradio 6 (competition requirement)
|
| 138 |
+
- [ ] Implement agentic chatbot interface
|
| 139 |
+
- [ ] Add dark/light theme toggle
|
| 140 |
+
- [ ] Improve layout and responsiveness
|
| 141 |
+
- [ ] Add progress indicators and loading states
|
| 142 |
+
- [ ] Implement error handling with user-friendly messages
|
| 143 |
+
- [ ] Add interactive graph zoom/pan
|
| 144 |
+
- [ ] Add lineage node click interactions
|
| 145 |
+
|
| 146 |
+
#### UI Improvements:
|
| 147 |
+
- Professional color scheme
|
| 148 |
+
- Clear visual hierarchy
|
| 149 |
+
- Tooltips and help text
|
| 150 |
+
- Export buttons with icons
|
| 151 |
+
- Collapsible sections
|
| 152 |
+
- Mobile-friendly design
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
### 2.6 Agentic Chatbot Integration
|
| 157 |
+
**Priority:** Critical (Competition Judging)
|
| 158 |
+
**Status:** Not Started
|
| 159 |
+
|
| 160 |
+
#### Tasks:
|
| 161 |
+
- [ ] Implement conversational interface for lineage queries
|
| 162 |
+
- [ ] Add natural language input for lineage extraction
|
| 163 |
+
- [ ] Enable follow-up questions about lineage
|
| 164 |
+
- [ ] Integrate with Anthropic/OpenAI APIs
|
| 165 |
+
- [ ] Add streaming responses
|
| 166 |
+
- [ ] Implement context memory for conversations
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
### 2.7 Demo Video Production
|
| 171 |
+
**Priority:** Critical (Submission Requirement)
|
| 172 |
+
**Status:** Not Started
|
| 173 |
+
|
| 174 |
+
#### Video Content Plan (1-5 minutes):
|
| 175 |
+
1. Introduction (15s)
|
| 176 |
+
2. Problem statement (20s)
|
| 177 |
+
3. Live demo - Text input (30s)
|
| 178 |
+
4. Live demo - Sample data (30s)
|
| 179 |
+
5. Export features (30s)
|
| 180 |
+
6. MCP integration (30s)
|
| 181 |
+
7. Real-world use cases (30s)
|
| 182 |
+
8. Call to action (15s)
|
| 183 |
+
|
| 184 |
+
---
|
| 185 |
+
|
| 186 |
+
## Technical Architecture Updates
|
| 187 |
+
|
| 188 |
+
### Current Architecture:
|
| 189 |
+
```
|
| 190 |
+
User -> Gradio UI -> Stub Extractors -> Mermaid Render
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
### Target Architecture:
|
| 194 |
+
```
|
| 195 |
+
User -> Gradio 6 UI -> Agentic Chatbot
|
| 196 |
+
-> MCP Server (HF/Local)
|
| 197 |
+
-> Lineage Parser
|
| 198 |
+
-> Graph Visualizer
|
| 199 |
+
-> Export Engine -> [Collibra|Purview|Alation|Atlas]
|
| 200 |
+
-> Mermaid/DOT/Text Render
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
+
## Dependencies to Add
|
| 206 |
+
|
| 207 |
+
```txt
|
| 208 |
+
# requirements.txt additions
|
| 209 |
+
gradio>=6.0.0
|
| 210 |
+
anthropic>=0.25.0
|
| 211 |
+
openai>=1.0.0
|
| 212 |
+
openlineage-integration-common>=1.0.0
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
## Testing Plan
|
| 218 |
+
|
| 219 |
+
### Unit Tests:
|
| 220 |
+
- [ ] Test all export formats
|
| 221 |
+
- [ ] Test MCP server integration
|
| 222 |
+
- [ ] Test sample data loading
|
| 223 |
+
- [ ] Test visualization rendering
|
| 224 |
+
|
| 225 |
+
### Integration Tests:
|
| 226 |
+
- [ ] End-to-end lineage extraction
|
| 227 |
+
- [ ] Export file validation
|
| 228 |
+
- [ ] MCP server communication
|
| 229 |
+
|
| 230 |
+
### Manual Tests:
|
| 231 |
+
- [ ] UI/UX on different browsers
|
| 232 |
+
- [ ] Mobile responsiveness
|
| 233 |
+
- [ ] Load testing with large graphs
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## Deployment Checklist
|
| 238 |
+
|
| 239 |
+
### HuggingFace Space:
|
| 240 |
+
- [ ] Update Space SDK to Gradio 6
|
| 241 |
+
- [ ] Configure environment variables
|
| 242 |
+
- [ ] Set up secrets for API keys
|
| 243 |
+
- [ ] Test on HF infrastructure
|
| 244 |
+
- [ ] Verify MCP server connectivity
|
| 245 |
+
|
| 246 |
+
### Documentation:
|
| 247 |
+
- [ ] README.md complete
|
| 248 |
+
- [ ] USER_GUIDE.md complete
|
| 249 |
+
- [ ] Demo video uploaded
|
| 250 |
+
- [ ] Social media post drafted
|
| 251 |
+
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
## Timeline (Remaining Days)
|
| 255 |
+
|
| 256 |
+
### Immediate (Days 1-2):
|
| 257 |
+
1. Upgrade to Gradio 6
|
| 258 |
+
2. Create comprehensive sample data
|
| 259 |
+
3. Implement basic export functionality
|
| 260 |
+
|
| 261 |
+
### Short-term (Days 3-5):
|
| 262 |
+
4. Implement agentic chatbot
|
| 263 |
+
5. HuggingFace MCP integration
|
| 264 |
+
6. UI/UX enhancements
|
| 265 |
+
|
| 266 |
+
### Final (Days 6-7):
|
| 267 |
+
7. Create user guide
|
| 268 |
+
8. Record demo video
|
| 269 |
+
9. Final testing and deployment
|
| 270 |
+
10. Social media post
|
| 271 |
+
|
| 272 |
+
---
|
| 273 |
+
|
| 274 |
+
## Risk Mitigation
|
| 275 |
+
|
| 276 |
+
| Risk | Mitigation |
|
| 277 |
+
|------|------------|
|
| 278 |
+
| Gradio 6 breaking changes | Test incrementally, have rollback plan |
|
| 279 |
+
| MCP server unavailability | Implement robust fallback chain |
|
| 280 |
+
| API rate limits | Cache responses, implement retry logic |
|
| 281 |
+
| Export format compatibility | Validate against official schemas |
|
| 282 |
+
|
| 283 |
+
---
|
| 284 |
+
|
| 285 |
+
## Success Metrics
|
| 286 |
+
|
| 287 |
+
- [ ] All judging criteria addressed
|
| 288 |
+
- [ ] Submission requirements complete
|
| 289 |
+
- [ ] Demo runs without errors
|
| 290 |
+
- [ ] Export files validate against schemas
|
| 291 |
+
- [ ] MCP integration functional
|
| 292 |
+
- [ ] UI is polished and intuitive
|
| 293 |
+
- [ ] Documentation is comprehensive
|
| 294 |
+
|
| 295 |
+
---
|
| 296 |
+
|
| 297 |
+
## Notes
|
| 298 |
+
|
| 299 |
+
- Competition ends November 30, 2025 at 11:59 PM UTC
|
| 300 |
+
- Focus on "Productivity" track for Track 2
|
| 301 |
+
- Leverage sponsor APIs for enhanced functionality
|
| 302 |
+
- Consider ElevenLabs integration for voice features (bonus prize)
|
README.md
CHANGED
|
@@ -1,125 +1,342 @@
|
|
| 1 |
---
|
| 2 |
title: Lineage Graph Accelerator
|
| 3 |
emoji: 🔥
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
-
pinned:
|
| 10 |
license: mit
|
| 11 |
-
short_description:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
# Lineage Graph Accelerator 🔥
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
flowchart TD
|
| 26 |
-
A[User/UI (Gradio)] --> B[Main Agent / Orchestrator]
|
| 27 |
-
B --> C[Metadata Parser Sub-Agent]
|
| 28 |
-
B --> D[Graph Visualizer Sub-Agent]
|
| 29 |
-
B --> E[Integration Adapters]
|
| 30 |
-
E --> E1[BigQuery Adapter]
|
| 31 |
-
E --> E2[URL / API Adapter]
|
| 32 |
-
E --> E3[dbt / Airflow Adapter]
|
| 33 |
-
C --> F[Lineage Model / Relations]
|
| 34 |
-
F --> D
|
| 35 |
-
D --> G[Mermaid / DOT Renderer]
|
| 36 |
-
G --> H[UI Visualization]
|
| 37 |
-
style B fill:#f9f,stroke:#333,stroke-width:1px
|
| 38 |
-
style C fill:#bbf,stroke:#333,stroke-width:1px
|
| 39 |
-
style D fill:#bfb,stroke:#333,stroke-width:1px
|
| 40 |
-
style E fill:#ffd,stroke:#333,stroke-width:1px
|
| 41 |
-
```
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
-
|
| 46 |
-
-
|
| 47 |
-
-
|
| 48 |
-
-
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
-
|
| 54 |
-
-
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
python3 -m venv .venv
|
| 64 |
source .venv/bin/activate
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
```
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
```
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
```bash
|
| 76 |
-
|
| 77 |
```
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
```bash
|
|
|
|
| 88 |
source .venv/bin/activate
|
|
|
|
|
|
|
| 89 |
python -m unittest tests.test_app -v
|
|
|
|
|
|
|
|
|
|
| 90 |
```
|
| 91 |
|
| 92 |
-
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
-
|
| 95 |
-
-
|
| 96 |
-
-
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
|
|
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
-
|
| 111 |
|
| 112 |
-
|
| 113 |
-
# Activate venv first if you use one
|
| 114 |
-
uvicorn mcp_example.server:app --reload --port 9000
|
| 115 |
-
```
|
| 116 |
|
| 117 |
-
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: Lineage Graph Accelerator
|
| 3 |
emoji: 🔥
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 6.0.0
|
| 8 |
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
license: mit
|
| 11 |
+
short_description: AI data lineage extraction & export to data catalogs
|
| 12 |
+
tags:
|
| 13 |
+
- data-lineage
|
| 14 |
+
- mcp
|
| 15 |
+
- gradio
|
| 16 |
+
- data-governance
|
| 17 |
+
- dbt
|
| 18 |
+
- airflow
|
| 19 |
+
- etl
|
| 20 |
---
|
| 21 |
|
| 22 |
# Lineage Graph Accelerator 🔥
|
| 23 |
|
| 24 |
+
**AI-powered data lineage extraction and visualization for modern data platforms**
|
| 25 |
|
| 26 |
+
[](https://huggingface.co/spaces/YOUR_SPACE)
|
| 27 |
+
[](https://opensource.org/licenses/MIT)
|
| 28 |
+
[](https://gradio.app)
|
| 29 |
|
| 30 |
+
> 🎉 **Built for the Gradio Agents & MCP Hackathon - Winter 2025** 🎉
|
| 31 |
+
>
|
| 32 |
+
> Celebrating MCP's 1st Birthday! This project demonstrates the power of MCP integration for enterprise data governance.
|
| 33 |
|
| 34 |
+
---
|
| 35 |
|
| 36 |
+
## 🌟 What is Lineage Graph Accelerator?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
Lineage Graph Accelerator is an AI-powered tool that helps data teams:
|
| 39 |
|
| 40 |
+
- **Extract** data lineage from dbt, Airflow, BigQuery, Snowflake, and more
|
| 41 |
+
- **Visualize** complex data dependencies with interactive Mermaid diagrams
|
| 42 |
+
- **Export** lineage to enterprise data catalogs (Collibra, Microsoft Purview, Alation)
|
| 43 |
+
- **Integrate** with MCP servers for enhanced AI-powered processing
|
| 44 |
|
| 45 |
+
### Why Data Lineage Matters
|
| 46 |
|
| 47 |
+
Understanding where your data comes from and where it goes is critical for:
|
| 48 |
+
- **Data Quality**: Track data transformations and identify issues
|
| 49 |
+
- **Compliance**: Document data flows for GDPR, CCPA, and other regulations
|
| 50 |
+
- **Impact Analysis**: Understand downstream effects of schema changes
|
| 51 |
+
- **Data Discovery**: Help analysts find and trust data assets
|
| 52 |
|
| 53 |
+
---
|
| 54 |
|
| 55 |
+
## 🎯 Key Features
|
| 56 |
+
|
| 57 |
+
### Multi-Source Support
|
| 58 |
+
| Source | Status | Description |
|
| 59 |
+
|--------|--------|-------------|
|
| 60 |
+
| dbt Manifest | ✅ | Parse dbt's manifest.json for model dependencies |
|
| 61 |
+
| Airflow DAG | ✅ | Extract task dependencies from DAG definitions |
|
| 62 |
+
| SQL DDL | ✅ | Parse CREATE statements for table lineage |
|
| 63 |
+
| BigQuery | ✅ | Query INFORMATION_SCHEMA for metadata |
|
| 64 |
+
| Custom JSON | ✅ | Flexible node/edge format for any source |
|
| 65 |
+
| Snowflake | 🔄 | Coming via MCP integration |
|
| 66 |
+
|
| 67 |
+
### Export to Data Catalogs
|
| 68 |
+
| Catalog | Status | Format |
|
| 69 |
+
|---------|--------|--------|
|
| 70 |
+
| OpenLineage | ✅ | Universal open standard |
|
| 71 |
+
| Collibra | ✅ | Data Intelligence Platform |
|
| 72 |
+
| Microsoft Purview | ✅ | Azure Data Governance |
|
| 73 |
+
| Alation | ✅ | Data Catalog |
|
| 74 |
+
| Apache Atlas | 🔄 | Coming soon |
|
| 75 |
+
|
| 76 |
+
### Visualization Options
|
| 77 |
+
- **Mermaid Diagrams**: Interactive, client-side rendering
|
| 78 |
+
- **Subgraph Grouping**: Organize by data layer (raw, staging, marts)
|
| 79 |
+
- **Color-Coded Nodes**: Distinguish sources, tables, models, reports
|
| 80 |
+
- **Edge Labels**: Show transformation types
|
| 81 |
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## 🚀 Quick Start
|
| 85 |
+
|
| 86 |
+
### Try Online (HuggingFace Space)
|
| 87 |
+
|
| 88 |
+
1. Visit [Lineage Graph Accelerator on HuggingFace](https://huggingface.co/spaces/YOUR_SPACE)
|
| 89 |
+
2. Click "Load Sample" to load example data
|
| 90 |
+
3. Click "Extract Lineage" to see the visualization
|
| 91 |
+
4. Explore the Demo Gallery for more examples
|
| 92 |
+
|
| 93 |
+
### Run Locally
|
| 94 |
|
| 95 |
```bash
|
| 96 |
+
# Clone the repository
|
| 97 |
+
git clone https://github.com/YOUR_REPO/lineage-graph-accelerator.git
|
| 98 |
+
cd lineage-graph-accelerator
|
| 99 |
+
|
| 100 |
+
# Create virtual environment
|
| 101 |
python3 -m venv .venv
|
| 102 |
source .venv/bin/activate
|
| 103 |
+
|
| 104 |
+
# Install dependencies
|
| 105 |
+
pip install -r requirements.txt
|
| 106 |
+
|
| 107 |
+
# Run the app
|
| 108 |
+
python app.py
|
| 109 |
```
|
| 110 |
|
| 111 |
+
Open http://127.0.0.1:7860 in your browser.
|
| 112 |
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## 📖 Usage Guide
|
| 116 |
+
|
| 117 |
+
### 1. Text/File Metadata Tab
|
| 118 |
+
|
| 119 |
+
Paste your metadata directly:
|
| 120 |
+
|
| 121 |
+
```json
|
| 122 |
+
{
|
| 123 |
+
"nodes": [
|
| 124 |
+
{"id": "source_db", "type": "source", "name": "Source Database"},
|
| 125 |
+
{"id": "staging", "type": "table", "name": "Staging Table"},
|
| 126 |
+
{"id": "analytics", "type": "table", "name": "Analytics Table"}
|
| 127 |
+
],
|
| 128 |
+
"edges": [
|
| 129 |
+
{"from": "source_db", "to": "staging"},
|
| 130 |
+
{"from": "staging", "to": "analytics"}
|
| 131 |
+
]
|
| 132 |
+
}
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
### 2. Sample Data
|
| 136 |
+
|
| 137 |
+
Load pre-built samples to explore different scenarios:
|
| 138 |
+
- **Simple JSON**: Basic node/edge lineage
|
| 139 |
+
- **dbt Manifest**: Full dbt project with 15+ models
|
| 140 |
+
- **Airflow DAG**: ETL pipeline with 15 tasks
|
| 141 |
+
- **Data Warehouse**: Snowflake-style multi-layer architecture
|
| 142 |
+
- **ETL Pipeline**: Complex multi-source pipeline
|
| 143 |
+
- **Complex Demo**: 50+ node e-commerce platform
|
| 144 |
+
|
| 145 |
+
### 3. Export to Data Catalogs
|
| 146 |
+
|
| 147 |
+
1. Extract lineage from your metadata
|
| 148 |
+
2. Expand "Export to Data Catalog"
|
| 149 |
+
3. Select format (OpenLineage, Collibra, Purview, Alation)
|
| 150 |
+
4. Click "Generate Export"
|
| 151 |
+
5. Copy the JSON for import into your catalog
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
## 🔌 MCP Integration
|
| 156 |
+
|
| 157 |
+
Connect to MCP (Model Context Protocol) servers for enhanced processing:
|
| 158 |
+
|
| 159 |
+
```
|
| 160 |
+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
| 161 |
+
│ Lineage Graph │────▶│ MCP Server │────▶│ AI Model │
|
| 162 |
+
│ Accelerator │ │ (HuggingFace) │ │ (Claude) │
|
| 163 |
+
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
| 164 |
```
|
| 165 |
|
| 166 |
+
### Configuration
|
| 167 |
+
|
| 168 |
+
1. Expand "MCP Server Configuration" in the UI
|
| 169 |
+
2. Enter your MCP server URL
|
| 170 |
+
3. Add API key (if required)
|
| 171 |
+
4. Click "Test Connection"
|
| 172 |
+
|
| 173 |
+
### Run Local MCP Server
|
| 174 |
|
| 175 |
```bash
|
| 176 |
+
uvicorn mcp_example.server:app --reload --port 9000
|
| 177 |
```
|
| 178 |
|
| 179 |
+
Then use `http://localhost:9000/mcp` as your server URL.
|
| 180 |
|
| 181 |
+
---
|
| 182 |
|
| 183 |
+
## 🏗️ Architecture
|
| 184 |
|
| 185 |
+
```mermaid
|
| 186 |
+
flowchart TD
|
| 187 |
+
A[User Interface - Gradio] --> B[Input Parser]
|
| 188 |
+
B --> C{Source Type}
|
| 189 |
+
C -->|dbt| D[dbt Parser]
|
| 190 |
+
C -->|Airflow| E[Airflow Parser]
|
| 191 |
+
C -->|SQL| F[SQL Parser]
|
| 192 |
+
C -->|JSON| G[JSON Parser]
|
| 193 |
+
D & E & F & G --> H[LineageGraph]
|
| 194 |
+
H --> I[Mermaid Generator]
|
| 195 |
+
H --> J[Export Engine]
|
| 196 |
+
I --> K[Visualization]
|
| 197 |
+
J --> L[OpenLineage]
|
| 198 |
+
J --> M[Collibra]
|
| 199 |
+
J --> N[Purview]
|
| 200 |
+
J --> O[Alation]
|
| 201 |
+
|
| 202 |
+
subgraph Optional
|
| 203 |
+
P[MCP Server] --> H
|
| 204 |
+
end
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
### Project Structure
|
| 208 |
+
|
| 209 |
+
```
|
| 210 |
+
lineage-graph-accelerator/
|
| 211 |
+
├── app.py # Main Gradio application
|
| 212 |
+
├── exporters/ # Data catalog exporters
|
| 213 |
+
│ ├── __init__.py
|
| 214 |
+
│ ├── base.py # Base classes
|
| 215 |
+
│ ├── openlineage.py # OpenLineage format
|
| 216 |
+
│ ├── collibra.py # Collibra format
|
| 217 |
+
│ ├── purview.py # Microsoft Purview format
|
| 218 |
+
│ └── alation.py # Alation format
|
| 219 |
+
├── samples/ # Sample data files
|
| 220 |
+
│ ├── sample_metadata.json
|
| 221 |
+
│ ├── dbt_manifest_sample.json
|
| 222 |
+
│ ├── airflow_dag_sample.json
|
| 223 |
+
│ ├── sql_ddl_sample.sql
|
| 224 |
+
│ ├── warehouse_lineage_sample.json
|
| 225 |
+
│ ├── etl_pipeline_sample.json
|
| 226 |
+
│ └── complex_lineage_demo.json
|
| 227 |
+
├── mcp_example/ # Example MCP server
|
| 228 |
+
│ └── server.py
|
| 229 |
+
├── tests/ # Unit tests
|
| 230 |
+
│ └── test_app.py
|
| 231 |
+
├── memories/ # Agent configuration
|
| 232 |
+
├── USER_GUIDE.md # Comprehensive user guide
|
| 233 |
+
├── BUILD_PLAN.md # Development roadmap
|
| 234 |
+
└── requirements.txt
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
---
|
| 238 |
+
|
| 239 |
+
## 🧪 Testing
|
| 240 |
|
| 241 |
```bash
|
| 242 |
+
# Activate virtual environment
|
| 243 |
source .venv/bin/activate
|
| 244 |
+
|
| 245 |
+
# Run unit tests
|
| 246 |
python -m unittest tests.test_app -v
|
| 247 |
+
|
| 248 |
+
# Run setup validation
|
| 249 |
+
python test_setup.py
|
| 250 |
```
|
| 251 |
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
## 📋 Requirements
|
| 255 |
|
| 256 |
+
- Python 3.9+
|
| 257 |
+
- Gradio 5.49.1+
|
| 258 |
+
- See `requirements.txt` for full dependencies
|
| 259 |
|
| 260 |
+
---
|
| 261 |
|
| 262 |
+
## 🎖️ Competition Submission
|
| 263 |
|
| 264 |
+
**Track**: Track 2 - MCP in Action (Productivity)
|
| 265 |
|
| 266 |
+
**Team Members**:
|
| 267 |
+
- [Your HuggingFace Username]
|
| 268 |
|
| 269 |
+
### Judging Criteria Alignment
|
| 270 |
|
| 271 |
+
| Criteria | Implementation |
|
| 272 |
+
|----------|----------------|
|
| 273 |
+
| **UI/UX Design** | Clean, professional interface with tabs, accordions, and color-coded visualizations |
|
| 274 |
+
| **Functionality** | Full MCP integration, multiple input formats, 4 export formats |
|
| 275 |
+
| **Creativity** | Novel approach to data lineage visualization with AI-powered parsing |
|
| 276 |
+
| **Documentation** | Comprehensive README, USER_GUIDE.md, inline comments |
|
| 277 |
+
| **Real-world Impact** | Solves critical enterprise need for data governance and compliance |
|
| 278 |
|
| 279 |
+
### Demo Video
|
| 280 |
|
| 281 |
+
[Link to demo video - Coming Soon]
|
|
|
|
|
|
|
|
|
|
| 282 |
|
| 283 |
+
### Social Media Post
|
| 284 |
|
| 285 |
+
[Link to LinkedIn/X post - Coming Soon]
|
| 286 |
+
|
| 287 |
+
---
|
| 288 |
+
|
| 289 |
+
## 🔜 Roadmap
|
| 290 |
|
| 291 |
+
- [ ] Gradio 6 upgrade for enhanced UI components
|
| 292 |
+
- [ ] Agentic chatbot for natural language queries
|
| 293 |
+
- [ ] Apache Atlas export support
|
| 294 |
+
- [ ] File upload functionality
|
| 295 |
+
- [ ] Graph export as PNG/SVG
|
| 296 |
+
- [ ] Batch processing API
|
| 297 |
+
- [ ] Column-level lineage
|
| 298 |
+
|
| 299 |
+
---
|
| 300 |
+
|
| 301 |
+
## 🤝 Contributing
|
| 302 |
+
|
| 303 |
+
Contributions welcome! Please:
|
| 304 |
+
|
| 305 |
+
1. Fork the repository
|
| 306 |
+
2. Create a feature branch
|
| 307 |
+
3. Make your changes
|
| 308 |
+
4. Submit a pull request
|
| 309 |
+
|
| 310 |
+
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
|
| 311 |
+
|
| 312 |
+
---
|
| 313 |
+
|
| 314 |
+
## 📄 License
|
| 315 |
+
|
| 316 |
+
MIT License - see [LICENSE](LICENSE) for details.
|
| 317 |
+
|
| 318 |
+
---
|
| 319 |
+
|
| 320 |
+
## 🙏 Acknowledgments
|
| 321 |
+
|
| 322 |
+
- **Anthropic** - MCP Protocol and Claude
|
| 323 |
+
- **Gradio Team** - Amazing UI framework
|
| 324 |
+
- **HuggingFace** - Hosting and community
|
| 325 |
+
- **dbt Labs** - Inspiration for metadata standards
|
| 326 |
+
- **OpenLineage** - Open lineage specification
|
| 327 |
+
|
| 328 |
+
---
|
| 329 |
+
|
| 330 |
+
## 📞 Support
|
| 331 |
+
|
| 332 |
+
- **Documentation**: [USER_GUIDE.md](USER_GUIDE.md)
|
| 333 |
+
- **Issues**: [GitHub Issues](https://github.com/YOUR_REPO/issues)
|
| 334 |
+
- **Discussion**: [HuggingFace Community](https://huggingface.co/spaces/YOUR_SPACE/discussions)
|
| 335 |
+
|
| 336 |
+
---
|
| 337 |
|
| 338 |
+
<p align="center">
|
| 339 |
+
Built with ❤️ for the <strong>Gradio Agents & MCP Hackathon - Winter 2025</strong>
|
| 340 |
+
<br>
|
| 341 |
+
Celebrating MCP's 1st Birthday! 🎂
|
| 342 |
+
</p>
|
USER_GUIDE.md
ADDED
|
@@ -0,0 +1,550 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Lineage Graph Accelerator - User Guide
|
| 2 |
+
|
| 3 |
+
A comprehensive guide to using the Lineage Graph Accelerator for extracting, visualizing, and exporting data lineage from your data platforms.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Table of Contents
|
| 8 |
+
|
| 9 |
+
1. [Getting Started](#getting-started)
|
| 10 |
+
2. [Input Formats](#input-formats)
|
| 11 |
+
3. [Sample Lineage Examples](#sample-lineage-examples)
|
| 12 |
+
4. [Export to Data Catalogs](#export-to-data-catalogs)
|
| 13 |
+
5. [MCP Server Integration](#mcp-server-integration)
|
| 14 |
+
6. [Troubleshooting](#troubleshooting)
|
| 15 |
+
7. [FAQ](#faq)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## Getting Started
|
| 20 |
+
|
| 21 |
+
### Quick Start (3 Steps)
|
| 22 |
+
|
| 23 |
+
1. **Open the App**: Navigate to the Lineage Graph Accelerator on HuggingFace Spaces
|
| 24 |
+
2. **Load Sample Data**: Click "Load Sample" to try pre-built examples
|
| 25 |
+
3. **Extract Lineage**: Click "Extract Lineage" to visualize the data flow
|
| 26 |
+
|
| 27 |
+
### Interface Overview
|
| 28 |
+
|
| 29 |
+
The application has four main tabs:
|
| 30 |
+
|
| 31 |
+
| Tab | Purpose |
|
| 32 |
+
|-----|---------|
|
| 33 |
+
| **Text/File Metadata** | Paste or upload metadata directly |
|
| 34 |
+
| **BigQuery** | Connect to Google BigQuery for schema extraction |
|
| 35 |
+
| **URL/API** | Fetch metadata from REST APIs |
|
| 36 |
+
| **Demo Gallery** | One-click demos of various lineage scenarios |
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Input Formats
|
| 41 |
+
|
| 42 |
+
The Lineage Graph Accelerator supports multiple metadata formats:
|
| 43 |
+
|
| 44 |
+
### 1. Simple JSON (Nodes & Edges)
|
| 45 |
+
|
| 46 |
+
The simplest format with explicit nodes and edges:
|
| 47 |
+
|
| 48 |
+
```json
|
| 49 |
+
{
|
| 50 |
+
"nodes": [
|
| 51 |
+
{"id": "raw_customers", "type": "table", "name": "raw_customers"},
|
| 52 |
+
{"id": "clean_customers", "type": "table", "name": "clean_customers"},
|
| 53 |
+
{"id": "analytics_customers", "type": "table", "name": "analytics_customers"}
|
| 54 |
+
],
|
| 55 |
+
"edges": [
|
| 56 |
+
{"from": "raw_customers", "to": "clean_customers"},
|
| 57 |
+
{"from": "clean_customers", "to": "analytics_customers"}
|
| 58 |
+
]
|
| 59 |
+
}
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
**Result**: A linear graph showing `raw_customers → clean_customers → analytics_customers`
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
### 2. dbt Manifest Format
|
| 67 |
+
|
| 68 |
+
Extract lineage from dbt's `manifest.json`:
|
| 69 |
+
|
| 70 |
+
```json
|
| 71 |
+
{
|
| 72 |
+
"metadata": {
|
| 73 |
+
"dbt_version": "1.7.0",
|
| 74 |
+
"project_name": "my_project"
|
| 75 |
+
},
|
| 76 |
+
"nodes": {
|
| 77 |
+
"source.my_project.raw.customers": {
|
| 78 |
+
"resource_type": "source",
|
| 79 |
+
"name": "customers",
|
| 80 |
+
"schema": "raw"
|
| 81 |
+
},
|
| 82 |
+
"model.my_project.stg_customers": {
|
| 83 |
+
"resource_type": "model",
|
| 84 |
+
"name": "stg_customers",
|
| 85 |
+
"schema": "staging",
|
| 86 |
+
"depends_on": {
|
| 87 |
+
"nodes": ["source.my_project.raw.customers"]
|
| 88 |
+
}
|
| 89 |
+
},
|
| 90 |
+
"model.my_project.dim_customers": {
|
| 91 |
+
"resource_type": "model",
|
| 92 |
+
"name": "dim_customers",
|
| 93 |
+
"schema": "marts",
|
| 94 |
+
"depends_on": {
|
| 95 |
+
"nodes": ["model.my_project.stg_customers"]
|
| 96 |
+
}
|
| 97 |
+
}
|
| 98 |
+
}
|
| 99 |
+
}
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
**Result**: A graph showing the dbt model dependencies from source to staging to marts.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
### 3. Airflow DAG Format
|
| 107 |
+
|
| 108 |
+
Extract task dependencies from Airflow DAGs:
|
| 109 |
+
|
| 110 |
+
```json
|
| 111 |
+
{
|
| 112 |
+
"dag_id": "etl_pipeline",
|
| 113 |
+
"tasks": [
|
| 114 |
+
{
|
| 115 |
+
"task_id": "extract_data",
|
| 116 |
+
"operator": "PythonOperator",
|
| 117 |
+
"upstream_dependencies": []
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"task_id": "transform_data",
|
| 121 |
+
"operator": "SparkSubmitOperator",
|
| 122 |
+
"upstream_dependencies": ["extract_data"]
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"task_id": "load_data",
|
| 126 |
+
"operator": "SnowflakeOperator",
|
| 127 |
+
"upstream_dependencies": ["transform_data"]
|
| 128 |
+
}
|
| 129 |
+
]
|
| 130 |
+
}
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
**Result**: A DAG visualization showing `extract_data → transform_data → load_data`
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
### 4. Data Warehouse Lineage Format
|
| 138 |
+
|
| 139 |
+
For Snowflake, BigQuery, or other warehouse lineage:
|
| 140 |
+
|
| 141 |
+
```json
|
| 142 |
+
{
|
| 143 |
+
"warehouse": {
|
| 144 |
+
"platform": "Snowflake",
|
| 145 |
+
"database": "ANALYTICS_DW"
|
| 146 |
+
},
|
| 147 |
+
"lineage": {
|
| 148 |
+
"datasets": [
|
| 149 |
+
{"id": "raw.customers", "type": "table", "schema": "RAW"},
|
| 150 |
+
{"id": "staging.customers", "type": "view", "schema": "STAGING"},
|
| 151 |
+
{"id": "marts.dim_customer", "type": "table", "schema": "MARTS"}
|
| 152 |
+
],
|
| 153 |
+
"relationships": [
|
| 154 |
+
{"source": "raw.customers", "target": "staging.customers", "type": "transform"},
|
| 155 |
+
{"source": "staging.customers", "target": "marts.dim_customer", "type": "transform"}
|
| 156 |
+
]
|
| 157 |
+
}
|
| 158 |
+
}
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
### 5. ETL Pipeline Format
|
| 164 |
+
|
| 165 |
+
For complex multi-stage ETL pipelines:
|
| 166 |
+
|
| 167 |
+
```json
|
| 168 |
+
{
|
| 169 |
+
"pipeline": {
|
| 170 |
+
"name": "customer_analytics",
|
| 171 |
+
"schedule": "daily"
|
| 172 |
+
},
|
| 173 |
+
"stages": [
|
| 174 |
+
{
|
| 175 |
+
"id": "extract",
|
| 176 |
+
"steps": [
|
| 177 |
+
{"id": "ext_crm", "name": "Extract CRM Data", "inputs": []},
|
| 178 |
+
{"id": "ext_payments", "name": "Extract Payments", "inputs": []}
|
| 179 |
+
]
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
"id": "transform",
|
| 183 |
+
"steps": [
|
| 184 |
+
{"id": "tfm_customers", "name": "Transform Customers", "inputs": ["ext_crm", "ext_payments"]}
|
| 185 |
+
]
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"id": "load",
|
| 189 |
+
"steps": [
|
| 190 |
+
{"id": "load_warehouse", "name": "Load to Warehouse", "inputs": ["tfm_customers"]}
|
| 191 |
+
]
|
| 192 |
+
}
|
| 193 |
+
]
|
| 194 |
+
}
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## Sample Lineage Examples
|
| 200 |
+
|
| 201 |
+
### Example 1: Simple E-Commerce Lineage
|
| 202 |
+
|
| 203 |
+
**Scenario**: Track data flow from raw transaction data to analytics reports.
|
| 204 |
+
|
| 205 |
+
```
|
| 206 |
+
Source Systems → Raw Layer → Staging → Data Marts → Reports
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
**Input**:
|
| 210 |
+
```json
|
| 211 |
+
{
|
| 212 |
+
"nodes": [
|
| 213 |
+
{"id": "shopify_api", "type": "source", "name": "Shopify API"},
|
| 214 |
+
{"id": "raw_orders", "type": "table", "name": "raw.orders"},
|
| 215 |
+
{"id": "stg_orders", "type": "model", "name": "staging.stg_orders"},
|
| 216 |
+
{"id": "fct_orders", "type": "fact", "name": "marts.fct_orders"},
|
| 217 |
+
{"id": "rpt_daily_sales", "type": "report", "name": "Daily Sales Report"}
|
| 218 |
+
],
|
| 219 |
+
"edges": [
|
| 220 |
+
{"from": "shopify_api", "to": "raw_orders", "type": "ingest"},
|
| 221 |
+
{"from": "raw_orders", "to": "stg_orders", "type": "transform"},
|
| 222 |
+
{"from": "stg_orders", "to": "fct_orders", "type": "transform"},
|
| 223 |
+
{"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"}
|
| 224 |
+
]
|
| 225 |
+
}
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
**Expected Output**: A Mermaid diagram showing the complete data flow with color-coded nodes by type.
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
### Example 2: Multi-Source Customer 360
|
| 233 |
+
|
| 234 |
+
**Scenario**: Combine data from multiple sources to create a unified customer view.
|
| 235 |
+
|
| 236 |
+
```
|
| 237 |
+
CRM + Payments + Website → Identity Resolution → Customer 360
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
**Input**:
|
| 241 |
+
```json
|
| 242 |
+
{
|
| 243 |
+
"nodes": [
|
| 244 |
+
{"id": "salesforce", "type": "source", "name": "Salesforce CRM"},
|
| 245 |
+
{"id": "stripe", "type": "source", "name": "Stripe Payments"},
|
| 246 |
+
{"id": "ga4", "type": "source", "name": "Google Analytics"},
|
| 247 |
+
{"id": "identity_resolution", "type": "model", "name": "Identity Resolution"},
|
| 248 |
+
{"id": "customer_360", "type": "dimension", "name": "Customer 360"}
|
| 249 |
+
],
|
| 250 |
+
"edges": [
|
| 251 |
+
{"from": "salesforce", "to": "identity_resolution"},
|
| 252 |
+
{"from": "stripe", "to": "identity_resolution"},
|
| 253 |
+
{"from": "ga4", "to": "identity_resolution"},
|
| 254 |
+
{"from": "identity_resolution", "to": "customer_360"}
|
| 255 |
+
]
|
| 256 |
+
}
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
---
|
| 260 |
+
|
| 261 |
+
### Example 3: dbt Project with Multiple Layers
|
| 262 |
+
|
| 263 |
+
**Scenario**: A complete dbt project with staging, intermediate, and mart layers.
|
| 264 |
+
|
| 265 |
+
Load the "dbt Manifest" sample from the dropdown to see a full example with:
|
| 266 |
+
- 4 source tables
|
| 267 |
+
- 4 staging models
|
| 268 |
+
- 2 intermediate models
|
| 269 |
+
- 3 mart tables
|
| 270 |
+
- 2 reporting views
|
| 271 |
+
|
| 272 |
+
---
|
| 273 |
+
|
| 274 |
+
### Example 4: Airflow ETL Pipeline
|
| 275 |
+
|
| 276 |
+
**Scenario**: A daily ETL pipeline with parallel extraction, sequential transformation, and loading.
|
| 277 |
+
|
| 278 |
+
Load the "Airflow DAG" sample to see:
|
| 279 |
+
- Parallel extract tasks
|
| 280 |
+
- Transform tasks with dependencies
|
| 281 |
+
- Load tasks to data warehouse
|
| 282 |
+
- Final notification task
|
| 283 |
+
|
| 284 |
+
---
|
| 285 |
+
|
| 286 |
+
## Export to Data Catalogs
|
| 287 |
+
|
| 288 |
+
The Lineage Graph Accelerator can export lineage to major enterprise data catalogs.
|
| 289 |
+
|
| 290 |
+
### Supported Formats
|
| 291 |
+
|
| 292 |
+
| Format | Platform | Description |
|
| 293 |
+
|--------|----------|-------------|
|
| 294 |
+
| **OpenLineage** | Universal | Open standard, works with Marquez, Atlan, DataHub |
|
| 295 |
+
| **Collibra** | Collibra Data Intelligence | Enterprise data governance platform |
|
| 296 |
+
| **Purview** | Microsoft Purview | Azure native data governance |
|
| 297 |
+
| **Alation** | Alation Data Catalog | Self-service analytics catalog |
|
| 298 |
+
|
| 299 |
+
### How to Export
|
| 300 |
+
|
| 301 |
+
1. **Enter or load your metadata** in the Text/File Metadata tab
|
| 302 |
+
2. **Extract the lineage** to verify it looks correct
|
| 303 |
+
3. **Expand "Export to Data Catalog"** accordion
|
| 304 |
+
4. **Select your format** from the dropdown
|
| 305 |
+
5. **Click "Generate Export"** to create the export file
|
| 306 |
+
6. **Copy or download** the JSON output
|
| 307 |
+
|
| 308 |
+
### Export Format Details
|
| 309 |
+
|
| 310 |
+
#### OpenLineage Export
|
| 311 |
+
|
| 312 |
+
The OpenLineage export follows the [OpenLineage specification](https://openlineage.io/):
|
| 313 |
+
|
| 314 |
+
```json
|
| 315 |
+
{
|
| 316 |
+
"producer": "lineage-accelerator",
|
| 317 |
+
"schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json",
|
| 318 |
+
"events": [
|
| 319 |
+
{
|
| 320 |
+
"eventType": "COMPLETE",
|
| 321 |
+
"job": {"namespace": "...", "name": "..."},
|
| 322 |
+
"inputs": [...],
|
| 323 |
+
"outputs": [...]
|
| 324 |
+
}
|
| 325 |
+
]
|
| 326 |
+
}
|
| 327 |
+
```
|
| 328 |
+
|
| 329 |
+
#### Collibra Export
|
| 330 |
+
|
| 331 |
+
Ready for Collibra's Import API:
|
| 332 |
+
|
| 333 |
+
```json
|
| 334 |
+
{
|
| 335 |
+
"community": {"name": "Data Lineage"},
|
| 336 |
+
"domain": {"name": "Physical Data Dictionary"},
|
| 337 |
+
"assets": [...],
|
| 338 |
+
"relations": [...]
|
| 339 |
+
}
|
| 340 |
+
```
|
| 341 |
+
|
| 342 |
+
#### Microsoft Purview Export
|
| 343 |
+
|
| 344 |
+
Compatible with Purview's bulk import:
|
| 345 |
+
|
| 346 |
+
```json
|
| 347 |
+
{
|
| 348 |
+
"collection": {"referenceName": "lineage-accelerator"},
|
| 349 |
+
"entities": [...],
|
| 350 |
+
"processes": [...]
|
| 351 |
+
}
|
| 352 |
+
```
|
| 353 |
+
|
| 354 |
+
#### Alation Export
|
| 355 |
+
|
| 356 |
+
Ready for Alation's bulk upload:
|
| 357 |
+
|
| 358 |
+
```json
|
| 359 |
+
{
|
| 360 |
+
"datasource": {"id": 1, "title": "..."},
|
| 361 |
+
"tables": [...],
|
| 362 |
+
"columns": [...],
|
| 363 |
+
"lineage": [...],
|
| 364 |
+
"dataflows": [...]
|
| 365 |
+
}
|
| 366 |
+
```
|
| 367 |
+
|
| 368 |
+
---
|
| 369 |
+
|
| 370 |
+
## MCP Server Integration
|
| 371 |
+
|
| 372 |
+
Connect to external MCP (Model Context Protocol) servers for enhanced processing.
|
| 373 |
+
|
| 374 |
+
### What is MCP?
|
| 375 |
+
|
| 376 |
+
MCP (Model Context Protocol) is a standard for AI model integration. The Lineage Graph Accelerator can connect to MCP servers hosted on HuggingFace Spaces for:
|
| 377 |
+
|
| 378 |
+
- Enhanced lineage extraction with AI
|
| 379 |
+
- Support for additional metadata formats
|
| 380 |
+
- Custom processing pipelines
|
| 381 |
+
|
| 382 |
+
### Configuration
|
| 383 |
+
|
| 384 |
+
1. **Expand "MCP Server Configuration"** at the top of the app
|
| 385 |
+
2. **Enter the MCP Server URL**: e.g., `https://your-space.hf.space/mcp`
|
| 386 |
+
3. **Add API Key** (if required)
|
| 387 |
+
4. **Click "Test Connection"** to verify
|
| 388 |
+
|
| 389 |
+
### Example MCP Servers
|
| 390 |
+
|
| 391 |
+
| Server | URL | Description |
|
| 392 |
+
|--------|-----|-------------|
|
| 393 |
+
| Demo Server | `http://localhost:9000/mcp` | Local testing |
|
| 394 |
+
| HuggingFace | `https://your-space.hf.space/mcp` | Production deployment |
|
| 395 |
+
|
| 396 |
+
### Running Your Own MCP Server
|
| 397 |
+
|
| 398 |
+
See `mcp_example/server.py` for a FastAPI-based MCP server example:
|
| 399 |
+
|
| 400 |
+
```bash
|
| 401 |
+
cd mcp_example
|
| 402 |
+
uvicorn server:app --reload --port 9000
|
| 403 |
+
```
|
| 404 |
+
|
| 405 |
+
---
|
| 406 |
+
|
| 407 |
+
## Troubleshooting
|
| 408 |
+
|
| 409 |
+
### Common Issues
|
| 410 |
+
|
| 411 |
+
#### "No data to display"
|
| 412 |
+
|
| 413 |
+
**Cause**: The input metadata couldn't be parsed.
|
| 414 |
+
|
| 415 |
+
**Solutions**:
|
| 416 |
+
1. Verify your JSON is valid (use a JSON validator)
|
| 417 |
+
2. Check that the format matches one of the supported types
|
| 418 |
+
3. Try loading a sample first to see the expected format
|
| 419 |
+
|
| 420 |
+
#### "Export functionality not available"
|
| 421 |
+
|
| 422 |
+
**Cause**: The exporters module isn't loaded.
|
| 423 |
+
|
| 424 |
+
**Solutions**:
|
| 425 |
+
1. Ensure you're running the latest version
|
| 426 |
+
2. Check that the `exporters/` directory exists
|
| 427 |
+
3. Restart the application
|
| 428 |
+
|
| 429 |
+
#### MCP Connection Failed
|
| 430 |
+
|
| 431 |
+
**Cause**: Cannot reach the MCP server.
|
| 432 |
+
|
| 433 |
+
**Solutions**:
|
| 434 |
+
1. Verify the URL is correct
|
| 435 |
+
2. Check if the server is running
|
| 436 |
+
3. Ensure network/firewall allows the connection
|
| 437 |
+
4. Try without the API key first
|
| 438 |
+
|
| 439 |
+
#### Mermaid Diagram Not Rendering
|
| 440 |
+
|
| 441 |
+
**Cause**: JavaScript loading issue.
|
| 442 |
+
|
| 443 |
+
**Solutions**:
|
| 444 |
+
1. Refresh the page
|
| 445 |
+
2. Try a different browser
|
| 446 |
+
3. Check browser console for errors
|
| 447 |
+
4. Ensure JavaScript is enabled
|
| 448 |
+
|
| 449 |
+
### Error Messages
|
| 450 |
+
|
| 451 |
+
| Error | Meaning | Solution |
|
| 452 |
+
|-------|---------|----------|
|
| 453 |
+
| "JSONDecodeError" | Invalid JSON input | Fix JSON syntax |
|
| 454 |
+
| "KeyError" | Missing required field | Check input format |
|
| 455 |
+
| "Timeout" | MCP server slow/unreachable | Increase timeout or check server |
|
| 456 |
+
|
| 457 |
+
---
|
| 458 |
+
|
| 459 |
+
## FAQ
|
| 460 |
+
|
| 461 |
+
### General Questions
|
| 462 |
+
|
| 463 |
+
**Q: What file formats are supported?**
|
| 464 |
+
|
| 465 |
+
A: JSON is the primary format. We also support SQL DDL (with limitations) and can parse dbt manifests, Airflow DAGs, and custom formats.
|
| 466 |
+
|
| 467 |
+
**Q: Can I upload files?**
|
| 468 |
+
|
| 469 |
+
A: Currently, you need to paste content into the text box. File upload is planned for a future release.
|
| 470 |
+
|
| 471 |
+
**Q: Is my data stored?**
|
| 472 |
+
|
| 473 |
+
A: No. All processing happens in your browser session. No data is stored on servers.
|
| 474 |
+
|
| 475 |
+
### Export Questions
|
| 476 |
+
|
| 477 |
+
**Q: Which export format should I use?**
|
| 478 |
+
|
| 479 |
+
A:
|
| 480 |
+
- Use **OpenLineage** for universal compatibility
|
| 481 |
+
- Use **Collibra/Purview/Alation** if you use those specific platforms
|
| 482 |
+
|
| 483 |
+
**Q: Can I customize the export?**
|
| 484 |
+
|
| 485 |
+
A: The current exports use default settings. Advanced customization is available through the API.
|
| 486 |
+
|
| 487 |
+
### Technical Questions
|
| 488 |
+
|
| 489 |
+
**Q: What's the maximum graph size?**
|
| 490 |
+
|
| 491 |
+
A: The UI handles graphs up to ~500 nodes smoothly. Larger graphs may be slow to render.
|
| 492 |
+
|
| 493 |
+
**Q: Can I use this programmatically?**
|
| 494 |
+
|
| 495 |
+
A: Yes! See `integration_example.py` for API usage examples.
|
| 496 |
+
|
| 497 |
+
**Q: Is there a rate limit?**
|
| 498 |
+
|
| 499 |
+
A: The HuggingFace Space has standard rate limits. For heavy usage, deploy your own instance.
|
| 500 |
+
|
| 501 |
+
---
|
| 502 |
+
|
| 503 |
+
## Support
|
| 504 |
+
|
| 505 |
+
- **Issues**: [GitHub Issues](https://github.com/your-repo/issues)
|
| 506 |
+
- **Documentation**: This guide and README.md
|
| 507 |
+
- **Community**: HuggingFace Discussions
|
| 508 |
+
|
| 509 |
+
---
|
| 510 |
+
|
| 511 |
+
## Appendix: Complete Sample Data
|
| 512 |
+
|
| 513 |
+
### E-Commerce Platform (Complex)
|
| 514 |
+
|
| 515 |
+
This sample demonstrates a complete e-commerce analytics platform with:
|
| 516 |
+
- 9 source systems (Shopify, Stripe, GA4, etc.)
|
| 517 |
+
- 50+ nodes across all data layers
|
| 518 |
+
- 80+ lineage relationships
|
| 519 |
+
- Multiple output destinations (BI tools, reverse ETL)
|
| 520 |
+
|
| 521 |
+
Load the "Complex Demo" sample to explore the full graph.
|
| 522 |
+
|
| 523 |
+
### Node Types Reference
|
| 524 |
+
|
| 525 |
+
| Type | Color | Description |
|
| 526 |
+
|------|-------|-------------|
|
| 527 |
+
| `source` | Light Blue | External data sources |
|
| 528 |
+
| `table` | Light Green | Database tables |
|
| 529 |
+
| `view` | Light Purple | Database views |
|
| 530 |
+
| `model` | Light Orange | Transformation models |
|
| 531 |
+
| `report` | Light Pink | Reports and dashboards |
|
| 532 |
+
| `dimension` | Cyan | Dimension tables |
|
| 533 |
+
| `fact` | Light Yellow | Fact tables |
|
| 534 |
+
| `destination` | Light Red | Output destinations |
|
| 535 |
+
|
| 536 |
+
### Edge Types Reference
|
| 537 |
+
|
| 538 |
+
| Type | Arrow | Description |
|
| 539 |
+
|------|-------|-------------|
|
| 540 |
+
| `transform` | `-->` | Data transformation |
|
| 541 |
+
| `reference` | `-.->` | Reference/lookup |
|
| 542 |
+
| `ingest` | `-->` | Data ingestion |
|
| 543 |
+
| `export` | `-->` | Data export |
|
| 544 |
+
| `join` | `-->` | Table join |
|
| 545 |
+
| `aggregate` | `-->` | Aggregation |
|
| 546 |
+
|
| 547 |
+
---
|
| 548 |
+
|
| 549 |
+
*Last updated: November 2025*
|
| 550 |
+
*Version: 1.0.0*
|
app.py
CHANGED
|
@@ -1,24 +1,52 @@
|
|
| 1 |
"""
|
| 2 |
-
Lineage Graph
|
| 3 |
-
A Gradio-based
|
|
|
|
|
|
|
| 4 |
"""
|
| 5 |
|
| 6 |
import gradio as gr
|
| 7 |
import json
|
| 8 |
import os
|
| 9 |
import requests
|
| 10 |
-
from typing import Optional, Tuple
|
|
|
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
|
|
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
"""
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
safe_viz = viz_code.replace("<", "<").replace(">", ">")
|
| 21 |
-
# Script will wait for mermaid to be available then initialize diagrams.
|
| 22 |
init_script = (
|
| 23 |
"<script>"
|
| 24 |
"(function(){"
|
|
@@ -28,14 +56,243 @@ def render_mermaid(viz_code: str) -> str:
|
|
| 28 |
" } run();})();"
|
| 29 |
"</script>"
|
| 30 |
)
|
| 31 |
-
return f"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
if not server_url:
|
| 40 |
return "", "No MCP server URL configured."
|
| 41 |
try:
|
|
@@ -44,18 +301,17 @@ def send_to_mcp(server_url: str, api_key: str, metadata_text: str, source_type:
|
|
| 44 |
"source_type": source_type,
|
| 45 |
"viz_format": viz_format,
|
| 46 |
}
|
| 47 |
-
headers = {}
|
| 48 |
if api_key:
|
| 49 |
headers["Authorization"] = f"Bearer {api_key}"
|
| 50 |
-
resp = requests.post(server_url, json=payload, headers=headers, timeout=
|
| 51 |
-
if
|
| 52 |
data = resp.json()
|
| 53 |
viz = data.get("visualization") or data.get("viz") or data.get("mermaid", "")
|
| 54 |
summary = data.get("summary", "Processed by MCP server.")
|
| 55 |
if viz:
|
| 56 |
return render_mermaid(viz), summary
|
| 57 |
-
|
| 58 |
-
return "", summary
|
| 59 |
else:
|
| 60 |
return "", f"MCP server returned status {resp.status_code}: {resp.text[:200]}"
|
| 61 |
except Exception as e:
|
|
@@ -63,193 +319,281 @@ def send_to_mcp(server_url: str, api_key: str, metadata_text: str, source_type:
|
|
| 63 |
|
| 64 |
|
| 65 |
def test_mcp_connection(server_url: str, api_key: str) -> str:
|
| 66 |
-
"""
|
| 67 |
if not server_url:
|
| 68 |
return "No MCP server URL configured."
|
| 69 |
try:
|
| 70 |
headers = {}
|
| 71 |
if api_key:
|
| 72 |
headers["Authorization"] = f"Bearer {api_key}"
|
| 73 |
-
resp = requests.get(server_url,
|
|
|
|
| 74 |
return f"MCP server responded: {resp.status_code} {resp.reason}"
|
| 75 |
except Exception as e:
|
| 76 |
return f"Error contacting MCP server: {e}"
|
| 77 |
|
| 78 |
|
| 79 |
-
#
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
viz, summary = send_to_mcp(mcp_server, mcp_api_key, metadata_text, source_type, visualization_format)
|
| 83 |
-
# If MCP returned something, use it. Otherwise fall back to local.
|
| 84 |
-
if viz or (summary and not summary.startswith("Error")):
|
| 85 |
-
return viz, summary
|
| 86 |
-
return extract_lineage_from_text(metadata_text, source_type, visualization_format)
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
# Send query as metadata to MCP; source_type indicates BigQuery
|
| 92 |
-
viz, summary = send_to_mcp(mcp_server, mcp_api_key, query, "BigQuery", visualization_format)
|
| 93 |
-
if viz or (summary and not summary.startswith("Error")):
|
| 94 |
-
return viz, summary
|
| 95 |
-
return extract_lineage_from_bigquery(project_id, query, api_key, visualization_format)
|
| 96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
-
|
| 107 |
-
#
|
|
|
|
|
|
|
| 108 |
|
| 109 |
def extract_lineage_from_text(
|
| 110 |
metadata_text: str,
|
| 111 |
source_type: str,
|
| 112 |
-
visualization_format: str
|
|
|
|
|
|
|
| 113 |
) -> Tuple[str, str]:
|
| 114 |
-
"""
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
|
|
|
| 133 |
|
| 134 |
def extract_lineage_from_bigquery(
|
| 135 |
project_id: str,
|
| 136 |
query: str,
|
| 137 |
api_key: str,
|
| 138 |
-
visualization_format: str
|
|
|
|
|
|
|
| 139 |
) -> Tuple[str, str]:
|
| 140 |
-
"""
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
return (
|
| 156 |
-
|
| 157 |
-
f"Extracted lineage from BigQuery project: {project_id}"
|
| 158 |
-
)
|
| 159 |
|
| 160 |
def extract_lineage_from_url(
|
| 161 |
url: str,
|
| 162 |
-
visualization_format: str
|
|
|
|
|
|
|
| 163 |
) -> Tuple[str, str]:
|
| 164 |
-
"""
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
-
#
|
| 183 |
-
with gr.Blocks(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
gr.Markdown("""
|
| 185 |
-
#
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
- **URLs/APIs**: Fetch metadata from web endpoints
|
| 194 |
-
- **dbt, Airflow, Snowflake**: Through MCP integration (when configured)
|
| 195 |
""")
|
| 196 |
-
|
| 197 |
-
#
|
| 198 |
-
|
| 199 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
visible=False
|
| 201 |
)
|
| 202 |
-
|
|
|
|
| 203 |
with gr.Tabs():
|
| 204 |
# Tab 1: Text/File Input
|
| 205 |
-
with gr.Tab("Text/File Metadata"):
|
| 206 |
with gr.Row():
|
| 207 |
-
with gr.Column():
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
metadata_input = gr.Textbox(
|
| 209 |
label="Metadata Content",
|
| 210 |
-
placeholder="Paste your metadata here (JSON, YAML, SQL, etc.)",
|
| 211 |
-
lines=
|
| 212 |
-
)
|
| 213 |
-
load_sample_text_btn = gr.Button("Load sample metadata")
|
| 214 |
-
source_type_text = gr.Dropdown(
|
| 215 |
-
choices=["dbt Manifest", "Airflow DAG", "SQL DDL", "Custom JSON", "Other"],
|
| 216 |
-
label="Source Type",
|
| 217 |
-
value="Custom JSON"
|
| 218 |
-
)
|
| 219 |
-
viz_format_text = gr.Dropdown(
|
| 220 |
-
choices=["Mermaid", "DOT/Graphviz", "Text", "All"],
|
| 221 |
-
label="Visualization Format",
|
| 222 |
-
value="Mermaid"
|
| 223 |
-
)
|
| 224 |
-
extract_btn_text = gr.Button("Extract Lineage", variant="primary")
|
| 225 |
-
|
| 226 |
-
with gr.Column():
|
| 227 |
-
output_viz_text = gr.HTML(
|
| 228 |
-
value="",
|
| 229 |
-
label="Lineage Visualization"
|
| 230 |
)
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
)
|
| 241 |
-
|
| 242 |
-
p = os.path.join(os.path.dirname(__file__), "samples", "sample_metadata.json")
|
| 243 |
-
try:
|
| 244 |
-
with open(p, "r") as f:
|
| 245 |
-
return f.read()
|
| 246 |
-
except Exception:
|
| 247 |
-
return "{\"error\": \"Could not load sample metadata\"}"
|
| 248 |
-
|
| 249 |
-
load_sample_text_btn.click(fn=load_sample_text, inputs=[], outputs=[metadata_input])
|
| 250 |
-
|
| 251 |
# Tab 2: BigQuery
|
| 252 |
-
with gr.Tab("BigQuery"):
|
| 253 |
with gr.Row():
|
| 254 |
with gr.Column():
|
| 255 |
bq_project = gr.Textbox(
|
|
@@ -259,104 +603,119 @@ with gr.Blocks(title="Lineage Graph Extractor", theme=gr.themes.Soft()) as demo:
|
|
| 259 |
bq_query = gr.Textbox(
|
| 260 |
label="Metadata Query",
|
| 261 |
placeholder="SELECT * FROM `project.dataset.INFORMATION_SCHEMA.TABLES`",
|
| 262 |
-
lines=
|
| 263 |
)
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
label="
|
| 267 |
-
placeholder="Enter your credentials",
|
| 268 |
type="password"
|
| 269 |
)
|
| 270 |
-
|
| 271 |
-
choices=["Mermaid", "DOT/Graphviz", "Text"
|
| 272 |
label="Visualization Format",
|
| 273 |
value="Mermaid"
|
| 274 |
)
|
| 275 |
-
|
| 276 |
-
|
| 277 |
with gr.Column():
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
lines=5
|
| 285 |
-
)
|
| 286 |
-
|
| 287 |
-
extract_btn_bq.click(
|
| 288 |
-
fn=handle_extract_bigquery,
|
| 289 |
-
inputs=[bq_project, bq_query, bq_api_key, viz_format_bq, mcp_server, mcp_api_key],
|
| 290 |
-
outputs=[output_viz_bq, output_summary_bq]
|
| 291 |
)
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
load_sample_bq_btn.click(fn=load_sample_bq, inputs=[], outputs=[bq_query])
|
| 301 |
-
|
| 302 |
# Tab 3: URL/API
|
| 303 |
-
with gr.Tab("URL/API"):
|
| 304 |
with gr.Row():
|
| 305 |
with gr.Column():
|
| 306 |
url_input = gr.Textbox(
|
| 307 |
-
label="URL",
|
| 308 |
placeholder="https://api.example.com/metadata"
|
| 309 |
)
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
choices=["Mermaid", "DOT/Graphviz", "Text"
|
| 313 |
label="Visualization Format",
|
| 314 |
value="Mermaid"
|
| 315 |
)
|
| 316 |
-
|
| 317 |
-
|
| 318 |
with gr.Column():
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
outputs=[output_viz_url, output_summary_url]
|
| 332 |
)
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
gr.Markdown("""
|
| 344 |
---
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
|
|
|
|
|
|
|
| 358 |
""")
|
| 359 |
|
| 360 |
-
# Launch
|
| 361 |
if __name__ == "__main__":
|
| 362 |
demo.launch()
|
|
|
|
| 1 |
"""
|
| 2 |
+
Lineage Graph Accelerator - Hugging Face Space
|
| 3 |
+
A Gradio-based AI agent for extracting and visualizing data lineage from various sources.
|
| 4 |
+
|
| 5 |
+
Built for the Gradio Agents & MCP Hackathon - Winter 2025
|
| 6 |
"""
|
| 7 |
|
| 8 |
import gradio as gr
|
| 9 |
import json
|
| 10 |
import os
|
| 11 |
import requests
|
| 12 |
+
from typing import Optional, Tuple, Dict, Any, List
|
| 13 |
+
from datetime import datetime
|
| 14 |
|
| 15 |
+
# Import exporters
|
| 16 |
+
try:
|
| 17 |
+
from exporters import (
|
| 18 |
+
LineageGraph, LineageNode, LineageEdge,
|
| 19 |
+
OpenLineageExporter, CollibraExporter, PurviewExporter, AlationExporter
|
| 20 |
+
)
|
| 21 |
+
EXPORTERS_AVAILABLE = True
|
| 22 |
+
except ImportError:
|
| 23 |
+
EXPORTERS_AVAILABLE = False
|
| 24 |
|
| 25 |
+
# ============================================================================
|
| 26 |
+
# Constants and Configuration
|
| 27 |
+
# ============================================================================
|
| 28 |
|
| 29 |
+
SAMPLE_FILES = {
|
| 30 |
+
"simple": "sample_metadata.json",
|
| 31 |
+
"dbt": "dbt_manifest_sample.json",
|
| 32 |
+
"airflow": "airflow_dag_sample.json",
|
| 33 |
+
"sql": "sql_ddl_sample.sql",
|
| 34 |
+
"warehouse": "warehouse_lineage_sample.json",
|
| 35 |
+
"etl": "etl_pipeline_sample.json",
|
| 36 |
+
"complex": "complex_lineage_demo.json",
|
| 37 |
+
"api": "sample_api_metadata.json",
|
| 38 |
+
"bigquery": "sample_bigquery.sql"
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
EXPORT_FORMATS = ["OpenLineage", "Collibra", "Purview", "Alation"]
|
| 42 |
+
|
| 43 |
+
# ============================================================================
|
| 44 |
+
# Mermaid Rendering
|
| 45 |
+
# ============================================================================
|
| 46 |
+
|
| 47 |
+
def render_mermaid(viz_code: str) -> str:
|
| 48 |
+
"""Wrap mermaid source in HTML and initialize mermaid when the HTML is inserted."""
|
| 49 |
safe_viz = viz_code.replace("<", "<").replace(">", ">")
|
|
|
|
| 50 |
init_script = (
|
| 51 |
"<script>"
|
| 52 |
"(function(){"
|
|
|
|
| 56 |
" } run();})();"
|
| 57 |
"</script>"
|
| 58 |
)
|
| 59 |
+
return f"""
|
| 60 |
+
<div style="background: white; padding: 20px; border-radius: 8px; overflow: auto;">
|
| 61 |
+
<div class="mermaid">{safe_viz}</div>
|
| 62 |
+
</div>
|
| 63 |
+
{init_script}
|
| 64 |
+
"""
|
| 65 |
|
| 66 |
|
| 67 |
+
# ============================================================================
|
| 68 |
+
# Lineage Parsing and Visualization Generation
|
| 69 |
+
# ============================================================================
|
| 70 |
|
| 71 |
+
def parse_metadata_to_graph(metadata_text: str, source_type: str) -> Tuple[LineageGraph, str]:
|
| 72 |
+
"""Parse metadata text into a LineageGraph structure."""
|
| 73 |
+
try:
|
| 74 |
+
# Try to parse as JSON first
|
| 75 |
+
if metadata_text.strip().startswith('{') or metadata_text.strip().startswith('['):
|
| 76 |
+
data = json.loads(metadata_text)
|
| 77 |
+
else:
|
| 78 |
+
# For SQL or other text formats, create a simple structure
|
| 79 |
+
data = {"raw_content": metadata_text, "source_type": source_type}
|
| 80 |
+
|
| 81 |
+
graph = LineageGraph(name=f"Lineage from {source_type}")
|
| 82 |
+
|
| 83 |
+
# Handle different formats
|
| 84 |
+
if "lineage_graph" in data:
|
| 85 |
+
# Complex lineage demo format
|
| 86 |
+
lg = data["lineage_graph"]
|
| 87 |
+
for node_data in lg.get("nodes", []):
|
| 88 |
+
node = LineageNode(
|
| 89 |
+
id=node_data.get("id"),
|
| 90 |
+
name=node_data.get("name"),
|
| 91 |
+
type=node_data.get("type", "table"),
|
| 92 |
+
category=node_data.get("category"),
|
| 93 |
+
description=node_data.get("description"),
|
| 94 |
+
metadata=node_data.get("metadata"),
|
| 95 |
+
tags=node_data.get("tags")
|
| 96 |
+
)
|
| 97 |
+
graph.add_node(node)
|
| 98 |
+
for edge_data in lg.get("edges", []):
|
| 99 |
+
edge = LineageEdge(
|
| 100 |
+
source=edge_data.get("from"),
|
| 101 |
+
target=edge_data.get("to"),
|
| 102 |
+
type=edge_data.get("type", "transform")
|
| 103 |
+
)
|
| 104 |
+
graph.add_edge(edge)
|
| 105 |
+
|
| 106 |
+
elif "nodes" in data and "edges" in data:
|
| 107 |
+
# Simple node/edge format
|
| 108 |
+
for node_data in data.get("nodes", []):
|
| 109 |
+
node = LineageNode(
|
| 110 |
+
id=node_data.get("id"),
|
| 111 |
+
name=node_data.get("name", node_data.get("id")),
|
| 112 |
+
type=node_data.get("type", "table")
|
| 113 |
+
)
|
| 114 |
+
graph.add_node(node)
|
| 115 |
+
for edge_data in data.get("edges", []):
|
| 116 |
+
edge = LineageEdge(
|
| 117 |
+
source=edge_data.get("from"),
|
| 118 |
+
target=edge_data.get("to"),
|
| 119 |
+
type=edge_data.get("type", "transform")
|
| 120 |
+
)
|
| 121 |
+
graph.add_edge(edge)
|
| 122 |
+
|
| 123 |
+
elif "nodes" in data:
|
| 124 |
+
# dbt manifest format
|
| 125 |
+
for node_id, node_data in data.get("nodes", {}).items():
|
| 126 |
+
node = LineageNode(
|
| 127 |
+
id=node_id,
|
| 128 |
+
name=node_data.get("name", node_id.split(".")[-1]),
|
| 129 |
+
type=node_data.get("resource_type", "model"),
|
| 130 |
+
schema=node_data.get("schema"),
|
| 131 |
+
database=node_data.get("database"),
|
| 132 |
+
description=node_data.get("description")
|
| 133 |
+
)
|
| 134 |
+
graph.add_node(node)
|
| 135 |
+
# Add edges from depends_on
|
| 136 |
+
deps = node_data.get("depends_on", {}).get("nodes", [])
|
| 137 |
+
for dep in deps:
|
| 138 |
+
edge = LineageEdge(source=dep, target=node_id, type="transform")
|
| 139 |
+
graph.add_edge(edge)
|
| 140 |
+
|
| 141 |
+
elif "tasks" in data:
|
| 142 |
+
# Airflow DAG format
|
| 143 |
+
for task in data.get("tasks", []):
|
| 144 |
+
node = LineageNode(
|
| 145 |
+
id=task.get("task_id"),
|
| 146 |
+
name=task.get("task_id"),
|
| 147 |
+
type="task",
|
| 148 |
+
description=task.get("description")
|
| 149 |
+
)
|
| 150 |
+
graph.add_node(node)
|
| 151 |
+
# Add edges from upstream dependencies
|
| 152 |
+
for dep in task.get("upstream_dependencies", []):
|
| 153 |
+
edge = LineageEdge(source=dep, target=task.get("task_id"), type="dependency")
|
| 154 |
+
graph.add_edge(edge)
|
| 155 |
+
|
| 156 |
+
elif "lineage" in data:
|
| 157 |
+
# Warehouse lineage format
|
| 158 |
+
lineage = data.get("lineage", {})
|
| 159 |
+
for dataset in lineage.get("datasets", []):
|
| 160 |
+
node = LineageNode(
|
| 161 |
+
id=dataset.get("id"),
|
| 162 |
+
name=dataset.get("name", dataset.get("id")),
|
| 163 |
+
type=dataset.get("type", "table"),
|
| 164 |
+
schema=dataset.get("schema"),
|
| 165 |
+
database=dataset.get("database"),
|
| 166 |
+
description=dataset.get("description"),
|
| 167 |
+
owner=dataset.get("owner"),
|
| 168 |
+
tags=dataset.get("tags")
|
| 169 |
+
)
|
| 170 |
+
graph.add_node(node)
|
| 171 |
+
for rel in lineage.get("relationships", []):
|
| 172 |
+
edge = LineageEdge(
|
| 173 |
+
source=rel.get("source"),
|
| 174 |
+
target=rel.get("target"),
|
| 175 |
+
type=rel.get("type", "transform"),
|
| 176 |
+
job_name=rel.get("job")
|
| 177 |
+
)
|
| 178 |
+
graph.add_edge(edge)
|
| 179 |
+
|
| 180 |
+
elif "stages" in data:
|
| 181 |
+
# ETL pipeline format
|
| 182 |
+
for stage in data.get("stages", []):
|
| 183 |
+
for step in stage.get("steps", []):
|
| 184 |
+
node = LineageNode(
|
| 185 |
+
id=step.get("id"),
|
| 186 |
+
name=step.get("name", step.get("id")),
|
| 187 |
+
type="step",
|
| 188 |
+
category=stage.get("id"),
|
| 189 |
+
description=step.get("description") or step.get("logic")
|
| 190 |
+
)
|
| 191 |
+
graph.add_node(node)
|
| 192 |
+
# Add edges from inputs
|
| 193 |
+
for inp in step.get("inputs", []):
|
| 194 |
+
edge = LineageEdge(source=inp, target=step.get("id"), type="transform")
|
| 195 |
+
graph.add_edge(edge)
|
| 196 |
+
else:
|
| 197 |
+
# Fallback: create sample nodes
|
| 198 |
+
graph.add_node(LineageNode(id="source", name="Source", type="source"))
|
| 199 |
+
graph.add_node(LineageNode(id="target", name="Target", type="table"))
|
| 200 |
+
graph.add_edge(LineageEdge(source="source", target="target", type="transform"))
|
| 201 |
+
|
| 202 |
+
summary = f"Parsed {len(graph.nodes)} nodes and {len(graph.edges)} relationships from {source_type}"
|
| 203 |
+
return graph, summary
|
| 204 |
+
|
| 205 |
+
except json.JSONDecodeError as e:
|
| 206 |
+
# Handle SQL or plain text
|
| 207 |
+
graph = LineageGraph(name=f"Lineage from {source_type}")
|
| 208 |
+
graph.add_node(LineageNode(id="input", name="Input Data", type="source"))
|
| 209 |
+
graph.add_node(LineageNode(id="output", name="Output Data", type="table"))
|
| 210 |
+
graph.add_edge(LineageEdge(source="input", target="output", type="transform"))
|
| 211 |
+
return graph, f"Created placeholder lineage (could not parse as JSON: {str(e)[:50]})"
|
| 212 |
+
except Exception as e:
|
| 213 |
+
graph = LineageGraph(name="Error")
|
| 214 |
+
return graph, f"Error parsing metadata: {str(e)}"
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
def generate_mermaid_from_graph(graph: LineageGraph) -> str:
|
| 218 |
+
"""Generate Mermaid diagram code from a LineageGraph."""
|
| 219 |
+
if not graph.nodes:
|
| 220 |
+
return "graph TD\n A[No data to display]"
|
| 221 |
+
|
| 222 |
+
lines = ["graph TD"]
|
| 223 |
+
|
| 224 |
+
# Group nodes by category for subgraphs
|
| 225 |
+
categories = {}
|
| 226 |
+
for node in graph.nodes:
|
| 227 |
+
cat = node.category or "default"
|
| 228 |
+
if cat not in categories:
|
| 229 |
+
categories[cat] = []
|
| 230 |
+
categories[cat].append(node)
|
| 231 |
+
|
| 232 |
+
# Generate nodes with styling
|
| 233 |
+
node_styles = {
|
| 234 |
+
"source": "fill:#e1f5fe",
|
| 235 |
+
"external_api": "fill:#e1f5fe",
|
| 236 |
+
"table": "fill:#e8f5e9",
|
| 237 |
+
"view": "fill:#f3e5f5",
|
| 238 |
+
"model": "fill:#fff3e0",
|
| 239 |
+
"report": "fill:#fce4ec",
|
| 240 |
+
"dimension": "fill:#e0f7fa",
|
| 241 |
+
"fact": "fill:#fff8e1",
|
| 242 |
+
"destination": "fill:#ffebee",
|
| 243 |
+
"task": "fill:#f5f5f5"
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
# Add subgraphs for categories
|
| 247 |
+
if len(categories) > 1:
|
| 248 |
+
for cat, nodes in categories.items():
|
| 249 |
+
if cat != "default":
|
| 250 |
+
lines.append(f" subgraph {cat.replace('_', ' ').title()}")
|
| 251 |
+
for node in nodes:
|
| 252 |
+
shape = f"[{node.name}]" if node.type in ["table", "model"] else f"({node.name})"
|
| 253 |
+
lines.append(f" {node.id}{shape}")
|
| 254 |
+
lines.append(" end")
|
| 255 |
+
else:
|
| 256 |
+
for node in nodes:
|
| 257 |
+
shape = f"[{node.name}]" if node.type in ["table", "model"] else f"({node.name})"
|
| 258 |
+
lines.append(f" {node.id}{shape}")
|
| 259 |
+
else:
|
| 260 |
+
for node in graph.nodes:
|
| 261 |
+
shape = f"[{node.name}]" if node.type in ["table", "model"] else f"({node.name})"
|
| 262 |
+
lines.append(f" {node.id}{shape}")
|
| 263 |
+
|
| 264 |
+
# Add edges
|
| 265 |
+
edge_labels = {
|
| 266 |
+
"transform": "-->",
|
| 267 |
+
"reference": "-.->",
|
| 268 |
+
"ingest": "-->",
|
| 269 |
+
"export": "-->",
|
| 270 |
+
"join": "-->",
|
| 271 |
+
"aggregate": "-->",
|
| 272 |
+
"dependency": "-->"
|
| 273 |
+
}
|
| 274 |
+
|
| 275 |
+
for edge in graph.edges:
|
| 276 |
+
arrow = edge_labels.get(edge.type, "-->")
|
| 277 |
+
if edge.type and edge.type not in ["transform", "dependency"]:
|
| 278 |
+
lines.append(f" {edge.source} {arrow}|{edge.type}| {edge.target}")
|
| 279 |
+
else:
|
| 280 |
+
lines.append(f" {edge.source} {arrow} {edge.target}")
|
| 281 |
+
|
| 282 |
+
# Add styling
|
| 283 |
+
for node in graph.nodes:
|
| 284 |
+
style = node_styles.get(node.type, "fill:#f5f5f5")
|
| 285 |
+
lines.append(f" style {node.id} {style}")
|
| 286 |
+
|
| 287 |
+
return "\n".join(lines)
|
| 288 |
+
|
| 289 |
+
|
| 290 |
+
# ============================================================================
|
| 291 |
+
# MCP Server Integration
|
| 292 |
+
# ============================================================================
|
| 293 |
+
|
| 294 |
+
def send_to_mcp(server_url: str, api_key: str, metadata_text: str, source_type: str, viz_format: str) -> Tuple[str, str]:
|
| 295 |
+
"""Send metadata to an external MCP server and return visualization + summary."""
|
| 296 |
if not server_url:
|
| 297 |
return "", "No MCP server URL configured."
|
| 298 |
try:
|
|
|
|
| 301 |
"source_type": source_type,
|
| 302 |
"viz_format": viz_format,
|
| 303 |
}
|
| 304 |
+
headers = {"Content-Type": "application/json"}
|
| 305 |
if api_key:
|
| 306 |
headers["Authorization"] = f"Bearer {api_key}"
|
| 307 |
+
resp = requests.post(server_url, json=payload, headers=headers, timeout=30)
|
| 308 |
+
if 200 <= resp.status_code < 300:
|
| 309 |
data = resp.json()
|
| 310 |
viz = data.get("visualization") or data.get("viz") or data.get("mermaid", "")
|
| 311 |
summary = data.get("summary", "Processed by MCP server.")
|
| 312 |
if viz:
|
| 313 |
return render_mermaid(viz), summary
|
| 314 |
+
return "", summary
|
|
|
|
| 315 |
else:
|
| 316 |
return "", f"MCP server returned status {resp.status_code}: {resp.text[:200]}"
|
| 317 |
except Exception as e:
|
|
|
|
| 319 |
|
| 320 |
|
| 321 |
def test_mcp_connection(server_url: str, api_key: str) -> str:
|
| 322 |
+
"""Health-check to MCP server."""
|
| 323 |
if not server_url:
|
| 324 |
return "No MCP server URL configured."
|
| 325 |
try:
|
| 326 |
headers = {}
|
| 327 |
if api_key:
|
| 328 |
headers["Authorization"] = f"Bearer {api_key}"
|
| 329 |
+
resp = requests.get(server_url.replace("/mcp", "/health").replace("/api", "/health"),
|
| 330 |
+
headers=headers, timeout=10)
|
| 331 |
return f"MCP server responded: {resp.status_code} {resp.reason}"
|
| 332 |
except Exception as e:
|
| 333 |
return f"Error contacting MCP server: {e}"
|
| 334 |
|
| 335 |
|
| 336 |
+
# ============================================================================
|
| 337 |
+
# Export Functions
|
| 338 |
+
# ============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 339 |
|
| 340 |
+
def export_lineage(metadata_text: str, source_type: str, export_format: str) -> Tuple[str, str]:
|
| 341 |
+
"""Export lineage to the specified data catalog format."""
|
| 342 |
+
if not EXPORTERS_AVAILABLE:
|
| 343 |
+
return "", "Export functionality not available. Please install the exporters module."
|
| 344 |
|
| 345 |
+
try:
|
| 346 |
+
graph, _ = parse_metadata_to_graph(metadata_text, source_type)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 347 |
|
| 348 |
+
if export_format == "OpenLineage":
|
| 349 |
+
exporter = OpenLineageExporter(graph)
|
| 350 |
+
elif export_format == "Collibra":
|
| 351 |
+
exporter = CollibraExporter(graph)
|
| 352 |
+
elif export_format == "Purview":
|
| 353 |
+
exporter = PurviewExporter(graph)
|
| 354 |
+
elif export_format == "Alation":
|
| 355 |
+
exporter = AlationExporter(graph)
|
| 356 |
+
else:
|
| 357 |
+
return "", f"Unknown export format: {export_format}"
|
| 358 |
|
| 359 |
+
exported_content = exporter.export()
|
| 360 |
+
filename = f"lineage_export_{export_format.lower()}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
|
| 361 |
+
|
| 362 |
+
return exported_content, f"Exported to {export_format} format. Download the file below."
|
| 363 |
+
|
| 364 |
+
except Exception as e:
|
| 365 |
+
return "", f"Export error: {str(e)}"
|
| 366 |
+
|
| 367 |
+
|
| 368 |
+
# ============================================================================
|
| 369 |
+
# Sample Data Loading
|
| 370 |
+
# ============================================================================
|
| 371 |
+
|
| 372 |
+
def load_sample(sample_type: str) -> str:
|
| 373 |
+
"""Load a sample file."""
|
| 374 |
+
filename = SAMPLE_FILES.get(sample_type)
|
| 375 |
+
if not filename:
|
| 376 |
+
return json.dumps({"error": f"Unknown sample type: {sample_type}"})
|
| 377 |
+
|
| 378 |
+
filepath = os.path.join(os.path.dirname(__file__), "samples", filename)
|
| 379 |
+
try:
|
| 380 |
+
with open(filepath, "r") as f:
|
| 381 |
+
return f.read()
|
| 382 |
+
except Exception as e:
|
| 383 |
+
return json.dumps({"error": f"Could not load sample: {str(e)}"})
|
| 384 |
|
| 385 |
+
|
| 386 |
+
# ============================================================================
|
| 387 |
+
# Main Extraction Handlers
|
| 388 |
+
# ============================================================================
|
| 389 |
|
| 390 |
def extract_lineage_from_text(
|
| 391 |
metadata_text: str,
|
| 392 |
source_type: str,
|
| 393 |
+
visualization_format: str,
|
| 394 |
+
mcp_server: str = "",
|
| 395 |
+
mcp_api_key: str = ""
|
| 396 |
) -> Tuple[str, str]:
|
| 397 |
+
"""Extract lineage from provided metadata text."""
|
| 398 |
+
# Try MCP server first if configured
|
| 399 |
+
if mcp_server:
|
| 400 |
+
viz, summary = send_to_mcp(mcp_server, mcp_api_key, metadata_text, source_type, visualization_format)
|
| 401 |
+
if viz or (summary and not summary.startswith("Error")):
|
| 402 |
+
return viz, summary
|
| 403 |
+
|
| 404 |
+
# Local processing
|
| 405 |
+
if not metadata_text.strip():
|
| 406 |
+
return "", "Please provide metadata content."
|
| 407 |
+
|
| 408 |
+
if EXPORTERS_AVAILABLE:
|
| 409 |
+
graph, summary = parse_metadata_to_graph(metadata_text, source_type)
|
| 410 |
+
mermaid_code = generate_mermaid_from_graph(graph)
|
| 411 |
+
return render_mermaid(mermaid_code), summary
|
| 412 |
+
else:
|
| 413 |
+
# Fallback stub
|
| 414 |
+
viz = "graph TD\n A[Sample Node] --> B[Output Node]"
|
| 415 |
+
return render_mermaid(viz), f"Processed {source_type} metadata."
|
| 416 |
+
|
| 417 |
|
| 418 |
def extract_lineage_from_bigquery(
|
| 419 |
project_id: str,
|
| 420 |
query: str,
|
| 421 |
api_key: str,
|
| 422 |
+
visualization_format: str,
|
| 423 |
+
mcp_server: str = "",
|
| 424 |
+
mcp_api_key: str = ""
|
| 425 |
) -> Tuple[str, str]:
|
| 426 |
+
"""Extract lineage from BigQuery."""
|
| 427 |
+
if mcp_server:
|
| 428 |
+
viz, summary = send_to_mcp(mcp_server, mcp_api_key, query, "BigQuery", visualization_format)
|
| 429 |
+
if viz or (summary and not summary.startswith("Error")):
|
| 430 |
+
return viz, summary
|
| 431 |
+
|
| 432 |
+
# Local stub - would integrate with BigQuery API in production
|
| 433 |
+
viz = f"""graph TD
|
| 434 |
+
subgraph BigQuery Project: {project_id or 'your-project'}
|
| 435 |
+
A[Source Tables] --> B[Query Execution]
|
| 436 |
+
B --> C[Destination Table]
|
| 437 |
+
end
|
| 438 |
+
style A fill:#e1f5fe
|
| 439 |
+
style B fill:#fff3e0
|
| 440 |
+
style C fill:#e8f5e9"""
|
| 441 |
+
return render_mermaid(viz), f"BigQuery lineage from project: {project_id or 'not specified'}"
|
| 442 |
+
|
|
|
|
|
|
|
| 443 |
|
| 444 |
def extract_lineage_from_url(
|
| 445 |
url: str,
|
| 446 |
+
visualization_format: str,
|
| 447 |
+
mcp_server: str = "",
|
| 448 |
+
mcp_api_key: str = ""
|
| 449 |
) -> Tuple[str, str]:
|
| 450 |
+
"""Extract lineage from URL/API endpoint."""
|
| 451 |
+
if mcp_server:
|
| 452 |
+
viz, summary = send_to_mcp(mcp_server, mcp_api_key, url, "URL", visualization_format)
|
| 453 |
+
if viz or (summary and not summary.startswith("Error")):
|
| 454 |
+
return viz, summary
|
| 455 |
+
|
| 456 |
+
# Try to fetch the URL
|
| 457 |
+
if url:
|
| 458 |
+
try:
|
| 459 |
+
resp = requests.get(url, timeout=10)
|
| 460 |
+
if resp.status_code == 200:
|
| 461 |
+
return extract_lineage_from_text(resp.text, "API Response", visualization_format)
|
| 462 |
+
except Exception as e:
|
| 463 |
+
pass
|
| 464 |
+
|
| 465 |
+
viz = "graph TD\n A[API Source] --> B[Data Pipeline] --> C[Output]"
|
| 466 |
+
return render_mermaid(viz), f"Lineage from URL: {url or 'not specified'}"
|
| 467 |
+
|
| 468 |
+
|
| 469 |
+
# ============================================================================
|
| 470 |
+
# Gradio UI
|
| 471 |
+
# ============================================================================
|
| 472 |
|
| 473 |
+
# Build the Gradio interface (Gradio 6 compatible)
|
| 474 |
+
with gr.Blocks(
|
| 475 |
+
title="Lineage Graph Accelerator",
|
| 476 |
+
fill_height=True
|
| 477 |
+
) as demo:
|
| 478 |
+
|
| 479 |
+
# Header
|
| 480 |
gr.Markdown("""
|
| 481 |
+
# Lineage Graph Accelerator
|
| 482 |
+
|
| 483 |
+
**AI-powered data lineage extraction and visualization for modern data platforms**
|
| 484 |
+
|
| 485 |
+
Extract, visualize, and export data lineage from dbt, Airflow, BigQuery, Snowflake, and more.
|
| 486 |
+
Export to enterprise data catalogs like Collibra, Microsoft Purview, and Alation.
|
| 487 |
+
|
| 488 |
+
---
|
|
|
|
|
|
|
| 489 |
""")
|
| 490 |
+
|
| 491 |
+
# MCP Server Configuration (collapsible)
|
| 492 |
+
with gr.Accordion("MCP Server Configuration (Optional)", open=False):
|
| 493 |
+
with gr.Row():
|
| 494 |
+
mcp_server = gr.Textbox(
|
| 495 |
+
label="MCP Server URL",
|
| 496 |
+
placeholder="https://your-mcp-server.hf.space/mcp",
|
| 497 |
+
info="Connect to a HuggingFace-hosted MCP server for enhanced processing"
|
| 498 |
+
)
|
| 499 |
+
mcp_api_key = gr.Textbox(
|
| 500 |
+
label="API Key",
|
| 501 |
+
placeholder="Optional API key",
|
| 502 |
+
type="password"
|
| 503 |
+
)
|
| 504 |
+
test_btn = gr.Button("Test Connection", size="sm")
|
| 505 |
+
mcp_status = gr.Textbox(label="Connection Status", interactive=False)
|
| 506 |
+
test_btn.click(fn=test_mcp_connection, inputs=[mcp_server, mcp_api_key], outputs=[mcp_status])
|
| 507 |
+
|
| 508 |
+
# Mermaid.js loader
|
| 509 |
+
gr.HTML(
|
| 510 |
+
value='<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>'
|
| 511 |
+
'<script>mermaid.initialize({startOnLoad:false, theme:"default"});</script>',
|
| 512 |
visible=False
|
| 513 |
)
|
| 514 |
+
|
| 515 |
+
# Main Tabs
|
| 516 |
with gr.Tabs():
|
| 517 |
# Tab 1: Text/File Input
|
| 518 |
+
with gr.Tab("Text/File Metadata", id="text"):
|
| 519 |
with gr.Row():
|
| 520 |
+
with gr.Column(scale=1):
|
| 521 |
+
gr.Markdown("### Input")
|
| 522 |
+
|
| 523 |
+
# Sample selector
|
| 524 |
+
with gr.Row():
|
| 525 |
+
sample_selector = gr.Dropdown(
|
| 526 |
+
choices=[
|
| 527 |
+
("Simple JSON", "simple"),
|
| 528 |
+
("dbt Manifest", "dbt"),
|
| 529 |
+
("Airflow DAG", "airflow"),
|
| 530 |
+
("SQL DDL", "sql"),
|
| 531 |
+
("Data Warehouse", "warehouse"),
|
| 532 |
+
("ETL Pipeline", "etl"),
|
| 533 |
+
("Complex Demo", "complex")
|
| 534 |
+
],
|
| 535 |
+
label="Load Sample Data",
|
| 536 |
+
value="simple"
|
| 537 |
+
)
|
| 538 |
+
load_sample_btn = gr.Button("Load Sample", size="sm")
|
| 539 |
+
|
| 540 |
metadata_input = gr.Textbox(
|
| 541 |
label="Metadata Content",
|
| 542 |
+
placeholder="Paste your metadata here (JSON, YAML, SQL, dbt manifest, Airflow DAG, etc.)",
|
| 543 |
+
lines=18
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 544 |
)
|
| 545 |
+
|
| 546 |
+
with gr.Row():
|
| 547 |
+
source_type = gr.Dropdown(
|
| 548 |
+
choices=["dbt Manifest", "Airflow DAG", "SQL DDL", "Data Warehouse", "ETL Pipeline", "Custom JSON", "Other"],
|
| 549 |
+
label="Source Type",
|
| 550 |
+
value="Custom JSON"
|
| 551 |
+
)
|
| 552 |
+
viz_format = gr.Dropdown(
|
| 553 |
+
choices=["Mermaid", "DOT/Graphviz", "Text"],
|
| 554 |
+
label="Visualization Format",
|
| 555 |
+
value="Mermaid"
|
| 556 |
+
)
|
| 557 |
+
|
| 558 |
+
extract_btn = gr.Button("Extract Lineage", variant="primary", size="lg")
|
| 559 |
+
|
| 560 |
+
with gr.Column(scale=1):
|
| 561 |
+
gr.Markdown("### Visualization")
|
| 562 |
+
output_viz = gr.HTML(label="Lineage Graph")
|
| 563 |
+
output_summary = gr.Textbox(label="Summary", lines=3)
|
| 564 |
+
|
| 565 |
+
# Export section
|
| 566 |
+
with gr.Accordion("Export to Data Catalog", open=False):
|
| 567 |
+
export_format = gr.Dropdown(
|
| 568 |
+
choices=EXPORT_FORMATS,
|
| 569 |
+
label="Export Format",
|
| 570 |
+
value="OpenLineage"
|
| 571 |
+
)
|
| 572 |
+
export_btn = gr.Button("Generate Export", variant="secondary")
|
| 573 |
+
export_output = gr.Code(label="Export Content", language="json", lines=10)
|
| 574 |
+
export_status = gr.Textbox(label="Export Status", interactive=False)
|
| 575 |
+
|
| 576 |
+
# Event handlers
|
| 577 |
+
load_sample_btn.click(
|
| 578 |
+
fn=load_sample,
|
| 579 |
+
inputs=[sample_selector],
|
| 580 |
+
outputs=[metadata_input]
|
| 581 |
+
)
|
| 582 |
+
|
| 583 |
+
extract_btn.click(
|
| 584 |
+
fn=extract_lineage_from_text,
|
| 585 |
+
inputs=[metadata_input, source_type, viz_format, mcp_server, mcp_api_key],
|
| 586 |
+
outputs=[output_viz, output_summary]
|
| 587 |
+
)
|
| 588 |
+
|
| 589 |
+
export_btn.click(
|
| 590 |
+
fn=export_lineage,
|
| 591 |
+
inputs=[metadata_input, source_type, export_format],
|
| 592 |
+
outputs=[export_output, export_status]
|
| 593 |
)
|
| 594 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 595 |
# Tab 2: BigQuery
|
| 596 |
+
with gr.Tab("BigQuery", id="bigquery"):
|
| 597 |
with gr.Row():
|
| 598 |
with gr.Column():
|
| 599 |
bq_project = gr.Textbox(
|
|
|
|
| 603 |
bq_query = gr.Textbox(
|
| 604 |
label="Metadata Query",
|
| 605 |
placeholder="SELECT * FROM `project.dataset.INFORMATION_SCHEMA.TABLES`",
|
| 606 |
+
lines=10
|
| 607 |
)
|
| 608 |
+
load_bq_sample = gr.Button("Load Sample Query", size="sm")
|
| 609 |
+
bq_creds = gr.Textbox(
|
| 610 |
+
label="Service Account JSON (optional)",
|
|
|
|
| 611 |
type="password"
|
| 612 |
)
|
| 613 |
+
bq_viz_format = gr.Dropdown(
|
| 614 |
+
choices=["Mermaid", "DOT/Graphviz", "Text"],
|
| 615 |
label="Visualization Format",
|
| 616 |
value="Mermaid"
|
| 617 |
)
|
| 618 |
+
bq_extract_btn = gr.Button("Extract Lineage", variant="primary")
|
| 619 |
+
|
| 620 |
with gr.Column():
|
| 621 |
+
bq_output_viz = gr.HTML(label="Lineage Graph")
|
| 622 |
+
bq_output_summary = gr.Textbox(label="Summary", lines=3)
|
| 623 |
+
|
| 624 |
+
load_bq_sample.click(
|
| 625 |
+
fn=lambda: load_sample("bigquery"),
|
| 626 |
+
outputs=[bq_query]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 627 |
)
|
| 628 |
+
|
| 629 |
+
bq_extract_btn.click(
|
| 630 |
+
fn=extract_lineage_from_bigquery,
|
| 631 |
+
inputs=[bq_project, bq_query, bq_creds, bq_viz_format, mcp_server, mcp_api_key],
|
| 632 |
+
outputs=[bq_output_viz, bq_output_summary]
|
| 633 |
+
)
|
| 634 |
+
|
|
|
|
|
|
|
|
|
|
| 635 |
# Tab 3: URL/API
|
| 636 |
+
with gr.Tab("URL/API", id="url"):
|
| 637 |
with gr.Row():
|
| 638 |
with gr.Column():
|
| 639 |
url_input = gr.Textbox(
|
| 640 |
+
label="Metadata URL",
|
| 641 |
placeholder="https://api.example.com/metadata"
|
| 642 |
)
|
| 643 |
+
load_url_sample = gr.Button("Load Sample API Metadata", size="sm")
|
| 644 |
+
url_viz_format = gr.Dropdown(
|
| 645 |
+
choices=["Mermaid", "DOT/Graphviz", "Text"],
|
| 646 |
label="Visualization Format",
|
| 647 |
value="Mermaid"
|
| 648 |
)
|
| 649 |
+
url_extract_btn = gr.Button("Extract Lineage", variant="primary")
|
| 650 |
+
|
| 651 |
with gr.Column():
|
| 652 |
+
url_output_viz = gr.HTML(label="Lineage Graph")
|
| 653 |
+
url_output_summary = gr.Textbox(label="Summary", lines=3)
|
| 654 |
+
|
| 655 |
+
load_url_sample.click(
|
| 656 |
+
fn=lambda: load_sample("api"),
|
| 657 |
+
outputs=[url_input]
|
| 658 |
+
)
|
| 659 |
+
|
| 660 |
+
url_extract_btn.click(
|
| 661 |
+
fn=extract_lineage_from_url,
|
| 662 |
+
inputs=[url_input, url_viz_format, mcp_server, mcp_api_key],
|
| 663 |
+
outputs=[url_output_viz, url_output_summary]
|
|
|
|
| 664 |
)
|
| 665 |
+
|
| 666 |
+
# Tab 4: Demo Gallery
|
| 667 |
+
with gr.Tab("Demo Gallery", id="gallery"):
|
| 668 |
+
gr.Markdown("""
|
| 669 |
+
## Sample Lineage Visualizations
|
| 670 |
+
|
| 671 |
+
Click any example below to see the lineage visualization.
|
| 672 |
+
""")
|
| 673 |
+
|
| 674 |
+
with gr.Row():
|
| 675 |
+
demo_simple = gr.Button("E-Commerce (Simple)")
|
| 676 |
+
demo_dbt = gr.Button("dbt Project")
|
| 677 |
+
demo_airflow = gr.Button("Airflow DAG")
|
| 678 |
+
with gr.Row():
|
| 679 |
+
demo_warehouse = gr.Button("Data Warehouse")
|
| 680 |
+
demo_etl = gr.Button("ETL Pipeline")
|
| 681 |
+
demo_complex = gr.Button("Complex Platform")
|
| 682 |
+
|
| 683 |
+
demo_viz = gr.HTML(label="Demo Visualization")
|
| 684 |
+
demo_summary = gr.Textbox(label="Description", lines=2)
|
| 685 |
+
|
| 686 |
+
# Demo handlers
|
| 687 |
+
for btn, sample_type in [(demo_simple, "simple"), (demo_dbt, "dbt"),
|
| 688 |
+
(demo_airflow, "airflow"), (demo_warehouse, "warehouse"),
|
| 689 |
+
(demo_etl, "etl"), (demo_complex, "complex")]:
|
| 690 |
+
btn.click(
|
| 691 |
+
fn=lambda st=sample_type: extract_lineage_from_text(
|
| 692 |
+
load_sample(st),
|
| 693 |
+
st.replace("_", " ").title(),
|
| 694 |
+
"Mermaid"
|
| 695 |
+
),
|
| 696 |
+
outputs=[demo_viz, demo_summary]
|
| 697 |
+
)
|
| 698 |
+
|
| 699 |
+
# Footer
|
| 700 |
gr.Markdown("""
|
| 701 |
---
|
| 702 |
+
|
| 703 |
+
### Export Formats Supported
|
| 704 |
+
|
| 705 |
+
| Format | Description | Use Case |
|
| 706 |
+
|--------|-------------|----------|
|
| 707 |
+
| **OpenLineage** | Open standard for lineage | Universal compatibility |
|
| 708 |
+
| **Collibra** | Collibra Data Intelligence | Enterprise data governance |
|
| 709 |
+
| **Purview** | Microsoft Purview | Azure ecosystem |
|
| 710 |
+
| **Alation** | Alation Data Catalog | Self-service analytics |
|
| 711 |
+
|
| 712 |
+
---
|
| 713 |
+
|
| 714 |
+
Built with Gradio for the **Gradio Agents & MCP Hackathon - Winter 2025**
|
| 715 |
+
|
| 716 |
+
[GitHub](https://github.com) | [Documentation](USER_GUIDE.md) | [HuggingFace](https://huggingface.co)
|
| 717 |
""")
|
| 718 |
|
| 719 |
+
# Launch
|
| 720 |
if __name__ == "__main__":
|
| 721 |
demo.launch()
|
exporters/__init__.py
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Data Lineage Exporters - Export lineage graphs to various data catalog formats.
|
| 3 |
+
|
| 4 |
+
Supported formats:
|
| 5 |
+
- OpenLineage (standard format)
|
| 6 |
+
- Collibra Data Intelligence
|
| 7 |
+
- Microsoft Purview
|
| 8 |
+
- Alation Data Catalog
|
| 9 |
+
- Apache Atlas
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from .base import LineageExporter, LineageGraph, LineageNode, LineageEdge
|
| 13 |
+
from .openlineage import OpenLineageExporter
|
| 14 |
+
from .collibra import CollibraExporter
|
| 15 |
+
from .purview import PurviewExporter
|
| 16 |
+
from .alation import AlationExporter
|
| 17 |
+
|
| 18 |
+
__all__ = [
|
| 19 |
+
'LineageExporter',
|
| 20 |
+
'LineageGraph',
|
| 21 |
+
'LineageNode',
|
| 22 |
+
'LineageEdge',
|
| 23 |
+
'OpenLineageExporter',
|
| 24 |
+
'CollibraExporter',
|
| 25 |
+
'PurviewExporter',
|
| 26 |
+
'AlationExporter',
|
| 27 |
+
]
|
exporters/alation.py
ADDED
|
@@ -0,0 +1,242 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Alation Exporter - Export to Alation Data Catalog format.
|
| 3 |
+
|
| 4 |
+
Alation is an enterprise data catalog and data governance platform.
|
| 5 |
+
https://www.alation.com/
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from typing import Dict, Any, List
|
| 9 |
+
from datetime import datetime
|
| 10 |
+
import uuid
|
| 11 |
+
from .base import LineageExporter, LineageGraph, LineageNode, LineageEdge
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class AlationExporter(LineageExporter):
|
| 15 |
+
"""Export lineage to Alation format."""
|
| 16 |
+
|
| 17 |
+
def __init__(self, graph: LineageGraph, datasource_id: int = 1,
|
| 18 |
+
datasource_name: str = "Lineage Accelerator"):
|
| 19 |
+
super().__init__(graph)
|
| 20 |
+
self.datasource_id = datasource_id
|
| 21 |
+
self.datasource_name = datasource_name
|
| 22 |
+
|
| 23 |
+
@property
|
| 24 |
+
def format_name(self) -> str:
|
| 25 |
+
return "Alation"
|
| 26 |
+
|
| 27 |
+
@property
|
| 28 |
+
def file_extension(self) -> str:
|
| 29 |
+
return ".json"
|
| 30 |
+
|
| 31 |
+
def _node_type_to_alation_otype(self, node_type: str) -> str:
|
| 32 |
+
"""Map internal node types to Alation object types."""
|
| 33 |
+
type_mapping = {
|
| 34 |
+
"table": "table",
|
| 35 |
+
"view": "view",
|
| 36 |
+
"model": "table",
|
| 37 |
+
"source": "datasource",
|
| 38 |
+
"destination": "table",
|
| 39 |
+
"column": "attribute",
|
| 40 |
+
"database": "schema",
|
| 41 |
+
"schema": "schema",
|
| 42 |
+
"report": "bi_report",
|
| 43 |
+
"dimension": "table",
|
| 44 |
+
"fact": "table",
|
| 45 |
+
"feature_set": "table",
|
| 46 |
+
"semantic_model": "bi_datasource",
|
| 47 |
+
"external_api": "datasource",
|
| 48 |
+
"extract": "table"
|
| 49 |
+
}
|
| 50 |
+
return type_mapping.get(node_type.lower(), "table")
|
| 51 |
+
|
| 52 |
+
def _create_table_object(self, node: LineageNode) -> Dict[str, Any]:
|
| 53 |
+
"""Create an Alation table object from a node."""
|
| 54 |
+
obj = {
|
| 55 |
+
"key": self._get_key(node),
|
| 56 |
+
"title": node.name,
|
| 57 |
+
"description": node.description or "",
|
| 58 |
+
"ds_id": self.datasource_id,
|
| 59 |
+
"schema_name": node.schema or "default",
|
| 60 |
+
"table_name": node.name,
|
| 61 |
+
"table_type": node.type.upper() if node.type else "TABLE"
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
# Add custom fields
|
| 65 |
+
custom_fields = []
|
| 66 |
+
|
| 67 |
+
if node.category:
|
| 68 |
+
custom_fields.append({
|
| 69 |
+
"field_name": "Data Layer",
|
| 70 |
+
"value": node.category
|
| 71 |
+
})
|
| 72 |
+
|
| 73 |
+
if node.owner:
|
| 74 |
+
custom_fields.append({
|
| 75 |
+
"field_name": "Data Owner",
|
| 76 |
+
"value": node.owner
|
| 77 |
+
})
|
| 78 |
+
|
| 79 |
+
if node.tags:
|
| 80 |
+
custom_fields.append({
|
| 81 |
+
"field_name": "Tags",
|
| 82 |
+
"value": ", ".join(node.tags)
|
| 83 |
+
})
|
| 84 |
+
|
| 85 |
+
if node.database:
|
| 86 |
+
custom_fields.append({
|
| 87 |
+
"field_name": "Database",
|
| 88 |
+
"value": node.database
|
| 89 |
+
})
|
| 90 |
+
|
| 91 |
+
if custom_fields:
|
| 92 |
+
obj["custom_fields"] = custom_fields
|
| 93 |
+
|
| 94 |
+
return obj
|
| 95 |
+
|
| 96 |
+
def _get_key(self, node: LineageNode) -> str:
|
| 97 |
+
"""Get Alation-style key for a node."""
|
| 98 |
+
parts = [str(self.datasource_id)]
|
| 99 |
+
if node.schema:
|
| 100 |
+
parts.append(node.schema)
|
| 101 |
+
else:
|
| 102 |
+
parts.append("default")
|
| 103 |
+
parts.append(node.name)
|
| 104 |
+
return ".".join(parts)
|
| 105 |
+
|
| 106 |
+
def _create_column_objects(self, node: LineageNode) -> List[Dict[str, Any]]:
|
| 107 |
+
"""Create Alation column objects from a node's columns."""
|
| 108 |
+
if not node.columns:
|
| 109 |
+
return []
|
| 110 |
+
|
| 111 |
+
column_objects = []
|
| 112 |
+
table_key = self._get_key(node)
|
| 113 |
+
|
| 114 |
+
for idx, col in enumerate(node.columns):
|
| 115 |
+
col_obj = {
|
| 116 |
+
"key": f"{table_key}.{col.get('name')}",
|
| 117 |
+
"column_name": col.get("name"),
|
| 118 |
+
"column_type": col.get("type") or col.get("data_type", "string"),
|
| 119 |
+
"description": col.get("description", ""),
|
| 120 |
+
"table_key": table_key,
|
| 121 |
+
"position": idx + 1
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
# Check for primary key
|
| 125 |
+
if col.get("isPrimaryKey"):
|
| 126 |
+
col_obj["is_primary_key"] = True
|
| 127 |
+
|
| 128 |
+
# Check for foreign key
|
| 129 |
+
if col.get("isForeignKey"):
|
| 130 |
+
col_obj["is_foreign_key"] = True
|
| 131 |
+
if col.get("references"):
|
| 132 |
+
col_obj["fk_reference"] = col.get("references")
|
| 133 |
+
|
| 134 |
+
column_objects.append(col_obj)
|
| 135 |
+
|
| 136 |
+
return column_objects
|
| 137 |
+
|
| 138 |
+
def _create_lineage_object(self, edge: LineageEdge) -> Dict[str, Any]:
|
| 139 |
+
"""Create an Alation lineage object from an edge."""
|
| 140 |
+
source_node = self.graph.get_node(edge.source)
|
| 141 |
+
target_node = self.graph.get_node(edge.target)
|
| 142 |
+
|
| 143 |
+
lineage = {
|
| 144 |
+
"source_key": self._get_key(source_node) if source_node else edge.source,
|
| 145 |
+
"target_key": self._get_key(target_node) if target_node else edge.target,
|
| 146 |
+
"lineage_type": edge.type or "DIRECT"
|
| 147 |
+
}
|
| 148 |
+
|
| 149 |
+
# Add job information if available
|
| 150 |
+
if edge.job_name:
|
| 151 |
+
lineage["dataflow_name"] = edge.job_name
|
| 152 |
+
if edge.job_id:
|
| 153 |
+
lineage["dataflow_id"] = edge.job_id
|
| 154 |
+
|
| 155 |
+
# Add transformation description
|
| 156 |
+
if edge.transformation:
|
| 157 |
+
lineage["transformation_description"] = edge.transformation
|
| 158 |
+
|
| 159 |
+
return lineage
|
| 160 |
+
|
| 161 |
+
def _create_dataflow(self, edge: LineageEdge) -> Dict[str, Any]:
|
| 162 |
+
"""Create an Alation dataflow object from an edge."""
|
| 163 |
+
dataflow_name = edge.job_name or f"dataflow_{edge.source}_to_{edge.target}"
|
| 164 |
+
|
| 165 |
+
dataflow = {
|
| 166 |
+
"external_id": edge.job_id or str(uuid.uuid4()),
|
| 167 |
+
"title": dataflow_name,
|
| 168 |
+
"description": f"Data transformation: {edge.type}",
|
| 169 |
+
"dataflow_type": edge.type.upper() if edge.type else "ETL"
|
| 170 |
+
}
|
| 171 |
+
|
| 172 |
+
return dataflow
|
| 173 |
+
|
| 174 |
+
def export(self) -> str:
|
| 175 |
+
"""Export to Alation JSON format."""
|
| 176 |
+
return self.to_json(indent=2)
|
| 177 |
+
|
| 178 |
+
def _to_dict(self) -> Dict[str, Any]:
|
| 179 |
+
"""Convert to Alation bulk import dictionary."""
|
| 180 |
+
# Collect tables
|
| 181 |
+
tables = []
|
| 182 |
+
columns = []
|
| 183 |
+
|
| 184 |
+
for node in self.graph.nodes:
|
| 185 |
+
tables.append(self._create_table_object(node))
|
| 186 |
+
columns.extend(self._create_column_objects(node))
|
| 187 |
+
|
| 188 |
+
# Collect lineage
|
| 189 |
+
lineage_objects = [self._create_lineage_object(edge) for edge in self.graph.edges]
|
| 190 |
+
|
| 191 |
+
# Collect unique dataflows
|
| 192 |
+
dataflows = []
|
| 193 |
+
seen_dataflows = set()
|
| 194 |
+
for edge in self.graph.edges:
|
| 195 |
+
dataflow_name = edge.job_name or f"dataflow_{edge.source}_to_{edge.target}"
|
| 196 |
+
if dataflow_name not in seen_dataflows:
|
| 197 |
+
dataflows.append(self._create_dataflow(edge))
|
| 198 |
+
seen_dataflows.add(dataflow_name)
|
| 199 |
+
|
| 200 |
+
return {
|
| 201 |
+
"exportInfo": {
|
| 202 |
+
"producer": "Lineage Graph Accelerator",
|
| 203 |
+
"exportedAt": self.graph.generated_at,
|
| 204 |
+
"sourceLineageName": self.graph.name,
|
| 205 |
+
"format": "Alation Bulk API",
|
| 206 |
+
"version": "1.0"
|
| 207 |
+
},
|
| 208 |
+
"datasource": {
|
| 209 |
+
"id": self.datasource_id,
|
| 210 |
+
"title": self.datasource_name,
|
| 211 |
+
"ds_type": "custom"
|
| 212 |
+
},
|
| 213 |
+
"schemas": self._extract_schemas(),
|
| 214 |
+
"tables": tables,
|
| 215 |
+
"columns": columns,
|
| 216 |
+
"lineage": lineage_objects,
|
| 217 |
+
"dataflows": dataflows,
|
| 218 |
+
"summary": {
|
| 219 |
+
"totalTables": len(tables),
|
| 220 |
+
"totalColumns": len(columns),
|
| 221 |
+
"totalLineageEdges": len(lineage_objects),
|
| 222 |
+
"totalDataflows": len(dataflows),
|
| 223 |
+
"schemas": list(set(t.get("schema_name", "default") for t in tables))
|
| 224 |
+
}
|
| 225 |
+
}
|
| 226 |
+
|
| 227 |
+
def _extract_schemas(self) -> List[Dict[str, Any]]:
|
| 228 |
+
"""Extract unique schemas from nodes."""
|
| 229 |
+
schemas = {}
|
| 230 |
+
for node in self.graph.nodes:
|
| 231 |
+
schema_name = node.schema or "default"
|
| 232 |
+
if schema_name not in schemas:
|
| 233 |
+
schemas[schema_name] = {
|
| 234 |
+
"key": f"{self.datasource_id}.{schema_name}",
|
| 235 |
+
"schema_name": schema_name,
|
| 236 |
+
"ds_id": self.datasource_id,
|
| 237 |
+
"description": f"Schema: {schema_name}"
|
| 238 |
+
}
|
| 239 |
+
if node.database:
|
| 240 |
+
schemas[schema_name]["db_name"] = node.database
|
| 241 |
+
|
| 242 |
+
return list(schemas.values())
|
exporters/base.py
ADDED
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Base classes for lineage export functionality.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass, field
|
| 6 |
+
from typing import List, Dict, Optional, Any
|
| 7 |
+
from abc import ABC, abstractmethod
|
| 8 |
+
import json
|
| 9 |
+
from datetime import datetime, timezone
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@dataclass
|
| 13 |
+
class LineageNode:
|
| 14 |
+
"""Represents a node in the lineage graph."""
|
| 15 |
+
id: str
|
| 16 |
+
name: str
|
| 17 |
+
type: str # table, view, model, source, destination, etc.
|
| 18 |
+
category: Optional[str] = None # raw, staging, marts, reporting, etc.
|
| 19 |
+
database: Optional[str] = None
|
| 20 |
+
schema: Optional[str] = None
|
| 21 |
+
description: Optional[str] = None
|
| 22 |
+
columns: Optional[List[Dict[str, Any]]] = None
|
| 23 |
+
metadata: Optional[Dict[str, Any]] = None
|
| 24 |
+
tags: Optional[List[str]] = None
|
| 25 |
+
owner: Optional[str] = None
|
| 26 |
+
|
| 27 |
+
def to_dict(self) -> Dict[str, Any]:
|
| 28 |
+
"""Convert node to dictionary."""
|
| 29 |
+
return {k: v for k, v in {
|
| 30 |
+
'id': self.id,
|
| 31 |
+
'name': self.name,
|
| 32 |
+
'type': self.type,
|
| 33 |
+
'category': self.category,
|
| 34 |
+
'database': self.database,
|
| 35 |
+
'schema': self.schema,
|
| 36 |
+
'description': self.description,
|
| 37 |
+
'columns': self.columns,
|
| 38 |
+
'metadata': self.metadata,
|
| 39 |
+
'tags': self.tags,
|
| 40 |
+
'owner': self.owner,
|
| 41 |
+
}.items() if v is not None}
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
@dataclass
|
| 45 |
+
class LineageEdge:
|
| 46 |
+
"""Represents an edge (relationship) in the lineage graph."""
|
| 47 |
+
source: str # source node id
|
| 48 |
+
target: str # target node id
|
| 49 |
+
type: str # transform, reference, ingest, export, etc.
|
| 50 |
+
job_id: Optional[str] = None
|
| 51 |
+
job_name: Optional[str] = None
|
| 52 |
+
transformation: Optional[str] = None
|
| 53 |
+
metadata: Optional[Dict[str, Any]] = None
|
| 54 |
+
|
| 55 |
+
def to_dict(self) -> Dict[str, Any]:
|
| 56 |
+
"""Convert edge to dictionary."""
|
| 57 |
+
return {k: v for k, v in {
|
| 58 |
+
'source': self.source,
|
| 59 |
+
'target': self.target,
|
| 60 |
+
'type': self.type,
|
| 61 |
+
'job_id': self.job_id,
|
| 62 |
+
'job_name': self.job_name,
|
| 63 |
+
'transformation': self.transformation,
|
| 64 |
+
'metadata': self.metadata,
|
| 65 |
+
}.items() if v is not None}
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
@dataclass
|
| 69 |
+
class LineageGraph:
|
| 70 |
+
"""Represents a complete lineage graph."""
|
| 71 |
+
name: str
|
| 72 |
+
nodes: List[LineageNode] = field(default_factory=list)
|
| 73 |
+
edges: List[LineageEdge] = field(default_factory=list)
|
| 74 |
+
metadata: Optional[Dict[str, Any]] = None
|
| 75 |
+
generated_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat().replace('+00:00', 'Z'))
|
| 76 |
+
|
| 77 |
+
def add_node(self, node: LineageNode) -> None:
|
| 78 |
+
"""Add a node to the graph."""
|
| 79 |
+
self.nodes.append(node)
|
| 80 |
+
|
| 81 |
+
def add_edge(self, edge: LineageEdge) -> None:
|
| 82 |
+
"""Add an edge to the graph."""
|
| 83 |
+
self.edges.append(edge)
|
| 84 |
+
|
| 85 |
+
def get_node(self, node_id: str) -> Optional[LineageNode]:
|
| 86 |
+
"""Get a node by ID."""
|
| 87 |
+
for node in self.nodes:
|
| 88 |
+
if node.id == node_id:
|
| 89 |
+
return node
|
| 90 |
+
return None
|
| 91 |
+
|
| 92 |
+
def get_upstream(self, node_id: str) -> List[LineageNode]:
|
| 93 |
+
"""Get all upstream nodes for a given node."""
|
| 94 |
+
upstream_ids = [e.source for e in self.edges if e.target == node_id]
|
| 95 |
+
return [n for n in self.nodes if n.id in upstream_ids]
|
| 96 |
+
|
| 97 |
+
def get_downstream(self, node_id: str) -> List[LineageNode]:
|
| 98 |
+
"""Get all downstream nodes for a given node."""
|
| 99 |
+
downstream_ids = [e.target for e in self.edges if e.source == node_id]
|
| 100 |
+
return [n for n in self.nodes if n.id in downstream_ids]
|
| 101 |
+
|
| 102 |
+
def to_dict(self) -> Dict[str, Any]:
|
| 103 |
+
"""Convert graph to dictionary."""
|
| 104 |
+
return {
|
| 105 |
+
'name': self.name,
|
| 106 |
+
'generated_at': self.generated_at,
|
| 107 |
+
'nodes': [n.to_dict() for n in self.nodes],
|
| 108 |
+
'edges': [e.to_dict() for e in self.edges],
|
| 109 |
+
'metadata': self.metadata,
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
@classmethod
|
| 113 |
+
def from_dict(cls, data: Dict[str, Any]) -> 'LineageGraph':
|
| 114 |
+
"""Create a LineageGraph from a dictionary."""
|
| 115 |
+
graph = cls(
|
| 116 |
+
name=data.get('name', 'Untitled'),
|
| 117 |
+
metadata=data.get('metadata'),
|
| 118 |
+
generated_at=data.get('generated_at', datetime.utcnow().isoformat() + 'Z')
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
# Parse nodes
|
| 122 |
+
for node_data in data.get('nodes', []):
|
| 123 |
+
node = LineageNode(
|
| 124 |
+
id=node_data.get('id'),
|
| 125 |
+
name=node_data.get('name'),
|
| 126 |
+
type=node_data.get('type', 'unknown'),
|
| 127 |
+
category=node_data.get('category'),
|
| 128 |
+
database=node_data.get('database'),
|
| 129 |
+
schema=node_data.get('schema'),
|
| 130 |
+
description=node_data.get('description'),
|
| 131 |
+
columns=node_data.get('columns'),
|
| 132 |
+
metadata=node_data.get('metadata'),
|
| 133 |
+
tags=node_data.get('tags'),
|
| 134 |
+
owner=node_data.get('owner'),
|
| 135 |
+
)
|
| 136 |
+
graph.add_node(node)
|
| 137 |
+
|
| 138 |
+
# Parse edges
|
| 139 |
+
for edge_data in data.get('edges', []):
|
| 140 |
+
edge = LineageEdge(
|
| 141 |
+
source=edge_data.get('source') or edge_data.get('from'),
|
| 142 |
+
target=edge_data.get('target') or edge_data.get('to'),
|
| 143 |
+
type=edge_data.get('type', 'transform'),
|
| 144 |
+
job_id=edge_data.get('job_id'),
|
| 145 |
+
job_name=edge_data.get('job_name') or edge_data.get('job'),
|
| 146 |
+
transformation=edge_data.get('transformation'),
|
| 147 |
+
metadata=edge_data.get('metadata'),
|
| 148 |
+
)
|
| 149 |
+
graph.add_edge(edge)
|
| 150 |
+
|
| 151 |
+
return graph
|
| 152 |
+
|
| 153 |
+
@classmethod
|
| 154 |
+
def from_json(cls, json_str: str) -> 'LineageGraph':
|
| 155 |
+
"""Create a LineageGraph from JSON string."""
|
| 156 |
+
data = json.loads(json_str)
|
| 157 |
+
# Handle nested structure (lineage_graph key)
|
| 158 |
+
if 'lineage_graph' in data:
|
| 159 |
+
data = data['lineage_graph']
|
| 160 |
+
return cls.from_dict(data)
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
class LineageExporter(ABC):
|
| 164 |
+
"""Abstract base class for lineage exporters."""
|
| 165 |
+
|
| 166 |
+
def __init__(self, graph: LineageGraph):
|
| 167 |
+
self.graph = graph
|
| 168 |
+
|
| 169 |
+
@property
|
| 170 |
+
@abstractmethod
|
| 171 |
+
def format_name(self) -> str:
|
| 172 |
+
"""Return the name of the export format."""
|
| 173 |
+
pass
|
| 174 |
+
|
| 175 |
+
@property
|
| 176 |
+
@abstractmethod
|
| 177 |
+
def file_extension(self) -> str:
|
| 178 |
+
"""Return the file extension for the export format."""
|
| 179 |
+
pass
|
| 180 |
+
|
| 181 |
+
@abstractmethod
|
| 182 |
+
def export(self) -> str:
|
| 183 |
+
"""Export the lineage graph to the target format."""
|
| 184 |
+
pass
|
| 185 |
+
|
| 186 |
+
def export_to_file(self, filepath: str) -> None:
|
| 187 |
+
"""Export the lineage graph to a file."""
|
| 188 |
+
content = self.export()
|
| 189 |
+
with open(filepath, 'w') as f:
|
| 190 |
+
f.write(content)
|
| 191 |
+
|
| 192 |
+
def to_json(self, indent: int = 2) -> str:
|
| 193 |
+
"""Convert export to JSON string."""
|
| 194 |
+
return json.dumps(self._to_dict(), indent=indent)
|
| 195 |
+
|
| 196 |
+
@abstractmethod
|
| 197 |
+
def _to_dict(self) -> Dict[str, Any]:
|
| 198 |
+
"""Convert export to dictionary (for JSON serialization)."""
|
| 199 |
+
pass
|
exporters/collibra.py
ADDED
|
@@ -0,0 +1,243 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Collibra Exporter - Export to Collibra Data Intelligence format.
|
| 3 |
+
|
| 4 |
+
Collibra is an enterprise data governance and catalog platform.
|
| 5 |
+
https://www.collibra.com/
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from typing import Dict, Any, List
|
| 9 |
+
from datetime import datetime
|
| 10 |
+
import uuid
|
| 11 |
+
from .base import LineageExporter, LineageGraph, LineageNode, LineageEdge
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class CollibraExporter(LineageExporter):
|
| 15 |
+
"""Export lineage to Collibra import format."""
|
| 16 |
+
|
| 17 |
+
def __init__(self, graph: LineageGraph, community_name: str = "Data Lineage",
|
| 18 |
+
domain_name: str = "Physical Data Dictionary"):
|
| 19 |
+
super().__init__(graph)
|
| 20 |
+
self.community_name = community_name
|
| 21 |
+
self.domain_name = domain_name
|
| 22 |
+
|
| 23 |
+
@property
|
| 24 |
+
def format_name(self) -> str:
|
| 25 |
+
return "Collibra"
|
| 26 |
+
|
| 27 |
+
@property
|
| 28 |
+
def file_extension(self) -> str:
|
| 29 |
+
return ".json"
|
| 30 |
+
|
| 31 |
+
def _node_type_to_collibra_type(self, node_type: str) -> str:
|
| 32 |
+
"""Map internal node types to Collibra asset types."""
|
| 33 |
+
type_mapping = {
|
| 34 |
+
"table": "Table",
|
| 35 |
+
"view": "View",
|
| 36 |
+
"model": "Data Set",
|
| 37 |
+
"source": "Data Source",
|
| 38 |
+
"destination": "Data Target",
|
| 39 |
+
"column": "Column",
|
| 40 |
+
"database": "Database",
|
| 41 |
+
"schema": "Schema",
|
| 42 |
+
"report": "Report",
|
| 43 |
+
"dimension": "Dimension Table",
|
| 44 |
+
"fact": "Fact Table",
|
| 45 |
+
"feature_set": "Data Set",
|
| 46 |
+
"semantic_model": "Business Intelligence Report",
|
| 47 |
+
"external_api": "Data Source",
|
| 48 |
+
"extract": "Data Set"
|
| 49 |
+
}
|
| 50 |
+
return type_mapping.get(node_type.lower(), "Data Set")
|
| 51 |
+
|
| 52 |
+
def _edge_type_to_collibra_relation(self, edge_type: str) -> str:
|
| 53 |
+
"""Map internal edge types to Collibra relation types."""
|
| 54 |
+
relation_mapping = {
|
| 55 |
+
"transform": "is source of",
|
| 56 |
+
"reference": "references",
|
| 57 |
+
"ingest": "is source of",
|
| 58 |
+
"export": "is target of",
|
| 59 |
+
"join": "is source of",
|
| 60 |
+
"aggregate": "is source of",
|
| 61 |
+
"model": "is source of",
|
| 62 |
+
"publish": "is target of",
|
| 63 |
+
"reverse_etl": "is target of"
|
| 64 |
+
}
|
| 65 |
+
return relation_mapping.get(edge_type.lower(), "is source of")
|
| 66 |
+
|
| 67 |
+
def _create_asset(self, node: LineageNode) -> Dict[str, Any]:
|
| 68 |
+
"""Create a Collibra asset from a node."""
|
| 69 |
+
asset = {
|
| 70 |
+
"resourceType": "Asset",
|
| 71 |
+
"identifier": {
|
| 72 |
+
"name": node.name,
|
| 73 |
+
"domain": {
|
| 74 |
+
"name": self.domain_name,
|
| 75 |
+
"community": {
|
| 76 |
+
"name": self.community_name
|
| 77 |
+
}
|
| 78 |
+
}
|
| 79 |
+
},
|
| 80 |
+
"type": {
|
| 81 |
+
"name": self._node_type_to_collibra_type(node.type)
|
| 82 |
+
},
|
| 83 |
+
"displayName": node.name,
|
| 84 |
+
"attributes": {}
|
| 85 |
+
}
|
| 86 |
+
|
| 87 |
+
# Add description
|
| 88 |
+
if node.description:
|
| 89 |
+
asset["attributes"]["Description"] = [{"value": node.description}]
|
| 90 |
+
|
| 91 |
+
# Add database and schema
|
| 92 |
+
if node.database:
|
| 93 |
+
asset["attributes"]["Technical Data Type"] = [{"value": node.database}]
|
| 94 |
+
if node.schema:
|
| 95 |
+
asset["attributes"]["Schema Name"] = [{"value": node.schema}]
|
| 96 |
+
|
| 97 |
+
# Add owner
|
| 98 |
+
if node.owner:
|
| 99 |
+
asset["attributes"]["Data Owner"] = [{"value": node.owner}]
|
| 100 |
+
|
| 101 |
+
# Add tags as business terms
|
| 102 |
+
if node.tags:
|
| 103 |
+
asset["attributes"]["Tags"] = [{"value": ", ".join(node.tags)}]
|
| 104 |
+
|
| 105 |
+
# Add category
|
| 106 |
+
if node.category:
|
| 107 |
+
asset["attributes"]["Category"] = [{"value": node.category}]
|
| 108 |
+
|
| 109 |
+
return asset
|
| 110 |
+
|
| 111 |
+
def _create_relation(self, edge: LineageEdge) -> Dict[str, Any]:
|
| 112 |
+
"""Create a Collibra relation from an edge."""
|
| 113 |
+
source_node = self.graph.get_node(edge.source)
|
| 114 |
+
target_node = self.graph.get_node(edge.target)
|
| 115 |
+
|
| 116 |
+
relation = {
|
| 117 |
+
"resourceType": "Relation",
|
| 118 |
+
"source": {
|
| 119 |
+
"name": source_node.name if source_node else edge.source,
|
| 120 |
+
"domain": {
|
| 121 |
+
"name": self.domain_name,
|
| 122 |
+
"community": {
|
| 123 |
+
"name": self.community_name
|
| 124 |
+
}
|
| 125 |
+
}
|
| 126 |
+
},
|
| 127 |
+
"target": {
|
| 128 |
+
"name": target_node.name if target_node else edge.target,
|
| 129 |
+
"domain": {
|
| 130 |
+
"name": self.domain_name,
|
| 131 |
+
"community": {
|
| 132 |
+
"name": self.community_name
|
| 133 |
+
}
|
| 134 |
+
}
|
| 135 |
+
},
|
| 136 |
+
"type": {
|
| 137 |
+
"role": self._edge_type_to_collibra_relation(edge.type),
|
| 138 |
+
"coRole": "has source",
|
| 139 |
+
"sourceType": {
|
| 140 |
+
"name": self._node_type_to_collibra_type(
|
| 141 |
+
source_node.type if source_node else "table"
|
| 142 |
+
)
|
| 143 |
+
},
|
| 144 |
+
"targetType": {
|
| 145 |
+
"name": self._node_type_to_collibra_type(
|
| 146 |
+
target_node.type if target_node else "table"
|
| 147 |
+
)
|
| 148 |
+
}
|
| 149 |
+
}
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
return relation
|
| 153 |
+
|
| 154 |
+
def _create_column_assets(self, node: LineageNode) -> List[Dict[str, Any]]:
|
| 155 |
+
"""Create Collibra column assets from a node's columns."""
|
| 156 |
+
if not node.columns:
|
| 157 |
+
return []
|
| 158 |
+
|
| 159 |
+
column_assets = []
|
| 160 |
+
for col in node.columns:
|
| 161 |
+
column_asset = {
|
| 162 |
+
"resourceType": "Asset",
|
| 163 |
+
"identifier": {
|
| 164 |
+
"name": f"{node.name}.{col.get('name')}",
|
| 165 |
+
"domain": {
|
| 166 |
+
"name": self.domain_name,
|
| 167 |
+
"community": {
|
| 168 |
+
"name": self.community_name
|
| 169 |
+
}
|
| 170 |
+
}
|
| 171 |
+
},
|
| 172 |
+
"type": {
|
| 173 |
+
"name": "Column"
|
| 174 |
+
},
|
| 175 |
+
"displayName": col.get("name"),
|
| 176 |
+
"attributes": {
|
| 177 |
+
"Technical Data Type": [{"value": col.get("type") or col.get("data_type", "string")}]
|
| 178 |
+
},
|
| 179 |
+
"relations": {
|
| 180 |
+
"Column is part of Table": [{
|
| 181 |
+
"name": node.name,
|
| 182 |
+
"domain": {
|
| 183 |
+
"name": self.domain_name,
|
| 184 |
+
"community": {
|
| 185 |
+
"name": self.community_name
|
| 186 |
+
}
|
| 187 |
+
}
|
| 188 |
+
}]
|
| 189 |
+
}
|
| 190 |
+
}
|
| 191 |
+
|
| 192 |
+
if col.get("description"):
|
| 193 |
+
column_asset["attributes"]["Description"] = [{"value": col.get("description")}]
|
| 194 |
+
|
| 195 |
+
column_assets.append(column_asset)
|
| 196 |
+
|
| 197 |
+
return column_assets
|
| 198 |
+
|
| 199 |
+
def export(self) -> str:
|
| 200 |
+
"""Export to Collibra JSON import format."""
|
| 201 |
+
return self.to_json(indent=2)
|
| 202 |
+
|
| 203 |
+
def _to_dict(self) -> Dict[str, Any]:
|
| 204 |
+
"""Convert to Collibra import dictionary."""
|
| 205 |
+
# Collect all assets (nodes)
|
| 206 |
+
assets = []
|
| 207 |
+
for node in self.graph.nodes:
|
| 208 |
+
assets.append(self._create_asset(node))
|
| 209 |
+
# Add column assets if present
|
| 210 |
+
assets.extend(self._create_column_assets(node))
|
| 211 |
+
|
| 212 |
+
# Collect all relations (edges)
|
| 213 |
+
relations = [self._create_relation(edge) for edge in self.graph.edges]
|
| 214 |
+
|
| 215 |
+
return {
|
| 216 |
+
"exportInfo": {
|
| 217 |
+
"producer": "Lineage Graph Accelerator",
|
| 218 |
+
"exportedAt": self.graph.generated_at,
|
| 219 |
+
"sourceLineageName": self.graph.name,
|
| 220 |
+
"format": "Collibra Import API",
|
| 221 |
+
"version": "2.0"
|
| 222 |
+
},
|
| 223 |
+
"community": {
|
| 224 |
+
"name": self.community_name,
|
| 225 |
+
"description": f"Data lineage imported from {self.graph.name}"
|
| 226 |
+
},
|
| 227 |
+
"domain": {
|
| 228 |
+
"name": self.domain_name,
|
| 229 |
+
"type": "Physical Data Dictionary",
|
| 230 |
+
"community": {
|
| 231 |
+
"name": self.community_name
|
| 232 |
+
}
|
| 233 |
+
},
|
| 234 |
+
"assets": assets,
|
| 235 |
+
"relations": relations,
|
| 236 |
+
"summary": {
|
| 237 |
+
"totalAssets": len(assets),
|
| 238 |
+
"totalRelations": len(relations),
|
| 239 |
+
"assetTypes": list(set(
|
| 240 |
+
self._node_type_to_collibra_type(n.type) for n in self.graph.nodes
|
| 241 |
+
))
|
| 242 |
+
}
|
| 243 |
+
}
|
exporters/openlineage.py
ADDED
|
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
OpenLineage Exporter - Export to OpenLineage standard format.
|
| 3 |
+
|
| 4 |
+
OpenLineage is an open standard for metadata and lineage collection.
|
| 5 |
+
https://openlineage.io/
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from typing import Dict, Any, List
|
| 9 |
+
from datetime import datetime
|
| 10 |
+
import uuid
|
| 11 |
+
from .base import LineageExporter, LineageGraph, LineageNode, LineageEdge
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class OpenLineageExporter(LineageExporter):
|
| 15 |
+
"""Export lineage to OpenLineage format."""
|
| 16 |
+
|
| 17 |
+
def __init__(self, graph: LineageGraph, namespace: str = "lineage-accelerator"):
|
| 18 |
+
super().__init__(graph)
|
| 19 |
+
self.namespace = namespace
|
| 20 |
+
|
| 21 |
+
@property
|
| 22 |
+
def format_name(self) -> str:
|
| 23 |
+
return "OpenLineage"
|
| 24 |
+
|
| 25 |
+
@property
|
| 26 |
+
def file_extension(self) -> str:
|
| 27 |
+
return ".json"
|
| 28 |
+
|
| 29 |
+
def _create_dataset(self, node: LineageNode) -> Dict[str, Any]:
|
| 30 |
+
"""Create an OpenLineage dataset from a node."""
|
| 31 |
+
dataset = {
|
| 32 |
+
"namespace": self.namespace,
|
| 33 |
+
"name": self._get_qualified_name(node),
|
| 34 |
+
"facets": {}
|
| 35 |
+
}
|
| 36 |
+
|
| 37 |
+
# Add schema facet if columns are present
|
| 38 |
+
if node.columns:
|
| 39 |
+
dataset["facets"]["schema"] = {
|
| 40 |
+
"_producer": "lineage-accelerator",
|
| 41 |
+
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json",
|
| 42 |
+
"fields": [
|
| 43 |
+
{
|
| 44 |
+
"name": col.get("name"),
|
| 45 |
+
"type": col.get("type") or col.get("data_type", "string"),
|
| 46 |
+
"description": col.get("description")
|
| 47 |
+
}
|
| 48 |
+
for col in node.columns
|
| 49 |
+
]
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
+
# Add documentation facet
|
| 53 |
+
if node.description:
|
| 54 |
+
dataset["facets"]["documentation"] = {
|
| 55 |
+
"_producer": "lineage-accelerator",
|
| 56 |
+
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/DocumentationDatasetFacet.json",
|
| 57 |
+
"description": node.description
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
# Add ownership facet
|
| 61 |
+
if node.owner:
|
| 62 |
+
dataset["facets"]["ownership"] = {
|
| 63 |
+
"_producer": "lineage-accelerator",
|
| 64 |
+
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/OwnershipDatasetFacet.json",
|
| 65 |
+
"owners": [{"name": node.owner, "type": "MAINTAINER"}]
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
# Add custom facet for additional metadata
|
| 69 |
+
custom_facet = {}
|
| 70 |
+
if node.type:
|
| 71 |
+
custom_facet["nodeType"] = node.type
|
| 72 |
+
if node.category:
|
| 73 |
+
custom_facet["category"] = node.category
|
| 74 |
+
if node.tags:
|
| 75 |
+
custom_facet["tags"] = node.tags
|
| 76 |
+
if node.metadata:
|
| 77 |
+
custom_facet.update(node.metadata)
|
| 78 |
+
|
| 79 |
+
if custom_facet:
|
| 80 |
+
dataset["facets"]["custom"] = {
|
| 81 |
+
"_producer": "lineage-accelerator",
|
| 82 |
+
"_schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json#/definitions/CustomFacet",
|
| 83 |
+
**custom_facet
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
return dataset
|
| 87 |
+
|
| 88 |
+
def _get_qualified_name(self, node: LineageNode) -> str:
|
| 89 |
+
"""Get fully qualified name for a node."""
|
| 90 |
+
parts = []
|
| 91 |
+
if node.database:
|
| 92 |
+
parts.append(node.database)
|
| 93 |
+
if node.schema:
|
| 94 |
+
parts.append(node.schema)
|
| 95 |
+
parts.append(node.name)
|
| 96 |
+
return ".".join(parts)
|
| 97 |
+
|
| 98 |
+
def _create_job(self, edge: LineageEdge) -> Dict[str, Any]:
|
| 99 |
+
"""Create an OpenLineage job from an edge."""
|
| 100 |
+
job_name = edge.job_name or f"transform_{edge.source}_to_{edge.target}"
|
| 101 |
+
|
| 102 |
+
job = {
|
| 103 |
+
"namespace": self.namespace,
|
| 104 |
+
"name": job_name,
|
| 105 |
+
"facets": {}
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
# Add job type facet
|
| 109 |
+
if edge.type:
|
| 110 |
+
job["facets"]["jobType"] = {
|
| 111 |
+
"_producer": "lineage-accelerator",
|
| 112 |
+
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/JobTypeJobFacet.json",
|
| 113 |
+
"processingType": "BATCH",
|
| 114 |
+
"integration": "CUSTOM",
|
| 115 |
+
"jobType": edge.type.upper()
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
return job
|
| 119 |
+
|
| 120 |
+
def _create_run_event(self, edge: LineageEdge) -> Dict[str, Any]:
|
| 121 |
+
"""Create an OpenLineage run event for an edge."""
|
| 122 |
+
source_node = self.graph.get_node(edge.source)
|
| 123 |
+
target_node = self.graph.get_node(edge.target)
|
| 124 |
+
|
| 125 |
+
event = {
|
| 126 |
+
"eventType": "COMPLETE",
|
| 127 |
+
"eventTime": self.graph.generated_at,
|
| 128 |
+
"run": {
|
| 129 |
+
"runId": str(uuid.uuid4()),
|
| 130 |
+
"facets": {}
|
| 131 |
+
},
|
| 132 |
+
"job": self._create_job(edge),
|
| 133 |
+
"inputs": [],
|
| 134 |
+
"outputs": []
|
| 135 |
+
}
|
| 136 |
+
|
| 137 |
+
if source_node:
|
| 138 |
+
event["inputs"].append(self._create_dataset(source_node))
|
| 139 |
+
|
| 140 |
+
if target_node:
|
| 141 |
+
output_dataset = self._create_dataset(target_node)
|
| 142 |
+
# Add lineage facet to output
|
| 143 |
+
if source_node:
|
| 144 |
+
output_dataset["facets"]["columnLineage"] = {
|
| 145 |
+
"_producer": "lineage-accelerator",
|
| 146 |
+
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/ColumnLineageDatasetFacet.json",
|
| 147 |
+
"fields": {}
|
| 148 |
+
}
|
| 149 |
+
event["outputs"].append(output_dataset)
|
| 150 |
+
|
| 151 |
+
return event
|
| 152 |
+
|
| 153 |
+
def export(self) -> str:
|
| 154 |
+
"""Export to OpenLineage JSON format."""
|
| 155 |
+
return self.to_json(indent=2)
|
| 156 |
+
|
| 157 |
+
def _to_dict(self) -> Dict[str, Any]:
|
| 158 |
+
"""Convert to dictionary."""
|
| 159 |
+
# Create run events for each edge
|
| 160 |
+
events = [self._create_run_event(edge) for edge in self.graph.edges]
|
| 161 |
+
|
| 162 |
+
# Create a summary structure
|
| 163 |
+
return {
|
| 164 |
+
"producer": "lineage-accelerator",
|
| 165 |
+
"schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json",
|
| 166 |
+
"generatedAt": self.graph.generated_at,
|
| 167 |
+
"lineageName": self.graph.name,
|
| 168 |
+
"namespace": self.namespace,
|
| 169 |
+
"events": events,
|
| 170 |
+
"datasets": [self._create_dataset(node) for node in self.graph.nodes],
|
| 171 |
+
"summary": {
|
| 172 |
+
"totalNodes": len(self.graph.nodes),
|
| 173 |
+
"totalEdges": len(self.graph.edges),
|
| 174 |
+
"nodeTypes": list(set(n.type for n in self.graph.nodes)),
|
| 175 |
+
"edgeTypes": list(set(e.type for e in self.graph.edges))
|
| 176 |
+
}
|
| 177 |
+
}
|
exporters/purview.py
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Microsoft Purview Exporter - Export to Microsoft Purview format.
|
| 3 |
+
|
| 4 |
+
Microsoft Purview is a unified data governance service.
|
| 5 |
+
https://azure.microsoft.com/en-us/products/purview
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from typing import Dict, Any, List
|
| 9 |
+
from datetime import datetime
|
| 10 |
+
import uuid
|
| 11 |
+
from .base import LineageExporter, LineageGraph, LineageNode, LineageEdge
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class PurviewExporter(LineageExporter):
|
| 15 |
+
"""Export lineage to Microsoft Purview format."""
|
| 16 |
+
|
| 17 |
+
def __init__(self, graph: LineageGraph, collection_name: str = "lineage-accelerator"):
|
| 18 |
+
super().__init__(graph)
|
| 19 |
+
self.collection_name = collection_name
|
| 20 |
+
|
| 21 |
+
@property
|
| 22 |
+
def format_name(self) -> str:
|
| 23 |
+
return "Microsoft Purview"
|
| 24 |
+
|
| 25 |
+
@property
|
| 26 |
+
def file_extension(self) -> str:
|
| 27 |
+
return ".json"
|
| 28 |
+
|
| 29 |
+
def _node_type_to_purview_type(self, node_type: str) -> str:
|
| 30 |
+
"""Map internal node types to Purview entity types."""
|
| 31 |
+
type_mapping = {
|
| 32 |
+
"table": "azure_sql_table",
|
| 33 |
+
"view": "azure_sql_view",
|
| 34 |
+
"model": "DataSet",
|
| 35 |
+
"source": "DataSource",
|
| 36 |
+
"destination": "DataSet",
|
| 37 |
+
"column": "azure_sql_column",
|
| 38 |
+
"database": "azure_sql_db",
|
| 39 |
+
"schema": "azure_sql_schema",
|
| 40 |
+
"report": "PowerBI_Report",
|
| 41 |
+
"dimension": "azure_sql_table",
|
| 42 |
+
"fact": "azure_sql_table",
|
| 43 |
+
"feature_set": "DataSet",
|
| 44 |
+
"semantic_model": "PowerBI_Dataset",
|
| 45 |
+
"external_api": "DataSource",
|
| 46 |
+
"extract": "DataSet"
|
| 47 |
+
}
|
| 48 |
+
return type_mapping.get(node_type.lower(), "DataSet")
|
| 49 |
+
|
| 50 |
+
def _create_entity(self, node: LineageNode) -> Dict[str, Any]:
|
| 51 |
+
"""Create a Purview entity from a node."""
|
| 52 |
+
qualified_name = self._get_qualified_name(node)
|
| 53 |
+
|
| 54 |
+
entity = {
|
| 55 |
+
"typeName": self._node_type_to_purview_type(node.type),
|
| 56 |
+
"attributes": {
|
| 57 |
+
"name": node.name,
|
| 58 |
+
"qualifiedName": qualified_name,
|
| 59 |
+
"description": node.description or f"Data asset: {node.name}"
|
| 60 |
+
},
|
| 61 |
+
"guid": str(uuid.uuid5(uuid.NAMESPACE_DNS, qualified_name)),
|
| 62 |
+
"status": "ACTIVE"
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
# Add database-specific attributes
|
| 66 |
+
if node.database:
|
| 67 |
+
entity["attributes"]["databaseName"] = node.database
|
| 68 |
+
if node.schema:
|
| 69 |
+
entity["attributes"]["schemaName"] = node.schema
|
| 70 |
+
|
| 71 |
+
# Add owner
|
| 72 |
+
if node.owner:
|
| 73 |
+
entity["attributes"]["owner"] = node.owner
|
| 74 |
+
|
| 75 |
+
# Add custom attributes
|
| 76 |
+
entity["attributes"]["sourceSystem"] = "lineage-accelerator"
|
| 77 |
+
if node.category:
|
| 78 |
+
entity["attributes"]["layer"] = node.category
|
| 79 |
+
if node.tags:
|
| 80 |
+
entity["attributes"]["userTags"] = node.tags
|
| 81 |
+
|
| 82 |
+
return entity
|
| 83 |
+
|
| 84 |
+
def _get_qualified_name(self, node: LineageNode) -> str:
|
| 85 |
+
"""Get Purview-style qualified name."""
|
| 86 |
+
parts = [self.collection_name]
|
| 87 |
+
if node.database:
|
| 88 |
+
parts.append(node.database)
|
| 89 |
+
if node.schema:
|
| 90 |
+
parts.append(node.schema)
|
| 91 |
+
parts.append(node.name)
|
| 92 |
+
return "://".join(parts[:1]) + "/" + "/".join(parts[1:])
|
| 93 |
+
|
| 94 |
+
def _create_column_entities(self, node: LineageNode) -> List[Dict[str, Any]]:
|
| 95 |
+
"""Create Purview column entities from a node's columns."""
|
| 96 |
+
if not node.columns:
|
| 97 |
+
return []
|
| 98 |
+
|
| 99 |
+
column_entities = []
|
| 100 |
+
parent_qualified_name = self._get_qualified_name(node)
|
| 101 |
+
|
| 102 |
+
for col in node.columns:
|
| 103 |
+
col_qualified_name = f"{parent_qualified_name}#{col.get('name')}"
|
| 104 |
+
column_entity = {
|
| 105 |
+
"typeName": "azure_sql_column",
|
| 106 |
+
"attributes": {
|
| 107 |
+
"name": col.get("name"),
|
| 108 |
+
"qualifiedName": col_qualified_name,
|
| 109 |
+
"data_type": col.get("type") or col.get("data_type", "string"),
|
| 110 |
+
"description": col.get("description", "")
|
| 111 |
+
},
|
| 112 |
+
"guid": str(uuid.uuid5(uuid.NAMESPACE_DNS, col_qualified_name)),
|
| 113 |
+
"status": "ACTIVE",
|
| 114 |
+
"relationshipAttributes": {
|
| 115 |
+
"table": {
|
| 116 |
+
"typeName": self._node_type_to_purview_type(node.type),
|
| 117 |
+
"guid": str(uuid.uuid5(uuid.NAMESPACE_DNS, parent_qualified_name))
|
| 118 |
+
}
|
| 119 |
+
}
|
| 120 |
+
}
|
| 121 |
+
column_entities.append(column_entity)
|
| 122 |
+
|
| 123 |
+
return column_entities
|
| 124 |
+
|
| 125 |
+
def _create_process(self, edge: LineageEdge) -> Dict[str, Any]:
|
| 126 |
+
"""Create a Purview process entity for lineage."""
|
| 127 |
+
source_node = self.graph.get_node(edge.source)
|
| 128 |
+
target_node = self.graph.get_node(edge.target)
|
| 129 |
+
|
| 130 |
+
process_name = edge.job_name or f"process_{edge.source}_to_{edge.target}"
|
| 131 |
+
process_qualified_name = f"{self.collection_name}://processes/{process_name}"
|
| 132 |
+
|
| 133 |
+
process = {
|
| 134 |
+
"typeName": "Process",
|
| 135 |
+
"attributes": {
|
| 136 |
+
"name": process_name,
|
| 137 |
+
"qualifiedName": process_qualified_name,
|
| 138 |
+
"description": f"Data transformation: {edge.type}"
|
| 139 |
+
},
|
| 140 |
+
"guid": str(uuid.uuid5(uuid.NAMESPACE_DNS, process_qualified_name)),
|
| 141 |
+
"status": "ACTIVE",
|
| 142 |
+
"relationshipAttributes": {
|
| 143 |
+
"inputs": [],
|
| 144 |
+
"outputs": []
|
| 145 |
+
}
|
| 146 |
+
}
|
| 147 |
+
|
| 148 |
+
# Add input reference
|
| 149 |
+
if source_node:
|
| 150 |
+
source_qualified_name = self._get_qualified_name(source_node)
|
| 151 |
+
process["relationshipAttributes"]["inputs"].append({
|
| 152 |
+
"typeName": self._node_type_to_purview_type(source_node.type),
|
| 153 |
+
"guid": str(uuid.uuid5(uuid.NAMESPACE_DNS, source_qualified_name)),
|
| 154 |
+
"qualifiedName": source_qualified_name
|
| 155 |
+
})
|
| 156 |
+
|
| 157 |
+
# Add output reference
|
| 158 |
+
if target_node:
|
| 159 |
+
target_qualified_name = self._get_qualified_name(target_node)
|
| 160 |
+
process["relationshipAttributes"]["outputs"].append({
|
| 161 |
+
"typeName": self._node_type_to_purview_type(target_node.type),
|
| 162 |
+
"guid": str(uuid.uuid5(uuid.NAMESPACE_DNS, target_qualified_name)),
|
| 163 |
+
"qualifiedName": target_qualified_name
|
| 164 |
+
})
|
| 165 |
+
|
| 166 |
+
return process
|
| 167 |
+
|
| 168 |
+
def export(self) -> str:
|
| 169 |
+
"""Export to Microsoft Purview JSON format."""
|
| 170 |
+
return self.to_json(indent=2)
|
| 171 |
+
|
| 172 |
+
def _to_dict(self) -> Dict[str, Any]:
|
| 173 |
+
"""Convert to Purview bulk import dictionary."""
|
| 174 |
+
# Collect all entities
|
| 175 |
+
entities = []
|
| 176 |
+
|
| 177 |
+
# Add node entities
|
| 178 |
+
for node in self.graph.nodes:
|
| 179 |
+
entities.append(self._create_entity(node))
|
| 180 |
+
# Add column entities
|
| 181 |
+
entities.extend(self._create_column_entities(node))
|
| 182 |
+
|
| 183 |
+
# Add process entities for lineage
|
| 184 |
+
processes = [self._create_process(edge) for edge in self.graph.edges]
|
| 185 |
+
|
| 186 |
+
return {
|
| 187 |
+
"exportInfo": {
|
| 188 |
+
"producer": "Lineage Graph Accelerator",
|
| 189 |
+
"exportedAt": self.graph.generated_at,
|
| 190 |
+
"sourceLineageName": self.graph.name,
|
| 191 |
+
"format": "Microsoft Purview Bulk Import",
|
| 192 |
+
"version": "1.0"
|
| 193 |
+
},
|
| 194 |
+
"collection": {
|
| 195 |
+
"referenceName": self.collection_name,
|
| 196 |
+
"type": "CollectionReference"
|
| 197 |
+
},
|
| 198 |
+
"entities": entities,
|
| 199 |
+
"processes": processes,
|
| 200 |
+
"referredEntities": {},
|
| 201 |
+
"summary": {
|
| 202 |
+
"totalEntities": len(entities),
|
| 203 |
+
"totalProcesses": len(processes),
|
| 204 |
+
"entityTypes": list(set(e["typeName"] for e in entities))
|
| 205 |
+
}
|
| 206 |
+
}
|
memories/graph_visualizer/tools.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"tools":[],"interrupt_config":{}}
|
memories/subagents/tools.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"tools":["bigquery_execute_query","read_url_content","google_sheets_read_range"],"interrupt_config":{}}
|
memories/tools.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"tools":["bigquery_execute_query","read_url_content","google_sheets_read_range","tavily_web_search"],"interrupt_config":{"bigquery_execute_query":false,"read_url_content":false,"google_sheets_read_range":false,"tavily_web_search":false}}
|
requirements.txt
CHANGED
|
@@ -1,6 +1,5 @@
|
|
| 1 |
-
gradio>=
|
| 2 |
anthropic>=0.25.0
|
| 3 |
google-cloud-bigquery>=3.10.0
|
| 4 |
requests>=2.31.0
|
| 5 |
pyyaml>=6.0
|
| 6 |
-
|
|
|
|
| 1 |
+
gradio>=6.0.0
|
| 2 |
anthropic>=0.25.0
|
| 3 |
google-cloud-bigquery>=3.10.0
|
| 4 |
requests>=2.31.0
|
| 5 |
pyyaml>=6.0
|
|
|
samples/airflow_dag_sample.json
ADDED
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"dag_id": "ecommerce_etl_pipeline",
|
| 3 |
+
"description": "Daily ETL pipeline for e-commerce data warehouse",
|
| 4 |
+
"schedule_interval": "0 2 * * *",
|
| 5 |
+
"start_date": "2025-01-01",
|
| 6 |
+
"catchup": false,
|
| 7 |
+
"tags": ["etl", "ecommerce", "daily"],
|
| 8 |
+
"default_args": {
|
| 9 |
+
"owner": "data_engineering",
|
| 10 |
+
"retries": 3,
|
| 11 |
+
"retry_delay_minutes": 5,
|
| 12 |
+
"email_on_failure": true
|
| 13 |
+
},
|
| 14 |
+
"tasks": [
|
| 15 |
+
{
|
| 16 |
+
"task_id": "extract_customers",
|
| 17 |
+
"operator": "PythonOperator",
|
| 18 |
+
"description": "Extract customer data from source database",
|
| 19 |
+
"upstream_dependencies": [],
|
| 20 |
+
"downstream_dependencies": ["transform_customers"],
|
| 21 |
+
"source": "postgres://source_db/customers",
|
| 22 |
+
"target": "s3://data-lake/raw/customers/"
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"task_id": "extract_orders",
|
| 26 |
+
"operator": "PythonOperator",
|
| 27 |
+
"description": "Extract orders data from source database",
|
| 28 |
+
"upstream_dependencies": [],
|
| 29 |
+
"downstream_dependencies": ["transform_orders"],
|
| 30 |
+
"source": "postgres://source_db/orders",
|
| 31 |
+
"target": "s3://data-lake/raw/orders/"
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"task_id": "extract_products",
|
| 35 |
+
"operator": "PythonOperator",
|
| 36 |
+
"description": "Extract products data from source database",
|
| 37 |
+
"upstream_dependencies": [],
|
| 38 |
+
"downstream_dependencies": ["transform_products"],
|
| 39 |
+
"source": "postgres://source_db/products",
|
| 40 |
+
"target": "s3://data-lake/raw/products/"
|
| 41 |
+
},
|
| 42 |
+
{
|
| 43 |
+
"task_id": "extract_order_items",
|
| 44 |
+
"operator": "PythonOperator",
|
| 45 |
+
"description": "Extract order items from source database",
|
| 46 |
+
"upstream_dependencies": [],
|
| 47 |
+
"downstream_dependencies": ["transform_order_items"],
|
| 48 |
+
"source": "postgres://source_db/order_items",
|
| 49 |
+
"target": "s3://data-lake/raw/order_items/"
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"task_id": "transform_customers",
|
| 53 |
+
"operator": "SparkSubmitOperator",
|
| 54 |
+
"description": "Clean and transform customer data",
|
| 55 |
+
"upstream_dependencies": ["extract_customers"],
|
| 56 |
+
"downstream_dependencies": ["load_dim_customers"],
|
| 57 |
+
"source": "s3://data-lake/raw/customers/",
|
| 58 |
+
"target": "s3://data-lake/transformed/customers/"
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"task_id": "transform_orders",
|
| 62 |
+
"operator": "SparkSubmitOperator",
|
| 63 |
+
"description": "Clean and transform orders data",
|
| 64 |
+
"upstream_dependencies": ["extract_orders"],
|
| 65 |
+
"downstream_dependencies": ["load_fct_orders"],
|
| 66 |
+
"source": "s3://data-lake/raw/orders/",
|
| 67 |
+
"target": "s3://data-lake/transformed/orders/"
|
| 68 |
+
},
|
| 69 |
+
{
|
| 70 |
+
"task_id": "transform_products",
|
| 71 |
+
"operator": "SparkSubmitOperator",
|
| 72 |
+
"description": "Clean and transform products data",
|
| 73 |
+
"upstream_dependencies": ["extract_products"],
|
| 74 |
+
"downstream_dependencies": ["load_dim_products"],
|
| 75 |
+
"source": "s3://data-lake/raw/products/",
|
| 76 |
+
"target": "s3://data-lake/transformed/products/"
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"task_id": "transform_order_items",
|
| 80 |
+
"operator": "SparkSubmitOperator",
|
| 81 |
+
"description": "Clean and transform order items data",
|
| 82 |
+
"upstream_dependencies": ["extract_order_items"],
|
| 83 |
+
"downstream_dependencies": ["load_fct_orders"],
|
| 84 |
+
"source": "s3://data-lake/raw/order_items/",
|
| 85 |
+
"target": "s3://data-lake/transformed/order_items/"
|
| 86 |
+
},
|
| 87 |
+
{
|
| 88 |
+
"task_id": "load_dim_customers",
|
| 89 |
+
"operator": "SnowflakeOperator",
|
| 90 |
+
"description": "Load customer dimension to Snowflake",
|
| 91 |
+
"upstream_dependencies": ["transform_customers"],
|
| 92 |
+
"downstream_dependencies": ["build_customer_metrics"],
|
| 93 |
+
"source": "s3://data-lake/transformed/customers/",
|
| 94 |
+
"target": "snowflake://warehouse/analytics.dim_customers"
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"task_id": "load_dim_products",
|
| 98 |
+
"operator": "SnowflakeOperator",
|
| 99 |
+
"description": "Load product dimension to Snowflake",
|
| 100 |
+
"upstream_dependencies": ["transform_products"],
|
| 101 |
+
"downstream_dependencies": ["build_sales_report"],
|
| 102 |
+
"source": "s3://data-lake/transformed/products/",
|
| 103 |
+
"target": "snowflake://warehouse/analytics.dim_products"
|
| 104 |
+
},
|
| 105 |
+
{
|
| 106 |
+
"task_id": "load_fct_orders",
|
| 107 |
+
"operator": "SnowflakeOperator",
|
| 108 |
+
"description": "Load orders fact table to Snowflake",
|
| 109 |
+
"upstream_dependencies": ["transform_orders", "transform_order_items"],
|
| 110 |
+
"downstream_dependencies": ["build_customer_metrics", "build_sales_report"],
|
| 111 |
+
"source": ["s3://data-lake/transformed/orders/", "s3://data-lake/transformed/order_items/"],
|
| 112 |
+
"target": "snowflake://warehouse/analytics.fct_orders"
|
| 113 |
+
},
|
| 114 |
+
{
|
| 115 |
+
"task_id": "build_customer_metrics",
|
| 116 |
+
"operator": "SnowflakeOperator",
|
| 117 |
+
"description": "Calculate customer lifetime value and metrics",
|
| 118 |
+
"upstream_dependencies": ["load_dim_customers", "load_fct_orders"],
|
| 119 |
+
"downstream_dependencies": ["publish_to_bi"],
|
| 120 |
+
"source": ["analytics.dim_customers", "analytics.fct_orders"],
|
| 121 |
+
"target": "snowflake://warehouse/analytics.rpt_customer_metrics"
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"task_id": "build_sales_report",
|
| 125 |
+
"operator": "SnowflakeOperator",
|
| 126 |
+
"description": "Build daily sales report",
|
| 127 |
+
"upstream_dependencies": ["load_dim_products", "load_fct_orders"],
|
| 128 |
+
"downstream_dependencies": ["publish_to_bi"],
|
| 129 |
+
"source": ["analytics.dim_products", "analytics.fct_orders"],
|
| 130 |
+
"target": "snowflake://warehouse/analytics.rpt_daily_sales"
|
| 131 |
+
},
|
| 132 |
+
{
|
| 133 |
+
"task_id": "publish_to_bi",
|
| 134 |
+
"operator": "PythonOperator",
|
| 135 |
+
"description": "Publish reports to BI tool",
|
| 136 |
+
"upstream_dependencies": ["build_customer_metrics", "build_sales_report"],
|
| 137 |
+
"downstream_dependencies": ["notify_stakeholders"],
|
| 138 |
+
"source": ["analytics.rpt_customer_metrics", "analytics.rpt_daily_sales"],
|
| 139 |
+
"target": "tableau://server/ecommerce_dashboard"
|
| 140 |
+
},
|
| 141 |
+
{
|
| 142 |
+
"task_id": "notify_stakeholders",
|
| 143 |
+
"operator": "EmailOperator",
|
| 144 |
+
"description": "Send completion notification",
|
| 145 |
+
"upstream_dependencies": ["publish_to_bi"],
|
| 146 |
+
"downstream_dependencies": []
|
| 147 |
+
}
|
| 148 |
+
],
|
| 149 |
+
"notes": "Sample Airflow DAG representing a complete ETL pipeline with extract, transform, load, and reporting stages."
|
| 150 |
+
}
|
samples/complex_lineage_demo.json
ADDED
|
@@ -0,0 +1,425 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"title": "E-Commerce Analytics Platform - Complete Data Lineage",
|
| 3 |
+
"description": "A comprehensive demonstration of data lineage tracking across an entire e-commerce analytics platform, showcasing multi-source ingestion, transformation layers, and cross-system dependencies.",
|
| 4 |
+
"version": "1.0",
|
| 5 |
+
"generated_at": "2025-11-20T15:00:00Z",
|
| 6 |
+
"lineage_graph": {
|
| 7 |
+
"nodes": [
|
| 8 |
+
{
|
| 9 |
+
"id": "shopify_orders",
|
| 10 |
+
"name": "Shopify Orders API",
|
| 11 |
+
"type": "source",
|
| 12 |
+
"category": "external_api",
|
| 13 |
+
"description": "Order data from Shopify e-commerce platform",
|
| 14 |
+
"metadata": {
|
| 15 |
+
"platform": "Shopify",
|
| 16 |
+
"refresh_frequency": "real-time webhook",
|
| 17 |
+
"data_volume": "~50K orders/day"
|
| 18 |
+
}
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"id": "shopify_products",
|
| 22 |
+
"name": "Shopify Products API",
|
| 23 |
+
"type": "source",
|
| 24 |
+
"category": "external_api"
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"id": "shopify_customers",
|
| 28 |
+
"name": "Shopify Customers API",
|
| 29 |
+
"type": "source",
|
| 30 |
+
"category": "external_api"
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
"id": "stripe_payments",
|
| 34 |
+
"name": "Stripe Payments",
|
| 35 |
+
"type": "source",
|
| 36 |
+
"category": "external_api",
|
| 37 |
+
"description": "Payment transaction data from Stripe"
|
| 38 |
+
},
|
| 39 |
+
{
|
| 40 |
+
"id": "stripe_subscriptions",
|
| 41 |
+
"name": "Stripe Subscriptions",
|
| 42 |
+
"type": "source",
|
| 43 |
+
"category": "external_api"
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"id": "ga4_events",
|
| 47 |
+
"name": "Google Analytics 4",
|
| 48 |
+
"type": "source",
|
| 49 |
+
"category": "analytics",
|
| 50 |
+
"description": "Website behavior and conversion events"
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"id": "fb_ads",
|
| 54 |
+
"name": "Facebook Ads",
|
| 55 |
+
"type": "source",
|
| 56 |
+
"category": "marketing"
|
| 57 |
+
},
|
| 58 |
+
{
|
| 59 |
+
"id": "google_ads",
|
| 60 |
+
"name": "Google Ads",
|
| 61 |
+
"type": "source",
|
| 62 |
+
"category": "marketing"
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"id": "zendesk_tickets",
|
| 66 |
+
"name": "Zendesk Support",
|
| 67 |
+
"type": "source",
|
| 68 |
+
"category": "support",
|
| 69 |
+
"description": "Customer support ticket data"
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"id": "raw_orders",
|
| 73 |
+
"name": "raw.orders",
|
| 74 |
+
"type": "table",
|
| 75 |
+
"category": "raw_layer",
|
| 76 |
+
"schema": "raw",
|
| 77 |
+
"database": "analytics_dw"
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"id": "raw_products",
|
| 81 |
+
"name": "raw.products",
|
| 82 |
+
"type": "table",
|
| 83 |
+
"category": "raw_layer"
|
| 84 |
+
},
|
| 85 |
+
{
|
| 86 |
+
"id": "raw_customers",
|
| 87 |
+
"name": "raw.customers",
|
| 88 |
+
"type": "table",
|
| 89 |
+
"category": "raw_layer"
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"id": "raw_payments",
|
| 93 |
+
"name": "raw.payments",
|
| 94 |
+
"type": "table",
|
| 95 |
+
"category": "raw_layer"
|
| 96 |
+
},
|
| 97 |
+
{
|
| 98 |
+
"id": "raw_subscriptions",
|
| 99 |
+
"name": "raw.subscriptions",
|
| 100 |
+
"type": "table",
|
| 101 |
+
"category": "raw_layer"
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"id": "raw_web_events",
|
| 105 |
+
"name": "raw.web_events",
|
| 106 |
+
"type": "table",
|
| 107 |
+
"category": "raw_layer"
|
| 108 |
+
},
|
| 109 |
+
{
|
| 110 |
+
"id": "raw_ad_spend",
|
| 111 |
+
"name": "raw.ad_spend",
|
| 112 |
+
"type": "table",
|
| 113 |
+
"category": "raw_layer"
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"id": "raw_support_tickets",
|
| 117 |
+
"name": "raw.support_tickets",
|
| 118 |
+
"type": "table",
|
| 119 |
+
"category": "raw_layer"
|
| 120 |
+
},
|
| 121 |
+
{
|
| 122 |
+
"id": "stg_orders",
|
| 123 |
+
"name": "staging.stg_orders",
|
| 124 |
+
"type": "model",
|
| 125 |
+
"category": "staging_layer",
|
| 126 |
+
"transformation": "Clean, dedupe, add calculated fields"
|
| 127 |
+
},
|
| 128 |
+
{
|
| 129 |
+
"id": "stg_order_items",
|
| 130 |
+
"name": "staging.stg_order_items",
|
| 131 |
+
"type": "model",
|
| 132 |
+
"category": "staging_layer"
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"id": "stg_products",
|
| 136 |
+
"name": "staging.stg_products",
|
| 137 |
+
"type": "model",
|
| 138 |
+
"category": "staging_layer"
|
| 139 |
+
},
|
| 140 |
+
{
|
| 141 |
+
"id": "stg_customers",
|
| 142 |
+
"name": "staging.stg_customers",
|
| 143 |
+
"type": "model",
|
| 144 |
+
"category": "staging_layer"
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"id": "stg_payments",
|
| 148 |
+
"name": "staging.stg_payments",
|
| 149 |
+
"type": "model",
|
| 150 |
+
"category": "staging_layer"
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"id": "stg_subscriptions",
|
| 154 |
+
"name": "staging.stg_subscriptions",
|
| 155 |
+
"type": "model",
|
| 156 |
+
"category": "staging_layer"
|
| 157 |
+
},
|
| 158 |
+
{
|
| 159 |
+
"id": "stg_web_sessions",
|
| 160 |
+
"name": "staging.stg_web_sessions",
|
| 161 |
+
"type": "model",
|
| 162 |
+
"category": "staging_layer",
|
| 163 |
+
"transformation": "Sessionize events, calculate engagement"
|
| 164 |
+
},
|
| 165 |
+
{
|
| 166 |
+
"id": "stg_ad_campaigns",
|
| 167 |
+
"name": "staging.stg_ad_campaigns",
|
| 168 |
+
"type": "model",
|
| 169 |
+
"category": "staging_layer"
|
| 170 |
+
},
|
| 171 |
+
{
|
| 172 |
+
"id": "stg_support_cases",
|
| 173 |
+
"name": "staging.stg_support_cases",
|
| 174 |
+
"type": "model",
|
| 175 |
+
"category": "staging_layer"
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"id": "int_customer_orders",
|
| 179 |
+
"name": "intermediate.int_customer_orders",
|
| 180 |
+
"type": "model",
|
| 181 |
+
"category": "intermediate_layer",
|
| 182 |
+
"transformation": "Join customers with order history"
|
| 183 |
+
},
|
| 184 |
+
{
|
| 185 |
+
"id": "int_order_payments",
|
| 186 |
+
"name": "intermediate.int_order_payments",
|
| 187 |
+
"type": "model",
|
| 188 |
+
"category": "intermediate_layer",
|
| 189 |
+
"transformation": "Match orders with payments"
|
| 190 |
+
},
|
| 191 |
+
{
|
| 192 |
+
"id": "int_customer_attribution",
|
| 193 |
+
"name": "intermediate.int_customer_attribution",
|
| 194 |
+
"type": "model",
|
| 195 |
+
"category": "intermediate_layer",
|
| 196 |
+
"transformation": "Multi-touch attribution model"
|
| 197 |
+
},
|
| 198 |
+
{
|
| 199 |
+
"id": "int_product_performance",
|
| 200 |
+
"name": "intermediate.int_product_performance",
|
| 201 |
+
"type": "model",
|
| 202 |
+
"category": "intermediate_layer"
|
| 203 |
+
},
|
| 204 |
+
{
|
| 205 |
+
"id": "int_customer_support_history",
|
| 206 |
+
"name": "intermediate.int_customer_support_history",
|
| 207 |
+
"type": "model",
|
| 208 |
+
"category": "intermediate_layer"
|
| 209 |
+
},
|
| 210 |
+
{
|
| 211 |
+
"id": "dim_customers",
|
| 212 |
+
"name": "marts.dim_customers",
|
| 213 |
+
"type": "dimension",
|
| 214 |
+
"category": "marts_layer",
|
| 215 |
+
"description": "Customer dimension with lifetime metrics",
|
| 216 |
+
"grain": "customer"
|
| 217 |
+
},
|
| 218 |
+
{
|
| 219 |
+
"id": "dim_products",
|
| 220 |
+
"name": "marts.dim_products",
|
| 221 |
+
"type": "dimension",
|
| 222 |
+
"category": "marts_layer",
|
| 223 |
+
"grain": "product"
|
| 224 |
+
},
|
| 225 |
+
{
|
| 226 |
+
"id": "dim_date",
|
| 227 |
+
"name": "marts.dim_date",
|
| 228 |
+
"type": "dimension",
|
| 229 |
+
"category": "marts_layer",
|
| 230 |
+
"grain": "day"
|
| 231 |
+
},
|
| 232 |
+
{
|
| 233 |
+
"id": "fct_orders",
|
| 234 |
+
"name": "marts.fct_orders",
|
| 235 |
+
"type": "fact",
|
| 236 |
+
"category": "marts_layer",
|
| 237 |
+
"grain": "order"
|
| 238 |
+
},
|
| 239 |
+
{
|
| 240 |
+
"id": "fct_order_items",
|
| 241 |
+
"name": "marts.fct_order_items",
|
| 242 |
+
"type": "fact",
|
| 243 |
+
"category": "marts_layer",
|
| 244 |
+
"grain": "order_item"
|
| 245 |
+
},
|
| 246 |
+
{
|
| 247 |
+
"id": "fct_web_sessions",
|
| 248 |
+
"name": "marts.fct_web_sessions",
|
| 249 |
+
"type": "fact",
|
| 250 |
+
"category": "marts_layer"
|
| 251 |
+
},
|
| 252 |
+
{
|
| 253 |
+
"id": "fct_marketing_spend",
|
| 254 |
+
"name": "marts.fct_marketing_spend",
|
| 255 |
+
"type": "fact",
|
| 256 |
+
"category": "marts_layer"
|
| 257 |
+
},
|
| 258 |
+
{
|
| 259 |
+
"id": "fct_support_tickets",
|
| 260 |
+
"name": "marts.fct_support_tickets",
|
| 261 |
+
"type": "fact",
|
| 262 |
+
"category": "marts_layer"
|
| 263 |
+
},
|
| 264 |
+
{
|
| 265 |
+
"id": "rpt_daily_sales",
|
| 266 |
+
"name": "reporting.rpt_daily_sales",
|
| 267 |
+
"type": "report",
|
| 268 |
+
"category": "reporting_layer",
|
| 269 |
+
"description": "Daily sales summary by channel and category"
|
| 270 |
+
},
|
| 271 |
+
{
|
| 272 |
+
"id": "rpt_customer_ltv",
|
| 273 |
+
"name": "reporting.rpt_customer_ltv",
|
| 274 |
+
"type": "report",
|
| 275 |
+
"category": "reporting_layer",
|
| 276 |
+
"description": "Customer lifetime value analysis"
|
| 277 |
+
},
|
| 278 |
+
{
|
| 279 |
+
"id": "rpt_marketing_roi",
|
| 280 |
+
"name": "reporting.rpt_marketing_roi",
|
| 281 |
+
"type": "report",
|
| 282 |
+
"category": "reporting_layer",
|
| 283 |
+
"description": "Marketing ROI by channel and campaign"
|
| 284 |
+
},
|
| 285 |
+
{
|
| 286 |
+
"id": "rpt_product_analytics",
|
| 287 |
+
"name": "reporting.rpt_product_analytics",
|
| 288 |
+
"type": "report",
|
| 289 |
+
"category": "reporting_layer"
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"id": "rpt_customer_health",
|
| 293 |
+
"name": "reporting.rpt_customer_health",
|
| 294 |
+
"type": "report",
|
| 295 |
+
"category": "reporting_layer",
|
| 296 |
+
"description": "Customer health score combining all signals"
|
| 297 |
+
},
|
| 298 |
+
{
|
| 299 |
+
"id": "ml_churn_features",
|
| 300 |
+
"name": "features.churn_prediction",
|
| 301 |
+
"type": "feature_set",
|
| 302 |
+
"category": "ml_features",
|
| 303 |
+
"description": "Features for churn prediction model"
|
| 304 |
+
},
|
| 305 |
+
{
|
| 306 |
+
"id": "ml_ltv_features",
|
| 307 |
+
"name": "features.ltv_prediction",
|
| 308 |
+
"type": "feature_set",
|
| 309 |
+
"category": "ml_features"
|
| 310 |
+
},
|
| 311 |
+
{
|
| 312 |
+
"id": "looker_model",
|
| 313 |
+
"name": "Looker Semantic Layer",
|
| 314 |
+
"type": "semantic_model",
|
| 315 |
+
"category": "bi_layer"
|
| 316 |
+
},
|
| 317 |
+
{
|
| 318 |
+
"id": "tableau_extract",
|
| 319 |
+
"name": "Tableau Extract",
|
| 320 |
+
"type": "extract",
|
| 321 |
+
"category": "bi_layer"
|
| 322 |
+
},
|
| 323 |
+
{
|
| 324 |
+
"id": "salesforce_sync",
|
| 325 |
+
"name": "Salesforce Account Sync",
|
| 326 |
+
"type": "destination",
|
| 327 |
+
"category": "reverse_etl"
|
| 328 |
+
},
|
| 329 |
+
{
|
| 330 |
+
"id": "hubspot_sync",
|
| 331 |
+
"name": "HubSpot Contact Sync",
|
| 332 |
+
"type": "destination",
|
| 333 |
+
"category": "reverse_etl"
|
| 334 |
+
}
|
| 335 |
+
],
|
| 336 |
+
"edges": [
|
| 337 |
+
{"from": "shopify_orders", "to": "raw_orders", "type": "ingest"},
|
| 338 |
+
{"from": "shopify_products", "to": "raw_products", "type": "ingest"},
|
| 339 |
+
{"from": "shopify_customers", "to": "raw_customers", "type": "ingest"},
|
| 340 |
+
{"from": "stripe_payments", "to": "raw_payments", "type": "ingest"},
|
| 341 |
+
{"from": "stripe_subscriptions", "to": "raw_subscriptions", "type": "ingest"},
|
| 342 |
+
{"from": "ga4_events", "to": "raw_web_events", "type": "ingest"},
|
| 343 |
+
{"from": "fb_ads", "to": "raw_ad_spend", "type": "ingest"},
|
| 344 |
+
{"from": "google_ads", "to": "raw_ad_spend", "type": "ingest"},
|
| 345 |
+
{"from": "zendesk_tickets", "to": "raw_support_tickets", "type": "ingest"},
|
| 346 |
+
|
| 347 |
+
{"from": "raw_orders", "to": "stg_orders", "type": "transform"},
|
| 348 |
+
{"from": "raw_orders", "to": "stg_order_items", "type": "transform"},
|
| 349 |
+
{"from": "raw_products", "to": "stg_products", "type": "transform"},
|
| 350 |
+
{"from": "raw_customers", "to": "stg_customers", "type": "transform"},
|
| 351 |
+
{"from": "raw_payments", "to": "stg_payments", "type": "transform"},
|
| 352 |
+
{"from": "raw_subscriptions", "to": "stg_subscriptions", "type": "transform"},
|
| 353 |
+
{"from": "raw_web_events", "to": "stg_web_sessions", "type": "transform"},
|
| 354 |
+
{"from": "raw_ad_spend", "to": "stg_ad_campaigns", "type": "transform"},
|
| 355 |
+
{"from": "raw_support_tickets", "to": "stg_support_cases", "type": "transform"},
|
| 356 |
+
|
| 357 |
+
{"from": "stg_customers", "to": "int_customer_orders", "type": "join"},
|
| 358 |
+
{"from": "stg_orders", "to": "int_customer_orders", "type": "join"},
|
| 359 |
+
{"from": "stg_orders", "to": "int_order_payments", "type": "join"},
|
| 360 |
+
{"from": "stg_payments", "to": "int_order_payments", "type": "join"},
|
| 361 |
+
{"from": "stg_customers", "to": "int_customer_attribution", "type": "join"},
|
| 362 |
+
{"from": "stg_web_sessions", "to": "int_customer_attribution", "type": "join"},
|
| 363 |
+
{"from": "stg_ad_campaigns", "to": "int_customer_attribution", "type": "join"},
|
| 364 |
+
{"from": "stg_products", "to": "int_product_performance", "type": "join"},
|
| 365 |
+
{"from": "stg_order_items", "to": "int_product_performance", "type": "join"},
|
| 366 |
+
{"from": "stg_customers", "to": "int_customer_support_history", "type": "join"},
|
| 367 |
+
{"from": "stg_support_cases", "to": "int_customer_support_history", "type": "join"},
|
| 368 |
+
|
| 369 |
+
{"from": "int_customer_orders", "to": "dim_customers", "type": "model"},
|
| 370 |
+
{"from": "int_customer_attribution", "to": "dim_customers", "type": "model"},
|
| 371 |
+
{"from": "int_customer_support_history", "to": "dim_customers", "type": "model"},
|
| 372 |
+
{"from": "stg_products", "to": "dim_products", "type": "model"},
|
| 373 |
+
{"from": "int_product_performance", "to": "dim_products", "type": "model"},
|
| 374 |
+
|
| 375 |
+
{"from": "int_order_payments", "to": "fct_orders", "type": "model"},
|
| 376 |
+
{"from": "dim_customers", "to": "fct_orders", "type": "reference"},
|
| 377 |
+
{"from": "stg_order_items", "to": "fct_order_items", "type": "model"},
|
| 378 |
+
{"from": "dim_products", "to": "fct_order_items", "type": "reference"},
|
| 379 |
+
{"from": "fct_orders", "to": "fct_order_items", "type": "reference"},
|
| 380 |
+
{"from": "stg_web_sessions", "to": "fct_web_sessions", "type": "model"},
|
| 381 |
+
{"from": "dim_customers", "to": "fct_web_sessions", "type": "reference"},
|
| 382 |
+
{"from": "stg_ad_campaigns", "to": "fct_marketing_spend", "type": "model"},
|
| 383 |
+
{"from": "int_customer_attribution", "to": "fct_marketing_spend", "type": "reference"},
|
| 384 |
+
{"from": "stg_support_cases", "to": "fct_support_tickets", "type": "model"},
|
| 385 |
+
{"from": "dim_customers", "to": "fct_support_tickets", "type": "reference"},
|
| 386 |
+
|
| 387 |
+
{"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"},
|
| 388 |
+
{"from": "fct_order_items", "to": "rpt_daily_sales", "type": "aggregate"},
|
| 389 |
+
{"from": "dim_products", "to": "rpt_daily_sales", "type": "reference"},
|
| 390 |
+
{"from": "dim_customers", "to": "rpt_customer_ltv", "type": "aggregate"},
|
| 391 |
+
{"from": "fct_orders", "to": "rpt_customer_ltv", "type": "aggregate"},
|
| 392 |
+
{"from": "fct_marketing_spend", "to": "rpt_marketing_roi", "type": "aggregate"},
|
| 393 |
+
{"from": "fct_orders", "to": "rpt_marketing_roi", "type": "aggregate"},
|
| 394 |
+
{"from": "int_customer_attribution", "to": "rpt_marketing_roi", "type": "reference"},
|
| 395 |
+
{"from": "dim_products", "to": "rpt_product_analytics", "type": "aggregate"},
|
| 396 |
+
{"from": "fct_order_items", "to": "rpt_product_analytics", "type": "aggregate"},
|
| 397 |
+
{"from": "dim_customers", "to": "rpt_customer_health", "type": "aggregate"},
|
| 398 |
+
{"from": "fct_orders", "to": "rpt_customer_health", "type": "aggregate"},
|
| 399 |
+
{"from": "fct_web_sessions", "to": "rpt_customer_health", "type": "aggregate"},
|
| 400 |
+
{"from": "fct_support_tickets", "to": "rpt_customer_health", "type": "aggregate"},
|
| 401 |
+
|
| 402 |
+
{"from": "dim_customers", "to": "ml_churn_features", "type": "export"},
|
| 403 |
+
{"from": "fct_orders", "to": "ml_churn_features", "type": "export"},
|
| 404 |
+
{"from": "fct_web_sessions", "to": "ml_churn_features", "type": "export"},
|
| 405 |
+
{"from": "fct_support_tickets", "to": "ml_churn_features", "type": "export"},
|
| 406 |
+
{"from": "dim_customers", "to": "ml_ltv_features", "type": "export"},
|
| 407 |
+
{"from": "fct_orders", "to": "ml_ltv_features", "type": "export"},
|
| 408 |
+
|
| 409 |
+
{"from": "rpt_daily_sales", "to": "looker_model", "type": "publish"},
|
| 410 |
+
{"from": "rpt_customer_ltv", "to": "looker_model", "type": "publish"},
|
| 411 |
+
{"from": "rpt_marketing_roi", "to": "looker_model", "type": "publish"},
|
| 412 |
+
{"from": "rpt_product_analytics", "to": "looker_model", "type": "publish"},
|
| 413 |
+
{"from": "rpt_customer_health", "to": "looker_model", "type": "publish"},
|
| 414 |
+
{"from": "rpt_daily_sales", "to": "tableau_extract", "type": "export"},
|
| 415 |
+
|
| 416 |
+
{"from": "rpt_customer_ltv", "to": "salesforce_sync", "type": "reverse_etl"},
|
| 417 |
+
{"from": "rpt_customer_health", "to": "salesforce_sync", "type": "reverse_etl"},
|
| 418 |
+
{"from": "rpt_customer_ltv", "to": "hubspot_sync", "type": "reverse_etl"}
|
| 419 |
+
]
|
| 420 |
+
},
|
| 421 |
+
"expected_visualization": {
|
| 422 |
+
"mermaid": "graph LR\n subgraph Sources\n shopify_orders[Shopify Orders]\n shopify_products[Shopify Products]\n shopify_customers[Shopify Customers]\n stripe_payments[Stripe Payments]\n ga4_events[GA4 Events]\n fb_ads[Facebook Ads]\n zendesk_tickets[Zendesk]\n end\n \n subgraph Raw\n raw_orders[raw.orders]\n raw_products[raw.products]\n raw_customers[raw.customers]\n raw_payments[raw.payments]\n raw_web_events[raw.web_events]\n end\n \n subgraph Staging\n stg_orders[staging.stg_orders]\n stg_customers[staging.stg_customers]\n stg_products[staging.stg_products]\n end\n \n subgraph Marts\n dim_customers[marts.dim_customers]\n dim_products[marts.dim_products]\n fct_orders[marts.fct_orders]\n end\n \n subgraph Reporting\n rpt_daily_sales[reporting.rpt_daily_sales]\n rpt_customer_ltv[reporting.rpt_customer_ltv]\n end\n \n shopify_orders --> raw_orders\n raw_orders --> stg_orders\n stg_orders --> fct_orders\n fct_orders --> rpt_daily_sales"
|
| 423 |
+
},
|
| 424 |
+
"notes": "This comprehensive demo showcases a real-world e-commerce analytics platform with 50+ nodes and 80+ edges across multiple data layers, from source systems through to BI tools and reverse ETL destinations."
|
| 425 |
+
}
|
samples/dbt_manifest_sample.json
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"metadata": {
|
| 3 |
+
"dbt_schema_version": "https://schemas.getdbt.com/dbt/manifest/v10.json",
|
| 4 |
+
"dbt_version": "1.7.0",
|
| 5 |
+
"project_name": "ecommerce_analytics",
|
| 6 |
+
"generated_at": "2025-11-20T10:30:00Z"
|
| 7 |
+
},
|
| 8 |
+
"nodes": {
|
| 9 |
+
"source.ecommerce.raw.customers": {
|
| 10 |
+
"resource_type": "source",
|
| 11 |
+
"name": "customers",
|
| 12 |
+
"schema": "raw",
|
| 13 |
+
"database": "ecommerce_db",
|
| 14 |
+
"columns": {
|
| 15 |
+
"customer_id": {"name": "customer_id", "data_type": "integer"},
|
| 16 |
+
"email": {"name": "email", "data_type": "varchar"},
|
| 17 |
+
"created_at": {"name": "created_at", "data_type": "timestamp"},
|
| 18 |
+
"country": {"name": "country", "data_type": "varchar"}
|
| 19 |
+
}
|
| 20 |
+
},
|
| 21 |
+
"source.ecommerce.raw.orders": {
|
| 22 |
+
"resource_type": "source",
|
| 23 |
+
"name": "orders",
|
| 24 |
+
"schema": "raw",
|
| 25 |
+
"database": "ecommerce_db",
|
| 26 |
+
"columns": {
|
| 27 |
+
"order_id": {"name": "order_id", "data_type": "integer"},
|
| 28 |
+
"customer_id": {"name": "customer_id", "data_type": "integer"},
|
| 29 |
+
"order_date": {"name": "order_date", "data_type": "date"},
|
| 30 |
+
"total_amount": {"name": "total_amount", "data_type": "decimal"},
|
| 31 |
+
"status": {"name": "status", "data_type": "varchar"}
|
| 32 |
+
}
|
| 33 |
+
},
|
| 34 |
+
"source.ecommerce.raw.products": {
|
| 35 |
+
"resource_type": "source",
|
| 36 |
+
"name": "products",
|
| 37 |
+
"schema": "raw",
|
| 38 |
+
"database": "ecommerce_db",
|
| 39 |
+
"columns": {
|
| 40 |
+
"product_id": {"name": "product_id", "data_type": "integer"},
|
| 41 |
+
"product_name": {"name": "product_name", "data_type": "varchar"},
|
| 42 |
+
"category": {"name": "category", "data_type": "varchar"},
|
| 43 |
+
"price": {"name": "price", "data_type": "decimal"}
|
| 44 |
+
}
|
| 45 |
+
},
|
| 46 |
+
"source.ecommerce.raw.order_items": {
|
| 47 |
+
"resource_type": "source",
|
| 48 |
+
"name": "order_items",
|
| 49 |
+
"schema": "raw",
|
| 50 |
+
"database": "ecommerce_db",
|
| 51 |
+
"columns": {
|
| 52 |
+
"order_item_id": {"name": "order_item_id", "data_type": "integer"},
|
| 53 |
+
"order_id": {"name": "order_id", "data_type": "integer"},
|
| 54 |
+
"product_id": {"name": "product_id", "data_type": "integer"},
|
| 55 |
+
"quantity": {"name": "quantity", "data_type": "integer"},
|
| 56 |
+
"unit_price": {"name": "unit_price", "data_type": "decimal"}
|
| 57 |
+
}
|
| 58 |
+
},
|
| 59 |
+
"model.ecommerce.stg_customers": {
|
| 60 |
+
"resource_type": "model",
|
| 61 |
+
"name": "stg_customers",
|
| 62 |
+
"schema": "staging",
|
| 63 |
+
"database": "ecommerce_db",
|
| 64 |
+
"depends_on": {
|
| 65 |
+
"nodes": ["source.ecommerce.raw.customers"]
|
| 66 |
+
},
|
| 67 |
+
"columns": {
|
| 68 |
+
"customer_id": {"name": "customer_id", "data_type": "integer"},
|
| 69 |
+
"email": {"name": "email", "data_type": "varchar"},
|
| 70 |
+
"signup_date": {"name": "signup_date", "data_type": "date"},
|
| 71 |
+
"country": {"name": "country", "data_type": "varchar"}
|
| 72 |
+
}
|
| 73 |
+
},
|
| 74 |
+
"model.ecommerce.stg_orders": {
|
| 75 |
+
"resource_type": "model",
|
| 76 |
+
"name": "stg_orders",
|
| 77 |
+
"schema": "staging",
|
| 78 |
+
"database": "ecommerce_db",
|
| 79 |
+
"depends_on": {
|
| 80 |
+
"nodes": ["source.ecommerce.raw.orders"]
|
| 81 |
+
},
|
| 82 |
+
"columns": {
|
| 83 |
+
"order_id": {"name": "order_id", "data_type": "integer"},
|
| 84 |
+
"customer_id": {"name": "customer_id", "data_type": "integer"},
|
| 85 |
+
"order_date": {"name": "order_date", "data_type": "date"},
|
| 86 |
+
"total_amount": {"name": "total_amount", "data_type": "decimal"},
|
| 87 |
+
"order_status": {"name": "order_status", "data_type": "varchar"}
|
| 88 |
+
}
|
| 89 |
+
},
|
| 90 |
+
"model.ecommerce.stg_products": {
|
| 91 |
+
"resource_type": "model",
|
| 92 |
+
"name": "stg_products",
|
| 93 |
+
"schema": "staging",
|
| 94 |
+
"database": "ecommerce_db",
|
| 95 |
+
"depends_on": {
|
| 96 |
+
"nodes": ["source.ecommerce.raw.products"]
|
| 97 |
+
}
|
| 98 |
+
},
|
| 99 |
+
"model.ecommerce.stg_order_items": {
|
| 100 |
+
"resource_type": "model",
|
| 101 |
+
"name": "stg_order_items",
|
| 102 |
+
"schema": "staging",
|
| 103 |
+
"database": "ecommerce_db",
|
| 104 |
+
"depends_on": {
|
| 105 |
+
"nodes": ["source.ecommerce.raw.order_items"]
|
| 106 |
+
}
|
| 107 |
+
},
|
| 108 |
+
"model.ecommerce.int_orders_enriched": {
|
| 109 |
+
"resource_type": "model",
|
| 110 |
+
"name": "int_orders_enriched",
|
| 111 |
+
"schema": "intermediate",
|
| 112 |
+
"database": "ecommerce_db",
|
| 113 |
+
"depends_on": {
|
| 114 |
+
"nodes": [
|
| 115 |
+
"model.ecommerce.stg_orders",
|
| 116 |
+
"model.ecommerce.stg_order_items",
|
| 117 |
+
"model.ecommerce.stg_products"
|
| 118 |
+
]
|
| 119 |
+
},
|
| 120 |
+
"description": "Orders joined with order items and product details"
|
| 121 |
+
},
|
| 122 |
+
"model.ecommerce.int_customer_orders": {
|
| 123 |
+
"resource_type": "model",
|
| 124 |
+
"name": "int_customer_orders",
|
| 125 |
+
"schema": "intermediate",
|
| 126 |
+
"database": "ecommerce_db",
|
| 127 |
+
"depends_on": {
|
| 128 |
+
"nodes": [
|
| 129 |
+
"model.ecommerce.stg_customers",
|
| 130 |
+
"model.ecommerce.stg_orders"
|
| 131 |
+
]
|
| 132 |
+
},
|
| 133 |
+
"description": "Customers joined with their orders"
|
| 134 |
+
},
|
| 135 |
+
"model.ecommerce.fct_orders": {
|
| 136 |
+
"resource_type": "model",
|
| 137 |
+
"name": "fct_orders",
|
| 138 |
+
"schema": "marts",
|
| 139 |
+
"database": "ecommerce_db",
|
| 140 |
+
"depends_on": {
|
| 141 |
+
"nodes": [
|
| 142 |
+
"model.ecommerce.int_orders_enriched",
|
| 143 |
+
"model.ecommerce.int_customer_orders"
|
| 144 |
+
]
|
| 145 |
+
},
|
| 146 |
+
"description": "Fact table for order analytics"
|
| 147 |
+
},
|
| 148 |
+
"model.ecommerce.dim_customers": {
|
| 149 |
+
"resource_type": "model",
|
| 150 |
+
"name": "dim_customers",
|
| 151 |
+
"schema": "marts",
|
| 152 |
+
"database": "ecommerce_db",
|
| 153 |
+
"depends_on": {
|
| 154 |
+
"nodes": ["model.ecommerce.int_customer_orders"]
|
| 155 |
+
},
|
| 156 |
+
"description": "Customer dimension with order metrics"
|
| 157 |
+
},
|
| 158 |
+
"model.ecommerce.dim_products": {
|
| 159 |
+
"resource_type": "model",
|
| 160 |
+
"name": "dim_products",
|
| 161 |
+
"schema": "marts",
|
| 162 |
+
"database": "ecommerce_db",
|
| 163 |
+
"depends_on": {
|
| 164 |
+
"nodes": ["model.ecommerce.stg_products"]
|
| 165 |
+
},
|
| 166 |
+
"description": "Product dimension table"
|
| 167 |
+
},
|
| 168 |
+
"model.ecommerce.rpt_daily_sales": {
|
| 169 |
+
"resource_type": "model",
|
| 170 |
+
"name": "rpt_daily_sales",
|
| 171 |
+
"schema": "reporting",
|
| 172 |
+
"database": "ecommerce_db",
|
| 173 |
+
"depends_on": {
|
| 174 |
+
"nodes": [
|
| 175 |
+
"model.ecommerce.fct_orders",
|
| 176 |
+
"model.ecommerce.dim_products"
|
| 177 |
+
]
|
| 178 |
+
},
|
| 179 |
+
"description": "Daily sales report by product category"
|
| 180 |
+
},
|
| 181 |
+
"model.ecommerce.rpt_customer_ltv": {
|
| 182 |
+
"resource_type": "model",
|
| 183 |
+
"name": "rpt_customer_ltv",
|
| 184 |
+
"schema": "reporting",
|
| 185 |
+
"database": "ecommerce_db",
|
| 186 |
+
"depends_on": {
|
| 187 |
+
"nodes": [
|
| 188 |
+
"model.ecommerce.fct_orders",
|
| 189 |
+
"model.ecommerce.dim_customers"
|
| 190 |
+
]
|
| 191 |
+
},
|
| 192 |
+
"description": "Customer lifetime value analysis"
|
| 193 |
+
}
|
| 194 |
+
},
|
| 195 |
+
"notes": "Sample dbt manifest representing an e-commerce analytics project with staging, intermediate, mart, and reporting layers."
|
| 196 |
+
}
|
samples/etl_pipeline_sample.json
ADDED
|
@@ -0,0 +1,252 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"pipeline": {
|
| 3 |
+
"name": "customer_analytics_pipeline",
|
| 4 |
+
"description": "End-to-end customer analytics data pipeline",
|
| 5 |
+
"version": "2.1.0",
|
| 6 |
+
"owner": "data-engineering@company.com",
|
| 7 |
+
"created": "2025-01-15",
|
| 8 |
+
"schedule": "daily at 02:00 UTC"
|
| 9 |
+
},
|
| 10 |
+
"sources": [
|
| 11 |
+
{
|
| 12 |
+
"id": "src_salesforce",
|
| 13 |
+
"name": "Salesforce CRM",
|
| 14 |
+
"type": "api",
|
| 15 |
+
"connection": {
|
| 16 |
+
"endpoint": "https://company.salesforce.com/api/v52.0",
|
| 17 |
+
"auth": "oauth2"
|
| 18 |
+
},
|
| 19 |
+
"objects": ["Account", "Contact", "Opportunity", "Lead"],
|
| 20 |
+
"incremental_field": "LastModifiedDate"
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"id": "src_stripe",
|
| 24 |
+
"name": "Stripe Payments",
|
| 25 |
+
"type": "api",
|
| 26 |
+
"connection": {
|
| 27 |
+
"endpoint": "https://api.stripe.com/v1",
|
| 28 |
+
"auth": "api_key"
|
| 29 |
+
},
|
| 30 |
+
"objects": ["charges", "customers", "subscriptions", "invoices"]
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
"id": "src_postgres_app",
|
| 34 |
+
"name": "Application Database",
|
| 35 |
+
"type": "database",
|
| 36 |
+
"connection": {
|
| 37 |
+
"host": "app-db.internal",
|
| 38 |
+
"port": 5432,
|
| 39 |
+
"database": "production"
|
| 40 |
+
},
|
| 41 |
+
"tables": ["users", "user_events", "feature_flags", "subscriptions"]
|
| 42 |
+
},
|
| 43 |
+
{
|
| 44 |
+
"id": "src_segment",
|
| 45 |
+
"name": "Segment Events",
|
| 46 |
+
"type": "stream",
|
| 47 |
+
"connection": {
|
| 48 |
+
"type": "kafka",
|
| 49 |
+
"topic": "segment-events",
|
| 50 |
+
"bootstrap_servers": "kafka.internal:9092"
|
| 51 |
+
},
|
| 52 |
+
"events": ["page", "track", "identify"]
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"id": "src_google_analytics",
|
| 56 |
+
"name": "Google Analytics 4",
|
| 57 |
+
"type": "api",
|
| 58 |
+
"connection": {
|
| 59 |
+
"property_id": "GA4-123456789"
|
| 60 |
+
},
|
| 61 |
+
"metrics": ["sessions", "users", "conversions", "revenue"]
|
| 62 |
+
}
|
| 63 |
+
],
|
| 64 |
+
"stages": [
|
| 65 |
+
{
|
| 66 |
+
"id": "extract",
|
| 67 |
+
"name": "Data Extraction",
|
| 68 |
+
"steps": [
|
| 69 |
+
{
|
| 70 |
+
"id": "ext_salesforce",
|
| 71 |
+
"source": "src_salesforce",
|
| 72 |
+
"output": "landing/salesforce/",
|
| 73 |
+
"format": "parquet",
|
| 74 |
+
"partitions": ["date"],
|
| 75 |
+
"mode": "incremental"
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"id": "ext_stripe",
|
| 79 |
+
"source": "src_stripe",
|
| 80 |
+
"output": "landing/stripe/",
|
| 81 |
+
"format": "parquet",
|
| 82 |
+
"mode": "incremental"
|
| 83 |
+
},
|
| 84 |
+
{
|
| 85 |
+
"id": "ext_postgres",
|
| 86 |
+
"source": "src_postgres_app",
|
| 87 |
+
"output": "landing/app_db/",
|
| 88 |
+
"format": "parquet",
|
| 89 |
+
"mode": "cdc"
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"id": "ext_segment",
|
| 93 |
+
"source": "src_segment",
|
| 94 |
+
"output": "landing/segment/",
|
| 95 |
+
"format": "parquet",
|
| 96 |
+
"mode": "streaming"
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"id": "ext_ga4",
|
| 100 |
+
"source": "src_google_analytics",
|
| 101 |
+
"output": "landing/ga4/",
|
| 102 |
+
"format": "parquet",
|
| 103 |
+
"mode": "batch"
|
| 104 |
+
}
|
| 105 |
+
]
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"id": "transform",
|
| 109 |
+
"name": "Data Transformation",
|
| 110 |
+
"steps": [
|
| 111 |
+
{
|
| 112 |
+
"id": "tfm_customer_identity",
|
| 113 |
+
"name": "Customer Identity Resolution",
|
| 114 |
+
"inputs": ["ext_salesforce", "ext_stripe", "ext_postgres"],
|
| 115 |
+
"output": "curated/customer_identity/",
|
| 116 |
+
"logic": "Match and merge customer identities across systems using email, phone, and probabilistic matching",
|
| 117 |
+
"technology": "Spark"
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"id": "tfm_event_enrichment",
|
| 121 |
+
"name": "Event Enrichment",
|
| 122 |
+
"inputs": ["ext_segment", "ext_ga4", "tfm_customer_identity"],
|
| 123 |
+
"output": "curated/events_enriched/",
|
| 124 |
+
"logic": "Join events with customer identity and add session context"
|
| 125 |
+
},
|
| 126 |
+
{
|
| 127 |
+
"id": "tfm_revenue_calc",
|
| 128 |
+
"name": "Revenue Calculation",
|
| 129 |
+
"inputs": ["ext_stripe", "ext_salesforce", "tfm_customer_identity"],
|
| 130 |
+
"output": "curated/revenue/",
|
| 131 |
+
"logic": "Calculate MRR, ARR, churn, and expansion revenue metrics"
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"id": "tfm_product_usage",
|
| 135 |
+
"name": "Product Usage Metrics",
|
| 136 |
+
"inputs": ["ext_postgres", "tfm_event_enrichment"],
|
| 137 |
+
"output": "curated/product_usage/",
|
| 138 |
+
"logic": "Aggregate product usage by customer and feature"
|
| 139 |
+
}
|
| 140 |
+
]
|
| 141 |
+
},
|
| 142 |
+
{
|
| 143 |
+
"id": "model",
|
| 144 |
+
"name": "Data Modeling",
|
| 145 |
+
"steps": [
|
| 146 |
+
{
|
| 147 |
+
"id": "mdl_dim_customer",
|
| 148 |
+
"name": "Customer Dimension",
|
| 149 |
+
"inputs": ["tfm_customer_identity", "tfm_revenue_calc"],
|
| 150 |
+
"output": "warehouse.dim_customer",
|
| 151 |
+
"type": "scd_type_2"
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"id": "mdl_dim_product",
|
| 155 |
+
"name": "Product Dimension",
|
| 156 |
+
"inputs": ["ext_postgres"],
|
| 157 |
+
"output": "warehouse.dim_product"
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"id": "mdl_fct_events",
|
| 161 |
+
"name": "Events Fact",
|
| 162 |
+
"inputs": ["tfm_event_enrichment", "mdl_dim_customer", "mdl_dim_product"],
|
| 163 |
+
"output": "warehouse.fct_events",
|
| 164 |
+
"grain": "event"
|
| 165 |
+
},
|
| 166 |
+
{
|
| 167 |
+
"id": "mdl_fct_revenue",
|
| 168 |
+
"name": "Revenue Fact",
|
| 169 |
+
"inputs": ["tfm_revenue_calc", "mdl_dim_customer"],
|
| 170 |
+
"output": "warehouse.fct_revenue",
|
| 171 |
+
"grain": "transaction"
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"id": "mdl_fct_usage",
|
| 175 |
+
"name": "Usage Fact",
|
| 176 |
+
"inputs": ["tfm_product_usage", "mdl_dim_customer", "mdl_dim_product"],
|
| 177 |
+
"output": "warehouse.fct_usage",
|
| 178 |
+
"grain": "daily_customer_feature"
|
| 179 |
+
}
|
| 180 |
+
]
|
| 181 |
+
},
|
| 182 |
+
{
|
| 183 |
+
"id": "aggregate",
|
| 184 |
+
"name": "Aggregations & Marts",
|
| 185 |
+
"steps": [
|
| 186 |
+
{
|
| 187 |
+
"id": "agg_customer_360",
|
| 188 |
+
"name": "Customer 360 View",
|
| 189 |
+
"inputs": ["mdl_dim_customer", "mdl_fct_events", "mdl_fct_revenue", "mdl_fct_usage"],
|
| 190 |
+
"output": "marts.customer_360",
|
| 191 |
+
"refresh": "hourly"
|
| 192 |
+
},
|
| 193 |
+
{
|
| 194 |
+
"id": "agg_revenue_metrics",
|
| 195 |
+
"name": "Revenue Metrics",
|
| 196 |
+
"inputs": ["mdl_fct_revenue", "mdl_dim_customer"],
|
| 197 |
+
"output": "marts.revenue_metrics",
|
| 198 |
+
"refresh": "daily"
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"id": "agg_product_analytics",
|
| 202 |
+
"name": "Product Analytics",
|
| 203 |
+
"inputs": ["mdl_fct_usage", "mdl_fct_events", "mdl_dim_product"],
|
| 204 |
+
"output": "marts.product_analytics",
|
| 205 |
+
"refresh": "daily"
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"id": "agg_health_score",
|
| 209 |
+
"name": "Customer Health Score",
|
| 210 |
+
"inputs": ["agg_customer_360", "agg_revenue_metrics", "agg_product_analytics"],
|
| 211 |
+
"output": "marts.customer_health_score",
|
| 212 |
+
"logic": "ML-based health score prediction"
|
| 213 |
+
}
|
| 214 |
+
]
|
| 215 |
+
},
|
| 216 |
+
{
|
| 217 |
+
"id": "publish",
|
| 218 |
+
"name": "Data Publishing",
|
| 219 |
+
"steps": [
|
| 220 |
+
{
|
| 221 |
+
"id": "pub_looker",
|
| 222 |
+
"name": "Looker Semantic Layer",
|
| 223 |
+
"inputs": ["agg_customer_360", "agg_revenue_metrics", "agg_product_analytics"],
|
| 224 |
+
"output": "looker://models/customer_analytics",
|
| 225 |
+
"type": "semantic_model"
|
| 226 |
+
},
|
| 227 |
+
{
|
| 228 |
+
"id": "pub_salesforce_sync",
|
| 229 |
+
"name": "Salesforce Sync",
|
| 230 |
+
"inputs": ["agg_customer_360", "agg_health_score"],
|
| 231 |
+
"output": "salesforce://Account.HealthScore__c",
|
| 232 |
+
"type": "reverse_etl"
|
| 233 |
+
},
|
| 234 |
+
{
|
| 235 |
+
"id": "pub_ml_features",
|
| 236 |
+
"name": "ML Feature Store",
|
| 237 |
+
"inputs": ["agg_customer_360", "agg_product_analytics"],
|
| 238 |
+
"output": "feast://customer_features",
|
| 239 |
+
"type": "feature_store"
|
| 240 |
+
}
|
| 241 |
+
]
|
| 242 |
+
}
|
| 243 |
+
],
|
| 244 |
+
"data_quality": {
|
| 245 |
+
"rules": [
|
| 246 |
+
{"table": "mdl_dim_customer", "check": "unique", "column": "customer_id"},
|
| 247 |
+
{"table": "mdl_fct_revenue", "check": "not_null", "columns": ["customer_id", "amount", "transaction_date"]},
|
| 248 |
+
{"table": "agg_revenue_metrics", "check": "freshness", "max_delay_hours": 2}
|
| 249 |
+
]
|
| 250 |
+
},
|
| 251 |
+
"notes": "Comprehensive ETL pipeline sample showing data flow from multiple sources through transformation, modeling, and publishing stages."
|
| 252 |
+
}
|
samples/sample_api_metadata.json
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"service": "example-api",
|
| 3 |
+
"endpoints": [
|
| 4 |
+
{"path": "/customers", "method": "GET", "produces": "raw_customers"},
|
| 5 |
+
{"path": "/orders", "method": "POST", "produces": "orders"}
|
| 6 |
+
],
|
| 7 |
+
"notes": "Sample API metadata representing sources that produce tables."
|
| 8 |
+
}
|
samples/sample_metadata.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"nodes": [
|
| 3 |
+
{"id": "raw_customers", "type": "table", "name": "raw_customers"},
|
| 4 |
+
{"id": "clean_customers", "type": "table", "name": "clean_customers"},
|
| 5 |
+
{"id": "orders", "type": "table", "name": "orders"}
|
| 6 |
+
],
|
| 7 |
+
"edges": [
|
| 8 |
+
{"from": "raw_customers", "to": "clean_customers"},
|
| 9 |
+
{"from": "clean_customers", "to": "orders"}
|
| 10 |
+
],
|
| 11 |
+
"notes": "Sample JSON manifest representing a tiny lineage graph."
|
| 12 |
+
}
|
samples/sql_ddl_sample.sql
ADDED
|
@@ -0,0 +1,269 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
-- Sample SQL DDL with complex lineage relationships
|
| 2 |
+
-- E-commerce Data Warehouse Schema
|
| 3 |
+
|
| 4 |
+
-- ============================================
|
| 5 |
+
-- RAW LAYER - Source tables
|
| 6 |
+
-- ============================================
|
| 7 |
+
|
| 8 |
+
CREATE TABLE raw.customers (
|
| 9 |
+
customer_id INTEGER PRIMARY KEY,
|
| 10 |
+
email VARCHAR(255) NOT NULL,
|
| 11 |
+
first_name VARCHAR(100),
|
| 12 |
+
last_name VARCHAR(100),
|
| 13 |
+
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
| 14 |
+
country VARCHAR(50),
|
| 15 |
+
segment VARCHAR(50)
|
| 16 |
+
);
|
| 17 |
+
|
| 18 |
+
CREATE TABLE raw.orders (
|
| 19 |
+
order_id INTEGER PRIMARY KEY,
|
| 20 |
+
customer_id INTEGER REFERENCES raw.customers(customer_id),
|
| 21 |
+
order_date DATE NOT NULL,
|
| 22 |
+
total_amount DECIMAL(10,2),
|
| 23 |
+
currency VARCHAR(3) DEFAULT 'USD',
|
| 24 |
+
status VARCHAR(20),
|
| 25 |
+
shipping_address_id INTEGER
|
| 26 |
+
);
|
| 27 |
+
|
| 28 |
+
CREATE TABLE raw.products (
|
| 29 |
+
product_id INTEGER PRIMARY KEY,
|
| 30 |
+
product_name VARCHAR(255) NOT NULL,
|
| 31 |
+
category VARCHAR(100),
|
| 32 |
+
subcategory VARCHAR(100),
|
| 33 |
+
brand VARCHAR(100),
|
| 34 |
+
price DECIMAL(10,2),
|
| 35 |
+
cost DECIMAL(10,2)
|
| 36 |
+
);
|
| 37 |
+
|
| 38 |
+
CREATE TABLE raw.order_items (
|
| 39 |
+
order_item_id INTEGER PRIMARY KEY,
|
| 40 |
+
order_id INTEGER REFERENCES raw.orders(order_id),
|
| 41 |
+
product_id INTEGER REFERENCES raw.products(product_id),
|
| 42 |
+
quantity INTEGER NOT NULL,
|
| 43 |
+
unit_price DECIMAL(10,2),
|
| 44 |
+
discount_percent DECIMAL(5,2) DEFAULT 0
|
| 45 |
+
);
|
| 46 |
+
|
| 47 |
+
-- ============================================
|
| 48 |
+
-- STAGING LAYER - Cleaned data
|
| 49 |
+
-- ============================================
|
| 50 |
+
|
| 51 |
+
CREATE VIEW staging.stg_customers AS
|
| 52 |
+
SELECT
|
| 53 |
+
customer_id,
|
| 54 |
+
LOWER(TRIM(email)) as email,
|
| 55 |
+
INITCAP(first_name) as first_name,
|
| 56 |
+
INITCAP(last_name) as last_name,
|
| 57 |
+
DATE(created_at) as signup_date,
|
| 58 |
+
UPPER(country) as country,
|
| 59 |
+
COALESCE(segment, 'Unknown') as segment
|
| 60 |
+
FROM raw.customers
|
| 61 |
+
WHERE email IS NOT NULL;
|
| 62 |
+
-- LINEAGE: raw.customers -> staging.stg_customers
|
| 63 |
+
|
| 64 |
+
CREATE VIEW staging.stg_orders AS
|
| 65 |
+
SELECT
|
| 66 |
+
order_id,
|
| 67 |
+
customer_id,
|
| 68 |
+
order_date,
|
| 69 |
+
total_amount,
|
| 70 |
+
currency,
|
| 71 |
+
CASE
|
| 72 |
+
WHEN status IN ('completed', 'shipped', 'delivered') THEN 'Fulfilled'
|
| 73 |
+
WHEN status IN ('pending', 'processing') THEN 'In Progress'
|
| 74 |
+
ELSE 'Other'
|
| 75 |
+
END as order_status
|
| 76 |
+
FROM raw.orders
|
| 77 |
+
WHERE order_date >= '2024-01-01';
|
| 78 |
+
-- LINEAGE: raw.orders -> staging.stg_orders
|
| 79 |
+
|
| 80 |
+
CREATE VIEW staging.stg_products AS
|
| 81 |
+
SELECT
|
| 82 |
+
product_id,
|
| 83 |
+
product_name,
|
| 84 |
+
category,
|
| 85 |
+
subcategory,
|
| 86 |
+
brand,
|
| 87 |
+
price,
|
| 88 |
+
cost,
|
| 89 |
+
(price - cost) / NULLIF(price, 0) * 100 as margin_percent
|
| 90 |
+
FROM raw.products
|
| 91 |
+
WHERE price > 0;
|
| 92 |
+
-- LINEAGE: raw.products -> staging.stg_products
|
| 93 |
+
|
| 94 |
+
CREATE VIEW staging.stg_order_items AS
|
| 95 |
+
SELECT
|
| 96 |
+
order_item_id,
|
| 97 |
+
order_id,
|
| 98 |
+
product_id,
|
| 99 |
+
quantity,
|
| 100 |
+
unit_price,
|
| 101 |
+
discount_percent,
|
| 102 |
+
quantity * unit_price * (1 - discount_percent/100) as line_total
|
| 103 |
+
FROM raw.order_items;
|
| 104 |
+
-- LINEAGE: raw.order_items -> staging.stg_order_items
|
| 105 |
+
|
| 106 |
+
-- ============================================
|
| 107 |
+
-- INTERMEDIATE LAYER - Business logic
|
| 108 |
+
-- ============================================
|
| 109 |
+
|
| 110 |
+
CREATE TABLE intermediate.int_customer_orders AS
|
| 111 |
+
SELECT
|
| 112 |
+
c.customer_id,
|
| 113 |
+
c.email,
|
| 114 |
+
c.first_name,
|
| 115 |
+
c.last_name,
|
| 116 |
+
c.signup_date,
|
| 117 |
+
c.country,
|
| 118 |
+
c.segment,
|
| 119 |
+
COUNT(DISTINCT o.order_id) as total_orders,
|
| 120 |
+
SUM(o.total_amount) as total_spent,
|
| 121 |
+
MIN(o.order_date) as first_order_date,
|
| 122 |
+
MAX(o.order_date) as last_order_date,
|
| 123 |
+
AVG(o.total_amount) as avg_order_value
|
| 124 |
+
FROM staging.stg_customers c
|
| 125 |
+
LEFT JOIN staging.stg_orders o ON c.customer_id = o.customer_id
|
| 126 |
+
GROUP BY c.customer_id, c.email, c.first_name, c.last_name,
|
| 127 |
+
c.signup_date, c.country, c.segment;
|
| 128 |
+
-- LINEAGE: staging.stg_customers, staging.stg_orders -> intermediate.int_customer_orders
|
| 129 |
+
|
| 130 |
+
CREATE TABLE intermediate.int_order_details AS
|
| 131 |
+
SELECT
|
| 132 |
+
o.order_id,
|
| 133 |
+
o.customer_id,
|
| 134 |
+
o.order_date,
|
| 135 |
+
o.order_status,
|
| 136 |
+
oi.product_id,
|
| 137 |
+
p.product_name,
|
| 138 |
+
p.category,
|
| 139 |
+
p.brand,
|
| 140 |
+
oi.quantity,
|
| 141 |
+
oi.unit_price,
|
| 142 |
+
oi.line_total,
|
| 143 |
+
p.margin_percent
|
| 144 |
+
FROM staging.stg_orders o
|
| 145 |
+
JOIN staging.stg_order_items oi ON o.order_id = oi.order_id
|
| 146 |
+
JOIN staging.stg_products p ON oi.product_id = p.product_id;
|
| 147 |
+
-- LINEAGE: staging.stg_orders, staging.stg_order_items, staging.stg_products -> intermediate.int_order_details
|
| 148 |
+
|
| 149 |
+
-- ============================================
|
| 150 |
+
-- MARTS LAYER - Dimensional model
|
| 151 |
+
-- ============================================
|
| 152 |
+
|
| 153 |
+
CREATE TABLE marts.dim_customers AS
|
| 154 |
+
SELECT
|
| 155 |
+
customer_id,
|
| 156 |
+
email,
|
| 157 |
+
first_name || ' ' || last_name as full_name,
|
| 158 |
+
signup_date,
|
| 159 |
+
country,
|
| 160 |
+
segment,
|
| 161 |
+
total_orders,
|
| 162 |
+
total_spent,
|
| 163 |
+
first_order_date,
|
| 164 |
+
last_order_date,
|
| 165 |
+
avg_order_value,
|
| 166 |
+
CASE
|
| 167 |
+
WHEN total_spent > 10000 THEN 'Platinum'
|
| 168 |
+
WHEN total_spent > 5000 THEN 'Gold'
|
| 169 |
+
WHEN total_spent > 1000 THEN 'Silver'
|
| 170 |
+
ELSE 'Bronze'
|
| 171 |
+
END as customer_tier,
|
| 172 |
+
DATEDIFF(day, signup_date, first_order_date) as days_to_first_order
|
| 173 |
+
FROM intermediate.int_customer_orders;
|
| 174 |
+
-- LINEAGE: intermediate.int_customer_orders -> marts.dim_customers
|
| 175 |
+
|
| 176 |
+
CREATE TABLE marts.dim_products AS
|
| 177 |
+
SELECT
|
| 178 |
+
product_id,
|
| 179 |
+
product_name,
|
| 180 |
+
category,
|
| 181 |
+
subcategory,
|
| 182 |
+
brand,
|
| 183 |
+
price,
|
| 184 |
+
cost,
|
| 185 |
+
margin_percent,
|
| 186 |
+
CASE
|
| 187 |
+
WHEN margin_percent > 50 THEN 'High Margin'
|
| 188 |
+
WHEN margin_percent > 25 THEN 'Medium Margin'
|
| 189 |
+
ELSE 'Low Margin'
|
| 190 |
+
END as margin_tier
|
| 191 |
+
FROM staging.stg_products;
|
| 192 |
+
-- LINEAGE: staging.stg_products -> marts.dim_products
|
| 193 |
+
|
| 194 |
+
CREATE TABLE marts.fct_orders AS
|
| 195 |
+
SELECT
|
| 196 |
+
od.order_id,
|
| 197 |
+
od.customer_id,
|
| 198 |
+
od.product_id,
|
| 199 |
+
od.order_date,
|
| 200 |
+
od.order_status,
|
| 201 |
+
od.quantity,
|
| 202 |
+
od.unit_price,
|
| 203 |
+
od.line_total,
|
| 204 |
+
od.margin_percent,
|
| 205 |
+
dc.customer_tier,
|
| 206 |
+
dp.margin_tier,
|
| 207 |
+
dp.category as product_category
|
| 208 |
+
FROM intermediate.int_order_details od
|
| 209 |
+
JOIN marts.dim_customers dc ON od.customer_id = dc.customer_id
|
| 210 |
+
JOIN marts.dim_products dp ON od.product_id = dp.product_id;
|
| 211 |
+
-- LINEAGE: intermediate.int_order_details, marts.dim_customers, marts.dim_products -> marts.fct_orders
|
| 212 |
+
|
| 213 |
+
-- ============================================
|
| 214 |
+
-- REPORTING LAYER - Analytics views
|
| 215 |
+
-- ============================================
|
| 216 |
+
|
| 217 |
+
CREATE VIEW reporting.rpt_daily_sales AS
|
| 218 |
+
SELECT
|
| 219 |
+
order_date,
|
| 220 |
+
product_category,
|
| 221 |
+
COUNT(DISTINCT order_id) as num_orders,
|
| 222 |
+
SUM(quantity) as units_sold,
|
| 223 |
+
SUM(line_total) as gross_revenue,
|
| 224 |
+
AVG(line_total) as avg_order_value
|
| 225 |
+
FROM marts.fct_orders
|
| 226 |
+
GROUP BY order_date, product_category;
|
| 227 |
+
-- LINEAGE: marts.fct_orders -> reporting.rpt_daily_sales
|
| 228 |
+
|
| 229 |
+
CREATE VIEW reporting.rpt_customer_ltv AS
|
| 230 |
+
SELECT
|
| 231 |
+
customer_id,
|
| 232 |
+
full_name,
|
| 233 |
+
customer_tier,
|
| 234 |
+
country,
|
| 235 |
+
total_orders,
|
| 236 |
+
total_spent as lifetime_value,
|
| 237 |
+
avg_order_value,
|
| 238 |
+
days_to_first_order,
|
| 239 |
+
DATEDIFF(day, first_order_date, last_order_date) as customer_lifespan_days,
|
| 240 |
+
total_spent / NULLIF(DATEDIFF(month, first_order_date, last_order_date), 0) as monthly_value
|
| 241 |
+
FROM marts.dim_customers
|
| 242 |
+
WHERE total_orders > 0;
|
| 243 |
+
-- LINEAGE: marts.dim_customers -> reporting.rpt_customer_ltv
|
| 244 |
+
|
| 245 |
+
CREATE VIEW reporting.rpt_product_performance AS
|
| 246 |
+
SELECT
|
| 247 |
+
dp.product_id,
|
| 248 |
+
dp.product_name,
|
| 249 |
+
dp.category,
|
| 250 |
+
dp.brand,
|
| 251 |
+
dp.margin_tier,
|
| 252 |
+
COUNT(DISTINCT fo.order_id) as times_ordered,
|
| 253 |
+
SUM(fo.quantity) as total_units_sold,
|
| 254 |
+
SUM(fo.line_total) as total_revenue,
|
| 255 |
+
AVG(fo.margin_percent) as avg_margin
|
| 256 |
+
FROM marts.dim_products dp
|
| 257 |
+
LEFT JOIN marts.fct_orders fo ON dp.product_id = fo.product_id
|
| 258 |
+
GROUP BY dp.product_id, dp.product_name, dp.category, dp.brand, dp.margin_tier;
|
| 259 |
+
-- LINEAGE: marts.dim_products, marts.fct_orders -> reporting.rpt_product_performance
|
| 260 |
+
|
| 261 |
+
-- ============================================
|
| 262 |
+
-- SUMMARY: Lineage Flow
|
| 263 |
+
-- ============================================
|
| 264 |
+
-- raw.customers -> staging.stg_customers -> intermediate.int_customer_orders -> marts.dim_customers -> reporting.rpt_customer_ltv
|
| 265 |
+
-- raw.orders -> staging.stg_orders -> intermediate.int_customer_orders
|
| 266 |
+
-- raw.orders -> staging.stg_orders -> intermediate.int_order_details -> marts.fct_orders -> reporting.rpt_daily_sales
|
| 267 |
+
-- raw.products -> staging.stg_products -> intermediate.int_order_details
|
| 268 |
+
-- raw.products -> staging.stg_products -> marts.dim_products -> marts.fct_orders
|
| 269 |
+
-- raw.order_items -> staging.stg_order_items -> intermediate.int_order_details
|
samples/warehouse_lineage_sample.json
ADDED
|
@@ -0,0 +1,216 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"warehouse": {
|
| 3 |
+
"platform": "Snowflake",
|
| 4 |
+
"account": "xy12345.us-east-1",
|
| 5 |
+
"database": "ANALYTICS_DW"
|
| 6 |
+
},
|
| 7 |
+
"lineage": {
|
| 8 |
+
"datasets": [
|
| 9 |
+
{
|
| 10 |
+
"id": "raw.customers",
|
| 11 |
+
"type": "table",
|
| 12 |
+
"database": "ANALYTICS_DW",
|
| 13 |
+
"schema": "RAW",
|
| 14 |
+
"name": "CUSTOMERS",
|
| 15 |
+
"description": "Raw customer data from CRM",
|
| 16 |
+
"columns": [
|
| 17 |
+
{"name": "CUSTOMER_ID", "type": "NUMBER", "isPrimaryKey": true},
|
| 18 |
+
{"name": "EMAIL", "type": "VARCHAR", "pii": true},
|
| 19 |
+
{"name": "NAME", "type": "VARCHAR"},
|
| 20 |
+
{"name": "CREATED_AT", "type": "TIMESTAMP_NTZ"},
|
| 21 |
+
{"name": "SOURCE_SYSTEM", "type": "VARCHAR"}
|
| 22 |
+
],
|
| 23 |
+
"tags": ["pii", "raw"],
|
| 24 |
+
"owner": "data-platform-team"
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"id": "raw.transactions",
|
| 28 |
+
"type": "table",
|
| 29 |
+
"database": "ANALYTICS_DW",
|
| 30 |
+
"schema": "RAW",
|
| 31 |
+
"name": "TRANSACTIONS",
|
| 32 |
+
"description": "Raw transaction events from payment gateway",
|
| 33 |
+
"columns": [
|
| 34 |
+
{"name": "TRANSACTION_ID", "type": "VARCHAR", "isPrimaryKey": true},
|
| 35 |
+
{"name": "CUSTOMER_ID", "type": "NUMBER", "isForeignKey": true, "references": "raw.customers.CUSTOMER_ID"},
|
| 36 |
+
{"name": "AMOUNT", "type": "NUMBER"},
|
| 37 |
+
{"name": "CURRENCY", "type": "VARCHAR"},
|
| 38 |
+
{"name": "TRANSACTION_DATE", "type": "DATE"},
|
| 39 |
+
{"name": "STATUS", "type": "VARCHAR"}
|
| 40 |
+
],
|
| 41 |
+
"tags": ["financial", "raw"],
|
| 42 |
+
"owner": "data-platform-team"
|
| 43 |
+
},
|
| 44 |
+
{
|
| 45 |
+
"id": "raw.products",
|
| 46 |
+
"type": "table",
|
| 47 |
+
"database": "ANALYTICS_DW",
|
| 48 |
+
"schema": "RAW",
|
| 49 |
+
"name": "PRODUCTS",
|
| 50 |
+
"description": "Product catalog from inventory system"
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"id": "staging.customers_cleaned",
|
| 54 |
+
"type": "view",
|
| 55 |
+
"database": "ANALYTICS_DW",
|
| 56 |
+
"schema": "STAGING",
|
| 57 |
+
"name": "CUSTOMERS_CLEANED",
|
| 58 |
+
"description": "Deduplicated and cleaned customer records",
|
| 59 |
+
"transformation": "DEDUP + CLEAN + VALIDATE",
|
| 60 |
+
"owner": "analytics-engineering"
|
| 61 |
+
},
|
| 62 |
+
{
|
| 63 |
+
"id": "staging.transactions_enriched",
|
| 64 |
+
"type": "view",
|
| 65 |
+
"database": "ANALYTICS_DW",
|
| 66 |
+
"schema": "STAGING",
|
| 67 |
+
"name": "TRANSACTIONS_ENRICHED",
|
| 68 |
+
"description": "Transactions with currency conversion and status mapping",
|
| 69 |
+
"transformation": "ENRICH + CONVERT + MAP"
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"id": "marts.dim_customer",
|
| 73 |
+
"type": "table",
|
| 74 |
+
"database": "ANALYTICS_DW",
|
| 75 |
+
"schema": "MARTS",
|
| 76 |
+
"name": "DIM_CUSTOMER",
|
| 77 |
+
"description": "Customer dimension with SCD Type 2",
|
| 78 |
+
"transformation": "SCD_TYPE_2 + AGGREGATE"
|
| 79 |
+
},
|
| 80 |
+
{
|
| 81 |
+
"id": "marts.fct_transaction",
|
| 82 |
+
"type": "table",
|
| 83 |
+
"database": "ANALYTICS_DW",
|
| 84 |
+
"schema": "MARTS",
|
| 85 |
+
"name": "FCT_TRANSACTION",
|
| 86 |
+
"description": "Transaction fact table with dimensions"
|
| 87 |
+
},
|
| 88 |
+
{
|
| 89 |
+
"id": "reporting.customer_360",
|
| 90 |
+
"type": "view",
|
| 91 |
+
"database": "ANALYTICS_DW",
|
| 92 |
+
"schema": "REPORTING",
|
| 93 |
+
"name": "CUSTOMER_360",
|
| 94 |
+
"description": "Complete customer view for BI tools"
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"id": "reporting.revenue_dashboard",
|
| 98 |
+
"type": "materialized_view",
|
| 99 |
+
"database": "ANALYTICS_DW",
|
| 100 |
+
"schema": "REPORTING",
|
| 101 |
+
"name": "REVENUE_DASHBOARD",
|
| 102 |
+
"description": "Aggregated revenue metrics for executive dashboard",
|
| 103 |
+
"refresh_schedule": "DAILY at 06:00 UTC"
|
| 104 |
+
},
|
| 105 |
+
{
|
| 106 |
+
"id": "external.crm_export",
|
| 107 |
+
"type": "external_table",
|
| 108 |
+
"location": "s3://company-exports/crm/",
|
| 109 |
+
"description": "CRM data export to S3"
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"id": "external.bi_semantic_layer",
|
| 113 |
+
"type": "semantic_model",
|
| 114 |
+
"platform": "Looker",
|
| 115 |
+
"description": "Looker semantic model for self-service analytics"
|
| 116 |
+
}
|
| 117 |
+
],
|
| 118 |
+
"relationships": [
|
| 119 |
+
{
|
| 120 |
+
"source": "raw.customers",
|
| 121 |
+
"target": "staging.customers_cleaned",
|
| 122 |
+
"type": "transform",
|
| 123 |
+
"job": "dbt_staging_customers",
|
| 124 |
+
"schedule": "hourly"
|
| 125 |
+
},
|
| 126 |
+
{
|
| 127 |
+
"source": "raw.transactions",
|
| 128 |
+
"target": "staging.transactions_enriched",
|
| 129 |
+
"type": "transform",
|
| 130 |
+
"job": "dbt_staging_transactions"
|
| 131 |
+
},
|
| 132 |
+
{
|
| 133 |
+
"source": "staging.customers_cleaned",
|
| 134 |
+
"target": "marts.dim_customer",
|
| 135 |
+
"type": "transform",
|
| 136 |
+
"job": "dbt_marts_dim_customer"
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"source": "staging.transactions_enriched",
|
| 140 |
+
"target": "marts.fct_transaction",
|
| 141 |
+
"type": "transform"
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"source": "raw.products",
|
| 145 |
+
"target": "marts.fct_transaction",
|
| 146 |
+
"type": "reference"
|
| 147 |
+
},
|
| 148 |
+
{
|
| 149 |
+
"source": "marts.dim_customer",
|
| 150 |
+
"target": "marts.fct_transaction",
|
| 151 |
+
"type": "reference"
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"source": "marts.dim_customer",
|
| 155 |
+
"target": "reporting.customer_360",
|
| 156 |
+
"type": "transform"
|
| 157 |
+
},
|
| 158 |
+
{
|
| 159 |
+
"source": "marts.fct_transaction",
|
| 160 |
+
"target": "reporting.customer_360",
|
| 161 |
+
"type": "transform"
|
| 162 |
+
},
|
| 163 |
+
{
|
| 164 |
+
"source": "marts.fct_transaction",
|
| 165 |
+
"target": "reporting.revenue_dashboard",
|
| 166 |
+
"type": "aggregate"
|
| 167 |
+
},
|
| 168 |
+
{
|
| 169 |
+
"source": "marts.dim_customer",
|
| 170 |
+
"target": "reporting.revenue_dashboard",
|
| 171 |
+
"type": "reference"
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"source": "reporting.customer_360",
|
| 175 |
+
"target": "external.crm_export",
|
| 176 |
+
"type": "export",
|
| 177 |
+
"job": "airflow_crm_sync"
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"source": "reporting.revenue_dashboard",
|
| 181 |
+
"target": "external.bi_semantic_layer",
|
| 182 |
+
"type": "publish",
|
| 183 |
+
"job": "looker_sync"
|
| 184 |
+
}
|
| 185 |
+
],
|
| 186 |
+
"jobs": [
|
| 187 |
+
{
|
| 188 |
+
"id": "dbt_staging_customers",
|
| 189 |
+
"type": "dbt",
|
| 190 |
+
"schedule": "0 * * * *",
|
| 191 |
+
"description": "Hourly customer staging refresh"
|
| 192 |
+
},
|
| 193 |
+
{
|
| 194 |
+
"id": "dbt_staging_transactions",
|
| 195 |
+
"type": "dbt",
|
| 196 |
+
"schedule": "0 * * * *"
|
| 197 |
+
},
|
| 198 |
+
{
|
| 199 |
+
"id": "dbt_marts_dim_customer",
|
| 200 |
+
"type": "dbt",
|
| 201 |
+
"schedule": "0 2 * * *"
|
| 202 |
+
},
|
| 203 |
+
{
|
| 204 |
+
"id": "airflow_crm_sync",
|
| 205 |
+
"type": "airflow",
|
| 206 |
+
"schedule": "0 6 * * *"
|
| 207 |
+
},
|
| 208 |
+
{
|
| 209 |
+
"id": "looker_sync",
|
| 210 |
+
"type": "api",
|
| 211 |
+
"schedule": "0 7 * * *"
|
| 212 |
+
}
|
| 213 |
+
]
|
| 214 |
+
},
|
| 215 |
+
"notes": "Sample Snowflake data warehouse lineage with multi-layer architecture (raw, staging, marts, reporting) and external system integrations."
|
| 216 |
+
}
|
tests/test_app.py
CHANGED
|
@@ -17,21 +17,103 @@ class TestLineageExtractors(unittest.TestCase):
|
|
| 17 |
self.assertIn('mermaid.init', html)
|
| 18 |
|
| 19 |
def test_extract_lineage_from_text_returns_html_and_summary(self):
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
self.assertIsInstance(html, str)
|
| 22 |
self.assertIsInstance(summary, str)
|
| 23 |
self.assertIn('<div class="mermaid">', html)
|
| 24 |
-
self.assertIn('
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
def test_extract_lineage_from_bigquery_returns_html_and_summary(self):
|
| 27 |
html, summary = extract_lineage_from_bigquery("proj", "SELECT 1", "key", "Mermaid")
|
| 28 |
self.assertIn('<div class="mermaid">', html)
|
| 29 |
-
self.assertIn('
|
| 30 |
|
| 31 |
def test_extract_lineage_from_url_returns_html_and_summary(self):
|
| 32 |
html, summary = extract_lineage_from_url("https://example.com", "Mermaid")
|
| 33 |
self.assertIn('<div class="mermaid">', html)
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
|
| 37 |
if __name__ == '__main__':
|
|
|
|
| 17 |
self.assertIn('mermaid.init', html)
|
| 18 |
|
| 19 |
def test_extract_lineage_from_text_returns_html_and_summary(self):
|
| 20 |
+
# Test with valid JSON input
|
| 21 |
+
sample_json = '{"nodes": [{"id": "a", "name": "A"}], "edges": []}'
|
| 22 |
+
html, summary = extract_lineage_from_text(sample_json, "Custom JSON", "Mermaid")
|
| 23 |
self.assertIsInstance(html, str)
|
| 24 |
self.assertIsInstance(summary, str)
|
| 25 |
self.assertIn('<div class="mermaid">', html)
|
| 26 |
+
self.assertIn('Parsed', summary)
|
| 27 |
+
|
| 28 |
+
def test_extract_lineage_from_text_empty_input(self):
|
| 29 |
+
# Test with empty input
|
| 30 |
+
html, summary = extract_lineage_from_text("", "dbt Manifest", "Mermaid")
|
| 31 |
+
self.assertIsInstance(html, str)
|
| 32 |
+
self.assertIsInstance(summary, str)
|
| 33 |
+
self.assertIn('provide metadata', summary.lower())
|
| 34 |
|
| 35 |
def test_extract_lineage_from_bigquery_returns_html_and_summary(self):
|
| 36 |
html, summary = extract_lineage_from_bigquery("proj", "SELECT 1", "key", "Mermaid")
|
| 37 |
self.assertIn('<div class="mermaid">', html)
|
| 38 |
+
self.assertIn('BigQuery', summary)
|
| 39 |
|
| 40 |
def test_extract_lineage_from_url_returns_html_and_summary(self):
|
| 41 |
html, summary = extract_lineage_from_url("https://example.com", "Mermaid")
|
| 42 |
self.assertIn('<div class="mermaid">', html)
|
| 43 |
+
# Summary can be either 'Lineage' or 'Parsed' depending on response
|
| 44 |
+
self.assertTrue('Lineage' in summary or 'Parsed' in summary)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
class TestExporters(unittest.TestCase):
|
| 48 |
+
def test_openlineage_export(self):
|
| 49 |
+
from exporters import LineageGraph, LineageNode, LineageEdge, OpenLineageExporter
|
| 50 |
+
|
| 51 |
+
graph = LineageGraph(name="test")
|
| 52 |
+
graph.add_node(LineageNode(id="a", name="Node A", type="table"))
|
| 53 |
+
graph.add_node(LineageNode(id="b", name="Node B", type="table"))
|
| 54 |
+
graph.add_edge(LineageEdge(source="a", target="b", type="transform"))
|
| 55 |
+
|
| 56 |
+
exporter = OpenLineageExporter(graph)
|
| 57 |
+
output = exporter.export()
|
| 58 |
+
|
| 59 |
+
self.assertIn("openlineage", output.lower())
|
| 60 |
+
self.assertIn("Node A", output)
|
| 61 |
+
|
| 62 |
+
def test_collibra_export(self):
|
| 63 |
+
from exporters import LineageGraph, LineageNode, LineageEdge, CollibraExporter
|
| 64 |
+
|
| 65 |
+
graph = LineageGraph(name="test")
|
| 66 |
+
graph.add_node(LineageNode(id="a", name="Node A", type="table"))
|
| 67 |
+
|
| 68 |
+
exporter = CollibraExporter(graph)
|
| 69 |
+
output = exporter.export()
|
| 70 |
+
|
| 71 |
+
self.assertIn("Collibra", output)
|
| 72 |
+
self.assertIn("Node A", output)
|
| 73 |
+
|
| 74 |
+
def test_purview_export(self):
|
| 75 |
+
from exporters import LineageGraph, LineageNode, LineageEdge, PurviewExporter
|
| 76 |
+
|
| 77 |
+
graph = LineageGraph(name="test")
|
| 78 |
+
graph.add_node(LineageNode(id="a", name="Node A", type="table"))
|
| 79 |
+
|
| 80 |
+
exporter = PurviewExporter(graph)
|
| 81 |
+
output = exporter.export()
|
| 82 |
+
|
| 83 |
+
self.assertIn("Purview", output)
|
| 84 |
+
self.assertIn("Node A", output)
|
| 85 |
+
|
| 86 |
+
def test_alation_export(self):
|
| 87 |
+
from exporters import LineageGraph, LineageNode, LineageEdge, AlationExporter
|
| 88 |
+
|
| 89 |
+
graph = LineageGraph(name="test")
|
| 90 |
+
graph.add_node(LineageNode(id="a", name="Node A", type="table"))
|
| 91 |
+
|
| 92 |
+
exporter = AlationExporter(graph)
|
| 93 |
+
output = exporter.export()
|
| 94 |
+
|
| 95 |
+
self.assertIn("Alation", output)
|
| 96 |
+
self.assertIn("Node A", output)
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
class TestSampleDataLoading(unittest.TestCase):
|
| 100 |
+
def test_load_sample_simple(self):
|
| 101 |
+
from app import load_sample
|
| 102 |
+
content = load_sample("simple")
|
| 103 |
+
self.assertIn("nodes", content)
|
| 104 |
+
self.assertIn("edges", content)
|
| 105 |
+
|
| 106 |
+
def test_load_sample_dbt(self):
|
| 107 |
+
from app import load_sample
|
| 108 |
+
content = load_sample("dbt")
|
| 109 |
+
self.assertIn("metadata", content)
|
| 110 |
+
self.assertIn("nodes", content)
|
| 111 |
+
|
| 112 |
+
def test_load_sample_airflow(self):
|
| 113 |
+
from app import load_sample
|
| 114 |
+
content = load_sample("airflow")
|
| 115 |
+
self.assertIn("dag_id", content)
|
| 116 |
+
self.assertIn("tasks", content)
|
| 117 |
|
| 118 |
|
| 119 |
if __name__ == '__main__':
|