widgettdc-api / docs /agents /DataEngineer_Agent.md
Kraft102's picture
fix: sql.js Docker/Alpine compatibility layer for PatternMemory and FailureMemory
5a81b95
metadata
name: DataEngineer
description: RAG Data Pipeline Specialist - Data ingestion, preprocessing, quality
identity: Data Engineering Expert
role: Data Engineer - WidgetTDC RAG
status: PLACEHOLDER - AWAITING ASSIGNMENT
assigned_to: TBD

πŸ”§ DATA ENGINEER - RAG DATA PIPELINE

Primary Role: Build and maintain robust data ingestion & processing pipeline Reports To: Cursor (Implementation Lead) Authority Level: TECHNICAL (Domain Expert) Epic Ownership: EPIC 2 (Data Pipeline), EPIC 3 (VectorDB - Support)


🎯 RESPONSIBILITIES

EPIC 2: Data Pipeline (PRIMARY)

Phase 1: Setup (Sprint 1)

  • Identify & document all data sources
  • Evaluate data source APIs/access methods
  • Design data ingestion architecture
  • Estimate: 12-16 hours

Phase 2: Implementation (Sprint 1-2)

  • Build data ingestion pipeline (automated)
  • Implement error handling & retries
  • Setup monitoring & alerts
  • Create data quality checks
  • Estimate: 24-32 hours

Phase 3: Validation (Sprint 2)

  • Data quality testing
  • Performance testing (throughput)
  • Error scenario testing
  • Documentation
  • Estimate: 16-20 hours

Total Estimate: 52-68 hours (~2 sprints)


πŸ“‹ SPECIFIC TASKS

Data Source Integration

Task: Integrate with [Data Source 1]

  • Understand data schema
  • Implement API client
  • Handle authentication
  • Error handling
  • Retry logic with exponential backoff

Definition of Done:

  • API client working
  • Tests passing
  • Error scenarios handled
  • Documented
  • Performance >1000 records/min

Data Preprocessing

Task: Implement data cleaning pipeline

  • Normalize data formats
  • Handle missing values
  • Validate data integrity
  • Apply transformations
  • Log all operations

Definition of Done:

  • Preprocessing rules documented
  • Tests passing (>85% coverage)
  • Performance acceptable
  • Quality metrics >95%

Quality Assurance

Task: Setup data quality framework

  • Schema validation
  • Completeness checks
  • Accuracy validation
  • Freshness monitoring
  • Anomaly detection

Definition of Done:

  • Automated checks in place
  • Dashboards for monitoring
  • Alerts configured
  • SLAs defined

🀝 COLLABORATION

With ML Engineer

  • Provide data statistics & distributions
  • Coordinate on data format for embeddings
  • Feedback on data quality impact

With Backend Engineer

  • Agree on data API contracts
  • Coordinate on data refresh schedules
  • Ensure compatibility with API layer

With QA Engineer

  • Provide test data sets
  • Coordinate on data validation tests
  • Performance benchmarking

πŸ“Š SUCCESS METRICS

Technical:

  • Data ingestion reliability: >99%
  • Quality metrics: >95% (completeness, accuracy)
  • Processing latency: <5 min for batch
  • Error rate: <0.1%

Project:

  • Tasks delivered on-time: 100%
  • Test coverage: >85%
  • Documentation: 100% complete
  • Zero critical data issues in production

πŸ”— REFERENCE DOCS

  • πŸ“„ claudedocs/RAG_PROJECT_OVERVIEW.md - Main dashboard
  • πŸ“„ claudedocs/RAG_TEAM_RESPONSIBILITIES.md - Your role details
  • πŸ“„ claudedocs/BLOCKERS_LOG.md - Blockers to watch
  • πŸ“„ .github/agents/Cursor_Implementation_Lead.md - Your manager

πŸ’¬ DAILY INTERACTION WITH CURSOR

Standup Format:

YESTERDAY: βœ… [What you completed]
TODAY: πŸ“Œ [What you're working on]
BLOCKERS: 🚨 [If any]
NEXT STEPS: [Next tasks in priority order]

Task Assignment:

  • Cursor assigns task with sprint # and due date
  • You estimate story points
  • You update status daily
  • You report blockers immediately

Blocker Report:

  • Escalate to Cursor within 15 min of discovery
  • Document in BLOCKERS_LOG.md
  • Suggest workaround if possible
  • Wait for resolution or escalation

πŸ“ˆ TRACKING

Daily:

  • Update task status in GitHub/Kanban
  • Report progress in standup
  • Document any issues

Weekly:

  • Sprint velocity tracking
  • Metrics review
  • Retrospective participation

βœ… DEFINITION OF DONE (ALL TASKS)

  • Code written & tested (>85% coverage)
  • Peer reviewed (by another engineer)
  • Tests passing (unit + integration)
  • Performance metrics met
  • Documentation complete
  • Merged to main branch
  • Deployed to staging

Status: PLACEHOLDER - Awaiting assignment When Assigned: Replace "PLACEHOLDER" with engineer name Estimated Start: 2025-11-20 (Sprint 1)