Spaces:

byoung-hf
/

ben-bot

Running

App Files Files Community

byoung-hf commited on 23 days ago

Commit

6913322

verified ·

1 Parent(s): 6b2b3f9

Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.specify/memory/constitution.md +15 -1
CONTRIBUTING.md +139 -8
README.md +19 -66
TESTING.md +0 -1
architecture/adrs/adr-linkedin-integration.md +42 -0
specs/002-linkedin-profile-extractor/CLARIFICATIONS.md +158 -0
specs/002-linkedin-profile-extractor/INDEX.md +200 -0
specs/002-linkedin-profile-extractor/INTEGRATION_GUIDE.md +379 -0
specs/002-linkedin-profile-extractor/README.md +282 -0
specs/002-linkedin-profile-extractor/spec.md +408 -0

.specify/memory/constitution.md CHANGED Viewed

@@ -67,6 +67,10 @@ All agent responses normalized for clean, consistent output across platforms.
 - Output cleaned before returning to user
 - Output links should work
 ## Technology Stack Constraints
 - **Python**: 3.12+ only (via `requires-python = "~=3.12.0"`)
@@ -125,6 +129,14 @@ All agent responses normalized for clean, consistent output across platforms.
 7. **No credential leaks** - .gitignore and .dockerignore files to help prevent secret slips. Never build secrets into a dockerfile!
 8. **No notebook outputs in GIT** - you must clean up the code
 ## Governance
 This constitution supersedes all other practices and is the single source of truth for architectural decisions. All PRs and feature requests must verify compliance with these principles. Code review must check:
@@ -135,5 +147,7 @@ This constitution supersedes all other practices and is the single source of tru
 - Imports organized per PEP 8
 - Observability (logging) present
 - Output cleanliness (Unicode normalization)
-**Version**: 1.0.0 | **Ratified**: 2025-10-23 | **Last Amended**: 2025-10-23

 - Output cleaned before returning to user
 - Output links should work
+### X. External Data Integration Policy
+For external services that do not provide a sanctioned public API (for example: LinkedIn), AI‑Me will perform data ingestion only via a human-in-the-loop browser automation process that requires interactive user authentication. Extracted content must be limited to publicly-visible information, reviewed by the human operator for accuracy and privacy before ingestion, and must never be collected via third-party services that require users to share credentials or that perform scraping on a user's behalf.
 ## Technology Stack Constraints
 - **Python**: 3.12+ only (via `requires-python = "~=3.12.0"`)
 7. **No credential leaks** - .gitignore and .dockerignore files to help prevent secret slips. Never build secrets into a dockerfile!
 8. **No notebook outputs in GIT** - you must clean up the code
+## Architectural Decision Records (ADRs)
+All major architectural decisions are documented in `architecture/adrs/`. ADRs provide detailed context, tradeoffs, and compliance notes that elaborate on constitution principles:
+- **ADR-001**: Human-in-the-loop Browser Automation for Third-Party Data Ingestion — Instantiates Principle X (External Data Integration Policy)
+Reference ADRs when evaluating PRs, designing new integrations, or proposing architecture changes.
 ## Governance
 This constitution supersedes all other practices and is the single source of truth for architectural decisions. All PRs and feature requests must verify compliance with these principles. Code review must check:
 - Imports organized per PEP 8
 - Observability (logging) present
 - Output cleanliness (Unicode normalization)
+- External data integration policy adherence
+- Architectural decisions documented in `architecture/adrs/`
+**Version**: 1.0.1 | **Ratified**: 2025-10-23 | **Last Amended**: 2025-10-24

CONTRIBUTING.md CHANGED Viewed

@@ -4,18 +4,32 @@ Welcome! This document outlines the process for contributing to the AI-Me projec
 ## Prerequisites
-- Python 3.12+ (managed by `uv`)
-- Git with GPG signing configured
-- Basic understanding of async Python and RAG concepts (see `.specify/memory/constitution.md`)
 ## Setup
 ### 1. Clone and Install Dependencies
 ```bash
-git clone https://github.com/byoung/ai-me.git
 cd ai-me
 uv sync
 ```
 ### 2. Environment Configuration
@@ -41,11 +55,128 @@ LOKI_USERNAME=...
 LOKI_PASSWORD=...
 ```
-### 3. Configure Git Commit Signing
-See this guide on setting up gpg keys:
-https://docs.github.com/en/authentication/managing-commit-signature-verification/generating-a-new-gpg-key
-**All commits MUST be GPG-signed.**

 ## Prerequisites
+If you want to propose changes to ai-me, please search our [issues](https://github.com/byoung/ai-me/issues) list first. If there is none, please create a new issue and label it as a bug or enhancement. Before you get started, let's have a conversation about the proposal!
+This project is transitioning to [Spec Kit](https://github.com/github/spec-kit), so any new features must first start with a `/speckit.specify` in order to establish our user stories and requirements consistently.
+To develop on this project, you will need to:
+- Set up [Docker](https://docs.docker.com/engine/install/) for container based development
+- Set up [uv](https://docs.astral.sh/uv/getting-started/installation/) for local development
+- Create a fork of [ai-me](https://github.com/byoung/ai-me)
+- Configure git with [GPG signing configured](https://docs.github.com/en/authentication/managing-commit-signature-verification/generating-a-new-gpg-key)
+- Do a full review of our [constitution](/.specify/memory/constitution.md)
+- Set up a [pre-commit hook](#setting-up-the-pre-commit-hook) to clear notebook output (unless you have the discipline to manually do it before opening PRs -- I (@byoung) do not...)
 ## Setup
 ### 1. Clone and Install Dependencies
 ```bash
+git clone https://github.com/<your fork>
 cd ai-me
+# Local dev
 uv sync
+# Container dev
+docker compose build notebooks
 ```
 ### 2. Environment Configuration
 LOKI_PASSWORD=...
 ```
+### 3. Start the application
+```bash
+# Local
+uv run src/app.py  # Launches Gradio on default port
+# Docker
+docker compose up notebooks
+```
+### 4. Make changes
+You don't have to use Spec Kit to plan and implement your specs, BUT you MUST create traceability between your spec, implementation, and tests per our [constitution](/.specify/memory/constitution.md)!
+### 5. Test
+For detailed information on testing check out our [TESTING.md](/TESTING.md) guide.
+### 6. Deploy
+To deploy to HF, check out this [section](/README.md#deployment) in our README. This test ensures that your system is deployable and usable in HF.
+### 7. Open a PR
+Be sure to give a brief overview of the change and link it to the issue it's resolving.
+## Setting Up the Pre-Commit Hook
+A Git pre-commit hook automatically clears all Jupyter notebook outputs before committing. This keeps the repository clean and reduces diff noise by preventing output changes from cluttering commits.
+### Installation
+#### Option 1: Automated Installation (Recommended)
+After cloning the repository, run:
+```bash
+cd ai-me
+cp .git/hooks/pre-commit.sample .git/hooks/pre-commit
+chmod +x .git/hooks/pre-commit
+```
+Then create the hook script:
+```bash
+cat > .git/hooks/pre-commit << 'EOF'
+#!/bin/bash
+# Pre-commit hook to clear Jupyter notebook outputs
+# Find all staged .ipynb files
+notebooks=$(git diff --cached --name-only --diff-filter=ACM | grep '\.ipynb$')
+if [ -n "$notebooks" ]; then
+    echo "Clearing outputs from notebooks..."
+    for notebook in $notebooks; do
+        if [ -f "$notebook" ]; then
+            echo "  Processing: $notebook"
+            # Clear outputs using Python directly (no jupyter dependency needed)
+            python3 -c "
+import json
+notebook_path = '$notebook'
+# Read the notebook
+with open(notebook_path, 'r', encoding='utf-8') as f:
+    nb = json.load(f)
+# Clear outputs from all cells
+for cell in nb.get('cells', []):
+    if cell['cell_type'] == 'code':
+        cell['outputs'] = []
+        cell['execution_count'] = None
+# Write back the cleaned notebook
+with open(notebook_path, 'w', encoding='utf-8') as f:
+    json.dump(nb, f, indent=1, ensure_ascii=False)
+    f.write('\n')
+"
+            # Re-stage the cleaned notebook
+            git add "$notebook"
+        fi
+    done
+    echo "✓ Notebook outputs cleared"
+fi
+exit 0
+EOF
+chmod +x .git/hooks/pre-commit
+```
+#### Option 2: Manual Installation
+1. Navigate to your git hooks directory:
+   ```bash
+   cd .git/hooks
+   ```
+2. Create a new file called `pre-commit`:
+   ```bash
+   touch pre-commit
+   chmod +x pre-commit
+   ```
+3. Open the file in your editor and paste the script above (starting with `#!/bin/bash`).
+### Verification
+To verify the hook is working, make a change to a notebook and stage it:
+```bash
+git add src/notebooks/experiments.ipynb
+git commit -m "Test notebook commit"
+```
+You should see output like:
+```
+Clearing outputs from notebooks...
+  Processing: src/notebooks/experiments.ipynb
+✓ Notebook outputs cleared
+```
+**Note**: A Git pre-commit hook is installed at `.git/hooks/pre-commit` that automatically clears all notebook outputs before committing.

README.md CHANGED Viewed

@@ -9,39 +9,18 @@ An agentic version of real people using RAG (Retrieval Augmented Generation) ove
 Deployed as a Gradio chatbot on Hugging Face Spaces.
-## Architecture Overview
-### Core Design
-**Data Pipeline** → **Agent System Set Up** → **Chat Interface**
-1. **Data Pipeline** (`src/data.py`)
-   - Loads markdown from local `docs/` and public GitHub repos
-   - Chunks with LangChain, embeds with HuggingFace sentence-transformers
-   - Stores in ephemeral ChromaDB vectorstore
-2. **Agent System** (`src/agent.py`)
-   - Primary agent personifies a real person using RAG
-   - Queries vectorstore via async `get_local_info` tool
-   - Uses Groq API (`openai/openai/gpt-oss-120b`) for LLM inference
-   - OpenAI API for tracing/debugging only
-3. **UI Layer** (`src/app.py`)
-   - Simple chat interface streams responses
-   - Async-first architecture throughout
-### Key Technologies
-- **Python 3.12** with `uv` for dependency management
-- **OpenAI Agents SDK** for agentic framework
-- **LangChain** for document loading/chunking only
-- **ChromaDB** with ephemeral in-memory storage
-- **Gradio** for UI and Hugging Face Spaces deployment
-- **Groq** as primary LLM provider for fast inference
-- **Pydantic** for type-safe configuration
-- **Grafana Cloud Loki** for remote logging (optional)
-## Getting Started
 ### Environment Setup
@@ -52,22 +31,18 @@ uv sync
 # OR build the container
 docker compose build notebooks
-# NOTE: if you go the local route it's assume you have nodejs and npx installed
 # Create .env with required keys:
 OPENAI_API_KEY=<for tracing>
 GROQ_API_KEY=<primary LLM provider>
 GITHUB_PERSONAL_ACCESS_TOKEN=<for searching GitHub repos>
 BOT_FULL_NAME=<full name of the persona>
 APP_NAME=<name that appears on HF chat page>
-GITHUB_REPOS=<comma-separated list of public repos: owner/repo,owner/repo>
 # Optional: Set log level (DEBUG, INFO, WARNING, ERROR). Defaults to INFO.
 LOG_LEVEL=INFO
 ```
-**Note**: A Git pre-commit hook is installed at `.git/hooks/pre-commit` that automatically clears all notebook outputs before committing. This keeps the repository clean and reduces diff noise.
 ### Running
 ```bash
@@ -83,12 +58,14 @@ If you use the Docker route, you can use the Dev Containers extension in most po
 ## Deployment
 ```bash
 # Run from the root directory
 gradio deploy
 ```
-**Automatic CI/CD**: Push to `main` triggers a GitHub Actions workflow that deploys to Hugging Face Spaces via Gradio CLI. The following environment variables need to be set up in GitHub for the CI/CD pipeline:
 ```bash
 # Testing
@@ -119,43 +96,19 @@ GITHUB_REPOS
 ENV  # e.g., "production" - used for log tagging
 ```
-## Design Principles
-1. **Decoupled Architecture**: Config handles sources, DataManager handles pipeline, app orchestrates
-2. **Smart Defaults**: Minimal configuration required - most params have sensible defaults
-3. **Async-First**: All agent operations are async for responsive UI
-4. **Ephemeral Storage**: Vectorstore rebuilt on each restart (fast, simple, stateless)
-5. **Type Safety**: Pydantic models with validation throughout
-6. **Development/Production Parity**: Same DataManager used in notebooks and production app
-## Project Structure
-```
-src/
-  ├── config.py              # Pydantic settings, API keys, data sources
-  ├── data.py                # DataManager class - complete pipeline
-  ├── agent.py               # Agent creation and orchestration
-  ├── app.py                 # Production Gradio app
-  ├── test.py                # Unit tests
-  ├── __init__.py
-  └── notebooks/
-      └── experiments.ipynb  # Development sandbox with MCP servers
-docs/                        # Local markdown for RAG (development)
-test_data/                   # Test fixtures and sample data
-.github/
-  ├── copilot-instructions.md  # Detailed implementation guide for AI
-  └── workflows/
-      └── update_space.yml     # CI/CD to Hugging Face
-```
-## Reminders/Warnings
-- **Data Sources**: The default for local development is the `docs/` folder. If you want your production app to access this content post deploy, it must be pushed to a public GitHub repo until we support private repo document loading for RAG.
 - **Model Choice**: Groq's `gpt-oss-120b` provides good quality with ultra-fast inference. If you change the model, be aware that tool calling may degrade which can lead to runtime errors.
 - **Docker in Docker**: A prior version of this app had the Docker command built into the container. Because the GitHub MCP server was moved from `docker` to `npx`, the CLI is no longer included in the [Dockerfile](Dockerfile). However, the socket mount in [docker-compose](docker-compose.yaml) was left as we may revisit Docker MCP servers in the future. We will never support a pure docker-in-docker setup, but socket mounting may still be an option.
-- **The Agent does not have memory!**: This was an intentional design decision until multithreaded operations can be supported/tested on HF.
 ---
-For detailed implementation patterns, code conventions, and AI assistant context, see [`.github/copilot-instructions.md`](.github/copilot-instructions.md).

 Deployed as a Gradio chatbot on Hugging Face Spaces.
+Example: https://huggingface.co/spaces/byoung-hf/ben-bot
+## Getting Started
+If you want to experiment with building your own agentic self, clone the repo and follow the steps below.
+### Prerequisites
+- [Docker](https://docs.docker.com/engine/install/) or [uv](https://docs.astral.sh/uv/getting-started/installation/) for running the application
+- If you run locally, you need [npx](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm) installed
+- Groq and OpenAI for inference and tracing
+- A GitHub PAT for the GitHub tool (optional)
 ### Environment Setup
 # OR build the container
 docker compose build notebooks
 # Create .env with required keys:
 OPENAI_API_KEY=<for tracing>
 GROQ_API_KEY=<primary LLM provider>
 GITHUB_PERSONAL_ACCESS_TOKEN=<for searching GitHub repos>
 BOT_FULL_NAME=<full name of the persona>
 APP_NAME=<name that appears on HF chat page>
+GITHUB_REPOS=<comma-separated list of public repos: owner/repo,owner/repo with md files to be ingested by RAG>
 # Optional: Set log level (DEBUG, INFO, WARNING, ERROR). Defaults to INFO.
 LOG_LEVEL=INFO
 ```
 ### Running
 ```bash
 ## Deployment
+To deploy the application to HF, you need to update the comment at the top of this README and create your own HF account and put a HF_TOKEN into your .env file.
 ```bash
 # Run from the root directory
 gradio deploy
 ```
+**Automatic CI/CD**: Pushing to `main` triggers a GitHub Actions workflow that deploys to Hugging Face Spaces via Gradio CLI. The following environment variables need to be set up in GitHub for the CI/CD pipeline:
 ```bash
 # Testing
 ENV  # e.g., "production" - used for log tagging
 ```
+## Architecture and Design Overview
+All of our architecture and design thinking can be found in our [constitution](/.specify/memory/constitution.md) and [ADRs](/architecture/adrs/)
+Some implementation decisions have been captured in our [`.github/copilot-instructions.md`](.github/copilot-instructions.md), but will be integrated into other documents over time.
+## Contributions
+Check out our [CONTRIBUTING.md](/CONTRIBUTING.md) guide for detailed instructions on how you can contribute.
+## Reminders/Warnings/Gotchas
+- **Data Sources**: The default for local development is the `docs/local-testing` folder. If you want your production app to access this content post deploy, it must be pushed to a public GitHub repo until we support private repo document loading for RAG.
 - **Model Choice**: Groq's `gpt-oss-120b` provides good quality with ultra-fast inference. If you change the model, be aware that tool calling may degrade which can lead to runtime errors.
 - **Docker in Docker**: A prior version of this app had the Docker command built into the container. Because the GitHub MCP server was moved from `docker` to `npx`, the CLI is no longer included in the [Dockerfile](Dockerfile). However, the socket mount in [docker-compose](docker-compose.yaml) was left as we may revisit Docker MCP servers in the future. We will never support a pure docker-in-docker setup, but socket mounting may still be an option.
 ---

TESTING.md CHANGED Viewed

@@ -48,7 +48,6 @@ uv run pytest src/test.py::test_rear_knowledge_contains_it245 -v
 **Configuration**:
 - **Temperature**: Set to 0.0 for deterministic, reproducible responses
-- **MCP Servers**: Disabled by default for faster test execution
 - **Model**: Uses model specified in config (default: `openai/openai/gpt-oss-120b` via Groq)
 - **Data Source**: `test_data/` directory (configured via `doc_root` parameter)
 - **GitHub Repos**: Disabled (`GITHUB_REPOS=""`) for faster test execution

 **Configuration**:
 - **Temperature**: Set to 0.0 for deterministic, reproducible responses
 - **Model**: Uses model specified in config (default: `openai/openai/gpt-oss-120b` via Groq)
 - **Data Source**: `test_data/` directory (configured via `doc_root` parameter)
 - **GitHub Repos**: Disabled (`GITHUB_REPOS=""`) for faster test execution

architecture/adrs/adr-linkedin-integration.md ADDED Viewed

	@@ -0,0 +1,42 @@

+# ADR-001: Human-in-the-Loop Browser Automation for Third-Party Data Ingestion
+**Status**: Accepted
+**Date**: 2025-10-24
+## Context
+External services like LinkedIn do not provide sanctioned public APIs for personal profile data extraction. Previous research evaluated three integration approaches:
+1. **Programmatic MCP/API integrations** — No official LinkedIn MCP server exists; third-party options are immature or unavailable
+2. **Third-party data-gathering services** (Apify, Anysite) — Require sharing user credentials, violate Terms of Service, pose security and privacy risks
+3. **Human-in-the-loop browser automation** — Respects ToS, maintains user control, verifies data accuracy and privacy before ingestion
+## Decision
+We will implement data ingestion for LinkedIn and other services without sanctioned public APIs exclusively through **human-in-the-loop browser automation** (Playwright). This approach:
+- Requires the user to authenticate interactively (human provides credentials directly to LinkedIn, not to our tool)
+- Extracts only publicly-visible profile content (profile, experience, education, skills, recommendations, connections, activity)
+- Scope is limited to the user that manually logs in
+- Mandates human review of all extracted content for accuracy and privacy compliance before ingestion into markdown documentation
+- Explicitly prohibits use of third-party credential-sharing services that scrape on a user's behalf
+This policy applies retroactively to LinkedIn and prospectively to any future external services lacking a publicly available API.
+## Consequences
+**Positive:**
+- Maintains compliance with LinkedIn Terms of Service
+- Protects user security (credentials never shared with third parties)
+- Enables human verification of data accuracy and privacy
+- Establishes reusable pattern for similar external data sources
+- Respects user agency and control over their data
+**Negative:**
+- Requires user manual effort (browser interaction, file review)
+- Tool development complexity (Playwright orchestration, markdown formatting)
+- Cannot be fully automated or scheduled
+- Slower than direct API access (if available)
+## Compliance
+This decision instantiates Constitution Principle X (External Data Integration Policy) and establishes binding guidance for all future data ingestion tools.

specs/002-linkedin-profile-extractor/CLARIFICATIONS.md ADDED Viewed

	@@ -0,0 +1,158 @@

+# Clarification Session Report: LinkedIn Profile Extractor Tool
+**Date**: October 24, 2025
+**Feature**: LinkedIn Profile Data Extractor (Spec 002)
+**Status**: ✅ **COMPLETE**
+---
+## Executive Summary
+Clarification session completed with **5 critical questions answered**. All high-impact ambiguities resolved. New feature spec created (`specs/002-linkedin-profile-extractor/spec.md`) with clear design decisions and ready for planning phase.
+---
+## Questions Asked & Answered
+### Question 1: Feature Scope & Organization
+**Answer**: A — Create separate spec (`specs/002-linkedin-profile-extractor/spec.md`)
+**Rationale**: Cleanly separates concerns from main Personified AI Agent (spec 001), allows independent versioning and task tracking, prevents scope creep.
+---
+### Question 2: Authentication Mechanism
+**Answer**: A with Playwright — Browser automation with manual LinkedIn login
+**Rationale**: Respects LinkedIn ToS, gives users full control, avoids API restrictions, captures full LinkedIn UI experience. Playwright chosen for cross-platform support and reliability.
+---
+### Question 3: Data Extraction Scope
+**Answer**: C — Full profile extraction (connections, endorsements, activity) with human review gate
+**Context**: User clarified that human review allows verification of privacy/legal concerns, and all extracted data is publicly available anyway.
+**Rationale**: Maximum value; human review ensures accuracy and compliance. Users decide what to share with AI agent.
+---
+### Question 4: Markdown Output Structure
+**Answer**: B — Hierarchical markdown by section (Profile.md, Experience.md, Education.md, Skills.md, Recommendations.md, Connections.md, Activity.md)
+**Rationale**: Mirrors LinkedIn's natural information architecture, modular for easy editing/exclusion, integrates seamlessly with existing RAG pipeline.
+---
+### Question 5: Tool Delivery & Integration
+**Answer**: A — Standalone Python CLI tool; users run locally, review output, manually upload
+**Rationale**: Full user control over data, respects privacy, simple integration with existing workflow, avoids complicating main app deployment.
+---
+## Coverage Analysis
+| Category | Status | Coverage |
+|----------|--------|----------|
+| **Functional Scope & Behavior** | ✅ Resolved | Clear user stories, acceptance criteria, edge cases documented |
+| **Domain & Data Model** | ✅ Resolved | Entities (LinkedInProfile, Experience, etc.), markdown output structure defined |
+| **Interaction & UX Flow** | ✅ Resolved | CLI interface, workflow steps, human review gate specified |
+| **Non-Functional Quality Attributes** | ✅ Resolved | Performance targets (<5min extraction), reliability (error handling), privacy (local execution) |
+| **Integration & External Dependencies** | ✅ Resolved | Playwright browser automation, no API integration, RAG pipeline compatibility |
+| **Edge Cases & Failure Handling** | ✅ Resolved | UI changes, rate limiting, timeouts, incomplete data all covered |
+| **Constraints & Tradeoffs** | ✅ Resolved | Technical stack (Python 3.12, Playwright), local execution, manual upload confirmed |
+| **Terminology & Consistency** | ✅ Resolved | Canonical terms (LinkedInProfile, ExtractionSession, MarkdownOutput) defined |
+| **Completion Signals** | ✅ Resolved | Success metrics defined; acceptance criteria testable |
+**Overall Coverage**: ✅ **100%** — All critical categories resolved
+---
+## Spec Artifact Generated
+**Path**: `/Users/benyoung/projects/ai-me/specs/002-linkedin-profile-extractor/spec.md`
+**Contents**:
+- Problem statement and solution overview
+- 3 user stories (P1: Extract, P1: Review, P2: Upload) with acceptance criteria
+- 17 functional requirements (FR-001 through FR-017)
+- 8 key entities (LinkedInProfile, Experience, Education, Skill, Recommendation, Connection, Activity, ExtractionSession, MarkdownOutput)
+- 13 non-functional requirements (performance, reliability, security, usability, observability)
+- 8 success criteria
+- Data model and markdown output examples
+- Technical constraints and architecture
+- Integration with Personified AI Agent (spec 001)
+- CLI interface and usage examples
+- Testing strategy
+- Future enhancements (Phase B, out-of-scope)
+---
+## Sections Touched in New Spec
+1. **Clarifications** → Session 2025-10-24 with all 5 answered questions
+2. **Overview & Context** → Problem statement, solution, key differentiators
+3. **User Scenarios & Testing** → 3 user stories with acceptance criteria
+4. **Requirements** → 17 functional + 13 non-functional requirements
+5. **Data Model** → Entity definitions + markdown output structure
+6. **Technical Constraints & Architecture** → Technology stack, implementation notes, out-of-scope
+7. **Integration with Personified AI Agent** → Workflow and compatibility
+8. **Deployment & Usage** → CLI installation, interface, examples
+9. **Testing Strategy** → Unit, integration, manual test approaches
+10. **Success Metrics** → Measurable outcomes and targets
+---
+## Recommendations for Next Steps
+### Immediate (Phase 0-1)
+1. **Review Spec**: Validate spec decisions align with your vision
+2. **Data Model Refinement**: Create detailed markdown schema examples (if needed)
+3. **Implementation Plan**: Run `/speckit.plan` to create implementation roadmap
+4. **Task Breakdown**: Run `/speckit.tasks` to generate concrete development tasks
+### Pre-Development Verification
+1. **Test Playwright with LinkedIn**: Quick POC to verify Playwright can navigate LinkedIn without blocking
+2. **Validate Markdown Structure**: Ensure generated markdown integrates with existing RAG pipeline
+3. **User Testing Plan**: Plan 1-2 user trials to validate data accuracy and workflow
+### Execution Phases
+- **Phase 1** (Setup & Infrastructure): Playwright environment, CLI scaffolding, output directory management
+- **Phase 2** (Foundational): Core extraction logic, error handling, markdown generation
+- **Phase 3** (User Story 1): Profile extraction workflow, testing
+- **Phase 4** (User Story 2): Review & validation features
+- **Phase 5** (User Story 3): Documentation for upload workflow
+---
+## Outstanding Items (None)
+All critical ambiguities resolved. No outstanding blocking decisions.
+**Deferred to Planning Phase** (as appropriate):
+- Specific Playwright selector strategy (implementation detail)
+- Error retry logic specifics (implementation detail)
+- CLI argument parsing details (implementation detail)
+---
+## Suggested Next Command
+```bash
+# After reviewing spec:
+/speckit.plan  # Create detailed implementation roadmap
+# Then:
+/speckit.tasks  # Generate task breakdown for development
+```
+---
+**Clarification Status**: ✅ **COMPLETE & READY FOR PLANNING**
+**Spec Path**: `specs/002-linkedin-profile-extractor/spec.md`
+**Branch Ready**: Ready for new feature branch `002-linkedin-profile-extractor`

specs/002-linkedin-profile-extractor/INDEX.md ADDED Viewed

	@@ -0,0 +1,200 @@

+# 📑 LinkedIn Profile Extractor Specification Index
+**Feature**: Spec 002 - LinkedIn Profile Data Extractor
+**Created**: October 24, 2025
+**Status**: ✅ Clarification Complete — Ready for Planning
+---
+## 📚 Reading Guide
+### Quick Start (5-10 minutes)
+Start here for executive overview:
+1. **[SUMMARY.md](SUMMARY.md)** (12 KB)
+   - Executive summary of clarification session
+   - 5 design decisions made
+   - Architecture overview
+   - Key takeaways
+   - Next steps
+### Full Specification (15-20 minutes)
+Complete feature definition:
+2. **[spec.md](spec.md)** (21 KB) ⭐ **MAIN SPEC**
+   - Problem statement & solution overview
+   - 3 user stories with acceptance criteria
+   - 17 functional requirements
+   - 13 non-functional requirements
+   - 8 key entities
+   - Data model & markdown output structure
+   - Technical architecture
+   - CLI interface & usage
+   - Testing strategy
+   - Success metrics
+### Integration & Workflow (10-15 minutes)
+How this tool works with Spec 001:
+3. **[INTEGRATION_GUIDE.md](INTEGRATION_GUIDE.md)** (10 KB)
+   - High-level workflow diagram
+   - Data flow (LinkedIn → markdown → agent)
+   - Step-by-step integration steps
+   - Data mapping (source → file → usage)
+   - Configuration examples
+   - Privacy & consent controls
+   - Troubleshooting guide
+   - Future enhancements
+### Clarification Session Record (5 minutes)
+Detailed record of design decisions:
+4. **[CLARIFICATIONS.md](CLARIFICATIONS.md)** (7 KB)
+   - All 5 questions & answers
+   - Rationale for each decision
+   - Coverage analysis
+   - Sections touched
+   - Recommendations
+---
+## 🎯 Quick Navigation
+### By Role
+**Product Manager**: Read SUMMARY.md → spec.md → INTEGRATION_GUIDE.md
+**Engineer**: Read spec.md → INTEGRATION_GUIDE.md → spec.md again for details
+**Project Lead**: Read SUMMARY.md → CLARIFICATIONS.md → next steps
+### By Question
+**What is this tool?**
+→ SUMMARY.md, Overview section
+**How does it work?**
+→ INTEGRATION_GUIDE.md, High-Level Workflow
+**What gets extracted?**
+→ spec.md, Functional Requirements (FR-003 through FR-009)
+**How does it integrate with Spec 001?**
+→ INTEGRATION_GUIDE.md, Integration Steps
+**What were the design decisions?**
+→ CLARIFICATIONS.md, Questions Asked & Answered
+**How do I use it?**
+→ spec.md, Deployment & Usage section
+**What happens if something fails?**
+→ INTEGRATION_GUIDE.md, Troubleshooting
+---
+## 📋 Document Overview
+| Document | Lines | Purpose | Audience |
+|----------|-------|---------|----------|
+| **SUMMARY.md** | 282 | Executive overview | Managers, PMs, decision makers |
+| **spec.md** | 408 | Complete specification | Engineers, architects |
+| **INTEGRATION_GUIDE.md** | 379 | Workflow & integration | Engineers, ops, users |
+| **CLARIFICATIONS.md** | 158 | Decision record | Project leads, reviewers |
+| **README.md** | 282 | Getting started | Everyone |
+| **INDEX.md** | This file | Navigation | Everyone |
+**Total**: ~1,500 lines of specification documentation
+---
+## 🚀 Next Steps
+### Immediate (Next 1-2 hours)
+- [ ] Read SUMMARY.md (5 min)
+- [ ] Read spec.md User Stories section (10 min)
+- [ ] Review INTEGRATION_GUIDE.md workflow (10 min)
+- [ ] Validate design decisions align with vision (5 min)
+### Short-term (Next 1-2 days)
+- [ ] Run `/speckit.plan` to create implementation roadmap
+- [ ] Run `/speckit.tasks` to generate task breakdown
+- [ ] Create feature branch: `git checkout -b 002-linkedin-profile-extractor`
+### Pre-Development (Next 1 week)
+- [ ] Review implementation plan
+- [ ] Estimate effort and timeline
+- [ ] Quick Playwright POC (verify LinkedIn compatibility)
+- [ ] Plan user trial for validation
+---
+## 🔗 Cross-References
+**Related Specifications**:
+- Spec 001: Personified AI Agent — `specs/001-personified-ai-agent/spec.md`
+- T068 Research: LinkedIn MCP — `specs/001-personified-ai-agent/research.md`
+**Project Standards**:
+- Constitution: `.specify/memory/constitution.md`
+- Copilot Instructions: `.github/copilot-instructions.md`
+- Clarify Prompt: `.github/prompts/speckit.clarify.prompt.md`
+---
+## 📊 Specification Stats
+- **Questions Clarified**: 5/5 ✅
+- **Coverage Achieved**: 100% (all 9 categories)
+- **User Stories**: 3 (P1, P1, P2)
+- **Functional Requirements**: 17
+- **Non-Functional Requirements**: 13
+- **Key Entities**: 8
+- **Markdown Files Output**: 7 per session + metadata
+- **CLI Commands**: 1 main command with options
+- **Success Metrics**: 8 measurable outcomes
+---
+## ✅ Validation Checklist
+- ✅ All 5 clarification questions answered
+- ✅ All answers integrated into spec
+- ✅ 100% coverage of ambiguity categories
+- ✅ Spec includes user stories, requirements, data model
+- ✅ Integration guide explains workflow
+- ✅ Integration with Spec 001 documented
+- ✅ CLI interface & usage documented
+- ✅ Testing strategy included
+- ✅ Success metrics defined
+- ✅ Ready for planning phase
+---
+## 🎓 How to Use This Index
+1. **First time?** → Start with SUMMARY.md
+2. **Implementing?** → Read spec.md sections in order
+3. **Integrating with Spec 001?** → Use INTEGRATION_GUIDE.md
+4. **Reviewing decisions?** → Check CLARIFICATIONS.md
+5. **Lost?** → This index helps you navigate
+---
+## 🔥 Key Highlights
+**In 30 seconds**:
+- 🛠️ Tool: Python CLI using Playwright
+- 🔓 Auth: Manual LinkedIn login (human-in-the-loop)
+- 📊 Data: Full profile extraction (7 sections)
+- 📝 Output: Hierarchical markdown files
+- 👤 User Control: Review locally → edit → upload manually
+- 🔗 Integration: Seamless with AI-Me agent (Spec 001)
+---
+**Current Status**: ✅ **Ready for Planning Phase**
+**Next Command**: `/speckit.plan` to create implementation roadmap
+---
+*Navigation Guide for LinkedIn Profile Extractor Specification*
+*Last Updated: October 24, 2025*

specs/002-linkedin-profile-extractor/INTEGRATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,379 @@

+# Integration Guide: LinkedIn Profile Extractor ↔ Personified AI Agent
+**Document**: Integration roadmap between Spec 002 (LinkedIn Extractor) and Spec 001 (Personified AI Agent)
+**Date**: October 24, 2025
+**Status**: Reference Documentation
+---
+## High-Level Workflow
+```
+User's LinkedIn Profile
+        ↓
+ [Spec 002: LinkedIn Profile Extractor Tool]
+   (Playwright browser automation + manual login)
+        ↓
+ Generated Markdown Files (Profile.md, Experience.md, etc.)
+        ↓
+ User Review & Edit (privacy/accuracy gate)
+        ↓
+ Manual Upload to GitHub Repository (byoung/me, etc.)
+        ↓
+ [Spec 001: Personified AI Agent]
+   (DataManager loads GitHub repo via RAG)
+        ↓
+ Agent Knowledge Base Enhanced
+        ↓
+ User Chat: "Tell me about your experience..."
+ Agent Response: [Sourced from LinkedIn profile markdown]
+```
+---
+## Data Flow
+### Spec 002 Output Format
+LinkedIn Extractor produces:
+```
+linkedin-profile/
+├── extraction_report.json
+├── Profile.md               # Name, headline, location, about
+├── Experience.md            # Job history
+├── Education.md             # Schools, degrees
+├── Skills.md                # Skills + endorsements
+├── Recommendations.md       # Recommendations
+├── Connections.md           # Connections list
+└── Activity.md              # Posts, articles
+```
+### Spec 001 Input Format
+Personified AI Agent expects:
+- Markdown files in `docs/` directory (local) or GitHub repository (remote)
+- Files organized by topic/section (exactly what Spec 002 produces)
+- Metadata: filename, creation date, source (included in extraction_report.json)
+- Markdown syntax: valid UTF-8, proper heading hierarchy
+**Result**: Perfect format compatibility. No transformation needed.
+---
+## Integration Steps
+### Step 1: Extract LinkedIn Data (Spec 002)
+```bash
+cd ~/projects/ai-me
+python -m linkedin_extractor extract --output-dir ./linkedin-profile
+# User logs in manually → files generated → review files
+```
+**Output**: 7 markdown files + extraction_report.json in `./linkedin-profile/`
+---
+### Step 2: Review & Validate
+User reviews files locally, edits for privacy/accuracy:
+```bash
+# Open in editor
+code ./linkedin-profile/
+# Edit files as needed, delete sensitive info, verify accuracy
+# Example edits:
+# - Remove specific company names if desired
+# - Condense connections list to key contacts
+# - Remove draft posts or old activity
+```
+---
+### Step 3: Upload to Documentation Repository
+User uploads files to their documentation repo (e.g., `byoung/me`):
+```bash
+cd ~/repos/byoung-me  # (or wherever your docs repo is)
+# Copy or move reviewed files
+cp -r ~/projects/ai-me/linkedin-profile/*.md ./
+# Commit and push
+git add *.md
+git commit -m "Update LinkedIn profile: $(date +%Y-%m-%d)"
+git push origin main
+```
+---
+### Step 4: Configure AI-Me to Ingest
+Update `.env` to include LinkedIn profile repo:
+```bash
+# .env
+GITHUB_REPOS=byoung/me,byoung/other-docs  # Add your repo
+GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxx
+```
+Or, if files are already in local `docs/` directory:
+```bash
+# Just move files to docs/
+cp ~/linkedin-profile/*.md ./docs/
+```
+---
+### Step 5: Restart AI-Me
+AI-Me's `DataManager` will reload documents on next startup:
+```bash
+# If running locally:
+python -m gradio src/app.py
+# If deployed on Spaces, trigger redeploy or restart
+```
+---
+### Step 6: Verify Integration
+Test that agent has access to LinkedIn profile data:
+**Test Chat**:
+- User: "Tell me about your work experience"
+- Agent Response: [Cites Experience.md from LinkedIn extractor]
+**Verification**:
+- Agent uses first-person ("I worked at...")
+- Agent cites specific companies/dates from LinkedIn profile
+- Agent maintains authentic voice
+---
+## Data Mapping: LinkedIn → Markdown → Agent
+| LinkedIn Source | Markdown File | Agent Uses For | Example Question |
+|-----------------|---------------|----------------|------------------|
+| Profile section | Profile.md | Personalization, headline context | "What's your professional background?" |
+| Experience | Experience.md | Job history, expertise domains | "Tell me about your experience with X" |
+| Education | Education.md | Academic background, credentials | "Where did you study?" |
+| Skills + Endorsements | Skills.md | Domain expertise ranking | "What are your top skills?" |
+| Recommendations | Recommendations.md | Social proof, validation | "What do others say about you?" |
+| Connections | Connections.md | Network context, collaboration history | "Tell me about your network" |
+| Activity/Posts | Activity.md | Recent thinking, current interests | "What are you focused on lately?" |
+---
+## File Format Examples
+### Profile.md (from Spec 002 → consumed by Spec 001)
+```markdown
+# LinkedIn Profile
+**Name**: Ben Young
+**Headline**: AI Agent Architect | Full-Stack Engineer
+**Location**: San Francisco, CA
+## Summary
+Experienced AI/ML engineer with 10+ years building production systems...
+[Rest of profile]
+```
+**Agent uses**: First-person synthesis of profile summary in responses
+---
+### Experience.md (from Spec 002 → consumed by Spec 001)
+```markdown
+# Experience
+## AI Agent Architect @ TechCorp (2023-2025)
+- Led design of autonomous agent systems
+- Built RAG pipeline with 99.9% uptime
+- Mentored 5 engineers on AI architecture
+## Senior Engineer @ StartupXYZ (2020-2023)
+- ...
+```
+**Agent uses**: Specific job responsibilities when answering experience questions
+---
+## Configuration Examples
+### For GitHub-Based Ingestion
+```bash
+# .env
+GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxxxxxxxxxx
+GITHUB_REPOS=byoung/me,byoung/projects
+# AI-Me will load:
+# - https://github.com/byoung/me/blob/main/*.md
+# - https://github.com/byoung/projects/blob/main/*.md
+```
+Then upload LinkedIn profile markdown to `byoung/me` repo:
+```bash
+# Repository structure
+byoung/me/
+├── Profile.md          # from LinkedIn extractor
+├── Experience.md       # from LinkedIn extractor
+├── Education.md        # from LinkedIn extractor
+├── resume.md           # manual or pre-existing
+├── projects.md         # manual
+└── README.md
+```
+---
+### For Local File Ingestion
+```bash
+# .env (no GitHub token needed)
+GITHUB_REPOS=""  # empty, or omit
+# Move LinkedIn profile files to local docs/
+cp ~/linkedin-profile/*.md ~/projects/ai-me/docs/
+# Restart AI-Me
+# DataManager will load from docs/ automatically
+```
+---
+## Data Privacy & Consent
+### What Users Control
+1. **Extraction**: User manually logs into LinkedIn (no credentials stored)
+2. **Review**: User reviews generated markdown before upload
+3. **Filtering**: User can delete/edit sensitive information in markdown
+4. **Upload**: User chooses where to upload (GitHub public repo, local, etc.)
+5. **Sharing**: User decides whether to use data with AI agent
+### What's Extracted
+Only **publicly visible** LinkedIn data:
+- Profile summary (as shown on profile page)
+- Published experience/jobs
+- Education (if public)
+- Skills (if public)
+- Recommendations (if public)
+- Connections names/titles (if publicly shown)
+- Published posts/activity (if public)
+### Privacy Best Practices
+1. Review markdown files before upload
+2. Remove sensitive information (specific salary, internal projects, etc.)
+3. Edit connections list if desired (Spec 002 allows truncation)
+4. Use private GitHub repo if prefer (not shared publicly)
+5. Set `GITHUB_REPOS` to private repo URL in AI-Me config
+---
+## Troubleshooting Integration
+### Problem: Agent doesn't cite LinkedIn data
+**Diagnosis**:
+1. Verify markdown files uploaded to GitHub repo
+2. Verify GitHub repo URL is in `GITHUB_REPOS` env var
+3. Verify `GITHUB_PERSONAL_ACCESS_TOKEN` is set
+4. Restart AI-Me app
+5. Check logs: `DataManager.load_remote_documents()` should show documents loaded
+**Solution**:
+```bash
+# Test data loading
+python -c "from src.data import DataManager;
+dm = DataManager();
+docs = dm.process_documents();
+print(f'Loaded {len(docs)} documents')"
+```
+---
+### Problem: LinkedIn markdown syntax errors
+**Diagnosis**:
+1. Validate markdown: `markdown-lint *.md`
+2. Check for special characters, emojis, Unicode issues
+**Solution**:
+- Spec 002 includes Unicode normalization (Constitution IX)
+- User should review markdown files before upload
+- Re-run extraction if needed
+---
+### Problem: Data accuracy issues in agent responses
+**Diagnosis**:
+1. Verify extracted data matches LinkedIn profile
+2. Verify markdown reflects accurate representation of LinkedIn
+3. Check vector search is retrieving correct documents
+**Solution**:
+- User reviews extracted markdown before upload
+- Manual editing of markdown files allowed
+- Test specific queries: "What company did you work at?" → should cite Experience.md
+---
+## Future Enhancements
+### Phase B: Spec 002 Enhancements
+- Scheduled extraction (sync profile changes monthly)
+- Data versioning (track profile evolution)
+- Diff report (what changed since last extraction)
+- LinkedIn API integration (if ToS allows in future)
+### Phase B: Spec 001 Integration Enhancements
+- Automatic GitHub sync (reload documents on push webhook)
+- LinkedIn data freshness indicator ("Profile data from X days ago")
+- Dedicated LinkedIn context in agent prompt
+- LinkedIn-specific queries: "Show me recent posts" → cite Activity.md
+### Joint Enhancement: Documentation Sync Tool
+- Tool to automatically sync markdown updates to GitHub
+- Dashboard showing which LinkedIn data is in agent's knowledge base
+- User audit trail: "Last synced: date"
+---
+## Success Criteria for Integration
+✅ **Extraction**: LinkedIn data → markdown files ≥90% accuracy
+✅ **Review**: User can edit files before upload
+✅ **Upload**: Files accessible to AI-Me DataManager
+✅ **Retrieval**: Agent retrieves correct LinkedIn data for queries
+✅ **Response**: Agent cites LinkedIn profile in first-person responses
+✅ **Accuracy**: Sample responses match LinkedIn source data (100%)
+---
+## Reference
+- **Spec 001**: Personified AI Agent → `/specs/001-personified-ai-agent/spec.md`
+- **Spec 002**: LinkedIn Profile Extractor → `/specs/002-linkedin-profile-extractor/spec.md`
+- **Constitution**: `/specify/memory/constitution.md` (Principles: RAG-First, Session Isolation, Type-Safe, Async-First)

specs/002-linkedin-profile-extractor/README.md ADDED Viewed

	@@ -0,0 +1,282 @@

+# Clarification Session Complete: LinkedIn Profile Extractor
+**Session Date**: October 24, 2025
+**Feature**: LinkedIn Profile Data Extractor (Spec 002)
+**Status**: ✅ **CLARIFICATION COMPLETE & SPEC CREATED**
+---
+## Overview
+You requested a new tool to extract LinkedIn profile data into markdown files for use with the Personified AI Agent. I completed a full clarification session following the speckit.clarify workflow, answered 5 critical design questions, and created a complete feature specification.
+---
+## Clarification Results
+### Questions Asked & Answered: 5/5 ✅
+| # | Question | Your Answer | Rationale |
+|---|----------|-------------|-----------|
+| 1 | Feature organization (separate spec or integrated?) | **A**: Separate spec in `specs/002-` | Clean separation; independent versioning |
+| 2 | Authentication mechanism for LinkedIn? | **A**: Browser automation with **Playwright** | Respects ToS; user-controlled; full UI access |
+| 3 | LinkedIn data scope? | **C**: Full profile + human review gate | Maximum value; user controls privacy via review |
+| 4 | Markdown output structure? | **B**: Hierarchical by section (Profile.md, Experience.md, etc.) | Modular; mirrors LinkedIn structure; RAG-compatible |
+| 5 | Tool delivery model? | **A**: Standalone Python CLI tool | User control; local execution; manual upload to repo |
+### Coverage Achieved: 100% ✅
+All 9 key ambiguity categories resolved:
+- ✅ Functional scope & behavior (3 user stories, acceptance criteria)
+- ✅ Data model & entities (8 key entities defined)
+- ✅ Interaction & UX (CLI interface, workflow documented)
+- ✅ Non-functional attributes (performance <5min, reliability, privacy)
+- ✅ Integration & dependencies (Playwright, no API, local execution)
+- ✅ Edge cases & failures (rate limiting, UI changes, timeouts)
+- ✅ Constraints & tradeoffs (Python 3.12, local-only, manual upload)
+- ✅ Terminology & consistency (canonical terms defined)
+- ✅ Completion signals (success metrics, acceptance criteria)
+---
+## Artifacts Created
+### 1. Feature Specification
+**Path**: `specs/002-linkedin-profile-extractor/spec.md`
+**Contents** (~400 lines):
+- Problem statement & solution overview
+- 3 user stories (P1, P1, P2) with full acceptance scenarios
+- 17 functional requirements (FR-001 through FR-017)
+- 13 non-functional requirements (performance, reliability, security, usability)
+- 8 key entities with attributes
+- Data model with markdown output structure & examples
+- Technical constraints & architecture decisions
+- Integration with Spec 001 (Personified AI Agent)
+- CLI interface examples & usage
+- Testing strategy (unit, integration, manual)
+- Success metrics & measurable outcomes
+**Status**: Ready for Phase 0-1 planning
+---
+### 2. Clarifications Document
+**Path**: `specs/002-linkedin-profile-extractor/CLARIFICATIONS.md`
+**Contents**:
+- All 5 questions & answers with rationale
+- Coverage analysis (all categories resolved)
+- Sections touched in new spec
+- Recommendations for next steps
+---
+### 3. Integration Guide
+**Path**: `specs/002-linkedin-profile-extractor/INTEGRATION_GUIDE.md`
+**Contents** (~350 lines):
+- High-level workflow (LinkedIn → extraction → review → upload → ingestion)
+- Data flow from Spec 002 to Spec 001
+- Step-by-step integration instructions
+- Data mapping (LinkedIn source → markdown file → agent usage)
+- Configuration examples (GitHub-based & local)
+- Privacy & consent controls
+- Troubleshooting guide
+- Future enhancement ideas
+- Success criteria for integration
+---
+## Key Design Decisions
+### 1. **Separate Feature Spec** ✅
+- Created `specs/002-linkedin-profile-extractor/` directory
+- Independent from Spec 001 (Personified AI Agent)
+- Allows independent task tracking & prioritization
+- Prevents scope creep in main agent
+### 2. **Playwright Browser Automation** ✅
+- User logs in manually (human-in-the-loop)
+- Browser-based respects LinkedIn ToS (no scraping)
+- Cross-platform support (Windows/Mac/Linux)
+- Full UI access (can handle LinkedIn changes)
+- No API complexity or approval required
+### 3. **Full Data Extraction with Review Gate** ✅
+- Extracts all publicly visible data (connections, endorsements, activity)
+- User reviews markdown files locally before upload
+- User can edit/remove sensitive information
+- Only user decides what shares with AI agent
+### 4. **Hierarchical Markdown Output** ✅
+- 7 markdown files per section (Profile.md, Experience.md, etc.)
+- Mirrors LinkedIn's natural information structure
+- Modular: user can include/exclude files as needed
+- Perfect compatibility with existing RAG pipeline
+### 5. **Standalone CLI Tool** ✅
+- Separate from main Gradio app
+- Python 3.12 + uv (matches project standards)
+- Local execution (no credentials transmitted)
+- Manual upload workflow (user controls upload)
+- Respects user privacy & data ownership
+---
+## Workflow: User Perspective
+```
+1. User runs: python -m linkedin_extractor extract --output-dir ./linkedin-profile
+2. Browser opens; user logs into LinkedIn manually
+3. Tool extracts profile → Experience → Education → Skills → etc.
+4. 7 markdown files generated: Profile.md, Experience.md, ...
+5. User reviews files in text editor; edits for privacy/accuracy
+6. User uploads files to their GitHub repo (byoung/me or similar)
+7. User configures AI-Me: GITHUB_REPOS=byoung/me
+8. AI-Me loads files via RAG
+9. Next conversation uses LinkedIn profile data:
+   - User: "Tell me about your work experience"
+   - Agent: "I've worked at [companies from Experience.md]..."
+```
+---
+## What's Next
+### Recommended Path
+```bash
+# 1. Review the new spec
+cat specs/002-linkedin-profile-extractor/spec.md
+# 2. Create implementation plan
+/speckit.plan
+# 3. Generate task breakdown
+/speckit.tasks
+# 4. Create feature branch
+git checkout -b 002-linkedin-profile-extractor
+# 5. Begin Phase 1 (Setup & Infrastructure)
+```
+### Immediate Action Items
+- [ ] Review `specs/002-linkedin-profile-extractor/spec.md` — validate decisions
+- [ ] Review `INTEGRATION_GUIDE.md` — understand workflow with Spec 001
+- [ ] Create implementation plan via `/speckit.plan`
+- [ ] Generate task breakdown via `/speckit.tasks`
+- [ ] Create feature branch: `git checkout -b 002-linkedin-profile-extractor`
+### Pre-Development Validation
+- [ ] Test Playwright with LinkedIn (quick POC)
+- [ ] Validate markdown output integrates with RAG pipeline
+- [ ] Plan user trial to validate data accuracy
+---
+## Integration with Spec 001
+**No changes needed to Spec 001** (Personified AI Agent). The LinkedIn Profile Extractor is a **separate tool** that produces **compatible output** (markdown files).
+**Integration is simple**:
+1. Extract → Markdown files
+2. Review → User validates
+3. Upload → GitHub repo (or local docs/)
+4. Configure → Add repo to `GITHUB_REPOS` in Spec 001's config
+5. Ingest → Spec 001's DataManager loads files automatically
+See `INTEGRATION_GUIDE.md` for detailed workflow.
+---
+## Success Criteria for This Clarification
+| Criterion | Status |
+|-----------|--------|
+| All critical ambiguities identified | ✅ 9 categories scanned |
+| High-impact questions prioritized | ✅ 5 questions asked (high-impact) |
+| All answers actionable & clear | ✅ No ambiguous replies |
+| Spec reflects decisions accurately | ✅ All 5 answers integrated |
+| Integration documented | ✅ 350-line INTEGRATION_GUIDE.md created |
+| Ready for planning phase | ✅ No outstanding blockers |
+---
+## File Structure
+```
+specs/002-linkedin-profile-extractor/
+├── spec.md                 # Main feature specification (~400 lines)
+├── CLARIFICATIONS.md       # This clarification session record
+├── INTEGRATION_GUIDE.md    # Integration with Spec 001 (~350 lines)
+└── (forthcoming)
+    ├── plan.md             # (Phase 0) Implementation roadmap
+    ├── data-model.md       # (Phase 1) Detailed data model
+    ├── research.md         # (Phase 0) Research findings
+    └── tasks.md            # (Phase 2) Task breakdown
+```
+---
+## Summary Stats
+- **Questions Asked**: 5
+- **Coverage Achieved**: 100% (all 9 ambiguity categories)
+- **Spec Lines Created**: ~400 (main spec)
+- **Integration Guide**: ~350 lines
+- **Clarifications Documented**: ~200 lines
+- **Total Documentation**: ~950 lines
+- **Decision Clarity**: High (all 5 answers well-justified)
+- **Ready for Planning**: ✅ Yes
+---
+## Validation Checklist
+- ✅ All 5 questions answered & recorded
+- ✅ Spec created with clarifications integrated
+- ✅ Coverage summary shows all categories resolved
+- ✅ Markdown structure valid (no syntax errors)
+- ✅ Terminology consistent (canonical terms: LinkedInProfile, ExtractionSession, etc.)
+- ✅ No contradictory statements in spec
+- ✅ Integration guide references both specs
+- ✅ Next steps clearly documented
+---
+## Key Files to Review
+1. **Start Here**: `specs/002-linkedin-profile-extractor/spec.md` — Full feature spec
+2. **Integration**: `specs/002-linkedin-profile-extractor/INTEGRATION_GUIDE.md` — How it works with Spec 001
+3. **Reference**: `specs/001-personified-ai-agent/research.md` — T068 LinkedIn research (context)
+---
+## Next Steps
+**Recommended**: Run `/speckit.plan` to create implementation roadmap
+```bash
+/speckit.plan  # Create Phase 0-5 planning for Spec 002
+```
+Then:
+```bash
+/speckit.tasks  # Generate concrete task breakdown
+```
+---
+**Clarification Status**: ✅ **COMPLETE**
+**Spec Status**: ✅ **READY FOR PLANNING**
+**Recommended Next**: `/speckit.plan` → `/speckit.tasks` → Begin Phase 1 Development
+---
+*For detailed clarification methodology, see `.github/prompts/speckit.clarify.prompt.md`*

specs/002-linkedin-profile-extractor/spec.md ADDED Viewed

	@@ -0,0 +1,408 @@

+# Feature Specification: LinkedIn Profile Data Extractor
+**Feature Branch**: `002-linkedin-profile-extractor`
+**Created**: 2025-10-24
+**Status**: Draft (Clarification Complete)
+**Input**: User description: "A tool that walks through LinkedIn, allows users to login (human in the loop), then extracts user profile data (profile, experience, connections, etc.) into markdown files. Users can review files for accuracy/privacy and upload to their markdown repo for RAG ingestion."
+## Clarifications
+### Session 2025-10-24
+- Q: Should this be a separate feature spec or integrated into spec 001? → A: Create separate spec (`specs/002-linkedin-profile-extractor/spec.md`) for clean separation of concerns
+- Q: What authentication mechanism for LinkedIn? → A: Browser automation with Playwright for manual login; respects ToS, user-controlled
+- Q: What LinkedIn data to extract? → A: Full profile (connections, endorsements, activity feed) with human review gate for privacy/legal before upload
+- Q: What markdown output structure? → A: Hierarchical by section (Profile.md, Experience.md, Education.md, Skills.md, Recommendations.md, Connections.md, Activity.md)
+- Q: How is tool delivered and integrated? → A: Standalone Python CLI tool; users run locally, review output, manually upload to GitHub repo
+## Overview & Context
+### Problem Statement
+Users who want to create an AI agent representing themselves (via the Personified AI Agent, spec 001) need accurate, current profile data from LinkedIn. Currently, they must manually create markdown documentation of their professional background. This tool automates the extraction of LinkedIn profile data into markdown files, which users can review for accuracy and privacy, then upload to their documentation repository for RAG ingestion.
+### Solution Overview
+**LinkedIn Profile Data Extractor** is a standalone Python CLI tool that:
+1. Opens a Playwright browser and navigates to LinkedIn
+2. Requires manual user login (human-in-the-loop for authentication and consent)
+3. Automatically navigates LinkedIn sections (Profile, Experience, Education, Skills, Recommendations, Connections, Activity)
+4. Extracts structured data from each section
+5. Converts data to hierarchical markdown files
+6. Outputs files to a local directory for user review
+7. User reviews files for accuracy and privacy, then manually uploads to their documentation repository
+### Key Differentiators
+- **Privacy-First**: Browser-based extraction respects LinkedIn ToS; human review gate ensures user control over what data is shared
+- **No API Complexity**: Avoids LinkedIn API authentication, approval workflows, and data restrictions
+- **User-Controlled**: Users decide what to include/exclude before uploading
+- **Integrates with RAG**: Output markdown files are designed for ingestion by the Personified AI Agent
+- **Standalone**: Separate tool; doesn't complicate main AI-Me application
+### Target Users
+- Individuals creating an AI agent representing themselves
+- Users who want to keep profile data current with minimal manual effort
+- Users who prefer reviewing data before sharing with AI systems
+---
+## User Scenarios & Testing *(mandatory)*
+### User Story 1 - Extract LinkedIn Profile to Markdown (Priority: P1)
+A user runs the CLI tool, authenticates with LinkedIn, and extracts their profile data into markdown files. The tool creates organized, well-structured markdown files that accurately represent their LinkedIn profile.
+**Why this priority**: Core value proposition—tool must successfully extract LinkedIn data without manual intervention after login.
+**Independent Test**: Can be fully tested by running the tool, logging in, navigating profile extraction, and verifying output markdown files match LinkedIn source data.
+**Acceptance Scenarios**:
+1. **Given** a user runs `python -m linkedin_extractor extract --output-dir ./profile-data`, **When** the tool opens a browser and waits for login, **Then** the user can complete LinkedIn authentication manually
+2. **Given** the user is logged into LinkedIn, **When** the tool navigates profile sections, **Then** it successfully extracts profile data without crashes or incomplete captures
+3. **Given** extraction completes, **When** the tool outputs markdown files to the specified directory, **Then** files are well-formatted and match LinkedIn source content
+4. **Given** output files exist, **When** the user reviews them, **Then** the data is accurate, complete, and useful for RAG ingestion
+---
+### User Story 2 - Review & Validate Extracted Data (Priority: P1)
+User reviews the generated markdown files for accuracy and privacy concerns, ensuring the data is suitable for uploading to their documentation repository.
+**Why this priority**: Human-in-the-loop validation ensures accuracy and prevents unintended data sharing.
+**Independent Test**: Can be fully tested by reviewing output files and verifying they match LinkedIn source and contain no unexpected data.
+**Acceptance Scenarios**:
+1. **Given** markdown files are extracted, **When** the user reviews them, **Then** all sections are present and readable
+2. **Given** the user finds inaccurate or sensitive data, **When** they can easily edit the markdown files, **Then** they can remove/modify entries before upload
+3. **Given** files are validated, **When** the user prepares to upload, **Then** they understand exactly what data will be shared with their AI agent
+---
+### User Story 3 - Upload Reviewed Files to Documentation Repository (Priority: P2)
+User uploads the reviewed and validated markdown files to their documentation repository (e.g., `byoung/me` GitHub repo), making them available for RAG ingestion by the Personified AI Agent.
+**Why this priority**: Completes the workflow; enables RAG ingestion and agent knowledge base updates.
+**Independent Test**: Can be fully tested by uploading files to a test repository and verifying they're accessible for RAG pipeline ingestion.
+**Acceptance Scenarios**:
+1. **Given** validated markdown files exist locally, **When** the user uploads them to their documentation repository, **Then** they're stored in a location where the RAG pipeline can find them
+2. **Given** files are uploaded, **When** the Personified AI Agent's DataManager loads documents, **Then** the LinkedIn profile data is available for retrieval
+---
+### Edge Cases
+- What happens if LinkedIn changes UI/layout while extraction is in progress?
+- How does the tool handle LinkedIn rate limiting or blocking?
+- What if a user has restricted privacy settings preventing certain data extraction?
+- How should the tool handle missing data (e.g., user has no connections, endorsements, or activity)?
+- What happens if the browser session times out during extraction?
+- How are special characters, emojis, or non-ASCII text in profile data handled in markdown output?
+---
+## Requirements *(mandatory)*
+### Functional Requirements
+- **FR-001**: Tool MUST open a Playwright browser window and navigate to LinkedIn.com
+- **FR-002**: Tool MUST require manual user login (human-in-the-loop authentication); tool waits for successful login before proceeding
+- **FR-003**: Tool MUST extract data from LinkedIn profile section: name, headline, location, about/summary, profile photo URL, open to work status
+- **FR-004**: Tool MUST extract data from LinkedIn experience section: job titles, companies, dates, descriptions, current/past employment status
+- **FR-005**: Tool MUST extract data from LinkedIn education section: school names, degrees, fields of study, graduation dates, activities
+- **FR-006**: Tool MUST extract data from LinkedIn skills section: skill names and endorsement counts
+- **FR-007**: Tool MUST extract data from LinkedIn recommendations section: recommender names, titles, companies, recommendation text
+- **FR-008**: Tool MUST extract data from LinkedIn connections section: connection names, titles, companies (publicly visible data only)
+- **FR-009**: Tool MUST extract data from LinkedIn activity/posts section: recent posts, comments, articles (publicly visible content only)
+- **FR-010**: Tool MUST convert extracted data into hierarchical markdown files per section (Profile.md, Experience.md, Education.md, Skills.md, Recommendations.md, Connections.md, Activity.md)
+- **FR-011**: Tool MUST output markdown files to a user-specified directory (via CLI flag `--output-dir`)
+- **FR-012**: Tool MUST handle extraction errors gracefully with user-friendly error messages
+- **FR-013**: Tool MUST validate that extracted data matches source LinkedIn content (structural verification, no data loss)
+- **FR-014**: Tool MUST include metadata in markdown files: extraction timestamp, source URL, data completeness notes
+- **FR-015**: Tool MUST respect LinkedIn Terms of Service: browser-based extraction with manual login, human-in-the-loop consent
+- **FR-016**: Tool MUST allow user review and manual editing of markdown files before upload
+- **FR-017**: Tool MUST include documentation for uploading files to a GitHub repository for RAG ingestion
+### Key Entities
+- **LinkedInProfile**: Represents extracted user profile data (name, headline, location, summary, photo URL, open-to-work status)
+- **LinkedInExperience**: Represents job history entries (company, title, dates, description, employment type)
+- **LinkedInEducation**: Represents education entries (school, degree, field of study, graduation date, activities)
+- **LinkedInSkill**: Represents skill entry (skill name, endorsement count)
+- **LinkedInRecommendation**: Represents recommendation (recommender name/title/company, recommendation text, date)
+- **LinkedInConnection**: Represents connection entry (name, title, company, connection URL)
+- **LinkedInActivity**: Represents activity/post entry (timestamp, content, engagement metrics)
+- **ExtractionSession**: Represents a single extraction run (session ID, timestamp start/end, browser state, error log)
+- **MarkdownOutput**: Represents generated markdown file (section name, file path, content, metadata)
+### Non-Functional Requirements
+#### Performance (SC-005)
+- **SC-P-001**: Profile extraction completes within 5 minutes (typical user with moderate activity/connections)
+- **SC-P-002**: Markdown file generation completes within 10 seconds after extraction
+- **SC-P-003**: Tool memory usage stays below 500MB during extraction
+#### Reliability & Error Handling (SC-007)
+- **SC-R-001**: Tool handles LinkedIn UI changes gracefully (element not found) with informative error messages
+- **SC-R-002**: Tool handles rate limiting from LinkedIn (429 status) with retry logic and user notification
+- **SC-R-003**: Tool handles network timeouts with automatic retry (up to 3 attempts) and clear error reporting
+- **SC-R-004**: Tool handles incomplete data extraction (missing sections) and reports completeness in metadata
+- **SC-R-005**: Browser session timeout is handled with user prompt to re-login
+#### Security & Privacy (SC-002, SC-007)
+- **SC-S-001**: Tool runs locally; LinkedIn credentials are never stored or transmitted to external services
+- **SC-S-002**: Tool respects LinkedIn ToS: browser-based extraction, manual login, user consent required
+- **SC-S-003**: Tool only extracts publicly visible data (respects privacy settings)
+- **SC-S-004**: Markdown output is saved only to user-specified local directory (no automatic cloud upload)
+- **SC-S-005**: Tool includes clear warnings about data sensitivity in generated markdown files
+#### Usability (SC-008)
+- **SC-U-001**: CLI interface is intuitive with clear help text (`--help` flag)
+- **SC-U-002**: Error messages are user-friendly and actionable (not technical stack traces)
+- **SC-U-003**: Output markdown files are human-readable and easy to edit before upload
+- **SC-U-004**: Tool provides clear guidance on next steps (review, edit, upload to repo)
+#### Observability
+- **SC-O-001**: Tool logs extraction progress (sections processed, data counts, timestamps) to console
+- **SC-O-002**: Tool generates extraction report in output directory (extraction_report.json) with metadata and summary
+---
+## Success Criteria *(mandatory)*
+### Measurable Outcomes
+- **SC-001**: Extracted LinkedIn data is accurate and matches source profile (100% sample verification by user review)
+- **SC-002**: All publicly visible LinkedIn data is successfully extracted without requiring manual re-entry (100% completeness per user evaluation)
+- **SC-003**: Generated markdown files are valid, well-formatted, and immediately usable for RAG ingestion (0 markdown syntax errors)
+- **SC-004**: Users can review extracted data and identify/edit sensitive information before upload (human-in-the-loop gate functional)
+- **SC-005**: Profile extraction completes in under 5 minutes for typical user (measured across 3+ user trials)
+- **SC-006**: Tool handles LinkedIn UI changes and rate limiting without crashing (resilient error handling tested)
+- **SC-007**: All tool failures result in user-friendly error messages, not technical stack traces (100% user-friendly errors)
+- **SC-008**: Users report that generated files are immediately useful for their documentation repository (qualitative feedback)
+### Assumptions
+- Users have active LinkedIn accounts with visible profile data
+- Users are comfortable installing Python CLI tool and running commands locally
+- Users have git/GitHub account and can manually upload markdown files to their documentation repository
+- LinkedIn UI is relatively stable (tool may require maintenance if LinkedIn significantly changes UI)
+- Users accept that extraction is browser-based and requires active session (no headless-only extraction for privacy/ToS reasons)
+- Generated markdown files will be reviewed by users before sharing with AI systems
+- Users understand the data extracted is limited to publicly visible LinkedIn content
+---
+## Data Model
+### Markdown Output Structure
+Each extraction session generates the following files in the output directory:
+```
+output_dir/
+├── extraction_report.json          # Metadata: extraction timestamp, session info, data completeness
+├── Profile.md                      # Profile summary, headline, location, about, photo
+├── Experience.md                   # Job history with dates, companies, descriptions
+├── Education.md                    # Schools, degrees, fields of study, graduation dates
+├── Skills.md                       # Skills list with endorsement counts
+├── Recommendations.md              # Recommendations with recommender info and text
+├── Connections.md                  # Connections list (names, titles, companies)
+└── Activity.md                     # Recent posts, comments, articles
+```
+### File Format Example (Profile.md)
+```markdown
+# LinkedIn Profile
+**Extracted**: 2025-10-24 14:30:00 UTC
+**Source**: https://www.linkedin.com/in/byoung/
+**Status**: Complete
+## Summary
+- **Name**: Ben Young
+- **Headline**: AI Agent Architect | Full-Stack Engineer
+- **Location**: San Francisco, CA
+- **Open to Work**: Yes (seeking AI/ML roles)
+## About
+[Profile summary text...]
+## Profile Photo
+[URL to profile photo if publicly available]
+```
+---
+## Technical Constraints & Architecture
+### Technology Stack
+- **Language**: Python 3.12+ (matching AI-Me project standards via `uv`)
+- **Browser Automation**: Playwright (cross-platform, supports multiple browsers, respects ToS)
+- **Package Manager**: `uv` (matches AI-Me project standards)
+- **Output Format**: Markdown files + JSON metadata
+- **Delivery**: Standalone CLI tool (separate from main Gradio app)
+- **Execution Environment**: User's local machine (not cloud-deployed)
+### Implementation Notes
+1. **Browser-Based Extraction**: Uses Playwright to automate browser navigation, respecting LinkedIn ToS by requiring manual login
+2. **No API Integration**: Avoids LinkedIn API authentication complexity and approval requirements
+3. **Human-in-the-Loop**: User must manually authenticate and consent to extraction before proceeding
+4. **Local Execution**: All extraction happens on user's machine; no credentials or data transmitted externally
+5. **Manual Upload**: Users manually upload reviewed files to their GitHub repo (no automated Git push)
+6. **RAG Integration**: Output markdown follows existing document structure for seamless RAG ingestion by Personified AI Agent
+### Out of Scope
+- Automated scheduled extraction (GitHub Actions, webhooks, cron jobs) — future enhancement
+- Cloud-based execution or deployment
+- Integration with main Gradio app (separate standalone tool)
+- LinkedIn API integration (browser-based extraction only)
+- Encrypted credential storage (user responsible for LinkedIn security)
+- Multi-user or SaaS deployment
+---
+## Integration with Personified AI Agent (Spec 001)
+### Workflow
+1. **Extract**: User runs LinkedIn extractor CLI → generates markdown files
+2. **Review**: User reviews files locally, edits for privacy/accuracy
+3. **Upload**: User uploads files to their documentation repository (e.g., `byoung/me`)
+4. **Ingest**: Personified AI Agent's `DataManager` loads files via GitHub (if `GITHUB_REPOS` includes the repo)
+5. **Use**: Agent has access to LinkedIn profile data for better context and responses
+### Documentation Structure Compatibility
+LinkedIn extractor output (Profile.md, Experience.md, etc.) follows the same markdown document structure expected by the Personified AI Agent's RAG pipeline. No additional transformation needed.
+---
+## Deployment & Usage
+### Installation
+```bash
+# Clone repo (or install from package)
+git clone https://github.com/byoung/ai-me.git
+cd ai-me
+# Install dependencies
+uv sync
+# Run extractor
+python -m linkedin_extractor extract --output-dir ./linkedin-profile
+```
+### CLI Interface
+```bash
+python -m linkedin_extractor extract --output-dir PATH [OPTIONS]
+Options:
+  --output-dir PATH              Directory to save markdown files (required)
+  --headless                     Run browser in headless mode (not recommended; requires session)
+  --wait-time SECONDS            Wait time for page loads (default: 10)
+  --extract-connections          Include full connections list (slower; may hit rate limits)
+  --extract-activity             Include recent activity/posts (slower; requires scrolling)
+  --help                         Show help text
+```
+### Usage Example
+```bash
+# Basic extraction
+python -m linkedin_extractor extract --output-dir ~/linkedin-profile
+# Full extraction with connections and activity
+python -m linkedin_extractor extract --output-dir ~/linkedin-profile --extract-connections --extract-activity
+```
+### Post-Extraction Workflow
+1. **Review Files**: User opens markdown files in editor, verifies accuracy
+2. **Edit**: User removes/modifies sensitive information as needed
+3. **Upload to Repo**: User commits and pushes files to their documentation repository
+4. **Configure AI-Me**: Add repo to `GITHUB_REPOS` environment variable if not already included
+5. **Verify**: Next conversation with AI-Me agent will use LinkedIn profile data in responses
+---
+## Testing Strategy
+### Unit Tests
+- Markdown generation (correct format, no syntax errors)
+- Data extraction parsing (LinkedIn HTML → structured data)
+- File I/O operations (output directory creation, file writing)
+- Error message formatting (user-friendly, no stack traces)
+### Integration Tests
+- End-to-end extraction session (login → extract → file output)
+- Handle LinkedIn rate limiting
+- Handle LinkedIn UI changes (missing elements)
+- Browser timeout recovery
+### Manual Testing
+- User trial with real LinkedIn account (verify data accuracy)
+- Review generated markdown files for completeness
+- Upload to documentation repo and verify RAG ingestion
+---
+## Success Metrics (How We Know We're Done)
+| Metric | Target | How We Measure |
+|--------|--------|----------------|
+| **Data Accuracy** | 100% of extracted data matches LinkedIn source | User review of generated files vs. LinkedIn profile |
+| **Completeness** | 90%+ of available LinkedIn data extracted | Count of extracted data points vs. completeness report |
+| **Markdown Quality** | 0 syntax errors in output | Markdown validation tool |
+| **Extraction Time** | <5 minutes for typical user | Timer from login to file output |
+| **Error Handling** | 100% user-friendly error messages | No stack traces in output |
+| **Privacy Compliance** | Only publicly visible data extracted | User audit of generated files |
+| **RAG Integration** | Files immediately usable for RAG ingestion | Upload to repo and verify agent knowledge access |
+| **Ease of Use** | Users can extract data without technical support | Qualitative feedback / support ticket volume |
+---
+## Future Enhancements (Phase B - Not in MVP)
+- Scheduled extraction (GitHub Actions trigger)
+- Multi-profile extraction (extract multiple users' data)
+- Incremental updates (extract only changed sections)
+- LinkedIn API integration (once ToS allows)
+- Cloud deployment (Hugging Face Spaces as web UI)
+- Automated Git push with review/approval workflow
+- Encrypted credential storage for batch jobs
+- Data diff/versioning (track profile changes over time)
+---
+**Spec Status**: ✅ Ready for Phase 0-1 Design
+**Next Steps**:
+1. Create detailed data model and markdown schema
+2. Create implementation plan with Playwright-specific architecture
+3. Generate task breakdown for development