byoung-hf commited on
Commit
6913322
Β·
verified Β·
1 Parent(s): 6b2b3f9

Upload folder using huggingface_hub

Browse files
.specify/memory/constitution.md CHANGED
@@ -67,6 +67,10 @@ All agent responses normalized for clean, consistent output across platforms.
67
  - Output cleaned before returning to user
68
  - Output links should work
69
 
 
 
 
 
70
  ## Technology Stack Constraints
71
 
72
  - **Python**: 3.12+ only (via `requires-python = "~=3.12.0"`)
@@ -125,6 +129,14 @@ All agent responses normalized for clean, consistent output across platforms.
125
  7. **No credential leaks** - .gitignore and .dockerignore files to help prevent secret slips. Never build secrets into a dockerfile!
126
  8. **No notebook outputs in GIT** - you must clean up the code
127
 
 
 
 
 
 
 
 
 
128
  ## Governance
129
 
130
  This constitution supersedes all other practices and is the single source of truth for architectural decisions. All PRs and feature requests must verify compliance with these principles. Code review must check:
@@ -135,5 +147,7 @@ This constitution supersedes all other practices and is the single source of tru
135
  - Imports organized per PEP 8
136
  - Observability (logging) present
137
  - Output cleanliness (Unicode normalization)
 
 
138
 
139
- **Version**: 1.0.0 | **Ratified**: 2025-10-23 | **Last Amended**: 2025-10-23
 
67
  - Output cleaned before returning to user
68
  - Output links should work
69
 
70
+ ### X. External Data Integration Policy
71
+
72
+ For external services that do not provide a sanctioned public API (for example: LinkedIn), AI‑Me will perform data ingestion only via a human-in-the-loop browser automation process that requires interactive user authentication. Extracted content must be limited to publicly-visible information, reviewed by the human operator for accuracy and privacy before ingestion, and must never be collected via third-party services that require users to share credentials or that perform scraping on a user's behalf.
73
+
74
  ## Technology Stack Constraints
75
 
76
  - **Python**: 3.12+ only (via `requires-python = "~=3.12.0"`)
 
129
  7. **No credential leaks** - .gitignore and .dockerignore files to help prevent secret slips. Never build secrets into a dockerfile!
130
  8. **No notebook outputs in GIT** - you must clean up the code
131
 
132
+ ## Architectural Decision Records (ADRs)
133
+
134
+ All major architectural decisions are documented in `architecture/adrs/`. ADRs provide detailed context, tradeoffs, and compliance notes that elaborate on constitution principles:
135
+
136
+ - **ADR-001**: Human-in-the-loop Browser Automation for Third-Party Data Ingestion β€” Instantiates Principle X (External Data Integration Policy)
137
+
138
+ Reference ADRs when evaluating PRs, designing new integrations, or proposing architecture changes.
139
+
140
  ## Governance
141
 
142
  This constitution supersedes all other practices and is the single source of truth for architectural decisions. All PRs and feature requests must verify compliance with these principles. Code review must check:
 
147
  - Imports organized per PEP 8
148
  - Observability (logging) present
149
  - Output cleanliness (Unicode normalization)
150
+ - External data integration policy adherence
151
+ - Architectural decisions documented in `architecture/adrs/`
152
 
153
+ **Version**: 1.0.1 | **Ratified**: 2025-10-23 | **Last Amended**: 2025-10-24
CONTRIBUTING.md CHANGED
@@ -4,18 +4,32 @@ Welcome! This document outlines the process for contributing to the AI-Me projec
4
 
5
  ## Prerequisites
6
 
7
- - Python 3.12+ (managed by `uv`)
8
- - Git with GPG signing configured
9
- - Basic understanding of async Python and RAG concepts (see `.specify/memory/constitution.md`)
 
 
 
 
 
 
 
 
 
10
 
11
  ## Setup
12
 
13
  ### 1. Clone and Install Dependencies
14
 
15
  ```bash
16
- git clone https://github.com/byoung/ai-me.git
17
  cd ai-me
 
 
18
  uv sync
 
 
 
19
  ```
20
 
21
  ### 2. Environment Configuration
@@ -41,11 +55,128 @@ LOKI_USERNAME=...
41
  LOKI_PASSWORD=...
42
  ```
43
 
44
- ### 3. Configure Git Commit Signing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- See this guide on setting up gpg keys:
47
 
48
- https://docs.github.com/en/authentication/managing-commit-signature-verification/generating-a-new-gpg-key
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
 
50
 
51
- **All commits MUST be GPG-signed.**
 
4
 
5
  ## Prerequisites
6
 
7
+ If you want to propose changes to ai-me, please search our [issues](https://github.com/byoung/ai-me/issues) list first. If there is none, please create a new issue and label it as a bug or enhancement. Before you get started, let's have a conversation about the proposal!
8
+
9
+ This project is transitioning to [Spec Kit](https://github.com/github/spec-kit), so any new features must first start with a `/speckit.specify` in order to establish our user stories and requirements consistently.
10
+
11
+ To develop on this project, you will need to:
12
+
13
+ - Set up [Docker](https://docs.docker.com/engine/install/) for container based development
14
+ - Set up [uv](https://docs.astral.sh/uv/getting-started/installation/) for local development
15
+ - Create a fork of [ai-me](https://github.com/byoung/ai-me)
16
+ - Configure git with [GPG signing configured](https://docs.github.com/en/authentication/managing-commit-signature-verification/generating-a-new-gpg-key)
17
+ - Do a full review of our [constitution](/.specify/memory/constitution.md)
18
+ - Set up a [pre-commit hook](#setting-up-the-pre-commit-hook) to clear notebook output (unless you have the discipline to manually do it before opening PRs -- I (@byoung) do not...)
19
 
20
  ## Setup
21
 
22
  ### 1. Clone and Install Dependencies
23
 
24
  ```bash
25
+ git clone https://github.com/<your fork>
26
  cd ai-me
27
+
28
+ # Local dev
29
  uv sync
30
+
31
+ # Container dev
32
+ docker compose build notebooks
33
  ```
34
 
35
  ### 2. Environment Configuration
 
55
  LOKI_PASSWORD=...
56
  ```
57
 
58
+ ### 3. Start the application
59
+
60
+ ```bash
61
+ # Local
62
+ uv run src/app.py # Launches Gradio on default port
63
+
64
+ # Docker
65
+ docker compose up notebooks
66
+ ```
67
+
68
+ ### 4. Make changes
69
+
70
+ You don't have to use Spec Kit to plan and implement your specs, BUT you MUST create traceability between your spec, implementation, and tests per our [constitution](/.specify/memory/constitution.md)!
71
+
72
+ ### 5. Test
73
+
74
+ For detailed information on testing check out our [TESTING.md](/TESTING.md) guide.
75
+
76
+
77
+ ### 6. Deploy
78
+
79
+ To deploy to HF, check out this [section](/README.md#deployment) in our README. This test ensures that your system is deployable and usable in HF.
80
+
81
+ ### 7. Open a PR
82
+
83
+ Be sure to give a brief overview of the change and link it to the issue it's resolving.
84
+
85
+
86
+ ## Setting Up the Pre-Commit Hook
87
+
88
+ A Git pre-commit hook automatically clears all Jupyter notebook outputs before committing. This keeps the repository clean and reduces diff noise by preventing output changes from cluttering commits.
89
+
90
+ ### Installation
91
+
92
+ #### Option 1: Automated Installation (Recommended)
93
+
94
+ After cloning the repository, run:
95
+
96
+ ```bash
97
+ cd ai-me
98
+ cp .git/hooks/pre-commit.sample .git/hooks/pre-commit
99
+ chmod +x .git/hooks/pre-commit
100
+ ```
101
+
102
+ Then create the hook script:
103
+
104
+ ```bash
105
+ cat > .git/hooks/pre-commit << 'EOF'
106
+ #!/bin/bash
107
+ # Pre-commit hook to clear Jupyter notebook outputs
108
+
109
+ # Find all staged .ipynb files
110
+ notebooks=$(git diff --cached --name-only --diff-filter=ACM | grep '\.ipynb$')
111
+
112
+ if [ -n "$notebooks" ]; then
113
+ echo "Clearing outputs from notebooks..."
114
+ for notebook in $notebooks; do
115
+ if [ -f "$notebook" ]; then
116
+ echo " Processing: $notebook"
117
+ # Clear outputs using Python directly (no jupyter dependency needed)
118
+ python3 -c "
119
+ import json
120
+
121
+ notebook_path = '$notebook'
122
+
123
+ # Read the notebook
124
+ with open(notebook_path, 'r', encoding='utf-8') as f:
125
+ nb = json.load(f)
126
+
127
+ # Clear outputs from all cells
128
+ for cell in nb.get('cells', []):
129
+ if cell['cell_type'] == 'code':
130
+ cell['outputs'] = []
131
+ cell['execution_count'] = None
132
+
133
+ # Write back the cleaned notebook
134
+ with open(notebook_path, 'w', encoding='utf-8') as f:
135
+ json.dump(nb, f, indent=1, ensure_ascii=False)
136
+ f.write('\n')
137
+ "
138
+ # Re-stage the cleaned notebook
139
+ git add "$notebook"
140
+ fi
141
+ done
142
+ echo "βœ“ Notebook outputs cleared"
143
+ fi
144
+
145
+ exit 0
146
+ EOF
147
+ chmod +x .git/hooks/pre-commit
148
+ ```
149
 
150
+ #### Option 2: Manual Installation
151
 
152
+ 1. Navigate to your git hooks directory:
153
+ ```bash
154
+ cd .git/hooks
155
+ ```
156
+
157
+ 2. Create a new file called `pre-commit`:
158
+ ```bash
159
+ touch pre-commit
160
+ chmod +x pre-commit
161
+ ```
162
+
163
+ 3. Open the file in your editor and paste the script above (starting with `#!/bin/bash`).
164
+
165
+ ### Verification
166
+
167
+ To verify the hook is working, make a change to a notebook and stage it:
168
+
169
+ ```bash
170
+ git add src/notebooks/experiments.ipynb
171
+ git commit -m "Test notebook commit"
172
+ ```
173
+
174
+ You should see output like:
175
+ ```
176
+ Clearing outputs from notebooks...
177
+ Processing: src/notebooks/experiments.ipynb
178
+ βœ“ Notebook outputs cleared
179
+ ```
180
 
181
+ **Note**: A Git pre-commit hook is installed at `.git/hooks/pre-commit` that automatically clears all notebook outputs before committing.
182
 
 
README.md CHANGED
@@ -9,39 +9,18 @@ An agentic version of real people using RAG (Retrieval Augmented Generation) ove
9
 
10
  Deployed as a Gradio chatbot on Hugging Face Spaces.
11
 
12
- ## Architecture Overview
13
 
14
- ### Core Design
15
-
16
- **Data Pipeline** β†’ **Agent System Set Up** β†’ **Chat Interface**
17
-
18
- 1. **Data Pipeline** (`src/data.py`)
19
- - Loads markdown from local `docs/` and public GitHub repos
20
- - Chunks with LangChain, embeds with HuggingFace sentence-transformers
21
- - Stores in ephemeral ChromaDB vectorstore
22
-
23
- 2. **Agent System** (`src/agent.py`)
24
- - Primary agent personifies a real person using RAG
25
- - Queries vectorstore via async `get_local_info` tool
26
- - Uses Groq API (`openai/openai/gpt-oss-120b`) for LLM inference
27
- - OpenAI API for tracing/debugging only
28
-
29
- 3. **UI Layer** (`src/app.py`)
30
- - Simple chat interface streams responses
31
- - Async-first architecture throughout
32
 
33
- ### Key Technologies
34
 
35
- - **Python 3.12** with `uv` for dependency management
36
- - **OpenAI Agents SDK** for agentic framework
37
- - **LangChain** for document loading/chunking only
38
- - **ChromaDB** with ephemeral in-memory storage
39
- - **Gradio** for UI and Hugging Face Spaces deployment
40
- - **Groq** as primary LLM provider for fast inference
41
- - **Pydantic** for type-safe configuration
42
- - **Grafana Cloud Loki** for remote logging (optional)
43
 
44
- ## Getting Started
 
 
 
45
 
46
  ### Environment Setup
47
 
@@ -52,22 +31,18 @@ uv sync
52
  # OR build the container
53
  docker compose build notebooks
54
 
55
- # NOTE: if you go the local route it's assume you have nodejs and npx installed
56
-
57
  # Create .env with required keys:
58
  OPENAI_API_KEY=<for tracing>
59
  GROQ_API_KEY=<primary LLM provider>
60
  GITHUB_PERSONAL_ACCESS_TOKEN=<for searching GitHub repos>
61
  BOT_FULL_NAME=<full name of the persona>
62
  APP_NAME=<name that appears on HF chat page>
63
- GITHUB_REPOS=<comma-separated list of public repos: owner/repo,owner/repo>
64
 
65
  # Optional: Set log level (DEBUG, INFO, WARNING, ERROR). Defaults to INFO.
66
  LOG_LEVEL=INFO
67
  ```
68
 
69
- **Note**: A Git pre-commit hook is installed at `.git/hooks/pre-commit` that automatically clears all notebook outputs before committing. This keeps the repository clean and reduces diff noise.
70
-
71
  ### Running
72
 
73
  ```bash
@@ -83,12 +58,14 @@ If you use the Docker route, you can use the Dev Containers extension in most po
83
 
84
  ## Deployment
85
 
 
 
86
  ```bash
87
  # Run from the root directory
88
  gradio deploy
89
  ```
90
 
91
- **Automatic CI/CD**: Push to `main` triggers a GitHub Actions workflow that deploys to Hugging Face Spaces via Gradio CLI. The following environment variables need to be set up in GitHub for the CI/CD pipeline:
92
 
93
  ```bash
94
  # Testing
@@ -119,43 +96,19 @@ GITHUB_REPOS
119
  ENV # e.g., "production" - used for log tagging
120
  ```
121
 
122
- ## Design Principles
123
 
124
- 1. **Decoupled Architecture**: Config handles sources, DataManager handles pipeline, app orchestrates
125
- 2. **Smart Defaults**: Minimal configuration required - most params have sensible defaults
126
- 3. **Async-First**: All agent operations are async for responsive UI
127
- 4. **Ephemeral Storage**: Vectorstore rebuilt on each restart (fast, simple, stateless)
128
- 5. **Type Safety**: Pydantic models with validation throughout
129
- 6. **Development/Production Parity**: Same DataManager used in notebooks and production app
130
 
131
- ## Project Structure
132
 
133
- ```
134
- src/
135
- β”œβ”€β”€ config.py # Pydantic settings, API keys, data sources
136
- β”œβ”€β”€ data.py # DataManager class - complete pipeline
137
- β”œβ”€β”€ agent.py # Agent creation and orchestration
138
- β”œβ”€β”€ app.py # Production Gradio app
139
- β”œβ”€β”€ test.py # Unit tests
140
- β”œβ”€β”€ __init__.py
141
- └── notebooks/
142
- └── experiments.ipynb # Development sandbox with MCP servers
143
-
144
- docs/ # Local markdown for RAG (development)
145
- test_data/ # Test fixtures and sample data
146
- .github/
147
- β”œβ”€β”€ copilot-instructions.md # Detailed implementation guide for AI
148
- └── workflows/
149
- └── update_space.yml # CI/CD to Hugging Face
150
- ```
151
 
152
- ## Reminders/Warnings
153
 
154
- - **Data Sources**: The default for local development is the `docs/` folder. If you want your production app to access this content post deploy, it must be pushed to a public GitHub repo until we support private repo document loading for RAG.
 
155
  - **Model Choice**: Groq's `gpt-oss-120b` provides good quality with ultra-fast inference. If you change the model, be aware that tool calling may degrade which can lead to runtime errors.
156
  - **Docker in Docker**: A prior version of this app had the Docker command built into the container. Because the GitHub MCP server was moved from `docker` to `npx`, the CLI is no longer included in the [Dockerfile](Dockerfile). However, the socket mount in [docker-compose](docker-compose.yaml) was left as we may revisit Docker MCP servers in the future. We will never support a pure docker-in-docker setup, but socket mounting may still be an option.
157
- - **The Agent does not have memory!**: This was an intentional design decision until multithreaded operations can be supported/tested on HF.
158
 
159
  ---
160
-
161
- For detailed implementation patterns, code conventions, and AI assistant context, see [`.github/copilot-instructions.md`](.github/copilot-instructions.md).
 
9
 
10
  Deployed as a Gradio chatbot on Hugging Face Spaces.
11
 
12
+ Example: https://huggingface.co/spaces/byoung-hf/ben-bot
13
 
14
+ ## Getting Started
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ If you want to experiment with building your own agentic self, clone the repo and follow the steps below.
17
 
18
+ ### Prerequisites
 
 
 
 
 
 
 
19
 
20
+ - [Docker](https://docs.docker.com/engine/install/) or [uv](https://docs.astral.sh/uv/getting-started/installation/) for running the application
21
+ - If you run locally, you need [npx](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm) installed
22
+ - Groq and OpenAI for inference and tracing
23
+ - A GitHub PAT for the GitHub tool (optional)
24
 
25
  ### Environment Setup
26
 
 
31
  # OR build the container
32
  docker compose build notebooks
33
 
 
 
34
  # Create .env with required keys:
35
  OPENAI_API_KEY=<for tracing>
36
  GROQ_API_KEY=<primary LLM provider>
37
  GITHUB_PERSONAL_ACCESS_TOKEN=<for searching GitHub repos>
38
  BOT_FULL_NAME=<full name of the persona>
39
  APP_NAME=<name that appears on HF chat page>
40
+ GITHUB_REPOS=<comma-separated list of public repos: owner/repo,owner/repo with md files to be ingested by RAG>
41
 
42
  # Optional: Set log level (DEBUG, INFO, WARNING, ERROR). Defaults to INFO.
43
  LOG_LEVEL=INFO
44
  ```
45
 
 
 
46
  ### Running
47
 
48
  ```bash
 
58
 
59
  ## Deployment
60
 
61
+ To deploy the application to HF, you need to update the comment at the top of this README and create your own HF account and put a HF_TOKEN into your .env file.
62
+
63
  ```bash
64
  # Run from the root directory
65
  gradio deploy
66
  ```
67
 
68
+ **Automatic CI/CD**: Pushing to `main` triggers a GitHub Actions workflow that deploys to Hugging Face Spaces via Gradio CLI. The following environment variables need to be set up in GitHub for the CI/CD pipeline:
69
 
70
  ```bash
71
  # Testing
 
96
  ENV # e.g., "production" - used for log tagging
97
  ```
98
 
99
+ ## Architecture and Design Overview
100
 
101
+ All of our architecture and design thinking can be found in our [constitution](/.specify/memory/constitution.md) and [ADRs](/architecture/adrs/)
 
 
 
 
 
102
 
103
+ Some implementation decisions have been captured in our [`.github/copilot-instructions.md`](.github/copilot-instructions.md), but will be integrated into other documents over time.
104
 
105
+ ## Contributions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
+ Check out our [CONTRIBUTING.md](/CONTRIBUTING.md) guide for detailed instructions on how you can contribute.
108
 
109
+ ## Reminders/Warnings/Gotchas
110
+ - **Data Sources**: The default for local development is the `docs/local-testing` folder. If you want your production app to access this content post deploy, it must be pushed to a public GitHub repo until we support private repo document loading for RAG.
111
  - **Model Choice**: Groq's `gpt-oss-120b` provides good quality with ultra-fast inference. If you change the model, be aware that tool calling may degrade which can lead to runtime errors.
112
  - **Docker in Docker**: A prior version of this app had the Docker command built into the container. Because the GitHub MCP server was moved from `docker` to `npx`, the CLI is no longer included in the [Dockerfile](Dockerfile). However, the socket mount in [docker-compose](docker-compose.yaml) was left as we may revisit Docker MCP servers in the future. We will never support a pure docker-in-docker setup, but socket mounting may still be an option.
 
113
 
114
  ---
 
 
TESTING.md CHANGED
@@ -48,7 +48,6 @@ uv run pytest src/test.py::test_rear_knowledge_contains_it245 -v
48
 
49
  **Configuration**:
50
  - **Temperature**: Set to 0.0 for deterministic, reproducible responses
51
- - **MCP Servers**: Disabled by default for faster test execution
52
  - **Model**: Uses model specified in config (default: `openai/openai/gpt-oss-120b` via Groq)
53
  - **Data Source**: `test_data/` directory (configured via `doc_root` parameter)
54
  - **GitHub Repos**: Disabled (`GITHUB_REPOS=""`) for faster test execution
 
48
 
49
  **Configuration**:
50
  - **Temperature**: Set to 0.0 for deterministic, reproducible responses
 
51
  - **Model**: Uses model specified in config (default: `openai/openai/gpt-oss-120b` via Groq)
52
  - **Data Source**: `test_data/` directory (configured via `doc_root` parameter)
53
  - **GitHub Repos**: Disabled (`GITHUB_REPOS=""`) for faster test execution
architecture/adrs/adr-linkedin-integration.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-001: Human-in-the-Loop Browser Automation for Third-Party Data Ingestion
2
+
3
+ **Status**: Accepted
4
+
5
+ **Date**: 2025-10-24
6
+
7
+ ## Context
8
+
9
+ External services like LinkedIn do not provide sanctioned public APIs for personal profile data extraction. Previous research evaluated three integration approaches:
10
+ 1. **Programmatic MCP/API integrations** β€” No official LinkedIn MCP server exists; third-party options are immature or unavailable
11
+ 2. **Third-party data-gathering services** (Apify, Anysite) β€” Require sharing user credentials, violate Terms of Service, pose security and privacy risks
12
+ 3. **Human-in-the-loop browser automation** β€” Respects ToS, maintains user control, verifies data accuracy and privacy before ingestion
13
+
14
+ ## Decision
15
+
16
+ We will implement data ingestion for LinkedIn and other services without sanctioned public APIs exclusively through **human-in-the-loop browser automation** (Playwright). This approach:
17
+ - Requires the user to authenticate interactively (human provides credentials directly to LinkedIn, not to our tool)
18
+ - Extracts only publicly-visible profile content (profile, experience, education, skills, recommendations, connections, activity)
19
+ - Scope is limited to the user that manually logs in
20
+ - Mandates human review of all extracted content for accuracy and privacy compliance before ingestion into markdown documentation
21
+ - Explicitly prohibits use of third-party credential-sharing services that scrape on a user's behalf
22
+
23
+ This policy applies retroactively to LinkedIn and prospectively to any future external services lacking a publicly available API.
24
+
25
+ ## Consequences
26
+
27
+ **Positive:**
28
+ - Maintains compliance with LinkedIn Terms of Service
29
+ - Protects user security (credentials never shared with third parties)
30
+ - Enables human verification of data accuracy and privacy
31
+ - Establishes reusable pattern for similar external data sources
32
+ - Respects user agency and control over their data
33
+
34
+ **Negative:**
35
+ - Requires user manual effort (browser interaction, file review)
36
+ - Tool development complexity (Playwright orchestration, markdown formatting)
37
+ - Cannot be fully automated or scheduled
38
+ - Slower than direct API access (if available)
39
+
40
+ ## Compliance
41
+
42
+ This decision instantiates Constitution Principle X (External Data Integration Policy) and establishes binding guidance for all future data ingestion tools.
specs/002-linkedin-profile-extractor/CLARIFICATIONS.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Clarification Session Report: LinkedIn Profile Extractor Tool
2
+
3
+ **Date**: October 24, 2025
4
+ **Feature**: LinkedIn Profile Data Extractor (Spec 002)
5
+ **Status**: βœ… **COMPLETE**
6
+
7
+ ---
8
+
9
+ ## Executive Summary
10
+
11
+ Clarification session completed with **5 critical questions answered**. All high-impact ambiguities resolved. New feature spec created (`specs/002-linkedin-profile-extractor/spec.md`) with clear design decisions and ready for planning phase.
12
+
13
+ ---
14
+
15
+ ## Questions Asked & Answered
16
+
17
+ ### Question 1: Feature Scope & Organization
18
+ **Answer**: A β€” Create separate spec (`specs/002-linkedin-profile-extractor/spec.md`)
19
+
20
+ **Rationale**: Cleanly separates concerns from main Personified AI Agent (spec 001), allows independent versioning and task tracking, prevents scope creep.
21
+
22
+ ---
23
+
24
+ ### Question 2: Authentication Mechanism
25
+ **Answer**: A with Playwright β€” Browser automation with manual LinkedIn login
26
+
27
+ **Rationale**: Respects LinkedIn ToS, gives users full control, avoids API restrictions, captures full LinkedIn UI experience. Playwright chosen for cross-platform support and reliability.
28
+
29
+ ---
30
+
31
+ ### Question 3: Data Extraction Scope
32
+ **Answer**: C β€” Full profile extraction (connections, endorsements, activity) with human review gate
33
+
34
+ **Context**: User clarified that human review allows verification of privacy/legal concerns, and all extracted data is publicly available anyway.
35
+
36
+ **Rationale**: Maximum value; human review ensures accuracy and compliance. Users decide what to share with AI agent.
37
+
38
+ ---
39
+
40
+ ### Question 4: Markdown Output Structure
41
+ **Answer**: B β€” Hierarchical markdown by section (Profile.md, Experience.md, Education.md, Skills.md, Recommendations.md, Connections.md, Activity.md)
42
+
43
+ **Rationale**: Mirrors LinkedIn's natural information architecture, modular for easy editing/exclusion, integrates seamlessly with existing RAG pipeline.
44
+
45
+ ---
46
+
47
+ ### Question 5: Tool Delivery & Integration
48
+ **Answer**: A β€” Standalone Python CLI tool; users run locally, review output, manually upload
49
+
50
+ **Rationale**: Full user control over data, respects privacy, simple integration with existing workflow, avoids complicating main app deployment.
51
+
52
+ ---
53
+
54
+ ## Coverage Analysis
55
+
56
+ | Category | Status | Coverage |
57
+ |----------|--------|----------|
58
+ | **Functional Scope & Behavior** | βœ… Resolved | Clear user stories, acceptance criteria, edge cases documented |
59
+ | **Domain & Data Model** | βœ… Resolved | Entities (LinkedInProfile, Experience, etc.), markdown output structure defined |
60
+ | **Interaction & UX Flow** | βœ… Resolved | CLI interface, workflow steps, human review gate specified |
61
+ | **Non-Functional Quality Attributes** | βœ… Resolved | Performance targets (<5min extraction), reliability (error handling), privacy (local execution) |
62
+ | **Integration & External Dependencies** | βœ… Resolved | Playwright browser automation, no API integration, RAG pipeline compatibility |
63
+ | **Edge Cases & Failure Handling** | βœ… Resolved | UI changes, rate limiting, timeouts, incomplete data all covered |
64
+ | **Constraints & Tradeoffs** | βœ… Resolved | Technical stack (Python 3.12, Playwright), local execution, manual upload confirmed |
65
+ | **Terminology & Consistency** | βœ… Resolved | Canonical terms (LinkedInProfile, ExtractionSession, MarkdownOutput) defined |
66
+ | **Completion Signals** | βœ… Resolved | Success metrics defined; acceptance criteria testable |
67
+
68
+ **Overall Coverage**: βœ… **100%** β€” All critical categories resolved
69
+
70
+ ---
71
+
72
+ ## Spec Artifact Generated
73
+
74
+ **Path**: `/Users/benyoung/projects/ai-me/specs/002-linkedin-profile-extractor/spec.md`
75
+
76
+ **Contents**:
77
+ - Problem statement and solution overview
78
+ - 3 user stories (P1: Extract, P1: Review, P2: Upload) with acceptance criteria
79
+ - 17 functional requirements (FR-001 through FR-017)
80
+ - 8 key entities (LinkedInProfile, Experience, Education, Skill, Recommendation, Connection, Activity, ExtractionSession, MarkdownOutput)
81
+ - 13 non-functional requirements (performance, reliability, security, usability, observability)
82
+ - 8 success criteria
83
+ - Data model and markdown output examples
84
+ - Technical constraints and architecture
85
+ - Integration with Personified AI Agent (spec 001)
86
+ - CLI interface and usage examples
87
+ - Testing strategy
88
+ - Future enhancements (Phase B, out-of-scope)
89
+
90
+ ---
91
+
92
+ ## Sections Touched in New Spec
93
+
94
+ 1. **Clarifications** β†’ Session 2025-10-24 with all 5 answered questions
95
+ 2. **Overview & Context** β†’ Problem statement, solution, key differentiators
96
+ 3. **User Scenarios & Testing** β†’ 3 user stories with acceptance criteria
97
+ 4. **Requirements** β†’ 17 functional + 13 non-functional requirements
98
+ 5. **Data Model** β†’ Entity definitions + markdown output structure
99
+ 6. **Technical Constraints & Architecture** β†’ Technology stack, implementation notes, out-of-scope
100
+ 7. **Integration with Personified AI Agent** β†’ Workflow and compatibility
101
+ 8. **Deployment & Usage** β†’ CLI installation, interface, examples
102
+ 9. **Testing Strategy** β†’ Unit, integration, manual test approaches
103
+ 10. **Success Metrics** β†’ Measurable outcomes and targets
104
+
105
+ ---
106
+
107
+ ## Recommendations for Next Steps
108
+
109
+ ### Immediate (Phase 0-1)
110
+
111
+ 1. **Review Spec**: Validate spec decisions align with your vision
112
+ 2. **Data Model Refinement**: Create detailed markdown schema examples (if needed)
113
+ 3. **Implementation Plan**: Run `/speckit.plan` to create implementation roadmap
114
+ 4. **Task Breakdown**: Run `/speckit.tasks` to generate concrete development tasks
115
+
116
+ ### Pre-Development Verification
117
+
118
+ 1. **Test Playwright with LinkedIn**: Quick POC to verify Playwright can navigate LinkedIn without blocking
119
+ 2. **Validate Markdown Structure**: Ensure generated markdown integrates with existing RAG pipeline
120
+ 3. **User Testing Plan**: Plan 1-2 user trials to validate data accuracy and workflow
121
+
122
+ ### Execution Phases
123
+
124
+ - **Phase 1** (Setup & Infrastructure): Playwright environment, CLI scaffolding, output directory management
125
+ - **Phase 2** (Foundational): Core extraction logic, error handling, markdown generation
126
+ - **Phase 3** (User Story 1): Profile extraction workflow, testing
127
+ - **Phase 4** (User Story 2): Review & validation features
128
+ - **Phase 5** (User Story 3): Documentation for upload workflow
129
+
130
+ ---
131
+
132
+ ## Outstanding Items (None)
133
+
134
+ All critical ambiguities resolved. No outstanding blocking decisions.
135
+
136
+ **Deferred to Planning Phase** (as appropriate):
137
+ - Specific Playwright selector strategy (implementation detail)
138
+ - Error retry logic specifics (implementation detail)
139
+ - CLI argument parsing details (implementation detail)
140
+
141
+ ---
142
+
143
+ ## Suggested Next Command
144
+
145
+ ```bash
146
+ # After reviewing spec:
147
+ /speckit.plan # Create detailed implementation roadmap
148
+
149
+ # Then:
150
+ /speckit.tasks # Generate task breakdown for development
151
+ ```
152
+
153
+ ---
154
+
155
+ **Clarification Status**: βœ… **COMPLETE & READY FOR PLANNING**
156
+ **Spec Path**: `specs/002-linkedin-profile-extractor/spec.md`
157
+ **Branch Ready**: Ready for new feature branch `002-linkedin-profile-extractor`
158
+
specs/002-linkedin-profile-extractor/INDEX.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“‘ LinkedIn Profile Extractor Specification Index
2
+
3
+ **Feature**: Spec 002 - LinkedIn Profile Data Extractor
4
+ **Created**: October 24, 2025
5
+ **Status**: βœ… Clarification Complete β€” Ready for Planning
6
+
7
+ ---
8
+
9
+ ## πŸ“š Reading Guide
10
+
11
+ ### Quick Start (5-10 minutes)
12
+ Start here for executive overview:
13
+
14
+ 1. **[SUMMARY.md](SUMMARY.md)** (12 KB)
15
+ - Executive summary of clarification session
16
+ - 5 design decisions made
17
+ - Architecture overview
18
+ - Key takeaways
19
+ - Next steps
20
+
21
+ ### Full Specification (15-20 minutes)
22
+ Complete feature definition:
23
+
24
+ 2. **[spec.md](spec.md)** (21 KB) ⭐ **MAIN SPEC**
25
+ - Problem statement & solution overview
26
+ - 3 user stories with acceptance criteria
27
+ - 17 functional requirements
28
+ - 13 non-functional requirements
29
+ - 8 key entities
30
+ - Data model & markdown output structure
31
+ - Technical architecture
32
+ - CLI interface & usage
33
+ - Testing strategy
34
+ - Success metrics
35
+
36
+ ### Integration & Workflow (10-15 minutes)
37
+ How this tool works with Spec 001:
38
+
39
+ 3. **[INTEGRATION_GUIDE.md](INTEGRATION_GUIDE.md)** (10 KB)
40
+ - High-level workflow diagram
41
+ - Data flow (LinkedIn β†’ markdown β†’ agent)
42
+ - Step-by-step integration steps
43
+ - Data mapping (source β†’ file β†’ usage)
44
+ - Configuration examples
45
+ - Privacy & consent controls
46
+ - Troubleshooting guide
47
+ - Future enhancements
48
+
49
+ ### Clarification Session Record (5 minutes)
50
+ Detailed record of design decisions:
51
+
52
+ 4. **[CLARIFICATIONS.md](CLARIFICATIONS.md)** (7 KB)
53
+ - All 5 questions & answers
54
+ - Rationale for each decision
55
+ - Coverage analysis
56
+ - Sections touched
57
+ - Recommendations
58
+
59
+ ---
60
+
61
+ ## 🎯 Quick Navigation
62
+
63
+ ### By Role
64
+
65
+ **Product Manager**: Read SUMMARY.md β†’ spec.md β†’ INTEGRATION_GUIDE.md
66
+ **Engineer**: Read spec.md β†’ INTEGRATION_GUIDE.md β†’ spec.md again for details
67
+ **Project Lead**: Read SUMMARY.md β†’ CLARIFICATIONS.md β†’ next steps
68
+
69
+ ### By Question
70
+
71
+ **What is this tool?**
72
+ β†’ SUMMARY.md, Overview section
73
+
74
+ **How does it work?**
75
+ β†’ INTEGRATION_GUIDE.md, High-Level Workflow
76
+
77
+ **What gets extracted?**
78
+ β†’ spec.md, Functional Requirements (FR-003 through FR-009)
79
+
80
+ **How does it integrate with Spec 001?**
81
+ β†’ INTEGRATION_GUIDE.md, Integration Steps
82
+
83
+ **What were the design decisions?**
84
+ β†’ CLARIFICATIONS.md, Questions Asked & Answered
85
+
86
+ **How do I use it?**
87
+ β†’ spec.md, Deployment & Usage section
88
+
89
+ **What happens if something fails?**
90
+ β†’ INTEGRATION_GUIDE.md, Troubleshooting
91
+
92
+ ---
93
+
94
+ ## πŸ“‹ Document Overview
95
+
96
+ | Document | Lines | Purpose | Audience |
97
+ |----------|-------|---------|----------|
98
+ | **SUMMARY.md** | 282 | Executive overview | Managers, PMs, decision makers |
99
+ | **spec.md** | 408 | Complete specification | Engineers, architects |
100
+ | **INTEGRATION_GUIDE.md** | 379 | Workflow & integration | Engineers, ops, users |
101
+ | **CLARIFICATIONS.md** | 158 | Decision record | Project leads, reviewers |
102
+ | **README.md** | 282 | Getting started | Everyone |
103
+ | **INDEX.md** | This file | Navigation | Everyone |
104
+
105
+ **Total**: ~1,500 lines of specification documentation
106
+
107
+ ---
108
+
109
+ ## πŸš€ Next Steps
110
+
111
+ ### Immediate (Next 1-2 hours)
112
+ - [ ] Read SUMMARY.md (5 min)
113
+ - [ ] Read spec.md User Stories section (10 min)
114
+ - [ ] Review INTEGRATION_GUIDE.md workflow (10 min)
115
+ - [ ] Validate design decisions align with vision (5 min)
116
+
117
+ ### Short-term (Next 1-2 days)
118
+ - [ ] Run `/speckit.plan` to create implementation roadmap
119
+ - [ ] Run `/speckit.tasks` to generate task breakdown
120
+ - [ ] Create feature branch: `git checkout -b 002-linkedin-profile-extractor`
121
+
122
+ ### Pre-Development (Next 1 week)
123
+ - [ ] Review implementation plan
124
+ - [ ] Estimate effort and timeline
125
+ - [ ] Quick Playwright POC (verify LinkedIn compatibility)
126
+ - [ ] Plan user trial for validation
127
+
128
+ ---
129
+
130
+ ## πŸ”— Cross-References
131
+
132
+ **Related Specifications**:
133
+ - Spec 001: Personified AI Agent β€” `specs/001-personified-ai-agent/spec.md`
134
+ - T068 Research: LinkedIn MCP β€” `specs/001-personified-ai-agent/research.md`
135
+
136
+ **Project Standards**:
137
+ - Constitution: `.specify/memory/constitution.md`
138
+ - Copilot Instructions: `.github/copilot-instructions.md`
139
+ - Clarify Prompt: `.github/prompts/speckit.clarify.prompt.md`
140
+
141
+ ---
142
+
143
+ ## πŸ“Š Specification Stats
144
+
145
+ - **Questions Clarified**: 5/5 βœ…
146
+ - **Coverage Achieved**: 100% (all 9 categories)
147
+ - **User Stories**: 3 (P1, P1, P2)
148
+ - **Functional Requirements**: 17
149
+ - **Non-Functional Requirements**: 13
150
+ - **Key Entities**: 8
151
+ - **Markdown Files Output**: 7 per session + metadata
152
+ - **CLI Commands**: 1 main command with options
153
+ - **Success Metrics**: 8 measurable outcomes
154
+
155
+ ---
156
+
157
+ ## βœ… Validation Checklist
158
+
159
+ - βœ… All 5 clarification questions answered
160
+ - βœ… All answers integrated into spec
161
+ - βœ… 100% coverage of ambiguity categories
162
+ - βœ… Spec includes user stories, requirements, data model
163
+ - βœ… Integration guide explains workflow
164
+ - βœ… Integration with Spec 001 documented
165
+ - βœ… CLI interface & usage documented
166
+ - βœ… Testing strategy included
167
+ - βœ… Success metrics defined
168
+ - βœ… Ready for planning phase
169
+
170
+ ---
171
+
172
+ ## πŸŽ“ How to Use This Index
173
+
174
+ 1. **First time?** β†’ Start with SUMMARY.md
175
+ 2. **Implementing?** β†’ Read spec.md sections in order
176
+ 3. **Integrating with Spec 001?** β†’ Use INTEGRATION_GUIDE.md
177
+ 4. **Reviewing decisions?** β†’ Check CLARIFICATIONS.md
178
+ 5. **Lost?** β†’ This index helps you navigate
179
+
180
+ ---
181
+
182
+ ## πŸ”₯ Key Highlights
183
+
184
+ **In 30 seconds**:
185
+ - πŸ› οΈ Tool: Python CLI using Playwright
186
+ - πŸ”“ Auth: Manual LinkedIn login (human-in-the-loop)
187
+ - πŸ“Š Data: Full profile extraction (7 sections)
188
+ - πŸ“ Output: Hierarchical markdown files
189
+ - πŸ‘€ User Control: Review locally β†’ edit β†’ upload manually
190
+ - πŸ”— Integration: Seamless with AI-Me agent (Spec 001)
191
+
192
+ ---
193
+
194
+ **Current Status**: βœ… **Ready for Planning Phase**
195
+ **Next Command**: `/speckit.plan` to create implementation roadmap
196
+
197
+ ---
198
+
199
+ *Navigation Guide for LinkedIn Profile Extractor Specification*
200
+ *Last Updated: October 24, 2025*
specs/002-linkedin-profile-extractor/INTEGRATION_GUIDE.md ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Integration Guide: LinkedIn Profile Extractor ↔ Personified AI Agent
2
+
3
+ **Document**: Integration roadmap between Spec 002 (LinkedIn Extractor) and Spec 001 (Personified AI Agent)
4
+ **Date**: October 24, 2025
5
+ **Status**: Reference Documentation
6
+
7
+ ---
8
+
9
+ ## High-Level Workflow
10
+
11
+ ```
12
+ User's LinkedIn Profile
13
+ ↓
14
+ [Spec 002: LinkedIn Profile Extractor Tool]
15
+ (Playwright browser automation + manual login)
16
+ ↓
17
+ Generated Markdown Files (Profile.md, Experience.md, etc.)
18
+ ↓
19
+ User Review & Edit (privacy/accuracy gate)
20
+ ↓
21
+ Manual Upload to GitHub Repository (byoung/me, etc.)
22
+ ↓
23
+ [Spec 001: Personified AI Agent]
24
+ (DataManager loads GitHub repo via RAG)
25
+ ↓
26
+ Agent Knowledge Base Enhanced
27
+ ↓
28
+ User Chat: "Tell me about your experience..."
29
+ Agent Response: [Sourced from LinkedIn profile markdown]
30
+ ```
31
+
32
+ ---
33
+
34
+ ## Data Flow
35
+
36
+ ### Spec 002 Output Format
37
+
38
+ LinkedIn Extractor produces:
39
+
40
+ ```
41
+ linkedin-profile/
42
+ β”œβ”€β”€ extraction_report.json
43
+ β”œβ”€β”€ Profile.md # Name, headline, location, about
44
+ β”œβ”€β”€ Experience.md # Job history
45
+ β”œβ”€β”€ Education.md # Schools, degrees
46
+ β”œβ”€β”€ Skills.md # Skills + endorsements
47
+ β”œβ”€β”€ Recommendations.md # Recommendations
48
+ β”œβ”€β”€ Connections.md # Connections list
49
+ └── Activity.md # Posts, articles
50
+ ```
51
+
52
+ ### Spec 001 Input Format
53
+
54
+ Personified AI Agent expects:
55
+
56
+ - Markdown files in `docs/` directory (local) or GitHub repository (remote)
57
+ - Files organized by topic/section (exactly what Spec 002 produces)
58
+ - Metadata: filename, creation date, source (included in extraction_report.json)
59
+ - Markdown syntax: valid UTF-8, proper heading hierarchy
60
+
61
+ **Result**: Perfect format compatibility. No transformation needed.
62
+
63
+ ---
64
+
65
+ ## Integration Steps
66
+
67
+ ### Step 1: Extract LinkedIn Data (Spec 002)
68
+
69
+ ```bash
70
+ cd ~/projects/ai-me
71
+ python -m linkedin_extractor extract --output-dir ./linkedin-profile
72
+ # User logs in manually β†’ files generated β†’ review files
73
+ ```
74
+
75
+ **Output**: 7 markdown files + extraction_report.json in `./linkedin-profile/`
76
+
77
+ ---
78
+
79
+ ### Step 2: Review & Validate
80
+
81
+ User reviews files locally, edits for privacy/accuracy:
82
+
83
+ ```bash
84
+ # Open in editor
85
+ code ./linkedin-profile/
86
+
87
+ # Edit files as needed, delete sensitive info, verify accuracy
88
+ # Example edits:
89
+ # - Remove specific company names if desired
90
+ # - Condense connections list to key contacts
91
+ # - Remove draft posts or old activity
92
+ ```
93
+
94
+ ---
95
+
96
+ ### Step 3: Upload to Documentation Repository
97
+
98
+ User uploads files to their documentation repo (e.g., `byoung/me`):
99
+
100
+ ```bash
101
+ cd ~/repos/byoung-me # (or wherever your docs repo is)
102
+
103
+ # Copy or move reviewed files
104
+ cp -r ~/projects/ai-me/linkedin-profile/*.md ./
105
+
106
+ # Commit and push
107
+ git add *.md
108
+ git commit -m "Update LinkedIn profile: $(date +%Y-%m-%d)"
109
+ git push origin main
110
+ ```
111
+
112
+ ---
113
+
114
+ ### Step 4: Configure AI-Me to Ingest
115
+
116
+ Update `.env` to include LinkedIn profile repo:
117
+
118
+ ```bash
119
+ # .env
120
+ GITHUB_REPOS=byoung/me,byoung/other-docs # Add your repo
121
+ GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxx
122
+ ```
123
+
124
+ Or, if files are already in local `docs/` directory:
125
+
126
+ ```bash
127
+ # Just move files to docs/
128
+ cp ~/linkedin-profile/*.md ./docs/
129
+ ```
130
+
131
+ ---
132
+
133
+ ### Step 5: Restart AI-Me
134
+
135
+ AI-Me's `DataManager` will reload documents on next startup:
136
+
137
+ ```bash
138
+ # If running locally:
139
+ python -m gradio src/app.py
140
+
141
+ # If deployed on Spaces, trigger redeploy or restart
142
+ ```
143
+
144
+ ---
145
+
146
+ ### Step 6: Verify Integration
147
+
148
+ Test that agent has access to LinkedIn profile data:
149
+
150
+ **Test Chat**:
151
+ - User: "Tell me about your work experience"
152
+ - Agent Response: [Cites Experience.md from LinkedIn extractor]
153
+
154
+ **Verification**:
155
+ - Agent uses first-person ("I worked at...")
156
+ - Agent cites specific companies/dates from LinkedIn profile
157
+ - Agent maintains authentic voice
158
+
159
+ ---
160
+
161
+ ## Data Mapping: LinkedIn β†’ Markdown β†’ Agent
162
+
163
+ | LinkedIn Source | Markdown File | Agent Uses For | Example Question |
164
+ |-----------------|---------------|----------------|------------------|
165
+ | Profile section | Profile.md | Personalization, headline context | "What's your professional background?" |
166
+ | Experience | Experience.md | Job history, expertise domains | "Tell me about your experience with X" |
167
+ | Education | Education.md | Academic background, credentials | "Where did you study?" |
168
+ | Skills + Endorsements | Skills.md | Domain expertise ranking | "What are your top skills?" |
169
+ | Recommendations | Recommendations.md | Social proof, validation | "What do others say about you?" |
170
+ | Connections | Connections.md | Network context, collaboration history | "Tell me about your network" |
171
+ | Activity/Posts | Activity.md | Recent thinking, current interests | "What are you focused on lately?" |
172
+
173
+ ---
174
+
175
+ ## File Format Examples
176
+
177
+ ### Profile.md (from Spec 002 β†’ consumed by Spec 001)
178
+
179
+ ```markdown
180
+ # LinkedIn Profile
181
+
182
+ **Name**: Ben Young
183
+ **Headline**: AI Agent Architect | Full-Stack Engineer
184
+ **Location**: San Francisco, CA
185
+
186
+ ## Summary
187
+
188
+ Experienced AI/ML engineer with 10+ years building production systems...
189
+
190
+ [Rest of profile]
191
+ ```
192
+
193
+ **Agent uses**: First-person synthesis of profile summary in responses
194
+
195
+ ---
196
+
197
+ ### Experience.md (from Spec 002 β†’ consumed by Spec 001)
198
+
199
+ ```markdown
200
+ # Experience
201
+
202
+ ## AI Agent Architect @ TechCorp (2023-2025)
203
+ - Led design of autonomous agent systems
204
+ - Built RAG pipeline with 99.9% uptime
205
+ - Mentored 5 engineers on AI architecture
206
+
207
+ ## Senior Engineer @ StartupXYZ (2020-2023)
208
+ - ...
209
+ ```
210
+
211
+ **Agent uses**: Specific job responsibilities when answering experience questions
212
+
213
+ ---
214
+
215
+ ## Configuration Examples
216
+
217
+ ### For GitHub-Based Ingestion
218
+
219
+ ```bash
220
+ # .env
221
+ GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxxxxxxxxxx
222
+ GITHUB_REPOS=byoung/me,byoung/projects
223
+
224
+ # AI-Me will load:
225
+ # - https://github.com/byoung/me/blob/main/*.md
226
+ # - https://github.com/byoung/projects/blob/main/*.md
227
+ ```
228
+
229
+ Then upload LinkedIn profile markdown to `byoung/me` repo:
230
+
231
+ ```bash
232
+ # Repository structure
233
+ byoung/me/
234
+ β”œβ”€β”€ Profile.md # from LinkedIn extractor
235
+ β”œβ”€β”€ Experience.md # from LinkedIn extractor
236
+ β”œβ”€β”€ Education.md # from LinkedIn extractor
237
+ β”œβ”€β”€ resume.md # manual or pre-existing
238
+ β”œβ”€β”€ projects.md # manual
239
+ └── README.md
240
+ ```
241
+
242
+ ---
243
+
244
+ ### For Local File Ingestion
245
+
246
+ ```bash
247
+ # .env (no GitHub token needed)
248
+ GITHUB_REPOS="" # empty, or omit
249
+
250
+ # Move LinkedIn profile files to local docs/
251
+ cp ~/linkedin-profile/*.md ~/projects/ai-me/docs/
252
+
253
+ # Restart AI-Me
254
+ # DataManager will load from docs/ automatically
255
+ ```
256
+
257
+ ---
258
+
259
+ ## Data Privacy & Consent
260
+
261
+ ### What Users Control
262
+
263
+ 1. **Extraction**: User manually logs into LinkedIn (no credentials stored)
264
+ 2. **Review**: User reviews generated markdown before upload
265
+ 3. **Filtering**: User can delete/edit sensitive information in markdown
266
+ 4. **Upload**: User chooses where to upload (GitHub public repo, local, etc.)
267
+ 5. **Sharing**: User decides whether to use data with AI agent
268
+
269
+ ### What's Extracted
270
+
271
+ Only **publicly visible** LinkedIn data:
272
+ - Profile summary (as shown on profile page)
273
+ - Published experience/jobs
274
+ - Education (if public)
275
+ - Skills (if public)
276
+ - Recommendations (if public)
277
+ - Connections names/titles (if publicly shown)
278
+ - Published posts/activity (if public)
279
+
280
+ ### Privacy Best Practices
281
+
282
+ 1. Review markdown files before upload
283
+ 2. Remove sensitive information (specific salary, internal projects, etc.)
284
+ 3. Edit connections list if desired (Spec 002 allows truncation)
285
+ 4. Use private GitHub repo if prefer (not shared publicly)
286
+ 5. Set `GITHUB_REPOS` to private repo URL in AI-Me config
287
+
288
+ ---
289
+
290
+ ## Troubleshooting Integration
291
+
292
+ ### Problem: Agent doesn't cite LinkedIn data
293
+
294
+ **Diagnosis**:
295
+ 1. Verify markdown files uploaded to GitHub repo
296
+ 2. Verify GitHub repo URL is in `GITHUB_REPOS` env var
297
+ 3. Verify `GITHUB_PERSONAL_ACCESS_TOKEN` is set
298
+ 4. Restart AI-Me app
299
+ 5. Check logs: `DataManager.load_remote_documents()` should show documents loaded
300
+
301
+ **Solution**:
302
+ ```bash
303
+ # Test data loading
304
+ python -c "from src.data import DataManager;
305
+ dm = DataManager();
306
+ docs = dm.process_documents();
307
+ print(f'Loaded {len(docs)} documents')"
308
+ ```
309
+
310
+ ---
311
+
312
+ ### Problem: LinkedIn markdown syntax errors
313
+
314
+ **Diagnosis**:
315
+ 1. Validate markdown: `markdown-lint *.md`
316
+ 2. Check for special characters, emojis, Unicode issues
317
+
318
+ **Solution**:
319
+ - Spec 002 includes Unicode normalization (Constitution IX)
320
+ - User should review markdown files before upload
321
+ - Re-run extraction if needed
322
+
323
+ ---
324
+
325
+ ### Problem: Data accuracy issues in agent responses
326
+
327
+ **Diagnosis**:
328
+ 1. Verify extracted data matches LinkedIn profile
329
+ 2. Verify markdown reflects accurate representation of LinkedIn
330
+ 3. Check vector search is retrieving correct documents
331
+
332
+ **Solution**:
333
+ - User reviews extracted markdown before upload
334
+ - Manual editing of markdown files allowed
335
+ - Test specific queries: "What company did you work at?" β†’ should cite Experience.md
336
+
337
+ ---
338
+
339
+ ## Future Enhancements
340
+
341
+ ### Phase B: Spec 002 Enhancements
342
+
343
+ - Scheduled extraction (sync profile changes monthly)
344
+ - Data versioning (track profile evolution)
345
+ - Diff report (what changed since last extraction)
346
+ - LinkedIn API integration (if ToS allows in future)
347
+
348
+ ### Phase B: Spec 001 Integration Enhancements
349
+
350
+ - Automatic GitHub sync (reload documents on push webhook)
351
+ - LinkedIn data freshness indicator ("Profile data from X days ago")
352
+ - Dedicated LinkedIn context in agent prompt
353
+ - LinkedIn-specific queries: "Show me recent posts" β†’ cite Activity.md
354
+
355
+ ### Joint Enhancement: Documentation Sync Tool
356
+
357
+ - Tool to automatically sync markdown updates to GitHub
358
+ - Dashboard showing which LinkedIn data is in agent's knowledge base
359
+ - User audit trail: "Last synced: date"
360
+
361
+ ---
362
+
363
+ ## Success Criteria for Integration
364
+
365
+ βœ… **Extraction**: LinkedIn data β†’ markdown files β‰₯90% accuracy
366
+ βœ… **Review**: User can edit files before upload
367
+ βœ… **Upload**: Files accessible to AI-Me DataManager
368
+ βœ… **Retrieval**: Agent retrieves correct LinkedIn data for queries
369
+ βœ… **Response**: Agent cites LinkedIn profile in first-person responses
370
+ βœ… **Accuracy**: Sample responses match LinkedIn source data (100%)
371
+
372
+ ---
373
+
374
+ ## Reference
375
+
376
+ - **Spec 001**: Personified AI Agent β†’ `/specs/001-personified-ai-agent/spec.md`
377
+ - **Spec 002**: LinkedIn Profile Extractor β†’ `/specs/002-linkedin-profile-extractor/spec.md`
378
+ - **Constitution**: `/specify/memory/constitution.md` (Principles: RAG-First, Session Isolation, Type-Safe, Async-First)
379
+
specs/002-linkedin-profile-extractor/README.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Clarification Session Complete: LinkedIn Profile Extractor
2
+
3
+ **Session Date**: October 24, 2025
4
+ **Feature**: LinkedIn Profile Data Extractor (Spec 002)
5
+ **Status**: βœ… **CLARIFICATION COMPLETE & SPEC CREATED**
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ You requested a new tool to extract LinkedIn profile data into markdown files for use with the Personified AI Agent. I completed a full clarification session following the speckit.clarify workflow, answered 5 critical design questions, and created a complete feature specification.
12
+
13
+ ---
14
+
15
+ ## Clarification Results
16
+
17
+ ### Questions Asked & Answered: 5/5 βœ…
18
+
19
+ | # | Question | Your Answer | Rationale |
20
+ |---|----------|-------------|-----------|
21
+ | 1 | Feature organization (separate spec or integrated?) | **A**: Separate spec in `specs/002-` | Clean separation; independent versioning |
22
+ | 2 | Authentication mechanism for LinkedIn? | **A**: Browser automation with **Playwright** | Respects ToS; user-controlled; full UI access |
23
+ | 3 | LinkedIn data scope? | **C**: Full profile + human review gate | Maximum value; user controls privacy via review |
24
+ | 4 | Markdown output structure? | **B**: Hierarchical by section (Profile.md, Experience.md, etc.) | Modular; mirrors LinkedIn structure; RAG-compatible |
25
+ | 5 | Tool delivery model? | **A**: Standalone Python CLI tool | User control; local execution; manual upload to repo |
26
+
27
+ ### Coverage Achieved: 100% βœ…
28
+
29
+ All 9 key ambiguity categories resolved:
30
+ - βœ… Functional scope & behavior (3 user stories, acceptance criteria)
31
+ - βœ… Data model & entities (8 key entities defined)
32
+ - βœ… Interaction & UX (CLI interface, workflow documented)
33
+ - βœ… Non-functional attributes (performance <5min, reliability, privacy)
34
+ - βœ… Integration & dependencies (Playwright, no API, local execution)
35
+ - βœ… Edge cases & failures (rate limiting, UI changes, timeouts)
36
+ - βœ… Constraints & tradeoffs (Python 3.12, local-only, manual upload)
37
+ - βœ… Terminology & consistency (canonical terms defined)
38
+ - βœ… Completion signals (success metrics, acceptance criteria)
39
+
40
+ ---
41
+
42
+ ## Artifacts Created
43
+
44
+ ### 1. Feature Specification
45
+ **Path**: `specs/002-linkedin-profile-extractor/spec.md`
46
+
47
+ **Contents** (~400 lines):
48
+ - Problem statement & solution overview
49
+ - 3 user stories (P1, P1, P2) with full acceptance scenarios
50
+ - 17 functional requirements (FR-001 through FR-017)
51
+ - 13 non-functional requirements (performance, reliability, security, usability)
52
+ - 8 key entities with attributes
53
+ - Data model with markdown output structure & examples
54
+ - Technical constraints & architecture decisions
55
+ - Integration with Spec 001 (Personified AI Agent)
56
+ - CLI interface examples & usage
57
+ - Testing strategy (unit, integration, manual)
58
+ - Success metrics & measurable outcomes
59
+
60
+ **Status**: Ready for Phase 0-1 planning
61
+
62
+ ---
63
+
64
+ ### 2. Clarifications Document
65
+ **Path**: `specs/002-linkedin-profile-extractor/CLARIFICATIONS.md`
66
+
67
+ **Contents**:
68
+ - All 5 questions & answers with rationale
69
+ - Coverage analysis (all categories resolved)
70
+ - Sections touched in new spec
71
+ - Recommendations for next steps
72
+
73
+ ---
74
+
75
+ ### 3. Integration Guide
76
+ **Path**: `specs/002-linkedin-profile-extractor/INTEGRATION_GUIDE.md`
77
+
78
+ **Contents** (~350 lines):
79
+ - High-level workflow (LinkedIn β†’ extraction β†’ review β†’ upload β†’ ingestion)
80
+ - Data flow from Spec 002 to Spec 001
81
+ - Step-by-step integration instructions
82
+ - Data mapping (LinkedIn source β†’ markdown file β†’ agent usage)
83
+ - Configuration examples (GitHub-based & local)
84
+ - Privacy & consent controls
85
+ - Troubleshooting guide
86
+ - Future enhancement ideas
87
+ - Success criteria for integration
88
+
89
+ ---
90
+
91
+ ## Key Design Decisions
92
+
93
+ ### 1. **Separate Feature Spec** βœ…
94
+ - Created `specs/002-linkedin-profile-extractor/` directory
95
+ - Independent from Spec 001 (Personified AI Agent)
96
+ - Allows independent task tracking & prioritization
97
+ - Prevents scope creep in main agent
98
+
99
+ ### 2. **Playwright Browser Automation** βœ…
100
+ - User logs in manually (human-in-the-loop)
101
+ - Browser-based respects LinkedIn ToS (no scraping)
102
+ - Cross-platform support (Windows/Mac/Linux)
103
+ - Full UI access (can handle LinkedIn changes)
104
+ - No API complexity or approval required
105
+
106
+ ### 3. **Full Data Extraction with Review Gate** βœ…
107
+ - Extracts all publicly visible data (connections, endorsements, activity)
108
+ - User reviews markdown files locally before upload
109
+ - User can edit/remove sensitive information
110
+ - Only user decides what shares with AI agent
111
+
112
+ ### 4. **Hierarchical Markdown Output** βœ…
113
+ - 7 markdown files per section (Profile.md, Experience.md, etc.)
114
+ - Mirrors LinkedIn's natural information structure
115
+ - Modular: user can include/exclude files as needed
116
+ - Perfect compatibility with existing RAG pipeline
117
+
118
+ ### 5. **Standalone CLI Tool** βœ…
119
+ - Separate from main Gradio app
120
+ - Python 3.12 + uv (matches project standards)
121
+ - Local execution (no credentials transmitted)
122
+ - Manual upload workflow (user controls upload)
123
+ - Respects user privacy & data ownership
124
+
125
+ ---
126
+
127
+ ## Workflow: User Perspective
128
+
129
+ ```
130
+ 1. User runs: python -m linkedin_extractor extract --output-dir ./linkedin-profile
131
+ 2. Browser opens; user logs into LinkedIn manually
132
+ 3. Tool extracts profile β†’ Experience β†’ Education β†’ Skills β†’ etc.
133
+ 4. 7 markdown files generated: Profile.md, Experience.md, ...
134
+ 5. User reviews files in text editor; edits for privacy/accuracy
135
+ 6. User uploads files to their GitHub repo (byoung/me or similar)
136
+ 7. User configures AI-Me: GITHUB_REPOS=byoung/me
137
+ 8. AI-Me loads files via RAG
138
+ 9. Next conversation uses LinkedIn profile data:
139
+ - User: "Tell me about your work experience"
140
+ - Agent: "I've worked at [companies from Experience.md]..."
141
+ ```
142
+
143
+ ---
144
+
145
+ ## What's Next
146
+
147
+ ### Recommended Path
148
+
149
+ ```bash
150
+ # 1. Review the new spec
151
+ cat specs/002-linkedin-profile-extractor/spec.md
152
+
153
+ # 2. Create implementation plan
154
+ /speckit.plan
155
+
156
+ # 3. Generate task breakdown
157
+ /speckit.tasks
158
+
159
+ # 4. Create feature branch
160
+ git checkout -b 002-linkedin-profile-extractor
161
+
162
+ # 5. Begin Phase 1 (Setup & Infrastructure)
163
+ ```
164
+
165
+ ### Immediate Action Items
166
+
167
+ - [ ] Review `specs/002-linkedin-profile-extractor/spec.md` β€” validate decisions
168
+ - [ ] Review `INTEGRATION_GUIDE.md` β€” understand workflow with Spec 001
169
+ - [ ] Create implementation plan via `/speckit.plan`
170
+ - [ ] Generate task breakdown via `/speckit.tasks`
171
+ - [ ] Create feature branch: `git checkout -b 002-linkedin-profile-extractor`
172
+
173
+ ### Pre-Development Validation
174
+
175
+ - [ ] Test Playwright with LinkedIn (quick POC)
176
+ - [ ] Validate markdown output integrates with RAG pipeline
177
+ - [ ] Plan user trial to validate data accuracy
178
+
179
+ ---
180
+
181
+ ## Integration with Spec 001
182
+
183
+ **No changes needed to Spec 001** (Personified AI Agent). The LinkedIn Profile Extractor is a **separate tool** that produces **compatible output** (markdown files).
184
+
185
+ **Integration is simple**:
186
+ 1. Extract β†’ Markdown files
187
+ 2. Review β†’ User validates
188
+ 3. Upload β†’ GitHub repo (or local docs/)
189
+ 4. Configure β†’ Add repo to `GITHUB_REPOS` in Spec 001's config
190
+ 5. Ingest β†’ Spec 001's DataManager loads files automatically
191
+
192
+ See `INTEGRATION_GUIDE.md` for detailed workflow.
193
+
194
+ ---
195
+
196
+ ## Success Criteria for This Clarification
197
+
198
+ | Criterion | Status |
199
+ |-----------|--------|
200
+ | All critical ambiguities identified | βœ… 9 categories scanned |
201
+ | High-impact questions prioritized | βœ… 5 questions asked (high-impact) |
202
+ | All answers actionable & clear | βœ… No ambiguous replies |
203
+ | Spec reflects decisions accurately | βœ… All 5 answers integrated |
204
+ | Integration documented | βœ… 350-line INTEGRATION_GUIDE.md created |
205
+ | Ready for planning phase | βœ… No outstanding blockers |
206
+
207
+ ---
208
+
209
+ ## File Structure
210
+
211
+ ```
212
+ specs/002-linkedin-profile-extractor/
213
+ β”œβ”€β”€ spec.md # Main feature specification (~400 lines)
214
+ β”œβ”€β”€ CLARIFICATIONS.md # This clarification session record
215
+ β”œβ”€β”€ INTEGRATION_GUIDE.md # Integration with Spec 001 (~350 lines)
216
+ └── (forthcoming)
217
+ β”œβ”€β”€ plan.md # (Phase 0) Implementation roadmap
218
+ β”œβ”€β”€ data-model.md # (Phase 1) Detailed data model
219
+ β”œβ”€β”€ research.md # (Phase 0) Research findings
220
+ └── tasks.md # (Phase 2) Task breakdown
221
+ ```
222
+
223
+ ---
224
+
225
+ ## Summary Stats
226
+
227
+ - **Questions Asked**: 5
228
+ - **Coverage Achieved**: 100% (all 9 ambiguity categories)
229
+ - **Spec Lines Created**: ~400 (main spec)
230
+ - **Integration Guide**: ~350 lines
231
+ - **Clarifications Documented**: ~200 lines
232
+ - **Total Documentation**: ~950 lines
233
+ - **Decision Clarity**: High (all 5 answers well-justified)
234
+ - **Ready for Planning**: βœ… Yes
235
+
236
+ ---
237
+
238
+ ## Validation Checklist
239
+
240
+ - βœ… All 5 questions answered & recorded
241
+ - βœ… Spec created with clarifications integrated
242
+ - βœ… Coverage summary shows all categories resolved
243
+ - βœ… Markdown structure valid (no syntax errors)
244
+ - βœ… Terminology consistent (canonical terms: LinkedInProfile, ExtractionSession, etc.)
245
+ - βœ… No contradictory statements in spec
246
+ - βœ… Integration guide references both specs
247
+ - βœ… Next steps clearly documented
248
+
249
+ ---
250
+
251
+ ## Key Files to Review
252
+
253
+ 1. **Start Here**: `specs/002-linkedin-profile-extractor/spec.md` β€” Full feature spec
254
+ 2. **Integration**: `specs/002-linkedin-profile-extractor/INTEGRATION_GUIDE.md` β€” How it works with Spec 001
255
+ 3. **Reference**: `specs/001-personified-ai-agent/research.md` β€” T068 LinkedIn research (context)
256
+
257
+ ---
258
+
259
+ ## Next Steps
260
+
261
+ **Recommended**: Run `/speckit.plan` to create implementation roadmap
262
+
263
+ ```bash
264
+ /speckit.plan # Create Phase 0-5 planning for Spec 002
265
+ ```
266
+
267
+ Then:
268
+
269
+ ```bash
270
+ /speckit.tasks # Generate concrete task breakdown
271
+ ```
272
+
273
+ ---
274
+
275
+ **Clarification Status**: βœ… **COMPLETE**
276
+ **Spec Status**: βœ… **READY FOR PLANNING**
277
+ **Recommended Next**: `/speckit.plan` β†’ `/speckit.tasks` β†’ Begin Phase 1 Development
278
+
279
+ ---
280
+
281
+ *For detailed clarification methodology, see `.github/prompts/speckit.clarify.prompt.md`*
282
+
specs/002-linkedin-profile-extractor/spec.md ADDED
@@ -0,0 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Feature Specification: LinkedIn Profile Data Extractor
2
+
3
+ **Feature Branch**: `002-linkedin-profile-extractor`
4
+ **Created**: 2025-10-24
5
+ **Status**: Draft (Clarification Complete)
6
+ **Input**: User description: "A tool that walks through LinkedIn, allows users to login (human in the loop), then extracts user profile data (profile, experience, connections, etc.) into markdown files. Users can review files for accuracy/privacy and upload to their markdown repo for RAG ingestion."
7
+
8
+ ## Clarifications
9
+
10
+ ### Session 2025-10-24
11
+
12
+ - Q: Should this be a separate feature spec or integrated into spec 001? β†’ A: Create separate spec (`specs/002-linkedin-profile-extractor/spec.md`) for clean separation of concerns
13
+ - Q: What authentication mechanism for LinkedIn? β†’ A: Browser automation with Playwright for manual login; respects ToS, user-controlled
14
+ - Q: What LinkedIn data to extract? β†’ A: Full profile (connections, endorsements, activity feed) with human review gate for privacy/legal before upload
15
+ - Q: What markdown output structure? β†’ A: Hierarchical by section (Profile.md, Experience.md, Education.md, Skills.md, Recommendations.md, Connections.md, Activity.md)
16
+ - Q: How is tool delivered and integrated? β†’ A: Standalone Python CLI tool; users run locally, review output, manually upload to GitHub repo
17
+
18
+ ## Overview & Context
19
+
20
+ ### Problem Statement
21
+
22
+ Users who want to create an AI agent representing themselves (via the Personified AI Agent, spec 001) need accurate, current profile data from LinkedIn. Currently, they must manually create markdown documentation of their professional background. This tool automates the extraction of LinkedIn profile data into markdown files, which users can review for accuracy and privacy, then upload to their documentation repository for RAG ingestion.
23
+
24
+ ### Solution Overview
25
+
26
+ **LinkedIn Profile Data Extractor** is a standalone Python CLI tool that:
27
+ 1. Opens a Playwright browser and navigates to LinkedIn
28
+ 2. Requires manual user login (human-in-the-loop for authentication and consent)
29
+ 3. Automatically navigates LinkedIn sections (Profile, Experience, Education, Skills, Recommendations, Connections, Activity)
30
+ 4. Extracts structured data from each section
31
+ 5. Converts data to hierarchical markdown files
32
+ 6. Outputs files to a local directory for user review
33
+ 7. User reviews files for accuracy and privacy, then manually uploads to their documentation repository
34
+
35
+ ### Key Differentiators
36
+
37
+ - **Privacy-First**: Browser-based extraction respects LinkedIn ToS; human review gate ensures user control over what data is shared
38
+ - **No API Complexity**: Avoids LinkedIn API authentication, approval workflows, and data restrictions
39
+ - **User-Controlled**: Users decide what to include/exclude before uploading
40
+ - **Integrates with RAG**: Output markdown files are designed for ingestion by the Personified AI Agent
41
+ - **Standalone**: Separate tool; doesn't complicate main AI-Me application
42
+
43
+ ### Target Users
44
+
45
+ - Individuals creating an AI agent representing themselves
46
+ - Users who want to keep profile data current with minimal manual effort
47
+ - Users who prefer reviewing data before sharing with AI systems
48
+
49
+ ---
50
+
51
+ ## User Scenarios & Testing *(mandatory)*
52
+
53
+ ### User Story 1 - Extract LinkedIn Profile to Markdown (Priority: P1)
54
+
55
+ A user runs the CLI tool, authenticates with LinkedIn, and extracts their profile data into markdown files. The tool creates organized, well-structured markdown files that accurately represent their LinkedIn profile.
56
+
57
+ **Why this priority**: Core value propositionβ€”tool must successfully extract LinkedIn data without manual intervention after login.
58
+
59
+ **Independent Test**: Can be fully tested by running the tool, logging in, navigating profile extraction, and verifying output markdown files match LinkedIn source data.
60
+
61
+ **Acceptance Scenarios**:
62
+
63
+ 1. **Given** a user runs `python -m linkedin_extractor extract --output-dir ./profile-data`, **When** the tool opens a browser and waits for login, **Then** the user can complete LinkedIn authentication manually
64
+ 2. **Given** the user is logged into LinkedIn, **When** the tool navigates profile sections, **Then** it successfully extracts profile data without crashes or incomplete captures
65
+ 3. **Given** extraction completes, **When** the tool outputs markdown files to the specified directory, **Then** files are well-formatted and match LinkedIn source content
66
+ 4. **Given** output files exist, **When** the user reviews them, **Then** the data is accurate, complete, and useful for RAG ingestion
67
+
68
+ ---
69
+
70
+ ### User Story 2 - Review & Validate Extracted Data (Priority: P1)
71
+
72
+ User reviews the generated markdown files for accuracy and privacy concerns, ensuring the data is suitable for uploading to their documentation repository.
73
+
74
+ **Why this priority**: Human-in-the-loop validation ensures accuracy and prevents unintended data sharing.
75
+
76
+ **Independent Test**: Can be fully tested by reviewing output files and verifying they match LinkedIn source and contain no unexpected data.
77
+
78
+ **Acceptance Scenarios**:
79
+
80
+ 1. **Given** markdown files are extracted, **When** the user reviews them, **Then** all sections are present and readable
81
+ 2. **Given** the user finds inaccurate or sensitive data, **When** they can easily edit the markdown files, **Then** they can remove/modify entries before upload
82
+ 3. **Given** files are validated, **When** the user prepares to upload, **Then** they understand exactly what data will be shared with their AI agent
83
+
84
+ ---
85
+
86
+ ### User Story 3 - Upload Reviewed Files to Documentation Repository (Priority: P2)
87
+
88
+ User uploads the reviewed and validated markdown files to their documentation repository (e.g., `byoung/me` GitHub repo), making them available for RAG ingestion by the Personified AI Agent.
89
+
90
+ **Why this priority**: Completes the workflow; enables RAG ingestion and agent knowledge base updates.
91
+
92
+ **Independent Test**: Can be fully tested by uploading files to a test repository and verifying they're accessible for RAG pipeline ingestion.
93
+
94
+ **Acceptance Scenarios**:
95
+
96
+ 1. **Given** validated markdown files exist locally, **When** the user uploads them to their documentation repository, **Then** they're stored in a location where the RAG pipeline can find them
97
+ 2. **Given** files are uploaded, **When** the Personified AI Agent's DataManager loads documents, **Then** the LinkedIn profile data is available for retrieval
98
+
99
+ ---
100
+
101
+ ### Edge Cases
102
+
103
+ - What happens if LinkedIn changes UI/layout while extraction is in progress?
104
+ - How does the tool handle LinkedIn rate limiting or blocking?
105
+ - What if a user has restricted privacy settings preventing certain data extraction?
106
+ - How should the tool handle missing data (e.g., user has no connections, endorsements, or activity)?
107
+ - What happens if the browser session times out during extraction?
108
+ - How are special characters, emojis, or non-ASCII text in profile data handled in markdown output?
109
+
110
+ ---
111
+
112
+ ## Requirements *(mandatory)*
113
+
114
+ ### Functional Requirements
115
+
116
+ - **FR-001**: Tool MUST open a Playwright browser window and navigate to LinkedIn.com
117
+ - **FR-002**: Tool MUST require manual user login (human-in-the-loop authentication); tool waits for successful login before proceeding
118
+ - **FR-003**: Tool MUST extract data from LinkedIn profile section: name, headline, location, about/summary, profile photo URL, open to work status
119
+ - **FR-004**: Tool MUST extract data from LinkedIn experience section: job titles, companies, dates, descriptions, current/past employment status
120
+ - **FR-005**: Tool MUST extract data from LinkedIn education section: school names, degrees, fields of study, graduation dates, activities
121
+ - **FR-006**: Tool MUST extract data from LinkedIn skills section: skill names and endorsement counts
122
+ - **FR-007**: Tool MUST extract data from LinkedIn recommendations section: recommender names, titles, companies, recommendation text
123
+ - **FR-008**: Tool MUST extract data from LinkedIn connections section: connection names, titles, companies (publicly visible data only)
124
+ - **FR-009**: Tool MUST extract data from LinkedIn activity/posts section: recent posts, comments, articles (publicly visible content only)
125
+ - **FR-010**: Tool MUST convert extracted data into hierarchical markdown files per section (Profile.md, Experience.md, Education.md, Skills.md, Recommendations.md, Connections.md, Activity.md)
126
+ - **FR-011**: Tool MUST output markdown files to a user-specified directory (via CLI flag `--output-dir`)
127
+ - **FR-012**: Tool MUST handle extraction errors gracefully with user-friendly error messages
128
+ - **FR-013**: Tool MUST validate that extracted data matches source LinkedIn content (structural verification, no data loss)
129
+ - **FR-014**: Tool MUST include metadata in markdown files: extraction timestamp, source URL, data completeness notes
130
+ - **FR-015**: Tool MUST respect LinkedIn Terms of Service: browser-based extraction with manual login, human-in-the-loop consent
131
+ - **FR-016**: Tool MUST allow user review and manual editing of markdown files before upload
132
+ - **FR-017**: Tool MUST include documentation for uploading files to a GitHub repository for RAG ingestion
133
+
134
+ ### Key Entities
135
+
136
+ - **LinkedInProfile**: Represents extracted user profile data (name, headline, location, summary, photo URL, open-to-work status)
137
+ - **LinkedInExperience**: Represents job history entries (company, title, dates, description, employment type)
138
+ - **LinkedInEducation**: Represents education entries (school, degree, field of study, graduation date, activities)
139
+ - **LinkedInSkill**: Represents skill entry (skill name, endorsement count)
140
+ - **LinkedInRecommendation**: Represents recommendation (recommender name/title/company, recommendation text, date)
141
+ - **LinkedInConnection**: Represents connection entry (name, title, company, connection URL)
142
+ - **LinkedInActivity**: Represents activity/post entry (timestamp, content, engagement metrics)
143
+ - **ExtractionSession**: Represents a single extraction run (session ID, timestamp start/end, browser state, error log)
144
+ - **MarkdownOutput**: Represents generated markdown file (section name, file path, content, metadata)
145
+
146
+ ### Non-Functional Requirements
147
+
148
+ #### Performance (SC-005)
149
+
150
+ - **SC-P-001**: Profile extraction completes within 5 minutes (typical user with moderate activity/connections)
151
+ - **SC-P-002**: Markdown file generation completes within 10 seconds after extraction
152
+ - **SC-P-003**: Tool memory usage stays below 500MB during extraction
153
+
154
+ #### Reliability & Error Handling (SC-007)
155
+
156
+ - **SC-R-001**: Tool handles LinkedIn UI changes gracefully (element not found) with informative error messages
157
+ - **SC-R-002**: Tool handles rate limiting from LinkedIn (429 status) with retry logic and user notification
158
+ - **SC-R-003**: Tool handles network timeouts with automatic retry (up to 3 attempts) and clear error reporting
159
+ - **SC-R-004**: Tool handles incomplete data extraction (missing sections) and reports completeness in metadata
160
+ - **SC-R-005**: Browser session timeout is handled with user prompt to re-login
161
+
162
+ #### Security & Privacy (SC-002, SC-007)
163
+
164
+ - **SC-S-001**: Tool runs locally; LinkedIn credentials are never stored or transmitted to external services
165
+ - **SC-S-002**: Tool respects LinkedIn ToS: browser-based extraction, manual login, user consent required
166
+ - **SC-S-003**: Tool only extracts publicly visible data (respects privacy settings)
167
+ - **SC-S-004**: Markdown output is saved only to user-specified local directory (no automatic cloud upload)
168
+ - **SC-S-005**: Tool includes clear warnings about data sensitivity in generated markdown files
169
+
170
+ #### Usability (SC-008)
171
+
172
+ - **SC-U-001**: CLI interface is intuitive with clear help text (`--help` flag)
173
+ - **SC-U-002**: Error messages are user-friendly and actionable (not technical stack traces)
174
+ - **SC-U-003**: Output markdown files are human-readable and easy to edit before upload
175
+ - **SC-U-004**: Tool provides clear guidance on next steps (review, edit, upload to repo)
176
+
177
+ #### Observability
178
+
179
+ - **SC-O-001**: Tool logs extraction progress (sections processed, data counts, timestamps) to console
180
+ - **SC-O-002**: Tool generates extraction report in output directory (extraction_report.json) with metadata and summary
181
+
182
+ ---
183
+
184
+ ## Success Criteria *(mandatory)*
185
+
186
+ ### Measurable Outcomes
187
+
188
+ - **SC-001**: Extracted LinkedIn data is accurate and matches source profile (100% sample verification by user review)
189
+ - **SC-002**: All publicly visible LinkedIn data is successfully extracted without requiring manual re-entry (100% completeness per user evaluation)
190
+ - **SC-003**: Generated markdown files are valid, well-formatted, and immediately usable for RAG ingestion (0 markdown syntax errors)
191
+ - **SC-004**: Users can review extracted data and identify/edit sensitive information before upload (human-in-the-loop gate functional)
192
+ - **SC-005**: Profile extraction completes in under 5 minutes for typical user (measured across 3+ user trials)
193
+ - **SC-006**: Tool handles LinkedIn UI changes and rate limiting without crashing (resilient error handling tested)
194
+ - **SC-007**: All tool failures result in user-friendly error messages, not technical stack traces (100% user-friendly errors)
195
+ - **SC-008**: Users report that generated files are immediately useful for their documentation repository (qualitative feedback)
196
+
197
+ ### Assumptions
198
+
199
+ - Users have active LinkedIn accounts with visible profile data
200
+ - Users are comfortable installing Python CLI tool and running commands locally
201
+ - Users have git/GitHub account and can manually upload markdown files to their documentation repository
202
+ - LinkedIn UI is relatively stable (tool may require maintenance if LinkedIn significantly changes UI)
203
+ - Users accept that extraction is browser-based and requires active session (no headless-only extraction for privacy/ToS reasons)
204
+ - Generated markdown files will be reviewed by users before sharing with AI systems
205
+ - Users understand the data extracted is limited to publicly visible LinkedIn content
206
+
207
+ ---
208
+
209
+ ## Data Model
210
+
211
+ ### Markdown Output Structure
212
+
213
+ Each extraction session generates the following files in the output directory:
214
+
215
+ ```
216
+ output_dir/
217
+ β”œβ”€β”€ extraction_report.json # Metadata: extraction timestamp, session info, data completeness
218
+ β”œβ”€β”€ Profile.md # Profile summary, headline, location, about, photo
219
+ β”œβ”€β”€ Experience.md # Job history with dates, companies, descriptions
220
+ β”œβ”€β”€ Education.md # Schools, degrees, fields of study, graduation dates
221
+ β”œβ”€β”€ Skills.md # Skills list with endorsement counts
222
+ β”œβ”€β”€ Recommendations.md # Recommendations with recommender info and text
223
+ β”œβ”€β”€ Connections.md # Connections list (names, titles, companies)
224
+ └── Activity.md # Recent posts, comments, articles
225
+ ```
226
+
227
+ ### File Format Example (Profile.md)
228
+
229
+ ```markdown
230
+ # LinkedIn Profile
231
+
232
+ **Extracted**: 2025-10-24 14:30:00 UTC
233
+ **Source**: https://www.linkedin.com/in/byoung/
234
+ **Status**: Complete
235
+
236
+ ## Summary
237
+
238
+ - **Name**: Ben Young
239
+ - **Headline**: AI Agent Architect | Full-Stack Engineer
240
+ - **Location**: San Francisco, CA
241
+ - **Open to Work**: Yes (seeking AI/ML roles)
242
+
243
+ ## About
244
+
245
+ [Profile summary text...]
246
+
247
+ ## Profile Photo
248
+
249
+ [URL to profile photo if publicly available]
250
+ ```
251
+
252
+ ---
253
+
254
+ ## Technical Constraints & Architecture
255
+
256
+ ### Technology Stack
257
+
258
+ - **Language**: Python 3.12+ (matching AI-Me project standards via `uv`)
259
+ - **Browser Automation**: Playwright (cross-platform, supports multiple browsers, respects ToS)
260
+ - **Package Manager**: `uv` (matches AI-Me project standards)
261
+ - **Output Format**: Markdown files + JSON metadata
262
+ - **Delivery**: Standalone CLI tool (separate from main Gradio app)
263
+ - **Execution Environment**: User's local machine (not cloud-deployed)
264
+
265
+ ### Implementation Notes
266
+
267
+ 1. **Browser-Based Extraction**: Uses Playwright to automate browser navigation, respecting LinkedIn ToS by requiring manual login
268
+ 2. **No API Integration**: Avoids LinkedIn API authentication complexity and approval requirements
269
+ 3. **Human-in-the-Loop**: User must manually authenticate and consent to extraction before proceeding
270
+ 4. **Local Execution**: All extraction happens on user's machine; no credentials or data transmitted externally
271
+ 5. **Manual Upload**: Users manually upload reviewed files to their GitHub repo (no automated Git push)
272
+ 6. **RAG Integration**: Output markdown follows existing document structure for seamless RAG ingestion by Personified AI Agent
273
+
274
+ ### Out of Scope
275
+
276
+ - Automated scheduled extraction (GitHub Actions, webhooks, cron jobs) β€” future enhancement
277
+ - Cloud-based execution or deployment
278
+ - Integration with main Gradio app (separate standalone tool)
279
+ - LinkedIn API integration (browser-based extraction only)
280
+ - Encrypted credential storage (user responsible for LinkedIn security)
281
+ - Multi-user or SaaS deployment
282
+
283
+ ---
284
+
285
+ ## Integration with Personified AI Agent (Spec 001)
286
+
287
+ ### Workflow
288
+
289
+ 1. **Extract**: User runs LinkedIn extractor CLI β†’ generates markdown files
290
+ 2. **Review**: User reviews files locally, edits for privacy/accuracy
291
+ 3. **Upload**: User uploads files to their documentation repository (e.g., `byoung/me`)
292
+ 4. **Ingest**: Personified AI Agent's `DataManager` loads files via GitHub (if `GITHUB_REPOS` includes the repo)
293
+ 5. **Use**: Agent has access to LinkedIn profile data for better context and responses
294
+
295
+ ### Documentation Structure Compatibility
296
+
297
+ LinkedIn extractor output (Profile.md, Experience.md, etc.) follows the same markdown document structure expected by the Personified AI Agent's RAG pipeline. No additional transformation needed.
298
+
299
+ ---
300
+
301
+ ## Deployment & Usage
302
+
303
+ ### Installation
304
+
305
+ ```bash
306
+ # Clone repo (or install from package)
307
+ git clone https://github.com/byoung/ai-me.git
308
+ cd ai-me
309
+
310
+ # Install dependencies
311
+ uv sync
312
+
313
+ # Run extractor
314
+ python -m linkedin_extractor extract --output-dir ./linkedin-profile
315
+ ```
316
+
317
+ ### CLI Interface
318
+
319
+ ```bash
320
+ python -m linkedin_extractor extract --output-dir PATH [OPTIONS]
321
+
322
+ Options:
323
+ --output-dir PATH Directory to save markdown files (required)
324
+ --headless Run browser in headless mode (not recommended; requires session)
325
+ --wait-time SECONDS Wait time for page loads (default: 10)
326
+ --extract-connections Include full connections list (slower; may hit rate limits)
327
+ --extract-activity Include recent activity/posts (slower; requires scrolling)
328
+ --help Show help text
329
+ ```
330
+
331
+ ### Usage Example
332
+
333
+ ```bash
334
+ # Basic extraction
335
+ python -m linkedin_extractor extract --output-dir ~/linkedin-profile
336
+
337
+ # Full extraction with connections and activity
338
+ python -m linkedin_extractor extract --output-dir ~/linkedin-profile --extract-connections --extract-activity
339
+ ```
340
+
341
+ ### Post-Extraction Workflow
342
+
343
+ 1. **Review Files**: User opens markdown files in editor, verifies accuracy
344
+ 2. **Edit**: User removes/modifies sensitive information as needed
345
+ 3. **Upload to Repo**: User commits and pushes files to their documentation repository
346
+ 4. **Configure AI-Me**: Add repo to `GITHUB_REPOS` environment variable if not already included
347
+ 5. **Verify**: Next conversation with AI-Me agent will use LinkedIn profile data in responses
348
+
349
+ ---
350
+
351
+ ## Testing Strategy
352
+
353
+ ### Unit Tests
354
+
355
+ - Markdown generation (correct format, no syntax errors)
356
+ - Data extraction parsing (LinkedIn HTML β†’ structured data)
357
+ - File I/O operations (output directory creation, file writing)
358
+ - Error message formatting (user-friendly, no stack traces)
359
+
360
+ ### Integration Tests
361
+
362
+ - End-to-end extraction session (login β†’ extract β†’ file output)
363
+ - Handle LinkedIn rate limiting
364
+ - Handle LinkedIn UI changes (missing elements)
365
+ - Browser timeout recovery
366
+
367
+ ### Manual Testing
368
+
369
+ - User trial with real LinkedIn account (verify data accuracy)
370
+ - Review generated markdown files for completeness
371
+ - Upload to documentation repo and verify RAG ingestion
372
+
373
+ ---
374
+
375
+ ## Success Metrics (How We Know We're Done)
376
+
377
+ | Metric | Target | How We Measure |
378
+ |--------|--------|----------------|
379
+ | **Data Accuracy** | 100% of extracted data matches LinkedIn source | User review of generated files vs. LinkedIn profile |
380
+ | **Completeness** | 90%+ of available LinkedIn data extracted | Count of extracted data points vs. completeness report |
381
+ | **Markdown Quality** | 0 syntax errors in output | Markdown validation tool |
382
+ | **Extraction Time** | <5 minutes for typical user | Timer from login to file output |
383
+ | **Error Handling** | 100% user-friendly error messages | No stack traces in output |
384
+ | **Privacy Compliance** | Only publicly visible data extracted | User audit of generated files |
385
+ | **RAG Integration** | Files immediately usable for RAG ingestion | Upload to repo and verify agent knowledge access |
386
+ | **Ease of Use** | Users can extract data without technical support | Qualitative feedback / support ticket volume |
387
+
388
+ ---
389
+
390
+ ## Future Enhancements (Phase B - Not in MVP)
391
+
392
+ - Scheduled extraction (GitHub Actions trigger)
393
+ - Multi-profile extraction (extract multiple users' data)
394
+ - Incremental updates (extract only changed sections)
395
+ - LinkedIn API integration (once ToS allows)
396
+ - Cloud deployment (Hugging Face Spaces as web UI)
397
+ - Automated Git push with review/approval workflow
398
+ - Encrypted credential storage for batch jobs
399
+ - Data diff/versioning (track profile changes over time)
400
+
401
+ ---
402
+
403
+ **Spec Status**: βœ… Ready for Phase 0-1 Design
404
+ **Next Steps**:
405
+ 1. Create detailed data model and markdown schema
406
+ 2. Create implementation plan with Playwright-specific architecture
407
+ 3. Generate task breakdown for development
408
+