Spaces:
Runtime error
Runtime error
π Annotation Tool β Guide
A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.
Quick Start
For Annotators
- Go to the Space URL and click π€ Sign in with HuggingFace
- You'll see only your assigned documents in the dropdown
- Navigate pages with β Prev / Next β
- Open the Data Mentions panel to validate each mention
- Track your progress in the top-right:
Progress: π PDF 3/55 | π Page 2/12 | π·οΈ Verified 4/8
Validation Actions
| Action | What it does |
|---|---|
| β Correct | Confirms the AI extraction is a real dataset mention |
| β Incorrect | Marks the extraction as wrong / not a dataset |
| Click tag badge | Change dataset type (named, descriptive, generic) |
| Highlight text β Annotate | Manually add a dataset mention the AI missed |
| ποΈ Delete | Remove a dataset entry entirely |
Tip: If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.
Document Assignments
Each annotator sees only their assigned documents. A configurable percentage (default 10%) are overlap documents shared across all annotators for inter-annotator agreement measurement.
Configuration File
annotation_data/annotator_config.yaml:
settings:
overlap_percent: 10 # % of docs shared between all annotators
annotators:
- username: rafmacalaba # HuggingFace username
docs: [2, 3, 14, ...] # assigned doc indices
- username: rafaelmacalaba
docs: [1, 2, 10, ...]
Auto-Generate Assignments
# Preview assignment distribution:
uv run --with pyyaml python3 generate_assignments.py --dry-run
# Generate and save locally:
uv run --with pyyaml python3 generate_assignments.py
# Generate, save, and upload to HF:
uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
The script:
- Reads
annotator_config.yamlfor the annotator list and overlap % - Shuffles all available docs (deterministic seed=42)
- Reserves
overlap_percentdocs shared by ALL annotators - Splits the rest evenly across annotators
- Saves back to the YAML
Adding a New Annotator
- Add to
annotation_data/annotator_config.yaml:- username: new_hf_username docs: [] - Re-run:
uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload - Add the username to
ALLOWED_USERSin the Space settings
Manual Editing
You can manually edit the docs array for any annotator in the YAML file, then upload:
uv run --with huggingface_hub python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.upload_file('annotation_data/annotator_config.yaml',
'annotation_data/annotator_config.yaml',
'ai4data/annotation_data', repo_type='dataset')
"
Per-Annotator Validation (Overlap Support)
Each dataset mention stores validations per-annotator in a validations array:
{
"dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
"dataset_tag": "named",
"validations": [
{
"annotator": "rafmacalaba",
"human_validated": true,
"human_verdict": true,
"human_notes": null,
"validated_at": "2025-02-24T11:00:00Z"
},
{
"annotator": "rafaelmacalaba",
"human_validated": true,
"human_verdict": false,
"human_notes": "This is a study name, not a dataset",
"validated_at": "2025-02-24T11:05:00Z"
}
]
}
Key behavior:
- Each annotator only sees their own validation status (no bias)
- Progress bar and "Next" prompt count only your verifications
- Tag edits (
dataset_tag) are shared β they're factual, not judgment-based - Re-validating updates your existing entry (doesn't create duplicates)
Data Pipeline
prepare_data.py β Prepare & Upload Documents
# Dry run (scan only):
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run
# Upload missing docs + update links:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py
# Only update wbg_pdf_links.json:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only
This script:
- Scans local
annotation_data/wbg_extractions/for real_direct_judged.jsonlfiles - Detects language using
langdetect(excludes non-English: Arabic, French) - Uploads English docs to HF dataset
- Updates
wbg_pdf_links.jsonwithhas_revalidationandlanguagefields
Leaderboard π
Click π Leaderboard in the top bar to see annotator rankings.
| Metric | Description |
|---|---|
| β Verified | Number of mentions validated |
| βοΈ Added | Manually added dataset mentions |
| π Docs | Number of documents worked on |
| β Score | Verified + Added |
Cached for 2 minutes.
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/documents?user=X |
GET | List documents (filtered by user assignment) |
/api/document?index=X&page=Y |
GET | Get page data for a specific document |
/api/validate |
PUT | Submit validation for a dataset mention |
/api/validate?doc=X&page=Y&idx=Z |
DELETE | Remove a dataset entry |
/api/leaderboard |
GET | Annotator rankings |
/api/pdf-proxy?url=X |
GET | Proxy PDF downloads (bypasses CORS) |
/api/auth/login |
GET | Start HF OAuth flow |
/api/auth/callback |
GET | OAuth callback |
Architecture
hf_spaces_docker/
βββ app/
β βββ page.js # Main app (client component)
β βββ globals.css # All styles
β βββ api/
β β βββ documents/route.js # Doc listing + user filtering
β β βββ document/route.js # Single page data
β β βββ validate/route.js # Validate/delete mentions
β β βββ leaderboard/route.js # Leaderboard stats
β β βββ pdf-proxy/route.js # PDF CORS proxy
β β βββ auth/ # HF OAuth login/callback
β βββ components/
β βββ AnnotationPanel.js # Side panel with dataset cards
β βββ AnnotationModal.js # Manual annotation dialog
β βββ DocumentSelector.js # Document dropdown
β βββ Leaderboard.js # Leaderboard modal
β βββ MarkdownAnnotator.js # Text viewer with highlighting
β βββ PageNavigator.js # Prev/Next page buttons
β βββ PdfViewer.js # PDF iframe with loading state
β βββ ProgressBar.js # PDF/Page/Verified pills
βββ annotation_data/
β βββ annotator_config.yaml # Annotator assignments
β βββ wbg_data/
β βββ wbg_pdf_links.json # Doc registry with URLs
βββ prepare_data.py # Upload docs to HF
βββ generate_assignments.py # Auto-assign docs to annotators
Environment Variables
| Variable | Required | Description |
|---|---|---|
HF_TOKEN |
Yes | HuggingFace API token (read/write) |
OAUTH_CLIENT_ID |
Yes (Space) | HF OAuth app client ID |
OAUTH_CLIENT_SECRET |
Yes (Space) | HF OAuth app client secret |
ALLOWED_USERS |
Yes (Space) | Comma-separated HF usernames |
NEXTAUTH_SECRET |
Yes | Secret for cookie signing |