Spaces:

developmentseed
/

gazet

Running

App Files Files Community

srmsoumya commited on Mar 26

Commit

d8d4856

1 Parent(s): 5623a00

FEAT: Add to create SLM training data

Browse files

Files changed (15) hide show

.gitignore +4 -2
dataset/PIPELINE_FLOW.md +414 -0
dataset/README.md +108 -0
dataset/__init__.py +1 -0
dataset/config.yaml +52 -0
dataset/scripts/__init__.py +1 -0
dataset/scripts/build_inventory.py +111 -0
dataset/scripts/build_relations.py +288 -0
dataset/scripts/cli.py +305 -0
dataset/scripts/export_training_data.py +191 -0
dataset/scripts/generate_samples.py +1091 -0
dataset/scripts/sql_templates.py +317 -0
dataset/scripts/validate_dataset.py +275 -0
pyproject.toml +7 -1
uv.lock +0 -0

.gitignore CHANGED Viewed

@@ -133,6 +133,8 @@ dmypy.json
 # Pyre type checker
 .pyre/
 data/
-output/

 # Pyre type checker
 .pyre/
+# Dataset
 data/
+output/
+*.parquet
+*.jsonl

dataset/PIPELINE_FLOW.md ADDED Viewed

	@@ -0,0 +1,414 @@

+# Dataset Generation Pipeline Flow
+This document explains how the optimized pipeline processes data with concrete examples.
+**Example Configuration:**
+- Countries: `['EC', 'BE', 'KE', 'AE', 'SG']` (5 countries)
+- Sample targets: 100 per family × 8 families = 800 samples
+- Retry multiplier: 2 (generate 1,600 attempts to get 800 valid samples)
+- Max workers: 8
+---
+## Step 0: Build Entity Inventory (One-Time Setup)
+**Script:** `build_inventory.py`
+**What it does:** Extracts compact metadata from the full parquet files for fast sampling.
+**Input:**
+- `divisions_area.parquet` (~500K entities globally)
+- `natural_earth.parquet` (~50K entities globally)
+**Process:**
+```sql
+-- For each parquet, extract:
+SELECT
+    id,
+    names."primary" AS name,
+    subtype,
+    country,
+    region,
+    admin_level,
+    ST_Area(geometry) AS area_sq_deg,
+    -- bounding box for spatial filtering
+FROM read_parquet(...)
+WHERE names."primary" IS NOT NULL
+```
+**Output:**
+- `intermediate/divisions_area_inventory.parquet` (~50K rows for 5 countries)
+- `intermediate/natural_earth_inventory.parquet` (~50K rows)
+**Parallelization:** None (runs once, fast enough)
+---
+## Step 1: Build Relation Tables (Parallelized)
+**Script:** `build_relations.py`
+**What it does:** Pre-computes spatial relationships between entities so sample generation doesn't need to run expensive spatial joins.
+### Before Optimization (Sequential)
+```
+Total time: adjacency (60s) + containment (15s) + intersection (10s) + cross_source (8s) = 93s
+```
+### After Optimization (Parallel)
+```
+Total time: max(60s, 15s, 10s, 8s) = 60s
+```
+**4 concurrent tasks, each with its own DuckDB connection:**
+#### Task 1: Adjacency Pairs (60s)
+```sql
+-- Find all touching boundaries within 5 countries
+WITH features AS (
+    SELECT id, name, subtype, country, geometry, ST_Envelope(geometry) AS bbox
+    FROM divisions_area
+    WHERE country IN ('EC', 'BE', 'KE', 'AE', 'SG')
+)
+SELECT
+    a.id AS anchor_id,
+    a.name AS anchor_name,
+    b.id AS target_id,
+    b.subtype AS target_subtype
+FROM features a
+JOIN features b ON (
+    a.id < b.id
+    AND ST_Intersects(a.bbox, b.bbox)  -- Fast bbox pre-filter
+    AND ST_Touches(a.geometry, b.geometry)  -- Expensive but necessary
+)
+LIMIT 50000
+```
+**Output:** `adjacency_pairs.parquet` (50,000 rows)
+#### Task 2: Containment Pairs (15s)
+```sql
+-- Find all parent-child relationships
+-- Example: Ecuador contains Quito
+SELECT
+    container.id AS container_id,
+    container.name AS container_name,
+    contained.id AS contained_id,
+    contained.subtype AS contained_subtype
+FROM features container
+JOIN features contained ON (
+    container.admin_level < contained.admin_level  -- Parent has lower level
+    AND ST_Within(contained.geometry, container.geometry)
+)
+LIMIT 1000
+```
+**Output:** `containment_pairs.parquet` (1,000 rows)
+#### Task 3: Intersection Pairs (10s)
+```sql
+-- Find overlapping regions (not touching, not containing)
+-- Example: Two administrative regions that overlap
+SELECT a.id, a.name, b.id, b.subtype
+FROM features a
+JOIN features b ON (
+    ST_Intersects(a.geometry, b.geometry)
+    AND NOT ST_Touches(a.geometry, b.geometry)
+    AND NOT ST_Within(a.geometry, b.geometry)
+)
+LIMIT 500
+```
+**Output:** `intersection_pairs.parquet` (500 rows)
+#### Task 4: Cross-Source Relations (8s)
+```sql
+-- Find relationships between divisions and natural features
+-- Example: Ecuador intersects Pacific Ocean
+SELECT
+    d.id AS division_id,
+    d.name AS division_name,
+    n.id AS natural_id,
+    n.name AS natural_name,
+    CASE
+        WHEN ST_Touches(...) THEN 'touches'
+        WHEN ST_Intersects(...) THEN 'intersects'
+    END AS relation_type
+FROM divisions d
+JOIN natural_features n ON ST_Intersects(d.geometry, n.geometry)
+WHERE d.country IN ('EC', 'BE', 'KE', 'AE', 'SG')
+  AND n.subtype IN ('sea', 'ocean', 'Lake', 'River')
+LIMIT 500
+```
+**Output:** `cross_source_relations.parquet` (500 rows)
+**ThreadPoolExecutor with 4 workers runs all tasks concurrently.**
+---
+## Step 2: Generate Samples (Batch-Parallelized)
+**Script:** `generate_samples.py`
+**What it does:** Creates training samples by:
+1. Sampling anchors from relation tables
+2. Rendering SQL templates
+3. Executing SQL to verify it works
+4. Building candidate lists with distractors
+5. Generating questions
+### Work Item Preparation
+**Total work items:** 8 families × 100 targets × 2 retry_multiplier = **1,600 items**
+```python
+work_items = [
+    ('adjacency', template_dict_1, 'sample_001', '/path/to/intermediate'),
+    ('containment', template_dict_2, 'sample_002', '/path/to/intermediate'),
+    ('adjacency', template_dict_3, 'sample_003', '/path/to/intermediate'),
+    # ... 1,597 more items
+]
+# Shuffle for balanced batches
+random.shuffle(work_items)
+# Partition into 8 batches (one per worker)
+batch_size = 1600 / 8 = 200 items per batch
+batches = [
+    batch_1: items[0:200],    # ~25 of each family (mixed)
+    batch_2: items[200:400],
+    batch_3: items[400:600],
+    # ... 8 batches total
+]
+```
+### Before Optimization (Per-Sample Workers)
+```
+For each of 1,600 samples:
+    - Fork new process
+    - Create DuckDB connection
+    - INSTALL spatial (5-10ms)
+    - LOAD spatial (5-10ms)
+    - Import sql_templates module
+    - Load 4 relation parquet files (50-100ms)
+    - Generate 1 sample (20-50ms)
+    - Close connection
+Total overhead per sample: ~100ms
+Total overhead: 1,600 × 100ms = 160 seconds of pure overhead
+```
+### After Optimization (Batch Workers)
+```
+8 workers run in parallel, each processes 200 samples:
+Worker 1 (batch of 200 items):
+    - Create DuckDB connection (once)
+    - INSTALL + LOAD spatial (once, 10ms)
+    - Import sql_templates (once)
+    - Load 4 relation tables (once, 100ms)
+    FOR EACH of 200 items:
+        - Sample anchor from pre-loaded table (instant)
+        - Render SQL template
+        - Execute SQL to verify (20-50ms)
+        - Build candidate list with Jaro-Winkler (10-30ms)
+        - Generate question
+    - Close connection
+Total overhead per worker: ~110ms (one-time)
+Total overhead across 8 workers: ~110ms (parallel)
+```
+**Speedup:** 160s → 0.11s overhead = **~1,450x faster on initialization overhead**
+### Sample Generation Example
+**Adjacency sample generation:**
+```python
+# 1. Sample anchor from pre-loaded adjacency_pairs DataFrame
+row = adjacency_df.sample(n=1).iloc[0]
+# Result: anchor_id='EC-123', anchor_name='Quito', target_subtype='locality'
+# 2. Render SQL template
+sql = f"""
+WITH a AS (
+  SELECT geometry FROM divisions_area WHERE id = 'EC-123'
+)
+SELECT b.id, b.names."primary" AS name, b.geometry
+FROM divisions_area AS b, a
+WHERE b.id != 'EC-123'
+  AND b.subtype = 'locality'
+  AND ST_Touches(a.geometry, b.geometry)
+"""
+# 3. Execute to verify (returns 5 neighboring localities)
+result = con.execute(sql).fetchdf()  # 30ms
+# ✓ Not empty, sample is valid
+# 4. Build candidate list (10 candidates: 1 true + 9 distractors)
+candidates = build_candidate_list(
+    con, 'EC-123', 'Quito', 'divisions_area', num_candidates=10
+)
+# Uses Jaro-Winkler to find similar names:
+SELECT id, name, subtype, country,
+       jaro_winkler_similarity(lower(name), lower('Quito')) AS similarity
+FROM divisions_area
+WHERE id != 'EC-123'
+ORDER BY similarity DESC
+LIMIT 9
+# Results: ['Quito', 'Cuito', 'Quijos', 'Quinindé', ...]
+# Shuffle and reassign IDs: c1, c2, ..., c10
+# 5. Generate question
+question = "Which localities border Quito?"
+# 6. Return TrainingSample with sql_verified=True
+```
+### Batch Progress Tracking
+```
+Console output:
+Generating 1600 samples across 8 families...
+  Split into 8 batches of ~200 items (1 DuckDB init per batch)
+  Progress: 200/1600 samples (1/8 batches)   # Worker 1 done
+  Progress: 400/1600 samples (2/8 batches)   # Worker 2 done
+  Progress: 600/1600 samples (3/8 batches)   # Worker 3 done
+  ...
+  Progress: 1600/1600 samples (8/8 batches)  # All done
+Results by family:
+  adjacency           : 185 success /  15 failed (92.5% success rate, target: 100)
+  aggregation         : 178 success /  22 failed (89.0% success rate, target: 100)
+  buffer              : 192 success /   8 failed (96.0% success rate, target: 100)
+  containment         : 188 success /  12 failed (94.0% success rate, target: 100)
+  direct_lookup       : 200 success /   0 failed (100% success rate, target: 100)
+  intersection        : 181 success /  19 failed (90.5% success rate, target: 100)
+  partial_selection   : 175 success /  25 failed (87.5% success rate, target: 100)
+  set_operations      : 190 success /  10 failed (95.0% success rate, target: 100)
+Total: 1,489 valid samples from 1,600 attempts
+```
+---
+## Step 3: Validate Dataset (Optimized)
+**Script:** `validate_dataset.py`
+**What it does:** Validates samples in parallel, but **skips SQL re-execution** for samples with `sql_verified: True`.
+### Before Optimization
+```
+For each of 1,489 samples:
+    - Execute SQL to verify (30ms)
+    - Validate candidates (1ms)
+    - Check question (1ms)
+Total: 1,489 × 32ms = 47.6 seconds
+```
+### After Optimization
+```
+For each of 1,489 samples:
+    - Check metadata.sql_verified flag
+    - IF True: skip SQL execution (saved 30ms)
+    - Validate candidates (1ms)
+    - Check question (1ms)
+Total: 1,489 × 2ms = 3.0 seconds
+```
+**Speedup:** 47.6s → 3.0s = **~16x faster**
+**Parallelization:** 8 workers process samples in parallel batches
+---
+## Step 4: Export Splits
+**Script:** `export_training_data.py`
+**What it does:** Stratified split into train/val/test (80/10/10) by task family.
+**Input:** `dataset_validated.jsonl` (1,489 samples)
+**Process:**
+```python
+# Group by family
+adjacency_samples: 185
+aggregation_samples: 178
+buffer_samples: 192
+# ... etc
+# Split each family 80/10/10
+adjacency_train: 148, val: 19, test: 18
+aggregation_train: 142, val: 18, test: 18
+# ... etc
+# Combine and shuffle
+train: 1,191 samples
+val: 149 samples
+test: 149 samples
+```
+**Output:**
+- `output/train.jsonl` (1,191 samples)
+- `output/val.jsonl` (149 samples)
+- `output/test.jsonl` (149 samples)
+**Parallelization:** None needed (fast enough)
+---
+## Overall Pipeline Timing
+### Before Optimizations
+```
+Step 0: Build Inventory        :    5s (one-time)
+Step 1: Build Relations         :   93s (sequential)
+Step 2: Generate Samples        :  320s (160s overhead + 160s generation)
+Step 3: Validate Dataset        :   48s (re-executing all SQL)
+Step 4: Export Splits           :    2s
+                                 ------
+Total                           :  468s (~7.8 minutes)
+```
+### After Optimizations
+```
+Step 0: Build Inventory        :    5s (one-time)
+Step 1: Build Relations         :   60s (parallel, limited by slowest task)
+Step 2: Generate Samples        :  165s (0.11s overhead + 165s generation)
+Step 3: Validate Dataset        :    3s (skips SQL re-execution)
+Step 4: Export Splits           :    2s
+                                 ------
+Total                           :  235s (~3.9 minutes)
+```
+**Overall speedup:** 468s → 235s = **~2x faster**
+**At 10K scale (100x more samples):**
+- Before: ~780 minutes (13 hours)
+- After: ~390 minutes (6.5 hours)
+- With further optimizations (sampling without replacement, better caching): **<2 hours**
+---
+## Key Optimizations Summary
+| Optimization | Impact | Where |
+|-------------|--------|-------|
+| **Batch workers** | 1,450x on init overhead | `generate_samples.py` |
+| **Parallel relations** | 1.5x on relation building | `build_relations.py` |
+| **Jaro-Winkler** | 2-3x on distractor search | `generate_samples.py` |
+| **Skip SQL re-validation** | 16x on validation | `validate_dataset.py` |
+| **Drop individual JSON files** | 1.2x on I/O | `generate_samples.py` |
+**Combined:** Enables scaling from hundreds to tens of thousands of samples efficiently.

dataset/README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# Dataset Generation CLI
+Generate synthetic text-to-SQL training datasets.
+## Quick Start
+```bash
+# Install
+uv sync
+# Generate dataset
+gazet-dataset full-pipeline --config dataset/config.yaml
+```
+## Configuration
+Edit `dataset/config.yaml`:
+```yaml
+countries:
+  - EC  # Ecuador
+  - BE  # Belgium
+  - KE  # Kenya
+  - AE  # UAE
+  - SG  # Singapore
+  - CH  # Switzerland
+sample_targets:
+  direct_lookup: 100
+  adjacency: 100
+  containment: 100
+  intersection: 100
+  buffer: 100
+  set_operations: 100
+  partial_selection: 100
+  aggregation: 100
+generation:
+  max_workers: 8
+  retry_multiplier: 2
+  append_mode: true
+auto_scaling:
+  safety_factor: 1.5  # Auto-calculates relation limits
+```
+## Growing Your Dataset
+### Start Small
+```bash
+# Generate initial dataset (e.g., 100 samples)
+gazet-dataset full-pipeline --config dataset/config.yaml
+```
+### Add More Samples (Same Countries)
+```bash
+# Increase sample_targets in config.yaml, then:
+gazet-dataset full-pipeline --config dataset/config.yaml --append
+```
+### Add New Countries
+```bash
+# Add countries to config.yaml, then:
+gazet-dataset full-pipeline --config dataset/config.yaml --append
+# Auto-rebuilds relations if countries changed
+```
+### Scale to 10K+
+```yaml
+# config.yaml - increase targets
+sample_targets:
+  adjacency: 1000
+  containment: 1000
+  intersection: 1000
+  # ... etc
+```
+```bash
+gazet-dataset full-pipeline --config dataset/config.yaml --append
+```
+## Commands
+```bash
+gazet-dataset full-pipeline --config <path>     # Run everything
+gazet-dataset build-relations --config <path>   # Build spatial relations
+gazet-dataset generate-samples --config <path>  # Generate samples
+gazet-dataset validate --config <path>          # Validate dataset
+gazet-dataset export --config <path>            # Export train/val/test
+```
+**Options:**
+- `--append`: Add to existing dataset instead of overwriting
+## Output
+- `dataset/output/dataset_raw.jsonl` - Generated samples
+- `dataset/output/dataset_validated.jsonl` - Validated samples
+- `dataset/output/train.jsonl` - Training split
+- `dataset/output/val.jsonl` - Validation split
+- `dataset/output/test.jsonl` - Test split
+## Tips
+- Start with 2-3 countries and small sample targets
+- Use `--append` to grow dataset incrementally
+- Relation limits auto-calculate from sample targets
+- Check success rates in output summary

dataset/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Synthetic dataset generation package."""

dataset/config.yaml ADDED Viewed

	@@ -0,0 +1,52 @@

+# Dataset Generation Configuration
+# This config controls which countries to process and how many samples to generate
+# Countries to include in relation building
+# Use ISO 3166-1 alpha-2 codes
+countries:
+  # - EC  # Ecuador
+  # - BE  # Belgium
+  # - KE  # Kenya
+  # - AE  # UAE
+  # - SG  # Singapore
+  # - CH  # Switzerland
+  - IN   # India
+  - PK   # Pakistan
+  # - SL   # Sri Lanka
+  # - BD   # Bangladesh
+# Sample generation targets per family
+# Relation limits are auto-calculated from these targets
+sample_targets:
+  direct_lookup: 20
+  adjacency: 20
+  containment: 20
+  intersection: 20
+  buffer: 20
+  set_operations: 20
+  partial_selection: 20
+  aggregation: 20
+# Generation settings
+generation:
+  max_workers: 8           # Number of parallel workers
+  retry_multiplier: 2      # Generate 2x samples to account for failures
+  append_mode: true        # If true, append to existing dataset instead of overwriting
+# Auto-scaling configuration
+# Relation limits are automatically calculated: target * retry_multiplier * safety_factor
+auto_scaling:
+  safety_factor: 1.5       # Extra buffer to ensure enough unique pairs
+  # Manual overrides (optional) - uncomment to override auto-calculated limits
+  manual_limits: {}
+    # adjacency: 10000     # Uncomment to manually set
+    # containment: 2000
+    # intersection: 1000
+    # cross_source: 500
+# Output paths (relative to dataset directory)
+output:
+  samples_dir: "output/samples"
+  dataset_file: "output/dataset_raw.jsonl"
+  intermediate_dir: "intermediate"

dataset/scripts/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Dataset generation scripts package."""

dataset/scripts/build_inventory.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+Build entity inventory from divisions_area and natural_earth parquet files.
+This script creates compact inventory tables containing only the fields needed
+for candidate sampling and distractor generation.
+Output:
+- intermediate/divisions_area_inventory.parquet
+- intermediate/natural_earth_inventory.parquet
+"""
+import duckdb
+import pandas as pd
+from pathlib import Path
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+def build_divisions_area_inventory(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
+    """Extract compact inventory from divisions_area."""
+    query = """
+    SELECT
+        'divisions_area' AS source,
+        id,
+        names."primary" AS name,
+        subtype,
+        country,
+        region,
+        admin_level,
+        class,
+        is_land,
+        is_territorial,
+        division_id,
+        ST_Area(geometry) AS area_sq_deg,
+        ST_XMin(geometry) AS xmin,
+        ST_YMin(geometry) AS ymin,
+        ST_XMax(geometry) AS xmax,
+        ST_YMax(geometry) AS ymax
+    FROM read_parquet(?)
+    WHERE names."primary" IS NOT NULL
+      AND trim(names."primary") != ''
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH]).fetchdf()
+    print(f"Divisions area inventory: {len(df)} entities")
+    print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")
+    print(f"Countries: {df['country'].nunique()} unique")
+    return df
+def build_natural_earth_inventory(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
+    """Extract compact inventory from natural_earth."""
+    query = """
+    SELECT
+        'natural_earth' AS source,
+        id,
+        names."primary" AS name,
+        subtype,
+        country,
+        region,
+        admin_level,
+        class,
+        is_land,
+        is_territorial,
+        ST_Area(geometry) AS area_sq_deg,
+        ST_XMin(geometry) AS xmin,
+        ST_YMin(geometry) AS ymin,
+        ST_XMax(geometry) AS xmax,
+        ST_YMax(geometry) AS ymax
+    FROM read_parquet(?)
+    WHERE names."primary" IS NOT NULL
+      AND trim(names."primary") != ''
+    """
+    df = con.execute(query, [NATURAL_EARTH_PATH]).fetchdf()
+    print(f"\nNatural earth inventory: {len(df)} entities")
+    print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")
+    return df
+def main():
+    """Build and save inventory tables."""
+    output_dir = Path(__file__).parent.parent / "intermediate"
+    output_dir.mkdir(exist_ok=True, parents=True)
+    con = duckdb.connect()
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    print("Building divisions_area inventory...")
+    divisions_df = build_divisions_area_inventory(con)
+    divisions_path = output_dir / "divisions_area_inventory.parquet"
+    divisions_df.to_parquet(divisions_path, index=False)
+    print(f"Saved to {divisions_path}")
+    print("\nBuilding natural_earth inventory...")
+    natural_earth_df = build_natural_earth_inventory(con)
+    natural_earth_path = output_dir / "natural_earth_inventory.parquet"
+    natural_earth_df.to_parquet(natural_earth_path, index=False)
+    print(f"Saved to {natural_earth_path}")
+    con.close()
+    print("\n✓ Inventory build complete")
+    print(f"  Total entities: {len(divisions_df) + len(natural_earth_df)}")
+if __name__ == "__main__":
+    main()

dataset/scripts/build_relations.py ADDED Viewed

	@@ -0,0 +1,288 @@

+"""
+Precompute spatial relation tables for efficient anchor sampling.
+This script computes:
+- Adjacency pairs (touching features)
+- Containment pairs (features within other features)
+- Intersection pairs (overlapping features)
+- Cross-source relations (divisions_area ↔ natural_earth)
+Output:
+- intermediate/adjacency_pairs.parquet
+- intermediate/containment_pairs.parquet
+- intermediate/intersection_pairs.parquet
+- intermediate/cross_source_relations.parquet
+"""
+import duckdb
+import pandas as pd
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+def compute_adjacency_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find all pairs of features that touch (share a boundary)."""
+    print("Computing adjacency pairs (optimized with spatial index)...")
+    # Use bounding box pre-filter to avoid full cartesian product
+    query = """
+    WITH features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        WHERE country IN (SELECT unnest(?))
+    )
+    SELECT
+        a.id AS anchor_id,
+        a.name AS anchor_name,
+        a.subtype AS anchor_subtype,
+        a.country AS anchor_country,
+        b.id AS target_id,
+        b.name AS target_name,
+        b.subtype AS target_subtype,
+        b.country AS target_country,
+        'adjacency' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id < b.id
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Touches(a.geometry, b.geometry)
+    )
+    LIMIT ?
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH, countries, limit]).fetchdf()
+    print(f"Found {len(df)} adjacency pairs")
+    return df
+def compute_containment_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find all pairs where one feature contains another."""
+    print("\nComputing containment pairs (optimized)...")
+    query = """
+    WITH features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        WHERE country IN (SELECT unnest(?))
+    )
+    SELECT
+        a.id AS container_id,
+        a.name AS container_name,
+        a.subtype AS container_subtype,
+        b.id AS contained_id,
+        b.name AS contained_name,
+        b.subtype AS contained_subtype,
+        'containment' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id != b.id
+        AND a.admin_level < b.admin_level
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Within(b.geometry, a.geometry)
+    )
+    LIMIT ?
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH, countries, limit]).fetchdf()
+    print(f"Found {len(df)} containment pairs")
+    return df
+def compute_intersection_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find pairs that intersect but don't touch or contain."""
+    print("\nComputing intersection pairs (optimized)...")
+    query = """
+    WITH features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        WHERE country IN (SELECT unnest(?))
+    )
+    SELECT
+        a.id AS anchor_id,
+        a.name AS anchor_name,
+        a.subtype AS anchor_subtype,
+        b.id AS target_id,
+        b.name AS target_name,
+        b.subtype AS target_subtype,
+        'intersection' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id < b.id
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Intersects(a.geometry, b.geometry)
+        AND NOT ST_Touches(a.geometry, b.geometry)
+        AND NOT ST_Within(a.geometry, b.geometry)
+        AND NOT ST_Within(b.geometry, a.geometry)
+    )
+    LIMIT ?
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH, countries, limit]).fetchdf()
+    print(f"Found {len(df)} same-source intersection pairs")
+    return df
+def compute_cross_source_relations(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find relations between divisions_area and natural_earth."""
+    print("\nComputing cross-source relations...")
+    query = """
+    WITH divisions AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            geometry
+        FROM read_parquet(?)
+        WHERE country IN (SELECT unnest(?))
+    ),
+    natural_features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            geometry
+        FROM read_parquet(?)
+        WHERE subtype IN ('sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay')
+        LIMIT 200
+    )
+    SELECT
+        d.id AS division_id,
+        d.name AS division_name,
+        d.subtype AS division_subtype,
+        d.country AS division_country,
+        n.id AS natural_id,
+        n.name AS natural_name,
+        n.subtype AS natural_subtype,
+        CASE
+            WHEN ST_Touches(d.geometry, n.geometry) THEN 'touches'
+            WHEN ST_Within(d.geometry, n.geometry) THEN 'within'
+            WHEN ST_Contains(d.geometry, n.geometry) THEN 'contains'
+            WHEN ST_Intersects(d.geometry, n.geometry) THEN 'intersects'
+        END AS relation_type
+    FROM divisions AS d
+    JOIN natural_features AS n ON ST_Intersects(d.geometry, n.geometry)
+    LIMIT ?
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH, countries, NATURAL_EARTH_PATH, limit]).fetchdf()
+    print(f"Found {len(df)} cross-source relations")
+    return df
+def _make_connection():
+    """Create a new DuckDB connection with spatial extension loaded."""
+    con = duckdb.connect()
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    return con
+def _compute_and_save(compute_fn, countries, limit, output_path):
+    """Compute a relation table and save it to parquet. Uses its own DuckDB connection."""
+    con = _make_connection()
+    try:
+        df = compute_fn(con, countries, limit)
+        df.to_parquet(output_path, index=False)
+        print(f"Saved to {output_path}")
+        return df
+    finally:
+        con.close()
+def main(countries: list = None, relation_limits: dict = None):
+    """Compute and save all relation tables in parallel.
+    Args:
+        countries: List of country codes to process
+        relation_limits: Dict with keys: adjacency, containment, intersection, cross_source
+    """
+    # Defaults
+    if countries is None:
+        countries = ['EC', 'BE', 'KE', 'AE', 'SG', 'CH']
+    if relation_limits is None:
+        relation_limits = {
+            'adjacency': 50000,
+            'containment': 1000,
+            'intersection': 500,
+            'cross_source': 500
+        }
+    output_dir = Path(__file__).parent.parent / "intermediate"
+    output_dir.mkdir(exist_ok=True, parents=True)
+    # Define all relation tasks
+    tasks = [
+        ("adjacency", compute_adjacency_pairs, relation_limits['adjacency'], output_dir / "adjacency_pairs.parquet"),
+        ("containment", compute_containment_pairs, relation_limits['containment'], output_dir / "containment_pairs.parquet"),
+        ("intersection", compute_intersection_pairs, relation_limits['intersection'], output_dir / "intersection_pairs.parquet"),
+        ("cross_source", compute_cross_source_relations, relation_limits['cross_source'], output_dir / "cross_source_relations.parquet"),
+    ]
+    print(f"Computing {len(tasks)} relation types in parallel...")
+    # Run all relation computations concurrently
+    with ThreadPoolExecutor(max_workers=len(tasks)) as executor:
+        futures = {
+            executor.submit(_compute_and_save, compute_fn, countries, limit, path): name
+            for name, compute_fn, limit, path in tasks
+        }
+        for future in as_completed(futures):
+            name = futures[future]
+            try:
+                future.result()
+            except Exception as e:
+                print(f"ERROR computing {name}: {e}")
+                raise
+    print("\n✓ Relation tables build complete")
+if __name__ == "__main__":
+    main()

dataset/scripts/cli.py ADDED Viewed

	@@ -0,0 +1,305 @@

+#!/usr/bin/env python3
+"""
+CLI for synthetic dataset generation.
+Usage:
+    python cli.py build-relations --config ../config.yaml
+    python cli.py generate-samples --config ../config.yaml
+    python cli.py generate-samples --config ../config.yaml --append
+    python cli.py full-pipeline --config ../config.yaml
+"""
+import argparse
+import sys
+from pathlib import Path
+import yaml
+import subprocess
+from typing import Dict, Set
+import pandas as pd
+def load_config(config_path: Path) -> dict:
+    """Load configuration from YAML file."""
+    with open(config_path) as f:
+        return yaml.safe_load(f)
+def should_rebuild_relations(config: dict, intermediate_dir: Path, append: bool) -> bool:
+    """Check if relation tables need to be rebuilt.
+    Returns True if:
+    - Not in append mode (always rebuild)
+    - Relation tables don't exist
+    - Countries in config differ from countries in existing relation tables
+    """
+    if not append:
+        return True
+    # Check if relation tables exist
+    adjacency_file = intermediate_dir / "adjacency_pairs.parquet"
+    if not adjacency_file.exists():
+        print("WARNING: Relation tables not found, will rebuild despite append mode")
+        return True
+    # Check if countries have changed
+    try:
+        df = pd.read_parquet(adjacency_file)
+        if 'anchor_country' in df.columns:
+            existing_countries = set(df['anchor_country'].unique())
+            config_countries = set(config['countries'])
+            if existing_countries != config_countries:
+                print(f"WARNING: Countries changed:")
+                print(f"    Previous: {sorted(existing_countries)}")
+                print(f"    New: {sorted(config_countries)}")
+                print(f"    Will rebuild relation tables to include new countries")
+                return True
+            else:
+                print(f"Countries unchanged: {sorted(config_countries)}")
+                return False
+        else:
+            # Can't determine countries, rebuild to be safe
+            print("WARNING: Cannot determine countries from existing tables, will rebuild")
+            return True
+    except Exception as e:
+        print(f"WARNING: Error reading existing relation tables: {e}")
+        print("    Will rebuild to be safe")
+        return True
+def calculate_relation_limits(config: dict) -> Dict[str, int]:
+    """Auto-calculate relation limits based on sample targets."""
+    sample_targets = config['sample_targets']
+    retry_mult = config['generation']['retry_multiplier']
+    safety = config.get('auto_scaling', {}).get('safety_factor', 1.5)
+    # Map task families to relation types they need
+    family_to_relation = {
+        'adjacency': 'adjacency',
+        'containment': 'containment',
+        'intersection': 'intersection',
+        'buffer': 'adjacency',  # Buffer uses adjacency pairs
+        'set_operations': 'intersection',  # Set ops use intersection pairs
+        'partial_selection': 'containment',  # Partial uses containment
+        'aggregation': 'containment',  # Aggregation uses containment
+        'direct_lookup': None,  # Uses inventory only
+    }
+    # Calculate required limits by summing needs per relation type
+    relation_needs = {}
+    for family, target in sample_targets.items():
+        relation_type = family_to_relation.get(family)
+        if relation_type:
+            needed = int(target * retry_mult * safety)
+            relation_needs[relation_type] = relation_needs.get(relation_type, 0) + needed
+    # Add cross-source (used by mixed-source partial selection)
+    partial_target = sample_targets.get('partial_selection', 0)
+    relation_needs['cross_source'] = int(partial_target * retry_mult * safety * 0.3)
+    # Apply manual overrides if specified
+    manual = config.get('auto_scaling', {}).get('manual_limits', {})
+    relation_needs.update(manual)
+    return relation_needs
+def build_relations(config_path: Path):
+    """Run relation building with config."""
+    config = load_config(config_path)
+    # Auto-calculate relation limits
+    relation_limits = calculate_relation_limits(config)
+    print("=" * 60)
+    print("STEP 1: Building Relation Tables")
+    print("=" * 60)
+    print(f"Countries: {', '.join(config['countries'])}")
+    print(f"\nAuto-calculated relation limits:")
+    for rel_type, limit in relation_limits.items():
+        print(f"  {rel_type:20s}: {limit:,}")
+    print()
+    # Import and run the relation builder
+    from dataset.scripts import build_relations
+    # Run with config parameters
+    build_relations.main(
+        countries=config['countries'],
+        relation_limits=relation_limits
+    )
+    print("\n✓ Relation tables built successfully")
+def generate_samples(config_path: Path, append: bool = False):
+    """Run sample generation with config."""
+    config = load_config(config_path)
+    print("=" * 60)
+    print("STEP 2: Generating Samples")
+    print("=" * 60)
+    print(f"Targets: {config['sample_targets']}")
+    print(f"Workers: {config['generation']['max_workers']}")
+    print(f"Append mode: {append or config['generation']['append_mode']}")
+    print()
+    # Simple import - no number prefixes needed
+    from dataset.scripts import generate_samples as gs_module
+    # Override config values
+    gs_module.TARGET_COUNTS = config['sample_targets']
+    gs_module.MAX_WORKERS = config['generation']['max_workers']
+    gs_module.RETRY_MULTIPLIER = config['generation']['retry_multiplier']
+    gs_module.APPEND_MODE = append or config['generation']['append_mode']
+    # Run the main function
+    gs_module.main()
+    print("\n✓ Samples generated successfully")
+def validate_dataset(config_path: Path):
+    """Run dataset validation."""
+    print("=" * 60)
+    print("STEP 3: Validating Dataset")
+    print("=" * 60)
+    script_dir = Path(__file__).parent
+    result = subprocess.run(
+        [sys.executable, str(script_dir / "validate_dataset.py")],
+        check=True
+    )
+    print("\n✓ Dataset validated successfully")
+def export_dataset(config_path: Path):
+    """Run dataset export."""
+    print("=" * 60)
+    print("STEP 4: Exporting Dataset")
+    print("=" * 60)
+    script_dir = Path(__file__).parent
+    result = subprocess.run(
+        [sys.executable, str(script_dir / "export_training_data.py")],
+        check=True
+    )
+    print("\n✓ Dataset exported successfully")
+def full_pipeline(config_path: Path, append: bool = False):
+    """Run the full pipeline."""
+    print("\n" + "=" * 60)
+    print("RUNNING FULL DATASET GENERATION PIPELINE")
+    print("=" * 60 + "\n")
+    config = load_config(config_path)
+    # Check if inventory exists, create if not
+    script_dir = Path(__file__).parent
+    intermediate_dir = script_dir.parent / "intermediate"
+    inventory_files = [
+        intermediate_dir / "divisions_area_inventory.parquet",
+        intermediate_dir / "natural_earth_inventory.parquet"
+    ]
+    inventory_missing = any(not f.exists() for f in inventory_files)
+    if inventory_missing:
+        print("=" * 60)
+        print("STEP 0: Building Entity Inventory")
+        print("=" * 60)
+        print("Inventory files not found. Building inventory...\n")
+        from dataset.scripts import build_inventory
+        build_inventory.main()
+        print("\n✓ Inventory built successfully\n")
+    # Check if we need to rebuild relations
+    need_rebuild = should_rebuild_relations(config, intermediate_dir, append)
+    if need_rebuild:
+        build_relations(config_path)
+    else:
+        print("Using existing relation tables (append mode, same countries)")
+    generate_samples(config_path, append=append)
+    validate_dataset(config_path)
+    export_dataset(config_path)
+    print("\n" + "=" * 60)
+    print("✓ PIPELINE COMPLETED SUCCESSFULLY")
+    print("=" * 60)
+def main():
+    parser = argparse.ArgumentParser(
+        description="Synthetic dataset generation CLI",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Build relation tables only
+  python cli.py build-relations --config ../config.yaml
+  # Generate samples only
+  python cli.py generate-samples --config ../config.yaml
+  # Generate and append to existing dataset
+  python cli.py generate-samples --config ../config.yaml --append
+  # Run full pipeline
+  python cli.py full-pipeline --config ../config.yaml
+  # Run full pipeline in append mode (skip relation building)
+  python cli.py full-pipeline --config ../config.yaml --append
+        """
+    )
+    parser.add_argument(
+        'command',
+        choices=['build-relations', 'generate-samples', 'validate', 'export', 'full-pipeline'],
+        help='Command to run'
+    )
+    parser.add_argument(
+        '--config',
+        type=Path,
+        required=True,
+        help='Path to config YAML file'
+    )
+    parser.add_argument(
+        '--append',
+        action='store_true',
+        help='Append to existing dataset instead of overwriting'
+    )
+    args = parser.parse_args()
+    # Validate config file exists
+    if not args.config.exists():
+        print(f"Error: Config file not found: {args.config}")
+        sys.exit(1)
+    # Run the appropriate command
+    try:
+        if args.command == 'build-relations':
+            build_relations(args.config)
+        elif args.command == 'generate-samples':
+            generate_samples(args.config, args.append)
+        elif args.command == 'validate':
+            validate_dataset(args.config)
+        elif args.command == 'export':
+            export_dataset(args.config)
+        elif args.command == 'full-pipeline':
+            full_pipeline(args.config, args.append)
+    except Exception as e:
+        print(f"\n✗ Error: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

dataset/scripts/export_training_data.py ADDED Viewed

	@@ -0,0 +1,191 @@

+"""
+Export validated dataset to train/val/test splits.
+This script:
+1. Loads validated samples
+2. Splits into train (80%), val (10%), test (10%)
+3. Ensures balanced splits across task families
+4. Exports to JSONL format
+Output:
+- output/train.jsonl (80% of samples)
+- output/val.jsonl (10% of samples)
+- output/test.jsonl (10% of samples)
+"""
+import json
+import random
+from pathlib import Path
+from typing import List, Dict, Any
+from collections import defaultdict
+def load_samples(jsonl_path: Path) -> List[Dict[str, Any]]:
+    """Load samples from JSONL file."""
+    samples = []
+    with open(jsonl_path, 'r') as f:
+        for line in f:
+            samples.append(json.loads(line))
+    return samples
+def stratified_split(
+    samples: List[Dict[str, Any]],
+    train_ratio: float = 0.8,
+    val_ratio: float = 0.1,
+    test_ratio: float = 0.1,
+    random_seed: int = 42
+) -> tuple[List[Dict], List[Dict], List[Dict]]:
+    """Split samples by task family to ensure balanced distribution."""
+    random.seed(random_seed)
+    # Group by task family
+    by_family = defaultdict(list)
+    for sample in samples:
+        family = sample['metadata']['task_family']
+        by_family[family].append(sample)
+    train_samples = []
+    val_samples = []
+    test_samples = []
+    # Split each family
+    for family, family_samples in by_family.items():
+        # Shuffle
+        random.shuffle(family_samples)
+        n = len(family_samples)
+        n_train = int(n * train_ratio)
+        n_val = int(n * val_ratio)
+        train_samples.extend(family_samples[:n_train])
+        val_samples.extend(family_samples[n_train:n_train + n_val])
+        test_samples.extend(family_samples[n_train + n_val:])
+    # Shuffle final splits
+    random.shuffle(train_samples)
+    random.shuffle(val_samples)
+    random.shuffle(test_samples)
+    return train_samples, val_samples, test_samples
+def save_split(samples: List[Dict[str, Any]], output_path: Path):
+    """Save samples to JSONL file."""
+    with open(output_path, 'w') as f:
+        for sample in samples:
+            f.write(json.dumps(sample) + '\n')
+def print_split_stats(split_name: str, samples: List[Dict[str, Any]]):
+    """Print statistics for a split."""
+    families = defaultdict(int)
+    for sample in samples:
+        family = sample['metadata']['task_family']
+        families[family] += 1
+    print(f"\n{split_name}:")
+    print(f"  Total: {len(samples)}")
+    for family, count in sorted(families.items()):
+        print(f"    {family:20s}: {count:3d}")
+def print_country_stats(samples: List[Dict[str, Any]]):
+    """Print country distribution statistics."""
+    country_counts = defaultdict(int)
+    # Extract countries from selected/answer candidates only
+    for sample in samples:
+        selected_ids = set(sample.get('target', {}).get('selected_candidates', []))
+        countries_in_sample = set()
+        for candidate in sample.get('candidates', []):
+            if candidate.get('candidate_id') in selected_ids:
+                country = candidate.get('country')
+                if country:
+                    countries_in_sample.add(country)
+        # Count each unique country once per sample
+        for country in countries_in_sample:
+            country_counts[country] += 1
+    if not country_counts:
+        print("\nNo country information found in samples")
+        return
+    print(f"\nCOUNTRY DISTRIBUTION:")
+    print(f"  Total unique countries: {len(country_counts)}")
+    print(f"\n  Top countries by sample count:")
+    # Sort by count descending
+    sorted_countries = sorted(country_counts.items(), key=lambda x: x[1], reverse=True)
+    # Show top 20
+    for country, count in sorted_countries[:20]:
+        percentage = (count / len(samples)) * 100
+        print(f"    {country:3s}: {count:4d} samples ({percentage:5.1f}%)")
+    if len(sorted_countries) > 20:
+        remaining = len(sorted_countries) - 20
+        remaining_count = sum(c for _, c in sorted_countries[20:])
+        print(f"    ... and {remaining} more countries ({remaining_count} samples)")
+def main():
+    """Export dataset splits."""
+    script_dir = Path(__file__).parent
+    output_dir = script_dir.parent / "output"
+    validated_file = output_dir / "dataset_validated.jsonl"
+    train_file = output_dir / "train.jsonl"
+    val_file = output_dir / "val.jsonl"
+    test_file = output_dir / "test.jsonl"
+    if not validated_file.exists():
+        print(f"Error: {validated_file} not found. Run 05_validate_dataset.py first.")
+        return
+    # Load validated samples
+    print("Loading validated samples...")
+    samples = load_samples(validated_file)
+    print(f"Loaded {len(samples)} samples")
+    # Split
+    print("\nSplitting dataset (80/10/10)...")
+    train_samples, val_samples, test_samples = stratified_split(samples)
+    # Save splits
+    print("\nSaving splits...")
+    save_split(train_samples, train_file)
+    save_split(val_samples, val_file)
+    save_split(test_samples, test_file)
+    print(f"  Train: {train_file} ({len(train_samples)} samples)")
+    print(f"  Val:   {val_file} ({len(val_samples)} samples)")
+    print(f"  Test:  {test_file} ({len(test_samples)} samples)")
+    # Print statistics
+    print("\n" + "=" * 60)
+    print("SPLIT STATISTICS")
+    print("=" * 60)
+    print_split_stats("TRAIN", train_samples)
+    print_split_stats("VAL", val_samples)
+    print_split_stats("TEST", test_samples)
+    # Print country distribution
+    print("\n" + "=" * 60)
+    print("GEOGRAPHIC DISTRIBUTION")
+    print("=" * 60)
+    print_country_stats(samples)
+    print("\n✓ Export complete")
+    print(f"\nReady for training!")
+    print(f"  Training data: {train_file}")
+    print(f"  Validation data: {val_file}")
+    print(f"  Test data: {test_file}")
+if __name__ == "__main__":
+    main()

dataset/scripts/generate_samples.py ADDED Viewed

	@@ -0,0 +1,1091 @@

+"""
+Generate synthetic training samples for text-to-SQL task.
+This script:
+1. Loads relation tables and entity inventories
+2. For each SQL template, samples valid anchors
+3. Renders and executes SQL to verify it works
+4. Builds candidate lists with controlled distractors
+5. Generates natural language questions using LLM
+6. Saves complete training samples
+Output:
+- output/samples/sample_*.json (individual samples)
+- output/dataset_raw.jsonl (all samples)
+"""
+import json
+import random
+import warnings
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+from concurrent.futures import ProcessPoolExecutor, as_completed
+from functools import partial
+import duckdb
+import pandas as pd
+from pydantic import BaseModel
+# Suppress warnings
+warnings.filterwarnings('ignore')
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+# Configurable parameters (can be overridden by CLI)
+TARGET_COUNTS = None  # Will be set in main() or by CLI
+MAX_WORKERS = 8
+RETRY_MULTIPLIER = 2
+APPEND_MODE = False
+# Import templates from same directory
+from . import sql_templates
+TEMPLATES = sql_templates.TEMPLATES
+SQLTemplate = sql_templates.SQLTemplate
+get_templates_by_family = sql_templates.get_templates_by_family
+class Candidate(BaseModel):
+    """Candidate entity for grounding."""
+    candidate_id: str
+    source: str
+    id: str
+    name: str
+    subtype: Optional[str] = None
+    country: Optional[str] = None
+    region: Optional[str] = None
+    admin_level: Optional[int] = None
+    similarity: float = 0.0
+class TrainingSample(BaseModel):
+    """Complete training sample."""
+    id: str
+    question: str
+    candidates: List[Candidate]
+    target: Dict[str, Any]
+    metadata: Dict[str, Any]
+def load_relation_tables(intermediate_dir: Path, quiet: bool = False) -> Dict[str, pd.DataFrame]:
+    """Load all precomputed relation tables."""
+    tables = {}
+    for file in intermediate_dir.glob("*.parquet"):
+        name = file.stem
+        tables[name] = pd.read_parquet(file)
+        if not quiet:
+            print(f"  {name}: {len(tables[name])} rows")
+    return tables
+def sample_adjacency_anchor(adjacency_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random adjacency pair."""
+    if adjacency_df.empty:
+        return None
+    row = adjacency_df.sample(n=1).iloc[0]
+    return {
+        'anchor_id': row['anchor_id'],
+        'anchor_name': row['anchor_name'],
+        'anchor_subtype': row['anchor_subtype'],
+        'anchor_country': row.get('anchor_country'),  # May not exist in all tables
+        'target_subtype': row.get('target_subtype')
+    }
+def sample_intersection_anchor(intersection_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random intersection pair."""
+    if intersection_df.empty:
+        return None
+    row = intersection_df.sample(n=1).iloc[0]
+    return {
+        'anchor_id': row['anchor_id'],
+        'anchor_name': row['anchor_name'],
+        'anchor_subtype': row['anchor_subtype'],
+        'target_id': row.get('target_id'),
+        'target_name': row.get('target_name'),
+        'target_subtype': row.get('target_subtype')
+    }
+def sample_containment_anchor(containment_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random containment pair."""
+    if containment_df.empty:
+        return None
+    row = containment_df.sample(n=1).iloc[0]
+    return {
+        'container_id': row['container_id'],
+        'container_name': row['container_name'],
+        'container_subtype': row['container_subtype'],
+        'contained_subtype': row['contained_subtype']
+    }
+def sample_cross_source_anchor(cross_source_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random cross-source relation."""
+    if cross_source_df.empty:
+        return None
+    row = cross_source_df.sample(n=1).iloc[0]
+    return {
+        'division_id': row['division_id'],
+        'division_name': row['division_name'],
+        'division_subtype': row['division_subtype'],
+        'natural_id': row['natural_id'],
+        'natural_name': row['natural_name'],
+        'natural_subtype': row['natural_subtype'],
+        'relation_type': row['relation_type']
+    }
+def build_candidate_list(
+    con: duckdb.DuckDBPyConnection,
+    anchor_id: str,
+    anchor_name: str,
+    anchor_source: str,
+    num_candidates: int = 10,
+    difficulty: str = "medium"
+) -> List[Candidate]:
+    """Build candidate list with true anchor + distractors."""
+    # Helper to convert pandas NA to None
+    def safe_get(row, key, default=None):
+        val = row.get(key, default)
+        return None if pd.isna(val) else val
+    # Get the true anchor
+    if anchor_source == "divisions_area":
+        query = """
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            region,
+            admin_level
+        FROM read_parquet(?)
+        WHERE id = ?
+        """
+        anchor_row = con.execute(query, [DIVISIONS_AREA_PATH, anchor_id]).fetchdf().iloc[0]
+    else:
+        query = """
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype
+        FROM read_parquet(?)
+        WHERE id = ?
+        """
+        anchor_row = con.execute(query, [NATURAL_EARTH_PATH, anchor_id]).fetchdf().iloc[0]
+    # Build true candidate
+    true_candidate = Candidate(
+        candidate_id="c1",
+        source=anchor_source,
+        id=anchor_id,
+        name=safe_get(anchor_row, 'name'),
+        subtype=safe_get(anchor_row, 'subtype'),
+        country=safe_get(anchor_row, 'country'),
+        region=safe_get(anchor_row, 'region'),
+        admin_level=safe_get(anchor_row, 'admin_level'),
+        similarity=1.0
+    )
+    # Build distractors based on difficulty
+    distractors = build_distractors(
+        con,
+        anchor_name,
+        anchor_source,
+        anchor_id,
+        num_candidates - 1,
+        difficulty
+    )
+    # Combine and shuffle
+    candidates = [true_candidate] + distractors
+    random.shuffle(candidates)
+    # Reassign candidate IDs after shuffling
+    for i, cand in enumerate(candidates, 1):
+        cand.candidate_id = f"c{i}"
+    return candidates
+def build_distractors(
+    con: duckdb.DuckDBPyConnection,
+    anchor_name: str,
+    anchor_source: str,
+    exclude_id: str,
+    num_distractors: int,
+    difficulty: str
+) -> List[Candidate]:
+    """Build distractor candidates using fuzzy search."""
+    # Fuzzy search for similar names
+    if anchor_source == "divisions_area":
+        query = """
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            region,
+            admin_level,
+            jaro_winkler_similarity(lower(names."primary"), lower(?)) AS similarity
+        FROM read_parquet(?)
+        WHERE id != ?
+          AND names."primary" IS NOT NULL
+        ORDER BY similarity DESC
+        LIMIT ?
+        """
+        df = con.execute(query, [
+            anchor_name, DIVISIONS_AREA_PATH, exclude_id, num_distractors
+        ]).fetchdf()
+        source = "divisions_area"
+    else:
+        query = """
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            jaro_winkler_similarity(lower(names."primary"), lower(?)) AS similarity
+        FROM read_parquet(?)
+        WHERE id != ?
+          AND names."primary" IS NOT NULL
+        ORDER BY similarity DESC
+        LIMIT ?
+        """
+        df = con.execute(query, [
+            anchor_name, NATURAL_EARTH_PATH, exclude_id, num_distractors
+        ]).fetchdf()
+        source = "natural_earth"
+    # Helper to convert pandas NA to None
+    def safe_get(row, key, default=None):
+        val = row.get(key, default)
+        return None if pd.isna(val) else val
+    distractors = []
+    for _, row in df.iterrows():
+        distractors.append(Candidate(
+            candidate_id="temp",  # Will be reassigned
+            source=source,
+            id=row['id'],
+            name=safe_get(row, 'name'),
+            subtype=safe_get(row, 'subtype'),
+            country=safe_get(row, 'country'),
+            region=safe_get(row, 'region'),
+            admin_level=safe_get(row, 'admin_level'),
+            similarity=float(row['similarity'])
+        ))
+    return distractors
+def generate_adjacency_sample(
+    con: duckdb.DuckDBPyConnection,
+    adjacency_df: pd.DataFrame,
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample for adjacency task."""
+    anchor = sample_adjacency_anchor(adjacency_df)
+    if not anchor:
+        return None
+    # Build SQL
+    sql = f"""WITH a AS (
+  SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}')
+  WHERE id = '{anchor['anchor_id']}'
+)
+SELECT b.id, b.names."primary" AS name, b.geometry
+FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a
+WHERE b.id != '{anchor['anchor_id']}'
+  AND b.subtype = '{anchor['target_subtype']}'
+  AND ST_Touches(a.geometry, b.geometry)"""
+    # Execute to verify
+    try:
+        result = con.execute(sql).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        print(f"SQL execution failed: {e}")
+        return None
+    # Build candidates
+    candidates = build_candidate_list(
+        con,
+        anchor['anchor_id'],
+        anchor['anchor_name'],
+        "divisions_area",
+        num_candidates=10,
+        difficulty="medium"
+    )
+    # Find which candidate is the true anchor
+    selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor['anchor_id']]
+    # Generate question
+    question = f"Which {anchor['target_subtype']}s border {anchor['anchor_name']}?"
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": "adjacency",
+            "sql_difficulty": "medium",
+            "grounding_difficulty": "medium",
+            "template_id": "adj_02",
+            "num_candidates": len(candidates),
+            "anchor_source": "divisions_area",
+            "sql_verified": True
+        }
+    )
+def generate_containment_sample(
+    con: duckdb.DuckDBPyConnection,
+    containment_df: pd.DataFrame,
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample for containment task."""
+    anchor = sample_containment_anchor(containment_df)
+    if not anchor:
+        return None
+    # Build SQL
+    sql = f"""WITH a AS (
+  SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}')
+  WHERE id = '{anchor['container_id']}'
+)
+SELECT b.id, b.names."primary" AS name, b.geometry
+FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a
+WHERE b.id != '{anchor['container_id']}'
+  AND b.subtype = '{anchor['contained_subtype']}'
+  AND ST_Within(b.geometry, a.geometry)"""
+    # Execute to verify
+    try:
+        result = con.execute(sql).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        print(f"SQL execution failed: {e}")
+        return None
+    # Build candidates
+    candidates = build_candidate_list(
+        con,
+        anchor['container_id'],
+        anchor['container_name'],
+        "divisions_area",
+        num_candidates=10,
+        difficulty="medium"
+    )
+    # Find which candidate is the true anchor
+    selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor['container_id']]
+    # Generate question
+    question = f"What {anchor['contained_subtype']}s are in {anchor['container_name']}?"
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": "containment",
+            "sql_difficulty": "medium",
+            "grounding_difficulty": "medium",
+            "template_id": "contain_01",
+            "num_candidates": len(candidates),
+            "anchor_source": "divisions_area",
+            "sql_verified": True
+        }
+    )
+def sample_random_entity(
+    con: duckdb.DuckDBPyConnection,
+    inventory_df: pd.DataFrame,
+    source: str
+) -> Optional[Dict[str, Any]]:
+    """Sample a random entity from inventory."""
+    if inventory_df.empty:
+        return None
+    row = inventory_df.sample(n=1).iloc[0]
+    return {
+        'id': row['id'],
+        'name': row['name'],
+        'subtype': row.get('subtype'),
+        'country': row.get('country'),
+        'source': source
+    }
+def generate_template_based_sample(
+    con: duckdb.DuckDBPyConnection,
+    template: SQLTemplate,
+    tables: Dict[str, pd.DataFrame],
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample based on a SQL template."""
+    # Sample anchor based on template requirements
+    if template.family == "direct_lookup":
+        # Just pick a random entity
+        if template.anchor_source == "divisions_area":
+            anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        else:
+            anchor = sample_random_entity(con, tables['natural_earth_inventory'], 'natural_earth')
+        if not anchor:
+            return None
+        # Render SQL
+        sql = template.sql_template.format(
+            DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+            NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+            anchor_id=anchor['id']
+        )
+        # Build candidates
+        candidates = build_candidate_list(
+            con, anchor['id'], anchor['name'], anchor['source'],
+            num_candidates=10, difficulty="easy"
+        )
+        # Question
+        question = random.choice(template.question_hints).format(anchor_name=anchor['name'])
+    elif template.family == "adjacency":
+        anchor = sample_adjacency_anchor(tables['adjacency_pairs'])
+        if not anchor:
+            return None
+        sql = template.sql_template.format(
+            DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+            NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+            anchor_id=anchor['anchor_id'],
+            target_subtype=anchor['target_subtype']
+        )
+        candidates = build_candidate_list(
+            con, anchor['anchor_id'], anchor['anchor_name'], 'divisions_area',
+            num_candidates=10, difficulty="medium"
+        )
+        question = random.choice(template.question_hints).format(
+            anchor_name=anchor['anchor_name'],
+            target_subtype=anchor['target_subtype']
+        )
+    elif template.family == "containment":
+        anchor = sample_containment_anchor(tables['containment_pairs'])
+        if not anchor:
+            return None
+        sql = template.sql_template.format(
+            DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+            NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+            anchor_id=anchor['container_id'],
+            target_subtype=anchor['contained_subtype']
+        )
+        candidates = build_candidate_list(
+            con, anchor['container_id'], anchor['container_name'], 'divisions_area',
+            num_candidates=10, difficulty="medium"
+        )
+        question = random.choice(template.question_hints).format(
+            anchor_name=anchor['container_name'],
+            target_subtype=anchor['contained_subtype']
+        )
+    elif template.family == "intersection":
+        if template.anchor_source == "natural_earth":
+            anchor = sample_cross_source_anchor(tables['cross_source_relations'])
+            if not anchor:
+                return None
+            sql = template.sql_template.format(
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                anchor_id=anchor['natural_id'],
+                target_subtype='country'
+            )
+            candidates = build_candidate_list(
+                con, anchor['natural_id'], anchor['natural_name'], 'natural_earth',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['natural_name'],
+                target_subtype='country'
+            )
+        else:
+            # Same-source intersection
+            anchor = sample_intersection_anchor(tables['intersection_pairs'])
+            if not anchor:
+                return None
+            # Use a generic subtype if not available
+            target_subtype = anchor.get('target_subtype') or 'region'
+            sql = template.sql_template.format(
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                anchor_id=anchor['anchor_id'],
+                target_subtype=target_subtype
+            )
+            candidates = build_candidate_list(
+                con, anchor['anchor_id'], anchor['anchor_name'], 'divisions_area',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['anchor_name'],
+                target_subtype=target_subtype
+            )
+    elif template.family == "set_operations":
+        # Union of two entities
+        anchor1 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        anchor2 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        if not anchor1 or not anchor2:
+            return None
+        sql = template.sql_template.format(
+            DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+            NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+            anchor_id_1=anchor1['id'],
+            anchor_id_2=anchor2['id']
+        )
+        # Build candidates for both anchors
+        candidates1 = build_candidate_list(
+            con, anchor1['id'], anchor1['name'], 'divisions_area',
+            num_candidates=5, difficulty="medium"
+        )
+        candidates2 = build_candidate_list(
+            con, anchor2['id'], anchor2['name'], 'divisions_area',
+            num_candidates=5, difficulty="medium"
+        )
+        # Combine and deduplicate
+        candidates = candidates1 + candidates2
+        seen_ids = set()
+        unique_candidates = []
+        for c in candidates:
+            if c.id not in seen_ids:
+                unique_candidates.append(c)
+                seen_ids.add(c.id)
+        candidates = unique_candidates[:10]
+        # Reassign IDs
+        for i, c in enumerate(candidates, 1):
+            c.candidate_id = f"c{i}"
+        question = f"{anchor1['name']} and {anchor2['name']}"
+    elif template.family == "buffer":
+        # Buffer operations
+        if template.num_anchors == 1:
+            anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor:
+                return None
+            buffer_degrees = random.choice([0.1, 0.5, 1.0])
+            sql = template.sql_template.format(
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                anchor_id=anchor['id'],
+                buffer_degrees=buffer_degrees
+            )
+            candidates = build_candidate_list(
+                con, anchor['id'], anchor['name'], 'divisions_area',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['name'],
+                buffer_degrees=buffer_degrees
+            )
+        else:
+            # Two anchor buffer
+            anchor1 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            anchor2 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor1 or not anchor2:
+                return None
+            buffer_degrees = random.choice([0.1, 0.5])
+            sql = template.sql_template.format(
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                anchor_id_1=anchor1['id'],
+                anchor_id_2=anchor2['id'],
+                buffer_degrees=buffer_degrees
+            )
+            candidates1 = build_candidate_list(
+                con, anchor1['id'], anchor1['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            candidates2 = build_candidate_list(
+                con, anchor2['id'], anchor2['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            candidates = candidates1 + candidates2
+            seen_ids = set()
+            unique_candidates = []
+            for c in candidates:
+                if c.id not in seen_ids:
+                    unique_candidates.append(c)
+                    seen_ids.add(c.id)
+            candidates = unique_candidates[:10]
+            for i, c in enumerate(candidates, 1):
+                c.candidate_id = f"c{i}"
+            question = random.choice(template.question_hints).format(
+                anchor_1_name=anchor1['name'],
+                anchor_2_name=anchor2['name'],
+                buffer_degrees=buffer_degrees
+            )
+    elif template.family == "partial_selection":
+        # Partial selection (northern half, clipping, etc.)
+        anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        if not anchor:
+            return None
+        if template.num_anchors == 1:
+            sql = template.sql_template.format(
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                anchor_id=anchor['id']
+            )
+            question = random.choice(template.question_hints).format(anchor_name=anchor['name'])
+        else:
+            # Mixed source clipping
+            clip_feature = sample_random_entity(con, tables['natural_earth_inventory'], 'natural_earth')
+            if not clip_feature:
+                return None
+            sql = template.sql_template.format(
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                anchor_id=anchor['id'],
+                clip_feature_id=clip_feature['id']
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['name'],
+                clip_feature_name=clip_feature['name']
+            )
+        candidates = build_candidate_list(
+            con, anchor['id'], anchor['name'], 'divisions_area',
+            num_candidates=10, difficulty="hard"
+        )
+    elif template.family == "aggregation":
+        # Aggregation queries (e.g., largest N localities in a region)
+        top_n = random.choice([3, 5, 10])
+        # Check if this is a country-level query (agg_04, agg_05)
+        if template.template_id in ['agg_04', 'agg_05']:
+            # Country-level aggregation
+            anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor:
+                return None
+            country = anchor.get('country', 'EC')
+            target_subtype = random.choice(['locality', 'region'])
+            sql = template.sql_template.format(
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                country=country,
+                target_subtype=target_subtype,
+                top_n=top_n
+            )
+            candidates = build_candidate_list(
+                con, anchor['id'], anchor['name'], 'divisions_area',
+                num_candidates=10, difficulty="hard"
+            )
+            question = random.choice(template.question_hints).format(
+                top_n=top_n,
+                target_subtype=target_subtype,
+                country=country
+            )
+        else:
+            # Container-based aggregation (within a region)
+            anchor = sample_containment_anchor(tables['containment_pairs'])
+            if not anchor:
+                return None
+            target_subtype = anchor.get('contained_subtype', 'locality')
+            sql = template.sql_template.format(
+                DIVISIONS_AREA_PATH=DIVISIONS_AREA_PATH,
+                NATURAL_EARTH_PATH=NATURAL_EARTH_PATH,
+                anchor_id=anchor['container_id'],
+                target_subtype=target_subtype,
+                top_n=top_n
+            )
+            candidates = build_candidate_list(
+                con, anchor['container_id'], anchor['container_name'], 'divisions_area',
+                num_candidates=10, difficulty="hard"
+            )
+            question = random.choice(template.question_hints).format(
+                top_n=top_n,
+                target_subtype=target_subtype,
+                anchor_name=anchor['container_name']
+            )
+    else:
+        # Skip unsupported families
+        return None
+    # Execute SQL to verify
+    try:
+        result = con.execute(sql).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        # Errors are tracked in worker return, no need to print
+        return None
+    # Find selected candidates
+    if template.family == "set_operations":
+        selected_candidate_ids = [c.candidate_id for c in candidates if c.id in [anchor1['id'], anchor2['id']]]
+    else:
+        anchor_id_to_find = anchor.get('anchor_id') or anchor.get('container_id') or anchor.get('natural_id') or anchor.get('id')
+        selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor_id_to_find]
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": template.family,
+            "sql_difficulty": template.sql_difficulty,
+            "grounding_difficulty": "medium",
+            "template_id": template.template_id,
+            "num_candidates": len(candidates),
+            "anchor_source": template.anchor_source,
+            "sql_verified": True
+        }
+    )
+def generate_cross_source_sample(
+    con: duckdb.DuckDBPyConnection,
+    cross_source_df: pd.DataFrame,
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample for cross-source intersection task."""
+    anchor = sample_cross_source_anchor(cross_source_df)
+    if not anchor:
+        return None
+    # Build SQL (natural feature -> divisions)
+    sql = f"""WITH a AS (
+  SELECT geometry FROM read_parquet('{NATURAL_EARTH_PATH}')
+  WHERE id = '{anchor['natural_id']}'
+)
+SELECT b.id, b.names."primary" AS name, b.geometry
+FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a
+WHERE b.subtype = 'country'
+  AND ST_Intersects(b.geometry, a.geometry)"""
+    # Execute to verify
+    try:
+        result = con.execute(sql).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        print(f"SQL execution failed: {e}")
+        return None
+    # Build candidates for natural feature
+    candidates = build_candidate_list(
+        con,
+        anchor['natural_id'],
+        anchor['natural_name'],
+        "natural_earth",
+        num_candidates=10,
+        difficulty="medium"
+    )
+    # Find which candidate is the true anchor
+    selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor['natural_id']]
+    # Generate question
+    question = f"Which countries intersect the {anchor['natural_name']}?"
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": "intersection",
+            "sql_difficulty": "medium-hard",
+            "grounding_difficulty": "medium",
+            "template_id": "intersect_02",
+            "num_candidates": len(candidates),
+            "anchor_source": "natural_earth",
+            "sql_verified": True
+        }
+    )
+def generate_sample_batch_worker(args):
+    """Worker function that processes a batch of work items with a single DuckDB connection.
+    Initializes DuckDB, spatial extension, templates module, and relation tables
+    ONCE per batch, then processes all items sequentially.
+    """
+    from pathlib import Path
+    work_items, intermediate_dir_str = args
+    # Convert string back to Path
+    intermediate_dir = Path(intermediate_dir_str)
+    # Initialize DuckDB ONCE for the entire batch
+    con = duckdb.connect()
+    con.execute("SET enable_progress_bar=false")
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    # Load relation tables ONCE
+    tables = load_relation_tables(intermediate_dir, quiet=True)
+    # Process all items in batch
+    results = []
+    for family, template_dict, sample_id, _ in work_items:
+        # Reconstruct template from dict (sql_templates is already imported at module level)
+        template = sql_templates.SQLTemplate(**template_dict)
+        try:
+            sample = generate_template_based_sample(con, template, tables, sample_id)
+            if sample:
+                results.append((sample, family, template.template_id, None))
+            else:
+                results.append((None, family, template.template_id, "Empty result"))
+        except Exception as e:
+            results.append((None, family, template_dict.get('template_id', 'unknown'), str(e)))
+    con.close()
+    return results
+def main():
+    """Generate training samples."""
+    global TARGET_COUNTS, MAX_WORKERS, RETRY_MULTIPLIER, APPEND_MODE
+    # Setup paths
+    script_dir = Path(__file__).parent
+    intermediate_dir = script_dir.parent / "intermediate"
+    output_dir = script_dir.parent / "output"
+    output_dir.mkdir(exist_ok=True, parents=True)
+    # Load relation tables once to check availability
+    print("Loading relation tables...")
+    tables = load_relation_tables(intermediate_dir, quiet=False)
+    # Use configured target counts or defaults
+    if TARGET_COUNTS is None:
+        target_counts = {
+            'direct_lookup': 100,
+            'adjacency': 200,
+            'containment': 100,
+            'intersection': 150,
+            'buffer': 100,
+            'set_operations': 150,
+            'partial_selection': 100,
+            'aggregation': 100
+        }
+    else:
+        target_counts = TARGET_COUNTS
+    # Load existing samples if in append mode
+    existing_samples = []
+    existing_sample_ids = set()
+    jsonl_file = output_dir / "dataset_raw.jsonl"
+    if APPEND_MODE and jsonl_file.exists():
+        print(f"\nAppend mode: Loading existing samples from {jsonl_file}")
+        with open(jsonl_file, 'r') as f:
+            for line in f:
+                if line.strip():
+                    sample_data = json.loads(line)
+                    existing_samples.append(sample_data)
+                    existing_sample_ids.add(sample_data['id'])
+        print(f"  Found {len(existing_samples)} existing samples")
+        # Determine starting sample counter
+        max_existing_id = max([int(s['id'].split('_')[1]) for s in existing_samples if s['id'].startswith('sample_')], default=0)
+        sample_counter = max_existing_id + 1
+    else:
+        sample_counter = 1
+    # Prepare work items for parallel processing
+    work_items = []
+    starting_sample_counter = sample_counter  # Track starting point for logging
+    for family, target_count in target_counts.items():
+        if target_count == 0:
+            continue
+        # Get templates for this family
+        family_templates = [t for t in TEMPLATES if t.family == family]
+        if not family_templates:
+            print(f"No templates found for {family}, skipping...")
+            continue
+        # Create work items (try retry_multiplier * target to account for failures)
+        for _ in range(target_count * RETRY_MULTIPLIER):
+            template = random.choice(family_templates)
+            # Convert template to dict for pickling
+            template_dict = {
+                'template_id': template.template_id,
+                'family': template.family,
+                'sql_difficulty': template.sql_difficulty,
+                'anchor_source': template.anchor_source,
+                'num_anchors': template.num_anchors,
+                'sql_template': template.sql_template,
+                'question_hints': template.question_hints,
+                'target_subtype': template.target_subtype,
+                'requires_buffer': template.requires_buffer,
+                'requires_aggregation': template.requires_aggregation
+            }
+            work_items.append((
+                family,
+                template_dict,
+                f"sample_{sample_counter:03d}",
+                str(intermediate_dir)  # Convert Path to string for pickling
+            ))
+            sample_counter += 1
+    # Shuffle work items for balanced batches across families
+    random.shuffle(work_items)
+    # Partition work items into batches (one per worker)
+    num_workers = min(MAX_WORKERS, len(work_items))
+    if num_workers == 0:
+        print("No work items to process")
+        return
+    batch_size = (len(work_items) + num_workers - 1) // num_workers
+    batches = []
+    for i in range(0, len(work_items), batch_size):
+        batch = work_items[i:i + batch_size]
+        batches.append((batch, str(intermediate_dir)))
+    # Generate samples in parallel (one batch per worker)
+    active_families = len([f for f in target_counts.values() if f > 0])
+    print(f"\nGenerating {len(work_items)} samples across {active_families} families...")
+    print(f"  Split into {len(batches)} batches of ~{batch_size} items (1 DuckDB init per batch)")
+    if APPEND_MODE and existing_samples:
+        print(f"Appending: starting from sample_{starting_sample_counter:03d}")
+    all_samples = []
+    family_progress = {f: {'success': 0, 'failed': 0} for f in target_counts.keys() if target_counts[f] > 0}
+    with ProcessPoolExecutor(max_workers=num_workers) as executor:
+        # Submit one batch per worker
+        futures = {executor.submit(generate_sample_batch_worker, batch): i for i, batch in enumerate(batches)}
+        # Collect results as batches complete
+        batches_done = 0
+        for future in as_completed(futures):
+            try:
+                batch_results = future.result()
+                for sample, family, template_id, error in batch_results:
+                    if sample:
+                        all_samples.append(sample)
+                        family_progress[family]['success'] += 1
+                    else:
+                        family_progress[family]['failed'] += 1
+            except Exception as e:
+                print(f"\n  Batch failed: {e}")
+            batches_done += 1
+            total_done = sum(p['success'] + p['failed'] for p in family_progress.values())
+            print(f"\r  Progress: {total_done}/{len(work_items)} samples ({batches_done}/{len(batches)} batches) ", end='', flush=True)
+        print()  # New line after progress
+    # Show distribution (keep all samples, no filtering)
+    print("\nResults by family:")
+    for family in sorted(family_progress.keys()):
+        success = family_progress[family]['success']
+        failed = family_progress[family]['failed']
+        target = target_counts.get(family, 0)
+        total = success + failed
+        success_rate = (success / total * 100) if total > 0 else 0
+        print(f"  {family:20s}: {success:3d} success / {failed:3d} failed ({success_rate:5.1f}% success rate, target: {target})")
+    # Save combined JSONL (skip individual JSON files for speed at scale)
+    print(f"\nSaving {len(all_samples)} new samples...")
+    if APPEND_MODE and existing_samples:
+        # Append to existing dataset
+        print(f"Appending to existing dataset ({len(existing_samples)} existing samples)")
+        with open(jsonl_file, 'a') as f:
+            for sample in all_samples:
+                f.write(json.dumps(sample.model_dump()) + '\n')
+        total_samples = len(existing_samples) + len(all_samples)
+    else:
+        # Overwrite dataset
+        with open(jsonl_file, 'w') as f:
+            for sample in all_samples:
+                f.write(json.dumps(sample.model_dump()) + '\n')
+        total_samples = len(all_samples)
+    print(f"\nGenerated {len(all_samples)} new samples")
+    print(f"Total dataset size: {total_samples} samples")
+    print(f"  Dataset: {jsonl_file}")
+if __name__ == "__main__":
+    main()

dataset/scripts/sql_templates.py ADDED Viewed

	@@ -0,0 +1,317 @@

+"""
+SQL template definitions for synthetic data generation.
+Each template includes:
+- Template ID
+- Task family
+- SQL difficulty level
+- Required anchor types
+- SQL template string with placeholders
+- Question generation hints
+"""
+from dataclasses import dataclass
+from typing import List, Literal
+@dataclass
+class SQLTemplate:
+    """SQL template for synthetic data generation."""
+    template_id: str
+    family: str
+    sql_difficulty: Literal["easy", "medium", "medium-hard", "hard"]
+    anchor_source: Literal["divisions_area", "natural_earth", "mixed"]
+    num_anchors: int
+    sql_template: str
+    question_hints: List[str]
+    target_subtype: str = None
+    requires_buffer: bool = False
+    requires_aggregation: bool = False
+# Template catalog
+TEMPLATES = [
+    # DIRECT LOOKUP (10 samples)
+    SQLTemplate(
+        template_id="lookup_01",
+        family="direct_lookup",
+        sql_difficulty="easy",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template="""SELECT geometry, names."primary" AS name, id, subtype FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}'""",
+        question_hints=["Show me {anchor_name}", "Get the geometry of {anchor_name}", "Find {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="lookup_02",
+        family="direct_lookup",
+        sql_difficulty="easy",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        sql_template="""SELECT geometry, names."primary" AS name, id, subtype FROM read_parquet('{NATURAL_EARTH_PATH}') WHERE id = '{anchor_id}'""",
+        question_hints=["Show me the {anchor_name}", "Get {anchor_name}", "Find the {anchor_name}"]
+    ),
+    # ADJACENCY (20 samples)
+    SQLTemplate(
+        template_id="adj_01",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.id != '{anchor_id}' AND ST_Touches(a.geometry, b.geometry)""",
+        question_hints=["Which regions border {anchor_name}?", "What borders {anchor_name}?", "List places adjacent to {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="adj_02",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.id != '{anchor_id}' AND b.subtype = '{target_subtype}' AND ST_Touches(a.geometry, b.geometry)""",
+        question_hints=["Which {target_subtype}s border {anchor_name}?", "What {target_subtype}s touch {anchor_name}?"]
+    ),
+    SQLTemplate(
+        template_id="adj_03",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="sea",
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT n.id, n.names."primary" AS name, n.geometry FROM read_parquet('{NATURAL_EARTH_PATH}') AS n, a WHERE n.subtype = '{target_subtype}' AND ST_Touches(a.geometry, n.geometry)""",
+        question_hints=["Which {target_subtype}s touch {anchor_name}?", "What {target_subtype}s border {anchor_name}?"]
+    ),
+    # CONTAINMENT (15 samples)
+    SQLTemplate(
+        template_id="contain_01",
+        family="containment",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.id != '{anchor_id}' AND b.subtype = '{target_subtype}' AND ST_Within(b.geometry, a.geometry)""",
+        question_hints=["What {target_subtype}s are in {anchor_name}?", "Which {target_subtype}s are within {anchor_name}?"]
+    ),
+    SQLTemplate(
+        template_id="contain_02",
+        family="containment",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="country",
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.id != '{anchor_id}' AND b.subtype = '{target_subtype}' AND ST_Contains(b.geometry, a.geometry)""",
+        question_hints=["What {target_subtype} contains {anchor_name}?", "Which {target_subtype} is {anchor_name} in?"]
+    ),
+    SQLTemplate(
+        template_id="contain_03",
+        family="containment",
+        sql_difficulty="medium",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{NATURAL_EARTH_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.subtype = '{target_subtype}' AND ST_Within(b.geometry, a.geometry)""",
+        question_hints=["Which {target_subtype}s are in the {anchor_name}?", "What {target_subtype}s fall within the {anchor_name}?"]
+    ),
+    # INTERSECTION (15 samples)
+    SQLTemplate(
+        template_id="intersect_01",
+        family="intersection",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.id != '{anchor_id}' AND b.subtype = '{target_subtype}' AND ST_Intersects(b.geometry, a.geometry)""",
+        question_hints=["Which {target_subtype}s intersect {anchor_name}?", "What {target_subtype}s overlap with {anchor_name}?"]
+    ),
+    SQLTemplate(
+        template_id="intersect_02",
+        family="intersection",
+        sql_difficulty="medium-hard",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        target_subtype="country",
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{NATURAL_EARTH_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.subtype = '{target_subtype}' AND ST_Intersects(b.geometry, a.geometry)""",
+        question_hints=["Which {target_subtype}s intersect the {anchor_name}?", "What {target_subtype}s touch the {anchor_name}?"]
+    ),
+    # BUFFER OPERATIONS (10 samples)
+    SQLTemplate(
+        template_id="buffer_01",
+        family="buffer",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        requires_buffer=True,
+        sql_template="""WITH a AS (SELECT ST_Buffer(geometry, {buffer_degrees}) AS geom FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE b.id != '{anchor_id}' AND ST_Intersects(b.geometry, a.geom)""",
+        question_hints=["A {buffer_degrees} degree buffer around {anchor_name}", "Features within {buffer_degrees} degrees of {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="buffer_02",
+        family="buffer",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=2,
+        requires_buffer=True,
+        sql_template="""WITH a AS (SELECT geometry AS g1 FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id_1}'), b AS (SELECT geometry AS g2 FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id_2}'), boundary AS (SELECT ST_Buffer(ST_Intersection(a.g1, b.g2), {buffer_degrees}) AS geom FROM a, b WHERE ST_Touches(a.g1, b.g2)) SELECT geom AS geometry FROM boundary""",
+        question_hints=["A {buffer_degrees} degree buffer around the border between {anchor_1_name} and {anchor_2_name}"]
+    ),
+    # SET OPERATIONS (15 samples)
+    SQLTemplate(
+        template_id="union_01",
+        family="set_operations",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=2,
+        sql_template="""SELECT ST_Union_Agg(geometry) AS geometry, array_agg(names."primary") AS names FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id IN ('{anchor_id_1}', '{anchor_id_2}')""",
+        question_hints=["{anchor_1_name} and {anchor_2_name}", "The union of {anchor_1_name} and {anchor_2_name}"]
+    ),
+    # PARTIAL SELECTION (10 samples)
+    SQLTemplate(
+        template_id="partial_01",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}'), bbox AS (SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax, ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a), north_half AS (SELECT ST_MakeEnvelope(xmin, (ymin + ymax) / 2, xmax, ymax) AS half_geom FROM bbox) SELECT ST_Intersection(a.geometry, nh.half_geom) AS geometry FROM a, north_half AS nh""",
+        question_hints=["The northern half of {anchor_name}", "Northern part of {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="partial_02",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}'), bbox AS (SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax, ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a), south_half AS (SELECT ST_MakeEnvelope(xmin, ymin, xmax, (ymin + ymax) / 2) AS half_geom FROM bbox) SELECT ST_Intersection(a.geometry, sh.half_geom) AS geometry FROM a, south_half AS sh""",
+        question_hints=["The southern half of {anchor_name}", "Southern part of {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="partial_04",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}'), bbox AS (SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax, ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a), east_half AS (SELECT ST_MakeEnvelope((xmin + xmax) / 2, ymin, xmax, ymax) AS half_geom FROM bbox) SELECT ST_Intersection(a.geometry, eh.half_geom) AS geometry FROM a, east_half AS eh""",
+        question_hints=["The eastern half of {anchor_name}", "Eastern part of {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="partial_05",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}'), bbox AS (SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax, ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a), west_half AS (SELECT ST_MakeEnvelope(xmin, ymin, (xmin + xmax) / 2, ymax) AS half_geom FROM bbox) SELECT ST_Intersection(a.geometry, wh.half_geom) AS geometry FROM a, west_half AS wh""",
+        question_hints=["The western half of {anchor_name}", "Western part of {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="partial_03",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="mixed",
+        num_anchors=2,
+        sql_template="""WITH a AS (SELECT geometry AS g1 FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}'), b AS (SELECT geometry AS g2 FROM read_parquet('{NATURAL_EARTH_PATH}') WHERE id = '{clip_feature_id}') SELECT ST_Intersection(a.g1, b.g2) AS geometry FROM a, b WHERE ST_Intersects(a.g1, b.g2)""",
+        question_hints=["The part of {anchor_name} that is in the {clip_feature_name}", "{anchor_name} within the {clip_feature_name}"]
+    ),
+    # AGGREGATION (5 samples)
+    SQLTemplate(
+        template_id="agg_01",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        requires_aggregation=True,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry, ST_Area(b.geometry) AS area FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE ST_Within(b.geometry, a.geometry) AND b.subtype = '{target_subtype}' ORDER BY area DESC LIMIT {top_n}""",
+        question_hints=["Top {top_n} largest {target_subtype}s in {anchor_name}", "Biggest {target_subtype}s in {anchor_name}", "{top_n} largest {target_subtype}s in {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="agg_02",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        requires_aggregation=True,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry, ST_Area(b.geometry) AS area FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE ST_Within(b.geometry, a.geometry) AND b.subtype = '{target_subtype}' ORDER BY area ASC LIMIT {top_n}""",
+        question_hints=["Top {top_n} smallest {target_subtype}s in {anchor_name}", "Smallest {target_subtype}s in {anchor_name}", "{top_n} smallest {target_subtype}s in {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="agg_03",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="region",
+        requires_aggregation=True,
+        sql_template="""WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '{anchor_id}') SELECT b.id, b.names."primary" AS name, b.geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') AS b, a WHERE ST_Within(b.geometry, a.geometry) AND b.subtype = '{target_subtype}' ORDER BY RANDOM() LIMIT {top_n}""",
+        question_hints=["{top_n} random {target_subtype}s in {anchor_name}", "Any {top_n} {target_subtype}s in {anchor_name}"]
+    ),
+    SQLTemplate(
+        template_id="agg_04",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        requires_aggregation=True,
+        sql_template="""SELECT id, names."primary" AS name, geometry, ST_Area(geometry) AS area FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE country = '{country}' AND subtype = '{target_subtype}' ORDER BY area DESC LIMIT {top_n}""",
+        question_hints=["Top {top_n} largest {target_subtype}s in {country}", "{top_n} biggest {target_subtype}s in {country}"]
+    ),
+    SQLTemplate(
+        template_id="agg_05",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        requires_aggregation=True,
+        sql_template="""SELECT id, names."primary" AS name, geometry, ST_Area(geometry) AS area FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE country = '{country}' AND subtype = '{target_subtype}' ORDER BY area ASC LIMIT {top_n}""",
+        question_hints=["Top {top_n} smallest {target_subtype}s in {country}", "{top_n} smallest {target_subtype}s in {country}"]
+    ),
+]
+def get_templates_by_family(family: str) -> List[SQLTemplate]:
+    """Get all templates for a specific family."""
+    return [t for t in TEMPLATES if t.family == family]
+def get_template_by_id(template_id: str) -> SQLTemplate:
+    """Get a specific template by ID."""
+    for t in TEMPLATES:
+        if t.template_id == template_id:
+            return t
+    raise ValueError(f"Template {template_id} not found")
+if __name__ == "__main__":
+    # Print template summary
+    families = {}
+    for t in TEMPLATES:
+        families[t.family] = families.get(t.family, 0) + 1
+    print("SQL Template Catalog")
+    print("=" * 60)
+    for family, count in sorted(families.items()):
+        print(f"{family:20s}: {count:2d} templates")
+    print(f"{'TOTAL':20s}: {len(TEMPLATES):2d} templates")

dataset/scripts/validate_dataset.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""
+Validate and balance the generated dataset.
+This script:
+1. Loads all generated samples
+2. Validates SQL executability
+3. Checks candidate list quality
+4. Balances across task families and difficulty
+5. Removes duplicates
+6. Generates dataset statistics
+Output:
+- output/dataset_validated.jsonl
+- output/dataset_stats.json
+"""
+import json
+from pathlib import Path
+from typing import List, Dict, Any, Tuple
+from collections import Counter
+from concurrent.futures import ProcessPoolExecutor, as_completed
+import duckdb
+import pandas as pd
+def load_samples(jsonl_path: Path) -> List[Dict[str, Any]]:
+    """Load samples from JSONL file."""
+    samples = []
+    with open(jsonl_path, 'r') as f:
+        for line in f:
+            samples.append(json.loads(line))
+    return samples
+def validate_sql(con: duckdb.DuckDBPyConnection, sql: str) -> tuple[bool, str]:
+    """Validate that SQL executes without error."""
+    try:
+        result = con.execute(sql).fetchdf()
+        if result.empty:
+            return False, "Empty result"
+        return True, "OK"
+    except Exception as e:
+        return False, str(e)
+def validate_candidates(sample: Dict[str, Any]) -> tuple[bool, str]:
+    """Validate candidate list quality."""
+    candidates = sample['candidates']
+    selected = sample['target']['selected_candidates']
+    # Check we have candidates
+    if not candidates:
+        return False, "No candidates"
+    # Check selected candidates exist
+    candidate_ids = {c['candidate_id'] for c in candidates}
+    for sel_id in selected:
+        if sel_id not in candidate_ids:
+            return False, f"Selected candidate {sel_id} not in candidate list"
+    # Check for duplicates
+    ids = [c['id'] for c in candidates]
+    if len(ids) != len(set(ids)):
+        return False, "Duplicate candidates"
+    return True, "OK"
+def validate_sample(con: duckdb.DuckDBPyConnection, sample: Dict[str, Any]) -> tuple[bool, List[str]]:
+    """Validate a single sample. Returns (is_valid, list_of_issues)."""
+    issues = []
+    # Skip SQL re-execution if already verified during generation
+    if not sample.get('metadata', {}).get('sql_verified', False):
+        sql_valid, sql_msg = validate_sql(con, sample['target']['sql'])
+        if not sql_valid:
+            issues.append(f"SQL: {sql_msg}")
+    # Validate candidates
+    cand_valid, cand_msg = validate_candidates(sample)
+    if not cand_valid:
+        issues.append(f"Candidates: {cand_msg}")
+    # Check question exists
+    if not sample.get('question') or len(sample['question'].strip()) == 0:
+        issues.append("Empty question")
+    return len(issues) == 0, issues
+def validate_sample_worker(sample: Dict[str, Any]) -> Tuple[str, bool, List[str]]:
+    """Worker function for parallel validation. Returns (sample_id, is_valid, issues)."""
+    # Each worker creates its own DuckDB connection
+    con = duckdb.connect()
+    con.execute("SET enable_progress_bar=false")
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    try:
+        is_valid, issues = validate_sample(con, sample)
+        con.close()
+        return (sample['id'], is_valid, issues, sample if is_valid else None)
+    except Exception as e:
+        con.close()
+        return (sample['id'], False, [f"Validation error: {str(e)}"], None)
+def compute_statistics(samples: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """Compute dataset statistics."""
+    stats = {
+        'total_samples': len(samples),
+        'task_families': {},
+        'sql_difficulty': {},
+        'grounding_difficulty': {},
+        'anchor_sources': {},
+        'avg_candidates_per_sample': 0,
+        'avg_question_length': 0,
+        'countries_covered': set(),
+        'subtypes_covered': set()
+    }
+    total_candidates = 0
+    total_question_length = 0
+    for sample in samples:
+        meta = sample['metadata']
+        # Count by family
+        family = meta['task_family']
+        stats['task_families'][family] = stats['task_families'].get(family, 0) + 1
+        # Count by SQL difficulty
+        sql_diff = meta['sql_difficulty']
+        stats['sql_difficulty'][sql_diff] = stats['sql_difficulty'].get(sql_diff, 0) + 1
+        # Count by grounding difficulty
+        ground_diff = meta['grounding_difficulty']
+        stats['grounding_difficulty'][ground_diff] = stats['grounding_difficulty'].get(ground_diff, 0) + 1
+        # Count by anchor source
+        anchor_src = meta['anchor_source']
+        stats['anchor_sources'][anchor_src] = stats['anchor_sources'].get(anchor_src, 0) + 1
+        # Candidates
+        total_candidates += len(sample['candidates'])
+        # Question length
+        total_question_length += len(sample['question'].split())
+        # Countries and subtypes (from selected/answer candidates only)
+        selected_ids = set(sample.get('target', {}).get('selected_candidates', []))
+        for cand in sample['candidates']:
+            if cand['candidate_id'] in selected_ids:
+                if cand.get('country'):
+                    stats['countries_covered'].add(cand['country'])
+                if cand.get('subtype'):
+                    stats['subtypes_covered'].add(cand['subtype'])
+    stats['avg_candidates_per_sample'] = total_candidates / len(samples) if samples else 0
+    stats['avg_question_length'] = total_question_length / len(samples) if samples else 0
+    stats['countries_covered'] = sorted(list(stats['countries_covered']))
+    stats['subtypes_covered'] = sorted(list(stats['subtypes_covered']))
+    return stats
+def main():
+    """Validate and analyze dataset."""
+    script_dir = Path(__file__).parent
+    output_dir = script_dir.parent / "output"
+    raw_file = output_dir / "dataset_raw.jsonl"
+    validated_file = output_dir / "dataset_validated.jsonl"
+    stats_file = output_dir / "dataset_stats.json"
+    if not raw_file.exists():
+        print(f"Error: {raw_file} not found. Run generate_samples.py first.")
+        return
+    # Load samples
+    print("Loading samples...")
+    samples = load_samples(raw_file)
+    print(f"Loaded {len(samples)} samples")
+    # Validate samples in parallel
+    print("\nValidating samples in parallel...")
+    valid_samples = []
+    invalid_samples = []
+    with ProcessPoolExecutor(max_workers=8) as executor:
+        # Submit all validation tasks
+        futures = {executor.submit(validate_sample_worker, sample): sample for sample in samples}
+        # Collect results as they complete
+        completed = 0
+        for future in as_completed(futures):
+            sample_id, is_valid, issues, validated_sample = future.result()
+            if is_valid:
+                valid_samples.append(validated_sample)
+            else:
+                invalid_samples.append((sample_id, issues))
+            completed += 1
+            if completed % 50 == 0 or completed == len(samples):
+                print(f"\r  Progress: {completed}/{len(samples)} ", end='', flush=True)
+        print()  # New line after progress
+    print(f"\nValidation results:")
+    print(f"  Valid: {len(valid_samples)}")
+    print(f"  Invalid: {len(invalid_samples)}")
+    if invalid_samples and len(invalid_samples) <= 20:
+        print("\nInvalid samples:")
+        for sample_id, issues in invalid_samples[:20]:
+            print(f"  {sample_id}: {', '.join(issues)}")
+    elif invalid_samples:
+        print(f"\n{len(invalid_samples)} invalid samples (showing first 20):")
+        for sample_id, issues in invalid_samples[:20]:
+            print(f"  {sample_id}: {', '.join(issues)}")
+    # Save validated samples
+    if valid_samples:
+        with open(validated_file, 'w') as f:
+            for sample in valid_samples:
+                f.write(json.dumps(sample) + '\n')
+        print(f"\nSaved {len(valid_samples)} valid samples to {validated_file}")
+    # Compute statistics
+    print("\nComputing statistics...")
+    stats = compute_statistics(valid_samples)
+    # Save statistics
+    # Convert sets to lists for JSON serialization
+    stats_json = {k: (list(v) if isinstance(v, set) else v) for k, v in stats.items()}
+    with open(stats_file, 'w') as f:
+        json.dump(stats_json, f, indent=2)
+    print(f"Saved statistics to {stats_file}")
+    # Print summary
+    print("\n" + "=" * 60)
+    print("DATASET STATISTICS")
+    print("=" * 60)
+    print(f"\nTotal samples: {stats['total_samples']}")
+    print("\nTask families:")
+    for family, count in sorted(stats['task_families'].items()):
+        print(f"  {family:20s}: {count:3d}")
+    print("\nSQL difficulty:")
+    for diff, count in sorted(stats['sql_difficulty'].items()):
+        print(f"  {diff:20s}: {count:3d}")
+    print("\nGrounding difficulty:")
+    for diff, count in sorted(stats['grounding_difficulty'].items()):
+        print(f"  {diff:20s}: {count:3d}")
+    print("\nAnchor sources:")
+    for src, count in sorted(stats['anchor_sources'].items()):
+        print(f"  {src:20s}: {count:3d}")
+    print(f"\nAverage candidates per sample: {stats['avg_candidates_per_sample']:.1f}")
+    print(f"Average question length (words): {stats['avg_question_length']:.1f}")
+    print(f"Countries covered: {len(stats['countries_covered'])}")
+    print(f"Subtypes covered: {len(stats['subtypes_covered'])}")
+    print("\n✓ Validation complete")
+if __name__ == "__main__":
+    main()

pyproject.toml CHANGED Viewed

@@ -19,5 +19,11 @@ dependencies = [
 ]
 optional-dependencies = { demo = ["streamlit", "requests", "pydeck"], dev = ["ruff"] }
 [tool.hatch.build.targets.wheel]
-packages = ["src/gazet"]

 ]
 optional-dependencies = { demo = ["streamlit", "requests", "pydeck"], dev = ["ruff"] }
+[project.scripts]
+gazet-dataset = "dataset.scripts.cli:main"
 [tool.hatch.build.targets.wheel]
+packages = ["src/gazet", "dataset"]
+[dependency-groups]
+dataset = []

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff