GeoQuery / docs /backend /SCRIPTS.md
GerardCB's picture
Deploy to Spaces (Final Clean)
4851501

Data Ingestion Scripts

Documentation for scripts that download and process geographic datasets.


Overview

Data ingestion scripts in backend/scripts/ automate downloading and processing of various data sources:

  • OpenStreetMap via Geofabrik
  • Humanitarian Data Exchange (HDX)
  • World Bank Open Data
  • STRI GIS Portal
  • Kontur Population
  • Global datasets

##Scripts Reference

1. download_geofabrik.py

Downloads OpenStreetMap data for Panama from Geofabrik.

Usage:

cd backend
python scripts/download_geofabrik.py

What it downloads:

  • Roads network
  • Buildings
  • POI (points of interest)
  • Natural features

Output: GeoJSON files in backend/data/osm/

Schedule: Run monthly for updates


2. download_hdx_panama.py

Downloads administrative boundaries from Humanitarian Data Exchange.

Usage:

python scripts/download_hdx_panama.py

Downloads:

  • Level 1: Provinces (10 features)
  • Level 2: Districts (81 features)
  • Level 3: Corregimientos (679 features)

Output: backend/data/hdx/pan_admin{1,2,3}_2021.geojson

Schedule: Annual updates


3. download_worldbank.py

Downloads World Bank development indicators.

Usage:

python scripts/download_worldbank.py

Indicators:

  • GDP per capita
  • Life expectancy
  • Access to electricity
  • Internet usage
  • And more...

Output: backend/data/worldbank/indicators.geojson

Processing: Joins indicator data with country geometries


4. download_stri_data.py

Downloads datasets from STRI GIS Portal.

Usage:

python scripts/download_stri_data.py

Downloads:

  • Protected areas
  • Forest cover
  • Environmental datasets

Output: backend/data/stri/*.geojson

Note: Uses ArcGIS REST API


5. stri_catalog_scraper.py

Discovers and catalogs all available STRI datasets.

Usage:

python scripts/stri_catalog_scraper.py

Output: JSON catalog of 100+ STRI datasets with metadata

Features:

  • Priority scoring
  • Temporal dataset detection
  • REST endpoint generation

6. create_province_layer.py

Creates province-level socioeconomic data layer.

Usage:

python scripts/create_province_layer.py

Combines:

  • INEC Census data
  • MPI (poverty index)
  • Administrative geometries

Output: backend/data/socioeconomic/province_socioeconomic.geojson


7. download_global_datasets.py

Downloads global reference datasets.

Usage:

python scripts/download_global_datasets.py

Downloads:

  • Natural Earth country boundaries
  • Global admin boundaries
  • Reference layers

Output: backend/data/global/*.geojson


8. register_global_datasets.py

Registers global datasets in catalog.json.

Usage:

python scripts/register_global_datasets.py

Action: Adds dataset entries to backend/data/catalog.json


Adding New Data Sources

Step-by-Step Guide

1. Create Download Script

Create backend/scripts/download_mycustom_data.py:

import geopandas as gpd
import requests
from pathlib import Path

def download_custom_data():
    """Download custom dataset."""
    
    # Define output path
    output_dir = Path(__file__).parent.parent / "data" / "custom"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Download data
    url = "https://example.com/data.geojson"
    response = requests.get(url)
    
    # Save as GeoJSON
    output_file = output_dir / "custom_data.geojson"
    with open(output_file, 'w') as f:
        f.write(response.text)
    
    print(f"Downloaded to {output_file}")

if __name__ == "__main__":
    download_custom_data()

2. Update Catalog

Add entry to backend/data/catalog.json:

{
  "custom_data": {
    "path": "custom/custom_data.geojson",
    "description": "Short description for display",
    "semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.",
    "categories": ["infrastructure"],
    "tags": ["roads", "transport", "panama"],
    "schema": {
      "columns": ["name", "type", "length_km", "geom"],
      "geometry_type": "LineString"
    }
  }
}

Key Fields:

  • path: Relative path from backend/data/
  • description: Human-readable short description
  • semantic_description: Detailed description for AI semantic search
  • categories: Classify dataset
  • tags: Keywords for filtering
  • schema: Optional column and geometry info

3. Regenerate Embeddings

cd backend
rm data/embeddings.npy
python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()"

This generates vector embeddings for the new dataset description.

4. Test Discovery

# Start backend
uvicorn backend.main:app --reload

# Test query
curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message":"show me [your new data]","history":[]}'

Verify the AI can discover and query your dataset.


Script Templates

Basic Download Template

#!/usr/bin/env python3
"""
Download script for [DATA SOURCE NAME]
"""

import geopandas as gpd
import requests
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Constants
DATA_URL = "https://example.com/data.geojson"
OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category"

def download_data():
    """Download and process data."""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    
    logger.info(f"Downloading from {DATA_URL}")
    
    # Download
    gdf = gpd.read_file(DATA_URL)
    
    # Process (example: project to WGS84)
    if gdf.crs and gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Save
    output_file = OUTPUT_DIR / "data.geojson"
    gdf.to_file(output_file, driver="GeoJSON")
    
    logger.info(f"Saved {len(gdf)} features to {output_file}")

if __name__ == "__main__":
    download_data()

API Download Template

import requests
import json

def download_from_api():
    """Download from REST API."""
    
    # Query API
    params = {
        "where": "country='Panama'",
        "outFields": "*",
        "f": "geojson"
    }
    
    response = requests.get(API_URL, params=params)
    response.raise_for_status()
    
    # Parse and save
    geojson = response.json()
    
    with open(output_file, 'w') as f:
        json.dump(geojson, f)

Data Processing Best Practices

1. Coordinate System

Always save in WGS84 (EPSG:4326):

if gdf.crs != "EPSG:4326":
    gdf = gdf.to_crs("EPSG:4326")

2. Column Naming

Use lowercase with underscores:

gdf.columns = gdf.columns.str.lower().str.replace(' ', '_')

3. Null Handling

Remove or fill nulls:

gdf['name'] = gdf['name'].fillna('Unknown')
gdf = gdf.dropna(subset=['geom'])

4. Simplify Geometry (if needed)

For large datasets:

gdf['geom'] = gdf['geom'].simplify(tolerance=0.001)

5. Validate GeoJSON

import json

# Check valid JSON
with open(output_file) as f:
    data = json.load(f)
    
assert data['type'] == 'FeatureCollection'
assert 'features' in data

Data Sources Reference

Source Script Frequency Size
Geofabrik (OSM) download_geofabrik.py Monthly ~100MB
HDX download_hdx_panama.py Annual ~5MB
World Bank download_worldbank.py Annual ~1MB
STRI download_stri_data.py As updated ~50MB
Kontur Manual Quarterly ~200MB

Next Steps