Spaces:

GerardCB
/

GeoQuery

Running

App Files Files Community

GeoQuery / docs /backend /SCRIPTS.md

GerardCB

Deploy to Spaces (Final Clean)

4851501 2 days ago

preview code

raw

history blame contribute delete

7.86 kB

Data Ingestion Scripts

Documentation for scripts that download and process geographic datasets.

Overview

Data ingestion scripts in backend/scripts/ automate downloading and processing of various data sources:

OpenStreetMap via Geofabrik
Humanitarian Data Exchange (HDX)
World Bank Open Data
STRI GIS Portal
Kontur Population
Global datasets

##Scripts Reference

1. download_geofabrik.py

Downloads OpenStreetMap data for Panama from Geofabrik.

Usage:

cd backend
python scripts/download_geofabrik.py

What it downloads:

Roads network
Buildings
POI (points of interest)
Natural features

Output: GeoJSON files in backend/data/osm/

Schedule: Run monthly for updates

2. download_hdx_panama.py

Downloads administrative boundaries from Humanitarian Data Exchange.

Usage:

python scripts/download_hdx_panama.py

Downloads:

Level 1: Provinces (10 features)
Level 2: Districts (81 features)
Level 3: Corregimientos (679 features)

Output: backend/data/hdx/pan_admin{1,2,3}_2021.geojson

Schedule: Annual updates

3. download_worldbank.py

Downloads World Bank development indicators.

Usage:

python scripts/download_worldbank.py

Indicators:

GDP per capita
Life expectancy
Access to electricity
Internet usage
And more...

Output: backend/data/worldbank/indicators.geojson

Processing: Joins indicator data with country geometries

4. download_stri_data.py

Downloads datasets from STRI GIS Portal.

Usage:

python scripts/download_stri_data.py

Downloads:

Protected areas
Forest cover
Environmental datasets

Output: backend/data/stri/*.geojson

Note: Uses ArcGIS REST API

5. stri_catalog_scraper.py

Discovers and catalogs all available STRI datasets.

Usage:

python scripts/stri_catalog_scraper.py

Output: JSON catalog of 100+ STRI datasets with metadata

Features:

Priority scoring
Temporal dataset detection
REST endpoint generation

6. create_province_layer.py

Creates province-level socioeconomic data layer.

Usage:

python scripts/create_province_layer.py

Combines:

INEC Census data
MPI (poverty index)
Administrative geometries

Output: backend/data/socioeconomic/province_socioeconomic.geojson

7. download_global_datasets.py

Downloads global reference datasets.

Usage:

python scripts/download_global_datasets.py

Downloads:

Natural Earth country boundaries
Global admin boundaries
Reference layers

Output: backend/data/global/*.geojson

8. register_global_datasets.py

Registers global datasets in catalog.json.

Usage:

python scripts/register_global_datasets.py

Action: Adds dataset entries to backend/data/catalog.json

Adding New Data Sources

Step-by-Step Guide

1. Create Download Script

Create backend/scripts/download_mycustom_data.py:

import geopandas as gpd
import requests
from pathlib import Path

def download_custom_data():
    """Download custom dataset."""
    
    # Define output path
    output_dir = Path(__file__).parent.parent / "data" / "custom"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Download data
    url = "https://example.com/data.geojson"
    response = requests.get(url)
    
    # Save as GeoJSON
    output_file = output_dir / "custom_data.geojson"
    with open(output_file, 'w') as f:
        f.write(response.text)
    
    print(f"Downloaded to {output_file}")

if __name__ == "__main__":
    download_custom_data()

2. Update Catalog

Add entry to backend/data/catalog.json:

{
  "custom_data": {
    "path": "custom/custom_data.geojson",
    "description": "Short description for display",
    "semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.",
    "categories": ["infrastructure"],
    "tags": ["roads", "transport", "panama"],
    "schema": {
      "columns": ["name", "type", "length_km", "geom"],
      "geometry_type": "LineString"
    }
  }
}

Key Fields:

path: Relative path from backend/data/
description: Human-readable short description
semantic_description: Detailed description for AI semantic search
categories: Classify dataset
tags: Keywords for filtering
schema: Optional column and geometry info

3. Regenerate Embeddings

cd backend
rm data/embeddings.npy
python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()"

This generates vector embeddings for the new dataset description.

4. Test Discovery

# Start backend
uvicorn backend.main:app --reload

# Test query
curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message":"show me [your new data]","history":[]}'

Verify the AI can discover and query your dataset.

Script Templates

Basic Download Template

#!/usr/bin/env python3
"""
Download script for [DATA SOURCE NAME]
"""

import geopandas as gpd
import requests
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Constants
DATA_URL = "https://example.com/data.geojson"
OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category"

def download_data():
    """Download and process data."""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    
    logger.info(f"Downloading from {DATA_URL}")
    
    # Download
    gdf = gpd.read_file(DATA_URL)
    
    # Process (example: project to WGS84)
    if gdf.crs and gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Save
    output_file = OUTPUT_DIR / "data.geojson"
    gdf.to_file(output_file, driver="GeoJSON")
    
    logger.info(f"Saved {len(gdf)} features to {output_file}")

if __name__ == "__main__":
    download_data()

API Download Template

import requests
import json

def download_from_api():
    """Download from REST API."""
    
    # Query API
    params = {
        "where": "country='Panama'",
        "outFields": "*",
        "f": "geojson"
    }
    
    response = requests.get(API_URL, params=params)
    response.raise_for_status()
    
    # Parse and save
    geojson = response.json()
    
    with open(output_file, 'w') as f:
        json.dump(geojson, f)

Data Processing Best Practices

1. Coordinate System

Always save in WGS84 (EPSG:4326):

if gdf.crs != "EPSG:4326":
    gdf = gdf.to_crs("EPSG:4326")

2. Column Naming

Use lowercase with underscores:

gdf.columns = gdf.columns.str.lower().str.replace(' ', '_')

3. Null Handling

Remove or fill nulls:

gdf['name'] = gdf['name'].fillna('Unknown')
gdf = gdf.dropna(subset=['geom'])

4. Simplify Geometry (if needed)

For large datasets:

gdf['geom'] = gdf['geom'].simplify(tolerance=0.001)

5. Validate GeoJSON

import json

# Check valid JSON
with open(output_file) as f:
    data = json.load(f)
    
assert data['type'] == 'FeatureCollection'
assert 'features' in data

Data Sources Reference

Source	Script	Frequency	Size
Geofabrik (OSM)	`download_geofabrik.py`	Monthly	~100MB
HDX	`download_hdx_panama.py`	Annual	~5MB
World Bank	`download_worldbank.py`	Annual	~1MB
STRI	`download_stri_data.py`	As updated	~50MB
Kontur	Manual	Quarterly	~200MB

Next Steps

Dataset Sources: ../data/DATASET_SOURCES.md
Core Services: CORE_SERVICES.md