Data Ingestion Scripts
Documentation for scripts that download and process geographic datasets.
Overview
Data ingestion scripts in backend/scripts/ automate downloading and processing of various data sources:
- OpenStreetMap via Geofabrik
- Humanitarian Data Exchange (HDX)
- World Bank Open Data
- STRI GIS Portal
- Kontur Population
- Global datasets
##Scripts Reference
1. download_geofabrik.py
Downloads OpenStreetMap data for Panama from Geofabrik.
Usage:
cd backend
python scripts/download_geofabrik.py
What it downloads:
- Roads network
- Buildings
- POI (points of interest)
- Natural features
Output: GeoJSON files in backend/data/osm/
Schedule: Run monthly for updates
2. download_hdx_panama.py
Downloads administrative boundaries from Humanitarian Data Exchange.
Usage:
python scripts/download_hdx_panama.py
Downloads:
- Level 1: Provinces (10 features)
- Level 2: Districts (81 features)
- Level 3: Corregimientos (679 features)
Output: backend/data/hdx/pan_admin{1,2,3}_2021.geojson
Schedule: Annual updates
3. download_worldbank.py
Downloads World Bank development indicators.
Usage:
python scripts/download_worldbank.py
Indicators:
- GDP per capita
- Life expectancy
- Access to electricity
- Internet usage
- And more...
Output: backend/data/worldbank/indicators.geojson
Processing: Joins indicator data with country geometries
4. download_stri_data.py
Downloads datasets from STRI GIS Portal.
Usage:
python scripts/download_stri_data.py
Downloads:
- Protected areas
- Forest cover
- Environmental datasets
Output: backend/data/stri/*.geojson
Note: Uses ArcGIS REST API
5. stri_catalog_scraper.py
Discovers and catalogs all available STRI datasets.
Usage:
python scripts/stri_catalog_scraper.py
Output: JSON catalog of 100+ STRI datasets with metadata
Features:
- Priority scoring
- Temporal dataset detection
- REST endpoint generation
6. create_province_layer.py
Creates province-level socioeconomic data layer.
Usage:
python scripts/create_province_layer.py
Combines:
- INEC Census data
- MPI (poverty index)
- Administrative geometries
Output: backend/data/socioeconomic/province_socioeconomic.geojson
7. download_global_datasets.py
Downloads global reference datasets.
Usage:
python scripts/download_global_datasets.py
Downloads:
- Natural Earth country boundaries
- Global admin boundaries
- Reference layers
Output: backend/data/global/*.geojson
8. register_global_datasets.py
Registers global datasets in catalog.json.
Usage:
python scripts/register_global_datasets.py
Action: Adds dataset entries to backend/data/catalog.json
Adding New Data Sources
Step-by-Step Guide
1. Create Download Script
Create backend/scripts/download_mycustom_data.py:
import geopandas as gpd
import requests
from pathlib import Path
def download_custom_data():
"""Download custom dataset."""
# Define output path
output_dir = Path(__file__).parent.parent / "data" / "custom"
output_dir.mkdir(parents=True, exist_ok=True)
# Download data
url = "https://example.com/data.geojson"
response = requests.get(url)
# Save as GeoJSON
output_file = output_dir / "custom_data.geojson"
with open(output_file, 'w') as f:
f.write(response.text)
print(f"Downloaded to {output_file}")
if __name__ == "__main__":
download_custom_data()
2. Update Catalog
Add entry to backend/data/catalog.json:
{
"custom_data": {
"path": "custom/custom_data.geojson",
"description": "Short description for display",
"semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.",
"categories": ["infrastructure"],
"tags": ["roads", "transport", "panama"],
"schema": {
"columns": ["name", "type", "length_km", "geom"],
"geometry_type": "LineString"
}
}
}
Key Fields:
path: Relative path frombackend/data/description: Human-readable short descriptionsemantic_description: Detailed description for AI semantic searchcategories: Classify datasettags: Keywords for filteringschema: Optional column and geometry info
3. Regenerate Embeddings
cd backend
rm data/embeddings.npy
python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()"
This generates vector embeddings for the new dataset description.
4. Test Discovery
# Start backend
uvicorn backend.main:app --reload
# Test query
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"message":"show me [your new data]","history":[]}'
Verify the AI can discover and query your dataset.
Script Templates
Basic Download Template
#!/usr/bin/env python3
"""
Download script for [DATA SOURCE NAME]
"""
import geopandas as gpd
import requests
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Constants
DATA_URL = "https://example.com/data.geojson"
OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category"
def download_data():
"""Download and process data."""
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
logger.info(f"Downloading from {DATA_URL}")
# Download
gdf = gpd.read_file(DATA_URL)
# Process (example: project to WGS84)
if gdf.crs and gdf.crs != "EPSG:4326":
gdf = gdf.to_crs("EPSG:4326")
# Save
output_file = OUTPUT_DIR / "data.geojson"
gdf.to_file(output_file, driver="GeoJSON")
logger.info(f"Saved {len(gdf)} features to {output_file}")
if __name__ == "__main__":
download_data()
API Download Template
import requests
import json
def download_from_api():
"""Download from REST API."""
# Query API
params = {
"where": "country='Panama'",
"outFields": "*",
"f": "geojson"
}
response = requests.get(API_URL, params=params)
response.raise_for_status()
# Parse and save
geojson = response.json()
with open(output_file, 'w') as f:
json.dump(geojson, f)
Data Processing Best Practices
1. Coordinate System
Always save in WGS84 (EPSG:4326):
if gdf.crs != "EPSG:4326":
gdf = gdf.to_crs("EPSG:4326")
2. Column Naming
Use lowercase with underscores:
gdf.columns = gdf.columns.str.lower().str.replace(' ', '_')
3. Null Handling
Remove or fill nulls:
gdf['name'] = gdf['name'].fillna('Unknown')
gdf = gdf.dropna(subset=['geom'])
4. Simplify Geometry (if needed)
For large datasets:
gdf['geom'] = gdf['geom'].simplify(tolerance=0.001)
5. Validate GeoJSON
import json
# Check valid JSON
with open(output_file) as f:
data = json.load(f)
assert data['type'] == 'FeatureCollection'
assert 'features' in data
Data Sources Reference
| Source | Script | Frequency | Size |
|---|---|---|---|
| Geofabrik (OSM) | download_geofabrik.py |
Monthly | ~100MB |
| HDX | download_hdx_panama.py |
Annual | ~5MB |
| World Bank | download_worldbank.py |
Annual | ~1MB |
| STRI | download_stri_data.py |
As updated | ~50MB |
| Kontur | Manual | Quarterly | ~200MB |
Next Steps
- Dataset Sources: ../data/DATASET_SOURCES.md
- Core Services: CORE_SERVICES.md