GeoQuery / docs /DATA_FLOW.md
GerardCB's picture
Deploy to Spaces (Final Clean)
4851501

Data Flow: End-to-End Request Processing

This document provides a detailed walkthrough of how a user query flows through the GeoQuery system from input to visualization.


Overview

User Query → Intent Detection → Semantic Search → SQL Generation → 
Query Execution → Result Formatting → Explanation → Map Rendering

Timeline: 2-8 seconds for typical queries


Step-by-Step Walkthrough

Example Query

User Input: "Show me hospitals in Panama City"


Step 1: Frontend - User Submits Query

Component: ChatPanel.tsx

const handleSubmit = async (message: string) => {
  // Add user message to chat
  setMessages(prev => [...prev, { role: 'user', content: message }]);
  
  // Send to backend via SSE
  const response = await fetch('http://localhost:8000/api/chat', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({ message, history })
  });
  
  // Start streaming response
  const reader = response.body.getReader();
  ...
};

Request Payload:

{
  "message": "Show me hospitals in Panama City",
  "history": []
}

Step 2: Backend - API Endpoint Receives Request

File: backend/api/endpoints/chat.py

@router.post("/chat")
async def chat(request: ChatRequest):
    # Initialize executor
    executor = QueryExecutor()
    
    # Process query with streaming
    async for event in executor.process_query_stream(
        request.message, 
        request.history
    ):
        yield sse_format(event)

Action: Routes to QueryExecutor.process_query_stream()


Step 3: Intent Detection

Service: LLMGateway.detect_intent()
File: backend/core/llm_gateway.py

LLM Prompt:

Analyze this user query and determine the best output type.

User Query: "Show me hospitals in Panama City"

THINK STEP BY STEP:
1. What is the user asking for?
2. Does this require geographic visualization (map)?
...

Respond with ONLY ONE of these exact words:
- GENERAL_CHAT
- DATA_QUERY
- MAP_REQUEST
- SPATIAL_OP
- STAT_QUERY

Gemini Response:

Thinking: The user wants to SEE hospitals on a map, explicitly asks to "show"
Response: MAP_REQUEST

Streaming to Frontend:

{
  "event": "intent",
  "data": {"intent": "MAP_REQUEST"}
}

Frontend: Displays intent badge in chat


Step 4: Semantic Discovery

Service: SemanticSearch.search_table_names()
File: backend/core/semantic_search.py

Process:

  1. Convert query to embedding vector (384 dimensions)

    query_embedding = model.encode("Show me hospitals in Panama City")
    
  2. Calculate cosine similarity with all dataset embeddings

    similarities = cosine_similarity(query_embedding, catalog_embeddings)
    
  3. Return top-k matches

    top_k_indices = np.argsort(similarities)[-15:][::-1]
    

Result:

[
  "panama_healthsites_geojson",  # similarity: 0.89
  "osm_amenities",                # similarity: 0.76
  "panama_hospitals",             # similarity: 0.74
  "osm_healthcare",               # similarity: 0.71
  ...
]

Streaming to Frontend:

{
  "event": "status",
  "data": {"status": "📚 Searching data catalog..."}
}

Performance: <10ms for 100+ datasets


Step 5: Table Schema Retrieval

Service: GeoEngine.ensure_table_loaded()
File: backend/core/geo_engine.py

Process:

  1. Check if table already loaded in DuckDB

    if "panama_healthsites_geojson" not in self.loaded_tables:
    
  2. Load GeoJSON file

    gdf = gpd.read_file("backend/data/osm/healthsites.geojson")
    
  3. Register in DuckDB

    conn.register("panama_healthsites_geojson", gdf)
    self.loaded_tables["panama_healthsites_geojson"] = gdf
    

Schema Retrieved:

Table: panama_healthsites_geojson
Columns: osm_id, name, amenity, operator, addr:city, geom
Row count: 986
Geometry type: Point

Streaming to Frontend:

{
  "event": "status",
  "data": {"status": "💾 Loading tables: panama_healthsites_geojson..."}
}

Step 6: SQL Generation

Service: LLMGateway.generate_analytical_sql()
File: backend/core/llm_gateway.py

Prompt to Gemini:

You are a DuckDB SQL expert. Generate a valid DuckDB SQL query.

Table: panama_healthsites_geojson
Columns: osm_id, name, amenity, operator, addr:city, geom
Row count: 986

User Request: "Show me hospitals in Panama City"

Rules:
1. Return ONLY the SQL query
2. ALWAYS include 'geom' for map visualization
3. Use ILIKE for case-insensitive matching
4. Use ST_Intersects for spatial filtering
...

Generate SQL:

Gemini Thinking (streamed):

"I need to:
1. Filter for amenity='hospital'
2. Filter for Panama City (district name is 'Panamá')
3. Use spatial intersection with admin boundaries
4. Include geometry for map rendering"

Gemini Response:

SELECT 
  name, 
  amenity, 
  operator,
  geom 
FROM panama_healthsites_geojson 
WHERE amenity = 'hospital' 
  AND ST_Intersects(
    geom, 
    (SELECT geom FROM pan_admin2 WHERE adm2_name = 'Panamá')
  )

Streaming to Frontend:

{
  "event": "chunk",
  "data": {"type": "thought", "content": "I need to filter for hospitals..."}
}

Performance: ~1 second for SQL generation


Step 7: Query Execution

Service: GeoEngine.execute_spatial_query()
File: backend/core/geo_engine.py

Execution:

try:
    result = conn.execute(sql).fetchdf()
    geojson = json.loads(result.to_json())
except Exception as e:
    # Try SQL correction
    corrected_sql = await llm.correct_sql(query, sql, str(e), schema)
    result = conn.execute(corrected_sql).fetchdf()

Result:

# GeoDataFrame with 45 rows
   name                      amenity   operator              geom
0  Hospital Santo Tomás      hospital  MINSA       POINT(...)
1  Hospital del Niño         hospital  CSS         POINT(...)
...

Convert to GeoJSON:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {"type": "Point", "coordinates": [-79.5, 8.98]},
      "properties": {"name": "Hospital Santo Tomás", "amenity": "hospital"}
    },
    ...
  ]
}

Streaming to Frontend:

{
  "event": "status",
  "data": {"status": "✅ Found 45 results"}
}

Performance: 100ms - 2s depending on data size


Step 8: Result Formatting

Service: ResponseFormatter.format_geojson_layer()
File: backend/services/response_formatter.py

Layer Name Generation:

layer_info = await llm.generate_layer_name(query, sql)
# Returns: {"name": "Hospitals in Panama City", "emoji": "🏥", "pointStyle": "icon"}

GeoJSON Enhancement:

geojson["properties"] = {
    "layer_id": "abc123",
    "layer_name": "Hospitals in Panama City",
    "style": {
        "color": "#E63946",
        "fillColor": "#E63946",
        "opacity": 0.8,
        "fillOpacity": 0.4
    },
    "pointMarker": {
        "icon": "🏥",
        "style": "icon",
        "color": "#E63946",
        "size": 32
    },
    "choropleth": {"enabled": false}
}

Auto-Detection:

  • Detects geometry type (Point)
  • Checks for numeric columns (none meaningful)
  • Configures point marker style based on pointStyle: "icon"

Step 9: Explanation Generation

Service: LLMGateway.stream_explanation()
File: backend/core/llm_gateway.py

Prompt to Gemini:

Explain the results of this data query to the user.

User Question: "Show me hospitals in Panama City"
SQL Query: SELECT name, amenity, geom FROM ... WHERE amenity='hospital'...
Data Result Summary: Found 45 features (Points)

Instructions:
1. Keep response concise
2. Only describe ACTUAL data returned
3. Cite data source
4. Speak as GeoQuery

Gemini Response (streamed):

"I have located 45 hospitals within Panama City district.

The map displays each hospital as a 🏥 icon. You can click on any 
marker to see details including the hospital name and operator.

**Key facilities include**:
- Hospital Santo Tomás (MINSA)
- Hospital del Niño (CSS)
- Hospital Punta Pacifica (Private)

Source: Healthcare facility data from OpenStreetMap via Healthsites.io"

Streaming to Frontend:

{
  "event": "chunk",
  "data": {"type": "text", "content": "I have located 45 hospitals..."}
}

Performance: ~1-2 seconds for explanation


Step 10: Final Result Event

Event Sent:

{
  "event": "result",
  "data": {
    "response": "I have located 45 hospitals within Panama City...",
    "sql_query": "SELECT name, amenity, geom FROM ...",
    "geojson": { /* GeoJSON with 45 features */ },
    "chart_data": null,
    "raw_data": [ /* 45 rows of data */ ],
    "data_citations": [
      "Healthcare facility data from OpenStreetMap via Healthsites.io"
    ]
  }
}

Step 11: Frontend - Map Rendering

Component: MapViewer.tsx

Process:

  1. Receive GeoJSON from result event

  2. Create new MapLayer

    const newLayer: MapLayer = {
      id: geojson.properties.layer_id,
      name: geojson.properties.layer_name,
      data: geojson,
      visible: true,
      style: geojson.properties.style,
      pointMarker: geojson.properties.pointMarker
    };
    
  3. Render with Leaflet

    <GeoJSON
      data={layer.data}
      pointToLayer={(feature, latlng) => {
        if (layer.pointMarker?.style === "icon") {
          return L.marker(latlng, {
            icon: L.divIcon({
              html: `<div style="font-size: 32px">${layer.pointMarker.icon}</div>`
            })
          });
        }
      }}
    />
    
  4. Auto-fit bounds to show all hospitals

  5. Display layer in legend panel

Result: Interactive map with 45 hospital markers (🏥 icons)


Performance Breakdown

Step Service Time Async
1. Frontend Submit - <10ms -
2. API Routing FastAPI <5ms -
3. Intent Detection Gemini ~500ms
4. Semantic Search SentenceTransformer <10ms
5. Schema Loading DuckDB 50-200ms -
6. SQL Generation Gemini ~1s ✓ Streamed
7. Query Execution DuckDB 100ms-2s -
8. Formatting Python 10-50ms -
9. Explanation Gemini ~1s ✓ Streamed
10. Frontend Render Leaflet 50-200ms -

Total: 2-5 seconds (perception: faster due to streaming)


Error Handling Flow

SQL Execution Failure

SQL Error → Extract Error Message → Send to LLM → Generate Corrected SQL → 
Retry Execution → If Still Fails → Return Error to User

Example:

try:
    result = execute_query(sql)
except Exception as e:
    # Error: column "hospitals" does not exist
    corrected_sql = await llm.correct_sql(query, sql, str(e), schema)
    # LLM fixes: hospitals → panama_healthsites_geojson
    result = execute_query(corrected_sql)

Data Unavailable

LLM Realizes Data Missing → Returns Special Marker → 
System Detects Marker → Returns Helpful Error Message

Example:

-- ERROR: DATA_UNAVAILABLE
-- Requested: crime statistics
-- Available: admin boundaries, hospitals, schools

Streaming Architecture

Benefits of SSE (Server-Sent Events):

  1. Progressive Disclosure: User sees thinking process
  2. Faster Perceived Performance: Content streams in
  3. Transparency: Shows "why" behind answers
  4. Simple Protocol: HTTP-based, works everywhere

Event Types:

  • status: Processing updates ("🔍 Searching...", "⚡ Executing...")
  • intent: Detected intent category
  • chunk: Streamed content (thought or text)
  • result: Final payload with all data

Complex Query Flow

For queries requiring multiple steps (e.g., "Compare hospital density with school density by province"):

  1. Complexity Detection: QueryPlanner identifies multi-dataset query
  2. Step Decomposition: Break into atomic steps
    • Step 1: Count hospitals per province
    • Step 2: Count schools per province
    • Step 3: Calculate ratios
  3. Parallel Execution: Execute independent steps concurrently
  4. Result Combination: Merge results for final answer
  5. Unified Explanation: LLM explains combined analysis

See backend/core/query_planner.py for implementation.


Next Steps