Spaces:

Draichi
/

Formula1-race-debriefing

Running

App Files Files Community

Draichi commited on Aug 14, 2024

Commit

59fc0cc

unverified ·

1 Parent(s): 9a1be75

feat: init advanced_text_to_SQL.ipynb

Browse files

Files changed (1) hide show

multi-agents-analysis/advanced_text_to_SQL.ipynb +943 -0

multi-agents-analysis/advanced_text_to_SQL.ipynb ADDED Viewed

	@@ -0,0 +1,943 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Query Pipeline for Advanced Text-to-SQL¶\n",
+    "\n",
+    "In this guide we show you how to setup a text-to-SQL pipeline over your data with our query pipeline syntax.\n",
+    "\n",
+    "This gives you flexibility to enhance text-to-SQL with additional techniques. We show these in the below sections:\n",
+    "\n",
+    "1. Query-Time Table Retrieval: Dynamically retrieve relevant tables in the text-to-SQL prompt.\n",
+    "2. Query-Time Sample Row retrieval: Embed/Index each row, and dynamically retrieve example rows for each table in the text-to-SQL prompt.\n",
+    "   Our out-of-the box pipelines include our NLSQLTableQueryEngine and SQLTableRetrieverQueryEngine. (if you want to check out our text-to-SQL guide using these modules, take a look here). This guide implements an advanced version of those modules, giving you the utmost flexibility to apply this to your own setting.\n",
+    "\n",
+    "NOTE: Any Text-to-SQL application should be aware that executing arbitrary SQL queries can be a security risk. It is recommended to take precautions as needed, such as using restricted roles, read-only databases, sandboxing, etc.\n",
+    "\n",
+    "## Load and Ingest Data\n",
+    "\n",
+    "### Load Data\n",
+    "\n",
+    "We use the [WikiTableQuestions](https://github.com/ppasupat/WikiTableQuestions/releases) dataset (Pasupat and Liang 2015) as our test dataset.\n",
+    "\n",
+    "We go through all the csv's in one folder, store each in a sqlite database (we will then build an object index over each table schema).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "processing file: WikiTableQuestions/csv/200-csv/0.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/1.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/10.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/11.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/12.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/14.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/15.csv\n",
+      "Error parsing WikiTableQuestions/csv/200-csv/15.csv: Error tokenizing data. C error: Expected 4 fields in line 16, saw 5\n",
+      "\n",
+      "processing file: WikiTableQuestions/csv/200-csv/17.csv\n",
+      "Error parsing WikiTableQuestions/csv/200-csv/17.csv: Error tokenizing data. C error: Expected 6 fields in line 5, saw 7\n",
+      "\n",
+      "processing file: WikiTableQuestions/csv/200-csv/18.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/20.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/22.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/24.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/25.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/26.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/28.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/29.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/3.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/30.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/31.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/32.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/33.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/34.csv\n",
+      "Error parsing WikiTableQuestions/csv/200-csv/34.csv: Error tokenizing data. C error: Expected 4 fields in line 6, saw 13\n",
+      "\n",
+      "processing file: WikiTableQuestions/csv/200-csv/35.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/36.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/37.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/38.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/4.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/41.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/42.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/44.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/45.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/46.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/47.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/48.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/7.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/8.csv\n",
+      "processing file: WikiTableQuestions/csv/200-csv/9.csv\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "from pathlib import Path\n",
+    "\n",
+    "data_dir = Path(\"./WikiTableQuestions/csv/200-csv\")\n",
+    "csv_files = sorted([f for f in data_dir.glob(\"*.csv\")])\n",
+    "dfs = []\n",
+    "for csv_file in csv_files:\n",
+    "    print(f\"processing file: {csv_file}\")\n",
+    "    try:\n",
+    "        df = pd.read_csv(csv_file)\n",
+    "        dfs.append(df)\n",
+    "    except Exception as e:\n",
+    "        print(f\"Error parsing {csv_file}: {str(e)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Extract Table Name and Summary from each Table\n",
+    "\n",
+    "Here we use gpt-3.5 to extract a table name (with underscores) and summary from each table with our Pydantic program.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.program import LLMTextCompletionProgram\n",
+    "from llama_index.core.bridge.pydantic import BaseModel, Field\n",
+    "from llama_index.llms.openai import OpenAI\n",
+    "\n",
+    "\n",
+    "class TableInfo(BaseModel):\n",
+    "    \"\"\"Information regarding a structured table.\"\"\"\n",
+    "\n",
+    "    table_name: str = Field(\n",
+    "        ..., description=\"table name (must be underscores and NO spaces)\"\n",
+    "    )\n",
+    "    table_summary: str = Field(\n",
+    "        ..., description=\"short, concise summary/caption of the table\"\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "prompt_str = \"\"\"\\\n",
+    "Give me a summary of the table with the following JSON format.\n",
+    "\n",
+    "- The table name must be unique to the table and describe it while being concise. \n",
+    "- Do NOT output a generic table name (e.g. table, my_table).\n",
+    "\n",
+    "Do NOT make the table name one of the following: {exclude_table_name_list}\n",
+    "\n",
+    "Table:\n",
+    "{table_str}\n",
+    "\n",
+    "Summary: \"\"\"\n",
+    "\n",
+    "program = LLMTextCompletionProgram.from_defaults(\n",
+    "    output_cls=TableInfo,\n",
+    "    llm=OpenAI(model=\"gpt-3.5-turbo\"),\n",
+    "    prompt_template_str=prompt_str,\n",
+    ")\n",
+    "\n",
+    "print(program)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "\n",
+    "def _get_tableinfo_with_index(idx: int):\n",
+    "    results_gen = Path(\"WikiTableQuestions_TableInfo\").glob(f\"{idx}_*\")\n",
+    "    results_list = list(results_gen)\n",
+    "    if len(results_list) == 0:\n",
+    "        return None\n",
+    "    if len(results_list) == 1:\n",
+    "        path = results_list[0]\n",
+    "        return TableInfo.parse_file(path)\n",
+    "    else:\n",
+    "        raise ValueError(\n",
+    "            f\"More than one file matching index: {list(results_gen)}\"\n",
+    "        )\n",
+    "\n",
+    "\n",
+    "table_names = set()\n",
+    "table_infos = []\n",
+    "for idx, df in enumerate(dfs):\n",
+    "    table_info = _get_tableinfo_with_index(idx)\n",
+    "    if table_info:\n",
+    "        table_infos.append(table_info)\n",
+    "    else:\n",
+    "        while True:\n",
+    "            df_str = df.head(10).to_csv()\n",
+    "            table_info = program(\n",
+    "                table_str=df_str,\n",
+    "                exclude_table_name_list=str(list(table_names)),\n",
+    "            )\n",
+    "            table_name = table_info.table_name\n",
+    "            print(f\"Processed table: {table_name}\")\n",
+    "            if table_name not in table_names:\n",
+    "                table_names.add(table_name)\n",
+    "                break\n",
+    "            else:\n",
+    "                # try again\n",
+    "                print(f\"Table name {table_name} already exists, trying again.\")\n",
+    "                pass\n",
+    "\n",
+    "        out_file = f\"WikiTableQuestions_TableInfo/{idx}_{table_name}.json\"\n",
+    "        json.dump(table_info.dict(), open(out_file, \"w\"))\n",
+    "    table_infos.append(table_info)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Put Data in SQL Database\n",
+    "\n",
+    "We use sqlalchemy, a popular SQL database toolkit, to load all the tables.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# put data into sqlite db\n",
+    "from sqlalchemy import (\n",
+    "    create_engine,\n",
+    "    MetaData,\n",
+    "    Table,\n",
+    "    Column,\n",
+    "    String,\n",
+    "    Integer,\n",
+    ")\n",
+    "import re\n",
+    "\n",
+    "\n",
+    "# Function to create a sanitized column name\n",
+    "def sanitize_column_name(col_name):\n",
+    "    # Remove special characters and replace spaces with underscores\n",
+    "    return re.sub(r\"\\W+\", \"_\", col_name)\n",
+    "\n",
+    "\n",
+    "# Function to create a table from a DataFrame using SQLAlchemy\n",
+    "def create_table_from_dataframe(\n",
+    "    df: pd.DataFrame, table_name: str, engine, metadata_obj\n",
+    "):\n",
+    "    # Sanitize column names\n",
+    "    sanitized_columns = {col: sanitize_column_name(col) for col in df.columns}\n",
+    "    df = df.rename(columns=sanitized_columns)\n",
+    "\n",
+    "    # Dynamically create columns based on DataFrame columns and data types\n",
+    "    columns = [\n",
+    "        Column(col, String if dtype == \"object\" else Integer)\n",
+    "        for col, dtype in zip(df.columns, df.dtypes)\n",
+    "    ]\n",
+    "\n",
+    "    # Create a table with the defined columns\n",
+    "    table = Table(table_name, metadata_obj, *columns)\n",
+    "\n",
+    "    # Create the table in the database\n",
+    "    metadata_obj.create_all(engine)\n",
+    "\n",
+    "    # Insert data from DataFrame into the table\n",
+    "    with engine.connect() as conn:\n",
+    "        for _, row in df.iterrows():\n",
+    "            insert_stmt = table.insert().values(**row.to_dict())\n",
+    "            conn.execute(insert_stmt)\n",
+    "        conn.commit()\n",
+    "\n",
+    "\n",
+    "engine = create_engine(\"sqlite:///:memory:\")\n",
+    "metadata_obj = MetaData()\n",
+    "for idx, df in enumerate(dfs):\n",
+    "    tableinfo = _get_tableinfo_with_index(idx)\n",
+    "    print(f\"Creating table: {tableinfo.table_name}\")\n",
+    "    create_table_from_dataframe(df, tableinfo.table_name, engine, metadata_obj)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Setup Arize Phoenix for observability\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n",
+    "from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter\n",
+    "from opentelemetry.sdk import trace as trace_sdk\n",
+    "from opentelemetry.sdk.trace.export import SimpleSpanProcessor\n",
+    "\n",
+    "endpoint = \"http://127.0.0.1:6006/v1/traces\"  # Phoenix receiver address\n",
+    "\n",
+    "tracer_provider = trace_sdk.TracerProvider()\n",
+    "tracer_provider.add_span_processor(\n",
+    "    SimpleSpanProcessor(OTLPSpanExporter(endpoint)))\n",
+    "\n",
+    "LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Advanced Capability 1: Text-to-SQL with Query-Time Table Retrieval.\n",
+    "\n",
+    "We now show you how to setup an e2e text-to-SQL with table retrieval.\n",
+    "\n",
+    "Here we define the core modules.\n",
+    "\n",
+    "1. Object index + retriever to store table schemas\n",
+    "2. SQLDatabase object to connect to the above tables + SQLRetriever.\n",
+    "3. Text-to-SQL Prompt\n",
+    "4. Response synthesis Prompt\n",
+    "5. LLM\n",
+    "\n",
+    "### 1. Object index, retriever, SQLDatabase\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.objects import (\n",
+    "    SQLTableNodeMapping,\n",
+    "    ObjectIndex,\n",
+    "    SQLTableSchema,\n",
+    ")\n",
+    "from llama_index.core import SQLDatabase, VectorStoreIndex\n",
+    "\n",
+    "sql_database = SQLDatabase(engine)\n",
+    "\n",
+    "table_node_mapping = SQLTableNodeMapping(sql_database)\n",
+    "\n",
+    "table_schema_objs = [\n",
+    "    SQLTableSchema(table_name=t.table_name, context_str=t.table_summary)\n",
+    "    for t in table_infos\n",
+    "]  # add a SQLTableSchema for each table\n",
+    "\n",
+    "obj_index = ObjectIndex.from_objects(objects=table_schema_objs,\n",
+    "                                     object_mapping=table_node_mapping,\n",
+    "                                     index_cls=VectorStoreIndex,\n",
+    "                                     )\n",
+    "obj_retriever = obj_index.as_retriever(similarity_top_k=3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2. SQLRetriever + Table Parser\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.retrievers import SQLRetriever\n",
+    "from typing import List\n",
+    "from llama_index.core.query_pipeline import FnComponent\n",
+    "\n",
+    "sql_retriever = SQLRetriever(sql_database)\n",
+    "\n",
+    "\n",
+    "def get_table_context_str(table_schema_objs: List[SQLTableSchema]):\n",
+    "    \"\"\"Get table context string.\"\"\"\n",
+    "    context_strs = []\n",
+    "    for table_schema_obj in table_schema_objs:\n",
+    "        table_info = sql_database.get_single_table_info(\n",
+    "            table_schema_obj.table_name\n",
+    "        )\n",
+    "        if table_schema_obj.context_str:\n",
+    "            table_opt_context = \" The table description is: \"\n",
+    "            table_opt_context += table_schema_obj.context_str\n",
+    "            table_info += table_opt_context\n",
+    "\n",
+    "        context_strs.append(table_info)\n",
+    "    return \"\\n\\n\".join(context_strs)\n",
+    "\n",
+    "\n",
+    "table_parser_component = FnComponent(fn=get_table_context_str)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3. Text-to-SQL Prompt + Output Parser\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.prompts.default_prompts import DEFAULT_TEXT_TO_SQL_PROMPT\n",
+    "from llama_index.core import PromptTemplate\n",
+    "from llama_index.core.query_pipeline import FnComponent\n",
+    "from llama_index.core.llms import ChatResponse\n",
+    "\n",
+    "\n",
+    "def extract_sql_query(content: str) -> str:\n",
+    "    sql_query_start = content.find(\"SQLQuery:\")\n",
+    "    if sql_query_start == -1:\n",
+    "        raise ValueError(\"No 'SQLQuery:' marker found in the response content\")\n",
+    "\n",
+    "    query_content = content[sql_query_start + len(\"SQLQuery:\"):]\n",
+    "    sql_result_start = query_content.find(\"SQLResult:\")\n",
+    "\n",
+    "    if sql_result_start != -1:\n",
+    "        query_content = query_content[:sql_result_start]\n",
+    "\n",
+    "    return query_content\n",
+    "\n",
+    "\n",
+    "def clean_sql_query(query: str) -> str:\n",
+    "    return query.strip().strip(\"```\").strip()\n",
+    "\n",
+    "\n",
+    "def parse_response_to_sql(response: ChatResponse) -> str:\n",
+    "    \"\"\"\n",
+    "    Parse a ChatResponse object to extract the SQL query.\n",
+    "\n",
+    "    This function takes a ChatResponse object, which is expected to contain\n",
+    "    an SQL query within its content, and extracts the SQL query string.\n",
+    "    The function looks for specific markers ('SQLQuery:' and 'SQLResult:')\n",
+    "    to identify the SQL query portion of the response.\n",
+    "\n",
+    "    Args:\n",
+    "        response (ChatResponse): A ChatResponse object containing the response\n",
+    "                                 from a text-to-SQL model.\n",
+    "\n",
+    "    Returns:\n",
+    "        str: The extracted SQL query as a string, with surrounding whitespace\n",
+    "             and code block markers (```) removed.\n",
+    "\n",
+    "    Raises:\n",
+    "        AttributeError: If the input doesn't have the expected 'message.content' attribute.\n",
+    "        ValueError: If no 'SQLQuery:' marker is found in the response content.\n",
+    "\n",
+    "    Note:\n",
+    "        - The function assumes that the SQL query is preceded by 'SQLQuery:' \n",
+    "          and optionally followed by 'SQLResult:'.\n",
+    "        - Any content before 'SQLQuery:' or after 'SQLResult:' is discarded.\n",
+    "        - The function removes leading/trailing whitespace and code block markers.\n",
+    "\n",
+    "    Example:\n",
+    "        >>> response = ChatResponse(message=Message(content=\"Some text\\nSQLQuery: SELECT * FROM table\\nSQLResult: ...\"))\n",
+    "        >>> sql_query = parse_response_to_sql(response)\n",
+    "        >>> print(sql_query)\n",
+    "        SELECT * FROM table\n",
+    "    \"\"\"\n",
+    "    try:\n",
+    "        content = str(response.message.content)\n",
+    "    except AttributeError:\n",
+    "        raise ValueError(\n",
+    "            \"Input must be a ChatResponse object with a 'message.content' attribute\")\n",
+    "\n",
+    "    sql_query = extract_sql_query(content)\n",
+    "    return clean_sql_query(sql_query)\n",
+    "\n",
+    "\n",
+    "sql_parser_component = FnComponent(fn=parse_response_to_sql)\n",
+    "\n",
+    "text2sql_prompt = DEFAULT_TEXT_TO_SQL_PROMPT.partial_format(\n",
+    "    dialect=engine.dialect.name\n",
+    ")\n",
+    "print(text2sql_prompt.template)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4. Response Synthesis Prompt\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response_synthesis_prompt_str = (\n",
+    "    \"Given an input question, synthesize a response from the query results.\\n\"\n",
+    "    \"Query: {query_str}\\n\"\n",
+    "    \"SQL: {sql_query}\\n\"\n",
+    "    \"SQL Response: {context_str}\\n\"\n",
+    "    \"Response: \"\n",
+    ")\n",
+    "response_synthesis_prompt = PromptTemplate(\n",
+    "    response_synthesis_prompt_str,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5. LLM\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(model=\"gpt-3.5-turbo\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Define Query Pipeline\n",
+    "\n",
+    "Now that the components are in place, let's define the query pipeline!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.query_pipeline import (\n",
+    "    QueryPipeline as QP,\n",
+    "    Link,\n",
+    "    InputComponent,\n",
+    "    CustomQueryComponent,\n",
+    ")\n",
+    "\n",
+    "qp = QP(\n",
+    "    modules={\n",
+    "        \"input\": InputComponent(),\n",
+    "        \"table_retriever\": obj_retriever,\n",
+    "        \"table_output_parser\": table_parser_component,\n",
+    "        \"text2sql_prompt\": text2sql_prompt,\n",
+    "        \"text2sql_llm\": llm,\n",
+    "        \"sql_output_parser\": sql_parser_component,\n",
+    "        \"sql_retriever\": sql_retriever,\n",
+    "        \"response_synthesis_prompt\": response_synthesis_prompt,\n",
+    "        \"response_synthesis_llm\": llm,\n",
+    "    },\n",
+    "    verbose=True,\n",
+    ")\n",
+    "qp"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "qp.add_chain([\"input\", \"table_retriever\", \"table_output_parser\"])\n",
+    "qp.add_link(\"input\", \"text2sql_prompt\", dest_key=\"query_str\")\n",
+    "qp.add_link(\"table_output_parser\", \"text2sql_prompt\", dest_key=\"schema\")\n",
+    "qp.add_chain(\n",
+    "    [\"text2sql_prompt\", \"text2sql_llm\", \"sql_output_parser\", \"sql_retriever\"]\n",
+    ")\n",
+    "qp.add_link(\n",
+    "    \"sql_output_parser\", \"response_synthesis_prompt\", dest_key=\"sql_query\"\n",
+    ")\n",
+    "qp.add_link(\n",
+    "    \"sql_retriever\", \"response_synthesis_prompt\", dest_key=\"context_str\"\n",
+    ")\n",
+    "qp.add_link(\"input\", \"response_synthesis_prompt\", dest_key=\"query_str\")\n",
+    "qp.add_link(\"response_synthesis_prompt\", \"response_synthesis_llm\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Visualize Query Pipeline\n",
+    "\n",
+    "A really nice property of the query pipeline syntax is you can easily visualize it in a graph via networkx.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyvis.network import Network\n",
+    "\n",
+    "net = Network(notebook=True, cdn_resources=\"in_line\", directed=True)\n",
+    "net.from_nx(qp.dag)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save the network as \"text2sql_dag.html\"\n",
+    "net.write_html(\"text2sql_dag.html\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.display import display, HTML\n",
+    "\n",
+    "# Read the contents of the HTML file\n",
+    "with open(\"text2sql_dag.html\", \"r\") as file:\n",
+    "    html_content = file.read()\n",
+    "\n",
+    "# Display the HTML content\n",
+    "display(HTML(html_content))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run Some Queries!\n",
+    "\n",
+    "Now we're ready to run some queries across this entire pipeline.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = qp.run(\n",
+    "    query=\"What was the year that The Notorious B.I.G was signed to Bad Boy?\"\n",
+    ")\n",
+    "print(str(response))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = qp.run(query=\"Who won best director in the 1972 academy awards\")\n",
+    "print(str(response))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Advanced Capability 2: Text-to-SQL with Query-Time Row Retrieval (along with Table Retrieval)\n",
+    "\n",
+    "One problem in the previous example is that if the user asks a query that asks for \"The Notorious BIG\" but the artist is stored as \"The Notorious B.I.G\", then the generated SELECT statement will likely not return any matches.\n",
+    "\n",
+    "We can alleviate this problem by fetching a small number of example rows per table. A naive option would be to just take the first k rows. Instead, we embed, index, and retrieve k relevant rows given the user query to give the text-to-SQL LLM the most contextually relevant information for SQL generation.\n",
+    "\n",
+    "We now extend our query pipeline.\n",
+    "\n",
+    "## Index Each Table\n",
+    "\n",
+    "We embed/index the rows of each table, resulting in one index per table.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "from pathlib import Path\n",
+    "from typing import Dict, Optional\n",
+    "from llama_index.core import VectorStoreIndex, load_index_from_storage\n",
+    "from llama_index.core.schema import TextNode\n",
+    "from llama_index.core import StorageContext\n",
+    "from sqlalchemy.exc import SQLAlchemyError\n",
+    "from sqlalchemy import text\n",
+    "\n",
+    "logger = logging.getLogger(__name__)\n",
+    "\n",
+    "\n",
+    "def get_table_rows(engine, table_name: str):\n",
+    "    try:\n",
+    "        with engine.connect() as conn:\n",
+    "            cursor = conn.execute(text(f'SELECT * FROM \"{table_name}\"'))\n",
+    "            return [tuple(row) for row in cursor.fetchall()]\n",
+    "    except SQLAlchemyError as e:\n",
+    "        logger.error(f\"Error fetching rows from table {table_name}: {str(e)}\")\n",
+    "        raise\n",
+    "\n",
+    "\n",
+    "def create_index(rows, index_path: Path):\n",
+    "    nodes = [TextNode(text=str(t)) for t in rows]\n",
+    "    index = VectorStoreIndex(nodes)\n",
+    "    index.set_index_id(\"vector_index\")\n",
+    "    index.storage_context.persist(str(index_path))\n",
+    "    return index\n",
+    "\n",
+    "\n",
+    "def load_existing_index(index_path: Path):\n",
+    "    storage_context = StorageContext.from_defaults(persist_dir=str(index_path))\n",
+    "    return load_index_from_storage(storage_context, index_id=\"vector_index\")\n",
+    "\n",
+    "\n",
+    "def index_all_tables(\n",
+    "    sql_database,\n",
+    "    table_index_dir: str = \"table_index_dir\",\n",
+    "    force_refresh: bool = False,\n",
+    "    tables_to_index: Optional[list] = None\n",
+    ") -> Dict[str, VectorStoreIndex]:\n",
+    "    \"\"\"\n",
+    "    Create or load vector store indexes for specified tables in the given SQL database.\n",
+    "\n",
+    "    Args:\n",
+    "        sql_database: An instance of SQLDatabase containing the tables to be indexed.\n",
+    "        table_index_dir (str): The directory where the indexes will be stored.\n",
+    "        force_refresh (bool): If True, recreate all indexes even if they already exist.\n",
+    "        tables_to_index (Optional[list]): List of table names to index. If None, index all usable tables.\n",
+    "\n",
+    "    Returns:\n",
+    "        Dict[str, VectorStoreIndex]: A dictionary of table names to their VectorStoreIndex objects.\n",
+    "\n",
+    "    Raises:\n",
+    "        OSError: If there's an error creating or accessing the table_index_dir.\n",
+    "        SQLAlchemyError: If there's an error connecting to the database or executing SQL queries.\n",
+    "    \"\"\"\n",
+    "    index_dir = Path(table_index_dir)\n",
+    "    index_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "    vector_index_dict = {}\n",
+    "    tables = tables_to_index or sql_database.get_usable_table_names()\n",
+    "\n",
+    "    for table_name in tables:\n",
+    "        index_path = index_dir / table_name\n",
+    "        logger.info(f\"Processing table: {table_name}\")\n",
+    "\n",
+    "        try:\n",
+    "            if not index_path.exists() or force_refresh:\n",
+    "                logger.info(f\"Creating new index for table: {table_name}\")\n",
+    "                rows = get_table_rows(sql_database.engine, table_name)\n",
+    "                index = create_index(rows, index_path)\n",
+    "            else:\n",
+    "                logger.info(f\"Loading existing index for table: {table_name}\")\n",
+    "                index = load_existing_index(index_path)\n",
+    "\n",
+    "            vector_index_dict[table_name] = index\n",
+    "\n",
+    "        except (OSError, SQLAlchemyError) as e:\n",
+    "            logger.error(f\"Error processing table {table_name}: {str(e)}\")\n",
+    "            # Decide whether to continue with other tables or raise the exception\n",
+    "\n",
+    "    return vector_index_dict\n",
+    "\n",
+    "\n",
+    "vector_index_dict = index_all_tables(sql_database)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_retriever = vector_index_dict[\"Bad_Boy_Artists\"].as_retriever(\n",
+    "    similarity_top_k=1\n",
+    ")\n",
+    "nodes = test_retriever.retrieve(\"P. Diddy\")\n",
+    "print(nodes[0].get_content())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define Expanded Table Parser Component\n",
+    "\n",
+    "We expand the capability of our table_parser_component to not only return the relevant table schemas, but also return relevant rows per table schema.\n",
+    "\n",
+    "It now takes in both table_schema_objs (output of table retriever), but also the original query_str which will then be used for vector retrieval of relevant rows.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.retrievers import SQLRetriever\n",
+    "from typing import List\n",
+    "from llama_index.core.query_pipeline import FnComponent\n",
+    "\n",
+    "sql_retriever = SQLRetriever(sql_database)\n",
+    "\n",
+    "\n",
+    "def get_table_context_and_rows_str(\n",
+    "    query_str: str, table_schema_objs: List[SQLTableSchema]\n",
+    "):\n",
+    "    \"\"\"Get table context string.\"\"\"\n",
+    "    context_strs = []\n",
+    "    for table_schema_obj in table_schema_objs:\n",
+    "        # first append table info + additional context\n",
+    "        table_info = sql_database.get_single_table_info(\n",
+    "            table_schema_obj.table_name\n",
+    "        )\n",
+    "        if table_schema_obj.context_str:\n",
+    "            table_opt_context = \" The table description is: \"\n",
+    "            table_opt_context += table_schema_obj.context_str\n",
+    "            table_info += table_opt_context\n",
+    "\n",
+    "        # also lookup vector index to return relevant table rows\n",
+    "        vector_retriever = vector_index_dict[\n",
+    "            table_schema_obj.table_name\n",
+    "        ].as_retriever(similarity_top_k=2)\n",
+    "        relevant_nodes = vector_retriever.retrieve(query_str)\n",
+    "        if len(relevant_nodes) > 0:\n",
+    "            table_row_context = \"\\nHere are some relevant example rows (values in the same order as columns above)\\n\"\n",
+    "            for node in relevant_nodes:\n",
+    "                table_row_context += str(node.get_content()) + \"\\n\"\n",
+    "            table_info += table_row_context\n",
+    "\n",
+    "        context_strs.append(table_info)\n",
+    "    return \"\\n\\n\".join(context_strs)\n",
+    "\n",
+    "\n",
+    "table_parser_component = FnComponent(fn=get_table_context_and_rows_str)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define Expanded Query Pipeline\n",
+    "\n",
+    "This looks similar to the query pipeline in section 1, but with an upgraded table_parser_component.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.query_pipeline import (\n",
+    "    QueryPipeline as QP,\n",
+    "    Link,\n",
+    "    InputComponent,\n",
+    "    CustomQueryComponent,\n",
+    ")\n",
+    "\n",
+    "qp = QP(\n",
+    "    modules={\n",
+    "        \"input\": InputComponent(),\n",
+    "        \"table_retriever\": obj_retriever,\n",
+    "        \"table_output_parser\": table_parser_component,\n",
+    "        \"text2sql_prompt\": text2sql_prompt,\n",
+    "        \"text2sql_llm\": llm,\n",
+    "        \"sql_output_parser\": sql_parser_component,\n",
+    "        \"sql_retriever\": sql_retriever,\n",
+    "        \"response_synthesis_prompt\": response_synthesis_prompt,\n",
+    "        \"response_synthesis_llm\": llm,\n",
+    "    },\n",
+    "    verbose=True,\n",
+    ")\n",
+    "qp"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "qp.add_link(\"input\", \"table_retriever\")\n",
+    "qp.add_link(\"input\", \"table_output_parser\", dest_key=\"query_str\")\n",
+    "qp.add_link(\n",
+    "    \"table_retriever\", \"table_output_parser\", dest_key=\"table_schema_objs\"\n",
+    ")\n",
+    "qp.add_link(\"input\", \"text2sql_prompt\", dest_key=\"query_str\")\n",
+    "qp.add_link(\"table_output_parser\", \"text2sql_prompt\", dest_key=\"schema\")\n",
+    "qp.add_chain(\n",
+    "    [\"text2sql_prompt\", \"text2sql_llm\", \"sql_output_parser\", \"sql_retriever\"]\n",
+    ")\n",
+    "qp.add_link(\n",
+    "    \"sql_output_parser\", \"response_synthesis_prompt\", dest_key=\"sql_query\"\n",
+    ")\n",
+    "qp.add_link(\n",
+    "    \"sql_retriever\", \"response_synthesis_prompt\", dest_key=\"context_str\"\n",
+    ")\n",
+    "qp.add_link(\"input\", \"response_synthesis_prompt\", dest_key=\"query_str\")\n",
+    "qp.add_link(\"response_synthesis_prompt\", \"response_synthesis_llm\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run Some Queries\n",
+    "\n",
+    "We can now ask about relevant entries even if it doesn't exactly match the entry in the database.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = qp.run(\n",
+    "    query=\"What was the year that The Notorious BIG was signed to Bad Boy?\"\n",
+    ")\n",
+    "print(str(response))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "llama",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}