{ "cells": [ { "cell_type": "markdown", "id": "c96b164c", "metadata": {}, "source": [ "# Overview " ] }, { "cell_type": "markdown", "id": "22c9cad3", "metadata": {}, "source": [ "In this notebook, we will try to create a workflow between Langchain and Mixtral LLM.
\n", "We want to accomplish the following:
\n", "1. Establish a pipeline to read in a csv and pass it to our LLM. \n", "2. Establish a Pandas Agent in our LLM.\n", "3. Connect the CSV pipeline to this agent and enable a end-end EDA pipeline." ] }, { "cell_type": "markdown", "id": "3bd62bd1", "metadata": {}, "source": [ "# Setting up the Environment " ] }, { "cell_type": "code", "execution_count": 1, "id": "76b3d212", "metadata": {}, "outputs": [], "source": [ "####################################################################################################\n", "import os\n", "import re\n", "import subprocess\n", "import sys\n", "from PIL import Image\n", "\n", "\n", "from langchain import hub\n", "from langchain.agents.agent_types import AgentType\n", "from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent\n", "from langchain_experimental.utilities import PythonREPL\n", "from langchain.agents import Tool\n", "from langchain.agents.format_scratchpad.openai_tools import (\n", " format_to_openai_tool_messages,\n", ")\n", "from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser\n", "\n", "from langchain.agents import AgentExecutor\n", "\n", "from langchain.agents import create_structured_chat_agent\n", "from langchain.memory import ConversationBufferWindowMemory\n", "\n", "from langchain.output_parsers import PandasDataFrameOutputParser\n", "from langchain.prompts import PromptTemplate\n", "\n", "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n", "\n", "from langchain_community.chat_models import ChatAnyscale\n", "import pandas as pd \n", "import matplotlib.pyplot as plt \n", "import numpy as np\n", "\n", "plt.style.use('ggplot')\n", "####################################################################################################" ] }, { "cell_type": "code", "execution_count": 2, "id": "9da52e1f", "metadata": {}, "outputs": [], "source": [ "# insert your API key here\n", "os.environ[\"ANYSCALE_API_KEY\"] = \"esecret_8btufnh3615vnbpd924s1t3q7p\"\n", "memory_key = \"history\"" ] }, { "cell_type": "markdown", "id": "edcd6ca7", "metadata": {}, "source": [ "# Using out of the box agents\n", " Over here we simply try to run the Pandas Dataframe agent provided by langchain out of the box. It uses a CSV parser and has just one tool, the __PythonAstREPLTool__ which helps it execute python commands in a shell and display output. " ] }, { "cell_type": "markdown", "id": "64a086f2", "metadata": {}, "source": [ "# DECLARE YOUR LLM HERE " ] }, { "cell_type": "markdown", "id": "083bf9ab", "metadata": {}, "source": [ "# Building my own DF chain\n", " The above agent does not work well or mistral out of the box. The mistral LLM is unable to use the tools properly. It perhaps needs much more hand-held prompts.\n", "\n", "### Post Implementation Observation:\n", " We were able to observe the following:
\n", " 1. The agent now works very well almost always returning the desired output.
\n", " 2. The agent is unable to output plots.
" ] }, { "cell_type": "markdown", "id": "c5caa356", "metadata": {}, "source": [ "# Building an Agent " ] }, { "cell_type": "code", "execution_count": 4, "id": "44c07305", "metadata": {}, "outputs": [], "source": [ "prompt = ChatPromptTemplate.from_messages(\n", " [\n", " (\n", " \"system\",\n", " f\"\"\"You are Data Analysis assistant. Your job is to use your tools to answer a user query in the best\\\n", " manner possible. Your job is to respond to user queries by generating a python code file.\\\n", " Make sure to include all necessary imports.\\\n", " In case you make plots, make sure to label the axes and add a good title too. \\\n", " You must save any plots in the 'graphs' folder as png only.\\\n", " Provide no explanation for your code.\\\n", " Read the data from 'df.csv'.\\\n", " ** Enclose all your code between triple backticks ``` **\\\n", " RECTIFY ANY ERRORS FROM THE PREVIOUS RUNS.\n", " \"\"\",\n", " ),\n", " (\"user\", \"Dataframe named df: {df}\\nQuery: {input}\\nTools:{tools}\"),\n", " MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "id": "00f1ebc8", "metadata": {}, "outputs": [], "source": [ "python_repl = PythonREPL()\n", "\n", "repl_tool = Tool(\n", " name=\"python_repl\",\n", " description=\"\"\"A Python shell. Shell can dislay charts too. Use this to execute python commands.\\\n", " You have access to all libraries in python including but not limited to sklearn, pandas, numpy,\\\n", " matplotlib.pyplot, seaborn etc. Input should be a valid python command. If the user has not explicitly\\\n", " asked you to plot the results, always print the final output using print(...)\"\"\",\n", " func=python_repl.run,\n", ")\n", "\n", "if 'code' not in os.listdir():\n", " os.mkdir('code')\n", " \n", "tools = [repl_tool]" ] }, { "cell_type": "code", "execution_count": 6, "id": "dd3b5930", "metadata": {}, "outputs": [], "source": [ "def delete_png_files(dir_path):\n", " for filename in os.listdir(dir_path):\n", " if filename.endswith(\".png\"):\n", " os.remove(os.path.join(dir_path, filename))" ] }, { "cell_type": "code", "execution_count": 7, "id": "da4b2061", "metadata": {}, "outputs": [], "source": [ "def run_code(code):\n", " \n", " with open(f'{os.getcwd()}/code/code.py', 'w') as file:\n", " file.write(code)\n", " \n", " try:\n", " print(\"Running code ...\\n\")\n", " result = subprocess.run([sys.executable, 'code/code.py'], capture_output=True, text=True, check=True, timeout=20)\n", " return result.stdout, False\n", " \n", " except subprocess.CalledProcessError as e:\n", " print(\"Error in code!\\n\")\n", " return e.stdout + e.stderr, True\n", " \n", " except subprocess.TimeoutExpired:\n", " return \"Execution timed out.\", True" ] }, { "cell_type": "code", "execution_count": 8, "id": "b82adfda", "metadata": {}, "outputs": [], "source": [ "def infer_EDA(user_input:str = ''): \n", " agent = (\n", " {\n", " \"input\": lambda x: x[\"input\"],\n", " \"tools\": lambda x:x['tools'],\n", " \"df\": lambda x:x['df'],\n", " \"agent_scratchpad\": lambda x: format_to_openai_tool_messages(\n", " x[\"intermediate_steps\"]\n", " )\n", " }\n", " | prompt\n", " | llm\n", " | OpenAIToolsAgentOutputParser()\n", " )\n", "\n", " EDA_executor = AgentExecutor(agent=agent, tools=tools, df = 'df.csv', verbose=True)\n", " \n", " # Running Inference\n", " error_flag = True\n", " image_path = f'{os.getcwd()}/graphs'\n", " \n", " delete_png_files(image_path)\n", " res = None\n", " \n", " while error_flag:\n", " result = list(EDA_executor.stream({\"input\": user_input, \n", " \"df\":pd.read_csv('df.csv'),\n", " \"tools\":tools}))\n", "\n", " # need to extract the code\n", " pattern = r\"```python\\n(.*?)\\n```\"\n", " matches = re.findall(pattern, result[0]['output'], re.DOTALL)\n", " code = \"import matplotlib\\nmatplotlib.use('Agg')\\n\"\n", " code += \"\\n\".join(matches)\n", "\n", " # execute the code\n", " res, error_flag = run_code(code)\n", " \n", " image = None\n", " \n", " for file in os.listdir(image_path):\n", " if file.endswith(\".png\"):\n", " image = np.array(Image.open(f'{image_path}/{file}'))\n", " break\n", " \n", " final_val = result[-1]['output'] + f'\\n{res}'\n", " \n", " return final_val , image" ] }, { "cell_type": "code", "execution_count": 9, "id": "1735f7c8", "metadata": {}, "outputs": [], "source": [ "def run_agent(data_path:str = '', provider:str = 'Mistral', agent_type:str = 'EDA', query:str = '', layers:str ='', temp:float = 0.1):\n", " \n", " df = pd.read_csv(data_path)\n", " df.to_csv('./df.csv', index=False)\n", " \n", " if provider.lower() == 'mistral':\n", " \n", " llm = ChatAnyscale(model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=temp)\n", " \n", " if agent_type == 'Data Explorer':\n", " EDA_response, EDA_image = infer_EDA(user_input=query)\n", " return EDA_response, EDA_image\n", " \n", " return None" ] }, { "cell_type": "markdown", "id": "c2224b34", "metadata": {}, "source": [ "# Gradio Demo" ] }, { "cell_type": "code", "execution_count": 10, "id": "471a2947", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running on local URL: http://127.0.0.1:7860\n", "\n", "Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Read the data from 'df.csv'\n", "df = pd.read_csv('df.csv')\n", "\n", "# Calculate the number of null values in each column\n", "null_counts = df.isnull().sum()\n", "\n", "# Calculate the total number of null values in the dataframe\n", "total_null_count = df.isnull().sum().sum()\n", "\n", "# Print the distribution of null values in the dataframe\n", "print(\"Null Value Distribution:\")\n", "print(null_counts)\n", "print(f\"Total Null Count: {total_null_count}\")\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "Running code ...\n", "\n", "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Read the data from 'df.csv'\n", "df = pd.read_csv('df.csv')\n", "\n", "# Calculate the number of null values in each column\n", "null_counts = df.isnull().sum()\n", "\n", "# Prepare data for plotting\n", "data = null_counts.sort_values(ascending=False)\n", "\n", "# Create a bar plot of null values distribution\n", "plt.figure(figsize=(10, 6))\n", "plt.bar(data.index, data.values)\n", "plt.title('Distribution of Null Values in the Dataframe')\n", "plt.xlabel('Column Names')\n", "plt.ylabel('Number of Null Values')\n", "plt.xticks(rotation=45)\n", "plt.savefig('graphs/null_values_distribution.png')\n", "\n", "# Print the null values distribution dataframe\n", "print(data)\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "Running code ...\n", "\n" ] } ], "source": [ "import gradio as gr\n", "\n", "demo = gr.Interface (\n", " run_agent,\n", " [\n", " gr.UploadButton(label=\"Upload your CSV!\"),\n", " gr.Radio([\"Mistral\",\"GPT\"],label=\"Select Your LLM\"),\n", " gr.Radio([\"Data Explorer\", \"Hypothesis Tester\", \"Basic Inference\", \"Super Inference\"], label=\"Choose Your Agent\"),\n", " gr.Text(label=\"Query\", info=\"Your input to the Agent\"),\n", " gr.Text(label=\"Architecture\", info=\"Specify the layer by layer architecture only for Super Inference Agent\"),\n", " gr.Slider(label=\"Model Temperature\", info=\"Slide right to make your model more creative!\")\n", " ],\n", " [\n", " gr.Text(label=\"Updated JSON Object\", info=\"This the JSON Object with the changed component.\"),\n", " gr.Image(label=\"Graph\")\n", " ]\n", ")\n", "demo.launch(share=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "d351a2c2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }