{ "cells": [ { "cell_type": "markdown", "id": "c96b164c", "metadata": {}, "source": [ "# Overview " ] }, { "cell_type": "markdown", "id": "22c9cad3", "metadata": {}, "source": [ "In this notebook, we will try to create a workflow between Langchain and Mixtral LLM.
\n", "We want to accomplish the following:
\n", "1. Establish a pipeline to read in a csv and pass it to our LLM. \n", "2. Establish a Basic Inference Agent.\n", "3. Test the Basic Inference on a few tasks." ] }, { "cell_type": "markdown", "id": "3bd62bd1", "metadata": {}, "source": [ "# Setting up the Environment " ] }, { "cell_type": "code", "execution_count": 2, "id": "76b3d212", "metadata": {}, "outputs": [], "source": [ "####################################################################################################\n", "import os\n", "import re\n", "import pandas as pd \n", "import matplotlib.pyplot as plt \n", "import numpy as np\n", "import openai\n", "\n", "plt.style.use('ggplot')\n", "####################################################################################################" ] }, { "cell_type": "code", "execution_count": 3, "id": "9da52e1f", "metadata": {}, "outputs": [], "source": [ "# insert your API key here\n", "client = openai.OpenAI(\n", " base_url = \"https://api.endpoints.anyscale.com/v1\",\n", " api_key = \"esecret_8btufnh3615vnbpd924s1t3q7p\"\n", ")" ] }, { "cell_type": "markdown", "id": "edcd6ca7", "metadata": {}, "source": [ "# Reading the Dataframe and saving it \n", "We are saving the dataframe in a csv file in the cwd because we will iteratively update this df if need be between each inference. This way the model will continuously have the most upto date version of the dataframe at its disposal" ] }, { "cell_type": "code", "execution_count": 4, "id": "45f55df2", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/train.csv')\n", "df.to_csv('./df.csv', index=False)" ] }, { "cell_type": "markdown", "id": "64a086f2", "metadata": {}, "source": [ "# DECLARE YOUR LLM HERE " ] }, { "cell_type": "code", "execution_count": 4, "id": "3f686305", "metadata": {}, "outputs": [], "source": [ "llm = ChatAnyscale(model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0)\n", "agent = create_pandas_dataframe_agent(llm, df, verbose=True)" ] }, { "cell_type": "markdown", "id": "4b1d8f8f", "metadata": {}, "source": [ "# REPL Tool \n", "This tool enables our LLM to 'execute' python code so that we can actually display the results to the end-user." ] }, { "cell_type": "code", "execution_count": 5, "id": "23045c5d", "metadata": {}, "outputs": [], "source": [ "python_repl = PythonREPL()\n", "\n", "repl_tool = Tool(\n", " name=\"python_repl\",\n", " description=\"\"\"A Python shell. Shell can dislay charts too. Use this to execute python commands.\\\n", " You have access to all libraries in python including but not limited to sklearn, pandas, numpy,\\\n", " matplotlib.pyplot, seaborn etc. Input should be a valid python command. If the user has not explicitly\\\n", " asked you to plot the results, always print the final output using print(...)\"\"\",\n", " func=python_repl.run,\n", ")\n", "\n", "tools = [repl_tool]" ] }, { "cell_type": "markdown", "id": "c5caa356", "metadata": {}, "source": [ "# Building an Agent " ] }, { "cell_type": "code", "execution_count": 6, "id": "44c07305", "metadata": {}, "outputs": [], "source": [ "prompt = ChatPromptTemplate.from_messages(\n", " [\n", " (\n", " \"system\",\n", " \"\"\"You are Machine Learning Inference agent. Your job is to use your tools to answer a user query\\\n", " in the best manner possible.\\\n", " Provide no explanation for your code. Enclose all your code between triple backticks ``` \"\"\",\n", " ),\n", " (\"user\", \"Dataframe named df:\\n{df}\\nQuery: {input}\\nList of Tools: {tools}\"),\n", " MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "id": "67dae2d6", "metadata": {}, "outputs": [], "source": [ "agent = (\n", " {\n", " \"input\": lambda x: x[\"input\"],\n", " \"tools\": lambda x:x['tools'],\n", " \"df\": lambda x:x['df'],\n", " \"agent_scratchpad\": lambda x: format_to_openai_tool_messages(\n", " x[\"intermediate_steps\"]\n", " )\n", " }\n", " | prompt\n", " | llm\n", " | OpenAIToolsAgentOutputParser()\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "id": "45129995", "metadata": {}, "outputs": [], "source": [ "agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)" ] }, { "cell_type": "code", "execution_count": 9, "id": "9602539f", "metadata": {}, "outputs": [], "source": [ "# building an execution chain\n", "\n", "def infer(user_input):\n", " # fetch the response from the agent\n", " result = list(agent_executor.stream({\"input\": user_input, \n", " \"df\":pd.read_csv('df.csv'), \"tools\":tools}))\n", " \n", " # need to extract the code\n", " pattern = r\"```python\\n(.*?)\\n```\"\n", " matches = re.findall(pattern, result[0]['output'], re.DOTALL)\n", " final_line = \"df.to_csv('./df.csv', index=False)\"\n", " code = \"\\n\".join(matches)\n", " exec(code)\n", " try:\n", " exec(\"df.to_csv('./df.csv', index=False)\")\n", " except:\n", " pass\n", " # execute the code\n", " return None" ] }, { "cell_type": "markdown", "id": "2b119c18", "metadata": {}, "source": [ "# Testing the Agent" ] }, { "cell_type": "markdown", "id": "786a9a3e", "metadata": {}, "source": [ "### First we perform some data cleaning. Observe how each infer call updates the df and passes it onwards." ] }, { "cell_type": "code", "execution_count": 10, "id": "fbef0fff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.25NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name Sex Age SibSp \\\n", "0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.25 NaN S " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(1)" ] }, { "cell_type": "code", "execution_count": 11, "id": "634e0004", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "df.drop(columns=['Name', 'Ticket', 'Cabin', 'PassengerId'], inplace=True)\n", "```\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] } ], "source": [ "infer(\"Please drop the following columns inplace from the original dataframe: Name, Ticket, Cabin, PassengerId.\")" ] }, { "cell_type": "code", "execution_count": 12, "id": "53ad6af5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})\n", "df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n", "print(df.head())\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", " Survived Pclass Sex Age SibSp Parch Fare Embarked\n", "0 0 3 0 22.0 1 0 7.2500 0.0\n", "1 1 1 1 38.0 1 0 71.2833 1.0\n", "2 1 3 1 26.0 0 0 7.9250 0.0\n", "3 1 1 1 35.0 1 0 53.1000 0.0\n", "4 0 3 0 35.0 0 0 8.0500 0.0\n" ] } ], "source": [ "infer(\"Please map the strings in the following columns of the original dataframe itself to integers: sex, embarked\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "6000364b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "for col in df.select_dtypes(include='object').columns:\n", " df[col].fillna(df[col].mode()[0], inplace=True)\n", "\n", "for col in df.select_dtypes(include='float').columns:\n", " df[col].fillna(df[col].median(), inplace=True)\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] } ], "source": [ "infer(\"Please remove NaNs with most frequent value in the respective column. Do this inplace in original dataframe.\")" ] }, { "cell_type": "markdown", "id": "2c9d0c90", "metadata": {}, "source": [ "### Linear Regression to predict fare" ] }, { "cell_type": "code", "execution_count": 16, "id": "93e026c0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error\n", "\n", "# Split the dataframe into X and y. Where y is Fare column.\n", "X = df.drop('Fare', axis=1)\n", "y = df['Fare']\n", "\n", "# Split X and y into train, test sets.\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Fit a linear regression model to the train set.\n", "model = LinearRegression()\n", "model.fit(X_train, y_train)\n", "\n", "# Predict on test set.\n", "y_pred = model.predict(X_test)\n", "\n", "# print the loss on the predictions.\n", "print(mean_squared_error(y_test, y_pred))\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "935.49501752569\n" ] } ], "source": [ "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n", "1. Split the dataframe into X and y. Where y is Fare column.\n", "2. Split X and y into train, test sets.\n", "3. Fit a linear regression model to the train set.\n", "4. Predict on test set.\n", "5. print the loss on the predictions.\n", "\n", "Run all steps from 1 to 5.\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "ded19d06", "metadata": {}, "source": [ "### SVR to predict Fare" ] }, { "cell_type": "code", "execution_count": 17, "id": "586cad9b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.svm import SVR\n", "from sklearn.metrics import mean_squared_error\n", "\n", "# Split the dataframe into X and y. Where y is Fare column.\n", "X = df.drop('Fare', axis=1)\n", "y = df['Fare']\n", "\n", "# Split X and y into train, test sets.\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Fit a SVR model to the train set.\n", "model = SVR()\n", "model.fit(X_train, y_train)\n", "\n", "# Predict on test set.\n", "predictions = model.predict(X_test)\n", "\n", "# print the MSE loss on the predictions.\n", "print(mean_squared_error(y_test, predictions))\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "1709.714026652724\n" ] } ], "source": [ "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n", "1. Split the dataframe into X and y. Where y is Fare column.\n", "2. Split X and y into train, test sets.\n", "3. Fit a SVR model to the train set.\n", "4. Predict on test set.\n", "5. print the MSE loss on the predictions.\n", "\n", "Run all steps from 1 to 5.\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "d1d8d758", "metadata": {}, "source": [ "### Logistic Regression to Classify Survival" ] }, { "cell_type": "code", "execution_count": 18, "id": "3e0c12cd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score\n", "\n", "# Step 1\n", "X = df.drop('Survived', axis=1)\n", "y = df['Survived']\n", "\n", "# Step 2\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Step 3\n", "model = LogisticRegression()\n", "model.fit(X_train, y_train)\n", "\n", "# Step 4\n", "y_pred = model.predict(X_test)\n", "\n", "# Step 5\n", "accuracy = round(accuracy_score(y_test, y_pred), 4)\n", "print(f\"Accuracy: {accuracy}\")\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "Accuracy: 0.7989\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/khaquansheikh/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " n_iter_i = _check_optimize_result(\n" ] } ], "source": [ "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n", "1. Split the dataframe into X and y. Where y is Survived column.\n", "2. Split X and y into train, test sets.\n", "3. Fit a Logistic Regression model to the train set.\n", "4. Predict on test set.\n", "5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n", "\n", "Run all steps from 1 to 5.\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "77dd622b", "metadata": {}, "source": [ "### SVM to Classify Survival " ] }, { "cell_type": "code", "execution_count": 19, "id": "5f7cdf6a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.svm import SVC\n", "from sklearn.metrics import accuracy_score\n", "\n", "# Step 1\n", "X = df.drop('Survived', axis=1)\n", "y = df['Survived']\n", "\n", "# Step 2\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Step 3\n", "svm_model = SVC(kernel='poly', degree=3, coef0=1, probability=True, random_state=42)\n", "svm_model.fit(X_train, y_train)\n", "\n", "# Step 4\n", "y_pred = svm_model.predict(X_test)\n", "\n", "# Step 5\n", "accuracy = round(accuracy_score(y_test, y_pred), 4)\n", "print(f\"Accuracy: {accuracy}\")\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "Accuracy: 0.6816\n" ] } ], "source": [ "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n", "1. Split the dataframe into X and y. Where y is Survived column.\n", "2. Split X and y into train, test sets.\n", "3. Fit a Gaussian SVM with Poly kernel to the train set.\n", "4. Predict on test set.\n", "5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n", "\n", "Run all steps from 1 to 5.\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "583e657d", "metadata": {}, "source": [ "### Naive Bayes to Classify Survival" ] }, { "cell_type": "code", "execution_count": 20, "id": "cd88d9d7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.metrics import accuracy_score\n", "\n", "# Step 1\n", "X = df.drop('Survived', axis=1)\n", "y = df['Survived']\n", "\n", "# Step 2\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Step 3\n", "model = GaussianNB()\n", "model.fit(X_train, y_train)\n", "\n", "# Step 4\n", "y_pred = model.predict(X_test)\n", "\n", "# Step 5\n", "accuracy = round(accuracy_score(y_test, y_pred), 4)\n", "print(\"Accuracy:\", accuracy)\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "Accuracy: 0.7765\n" ] } ], "source": [ "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n", "1. Split the dataframe into X and y. Where y is Survived column.\n", "2. Split X and y into train, test sets.\n", "3. Fit a Naive Bayes model to the train set.\n", "4. Predict on test set.\n", "5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n", "\n", "Run all steps from 1 to 5.\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "4a6e463b", "metadata": {}, "source": [ "### KNN to Classify Survival\n", "We perform a grid search to figure out the best n_neighbors parameter!" ] }, { "cell_type": "code", "execution_count": 21, "id": "d978330c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "from sklearn.model_selection import train_test_split, GridSearchCV\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.metrics import accuracy_score\n", "\n", "# Step 1\n", "X = df.drop('Survived', axis=1)\n", "y = df['Survived']\n", "\n", "# Step 2\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Step 3\n", "knn = KNeighborsClassifier()\n", "\n", "# Step 4\n", "param_grid = {'n_neighbors': range(1, 11)}\n", "grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')\n", "grid_search.fit(X_train, y_train)\n", "\n", "# Step 5\n", "best_knn = grid_search.best_estimator_\n", "y_pred = best_knn.predict(X_test)\n", "\n", "# Step 6\n", "accuracy = round(accuracy_score(y_test, y_pred), 4)\n", "print(f'Accuracy: {accuracy}')\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "Accuracy: 0.7207\n" ] } ], "source": [ "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n", "1. Split the dataframe into X and y. Where y is Survived column.\n", "2. Split X and y into train and test sets.\n", "3. Fit a KNN model to the train set.\n", "4. Find the best n_neighbors parameter out of range (1,10). Import some grid search method to do this.\n", "5. Predict with best model on test set.\n", "6. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n", "\n", "Run all steps from 1 to 6.\n", "\"\"\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }