# Overview 

<span style="color:red; font-size:18px; font-weight:bold;">In this notebook, we will try to create a workflow between Langchain and Mixtral LLM.</br>
We want to accomplish the following:</span>
1. Establish a pipeline to read in a csv and pass it to our LLM. 
2. Establish a Basic Inference Agent.
3. Test the Basic Inference on a few tasks.

# Setting up the Environment 

In [2]:
####################################################################################################
import os
import re
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import openai

plt.style.use('ggplot')
####################################################################################################

In [3]:
# insert your API key here
client = openai.OpenAI(
    base_url = "https://api.endpoints.anyscale.com/v1",
    api_key = "esecret_8btufnh3615vnbpd924s1t3q7p"
)

# Reading the Dataframe and saving it 
<span style="font-size:16px;color:red;">We are saving the dataframe in a csv file in the cwd because we will iteratively update this df if need be between each inference. This way the model will continuously have the most upto date version of the dataframe at its disposal</span>

In [4]:
df = pd.read_csv('../data/train.csv')
df.to_csv('./df.csv', index=False)

# DECLARE YOUR LLM HERE 

In [4]:
llm = ChatAnyscale(model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0)
agent = create_pandas_dataframe_agent(llm, df, verbose=True)

# REPL Tool 
<span style="font-size:16px;color:red;">This tool enables our LLM to 'execute' python code so that we can actually display the results to the end-user.</span>

In [5]:
python_repl = PythonREPL()

repl_tool = Tool(
    name="python_repl",
    description="""A Python shell. Shell can dislay charts too. Use this to execute python commands.\
    You have access to all libraries in python including but not limited to sklearn, pandas, numpy,\
    matplotlib.pyplot, seaborn etc. Input should be a valid python command. If the user has not explicitly\
    asked you to plot the results, always print the final output using print(...)""",
    func=python_repl.run,
)

tools = [repl_tool]

# Building an Agent 

In [6]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are Machine Learning Inference agent. Your job is to use your tools to answer a user query\
            in the best manner possible.\
            Provide no explanation for your code. Enclose all your code between triple backticks ``` """,
        ),
        ("user", "Dataframe named df:\n{df}\nQuery: {input}\nList of Tools: {tools}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

In [7]:
agent = (
    {
        "input": lambda x: x["input"],
        "tools": lambda x:x['tools'],
        "df": lambda x:x['df'],
        "agent_scratchpad": lambda x: format_to_openai_tool_messages(
            x["intermediate_steps"]
        )
    }
    | prompt
    | llm
    | OpenAIToolsAgentOutputParser()
)

In [8]:
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

In [9]:
# building an execution chain

def infer(user_input):
    # fetch the response from the agent
    result = list(agent_executor.stream({"input": user_input, 
                                         "df":pd.read_csv('df.csv'), "tools":tools}))
    
    # need to extract the code
    pattern = r"```python\n(.*?)\n```"
    matches = re.findall(pattern, result[0]['output'], re.DOTALL)
    final_line = "df.to_csv('./df.csv', index=False)"
    code = "\n".join(matches)
    exec(code)
    try:
        exec("df.to_csv('./df.csv', index=False)")
    except:
        pass
    # execute the code
    return None

# Testing the Agent

### First we perform some data cleaning. Observe how each infer call updates the df and passes it onwards.

In [10]:
df.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [11]:
infer("Please drop the following columns inplace from the original dataframe: Name, Ticket, Cabin, PassengerId.")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
df.drop(columns=['Name', 'Ticket', 'Cabin', 'PassengerId'], inplace=True)
```
```[0m

[1m> Finished chain.[0m


In [12]:
infer("Please map the strings in the following columns of the original dataframe itself to integers: sex, embarked")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
print(df.head())
```[0m

[1m> Finished chain.[0m
   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
0         0       3    0  22.0      1      0   7.2500       0.0
1         1       1    1  38.0      1      0  71.2833       1.0
2         1       3    1  26.0      0      0   7.9250       0.0
3         1       1    1  35.0      1      0  53.1000       0.0
4         0       3    0  35.0      0      0   8.0500       0.0


In [15]:
infer("Please remove NaNs with most frequent value in the respective column. Do this inplace in original dataframe.")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
for col in df.select_dtypes(include='object').columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

for col in df.select_dtypes(include='float').columns:
    df[col].fillna(df[col].median(), inplace=True)
```[0m

[1m> Finished chain.[0m


### Linear Regression to predict fare

In [16]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Fare column.
2. Split X and y into train, test sets.
3. Fit a linear regression model to the train set.
4. Predict on test set.
5. print the loss on the predictions.

Run all steps from 1 to 5.

""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Split the dataframe into X and y. Where y is Fare column.
X = df.drop('Fare', axis=1)
y = df['Fare']

# Split X and y into train, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a linear regression model to the train set.
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set.
y_pred = model.predict(X_test)

# print the loss on the predictions.
print(mean_squared_error(y_test, y_pred))
```[0m

[1m> Finished chain.[0m
935.49501752569


### SVR to predict Fare

In [17]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Fare column.
2. Split X and y into train, test sets.
3. Fit a SVR model to the train set.
4. Predict on test set.
5. print the MSE loss on the predictions.

Run all steps from 1 to 5.
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# Split the dataframe into X and y. Where y is Fare column.
X = df.drop('Fare', axis=1)
y = df['Fare']

# Split X and y into train, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a SVR model to the train set.
model = SVR()
model.fit(X_train, y_train)

# Predict on test set.
predictions = model.predict(X_test)

# print the MSE loss on the predictions.
print(mean_squared_error(y_test, predictions))
```[0m

[1m> Finished chain.[0m
1709.714026652724


### Logistic Regression to Classify Survival

In [18]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train, test sets.
3. Fit a Logistic Regression model to the train set.
4. Predict on test set.
5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 5.
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1
X = df.drop('Survived', axis=1)
y = df['Survived']

# Step 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3
model = LogisticRegression()
model.fit(X_train, y_train)

# Step 4
y_pred = model.predict(X_test)

# Step 5
accuracy = round(accuracy_score(y_test, y_pred), 4)
print(f"Accuracy: {accuracy}")
```[0m

[1m> Finished chain.[0m
Accuracy: 0.7989


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### SVM to Classify Survival 

In [19]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train, test sets.
3. Fit a Gaussian SVM with Poly kernel to the train set.
4. Predict on test set.
5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 5.
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 1
X = df.drop('Survived', axis=1)
y = df['Survived']

# Step 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3
svm_model = SVC(kernel='poly', degree=3, coef0=1, probability=True, random_state=42)
svm_model.fit(X_train, y_train)

# Step 4
y_pred = svm_model.predict(X_test)

# Step 5
accuracy = round(accuracy_score(y_test, y_pred), 4)
print(f"Accuracy: {accuracy}")
```[0m

[1m> Finished chain.[0m
Accuracy: 0.6816


### Naive Bayes to Classify Survival

In [20]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train, test sets.
3. Fit a Naive Bayes model to the train set.
4. Predict on test set.
5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 5.
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Step 1
X = df.drop('Survived', axis=1)
y = df['Survived']

# Step 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3
model = GaussianNB()
model.fit(X_train, y_train)

# Step 4
y_pred = model.predict(X_test)

# Step 5
accuracy = round(accuracy_score(y_test, y_pred), 4)
print("Accuracy:", accuracy)
```[0m

[1m> Finished chain.[0m
Accuracy: 0.7765


### KNN to Classify Survival
<span style="font-size:16px;">We perform a grid search to figure out the best n_neighbors parameter!</span>

In [21]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train and test sets.
3. Fit a KNN model to the train set.
4. Find the best n_neighbors parameter out of range (1,10). Import some grid search method to do this.
5. Predict with best model on test set.
6. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 6.
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 1
X = df.drop('Survived', axis=1)
y = df['Survived']

# Step 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3
knn = KNeighborsClassifier()

# Step 4
param_grid = {'n_neighbors': range(1, 11)}
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 5
best_knn = grid_search.best_estimator_
y_pred = best_knn.predict(X_test)

# Step 6
accuracy = round(accuracy_score(y_test, y_pred), 4)
print(f'Accuracy: {accuracy}')
```[0m

[1m> Finished chain.[0m
Accuracy: 0.7207
