<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/14-Adding_Chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Install Packages and Setup Variables


In [1]:
!pip install -q llama-index==0.10.57 openai==1.37.0 llama-index-finetuning llama-index-embeddings-huggingface llama-index-embeddings-cohere llama-index-readers-web cohere==5.6.2 tiktoken==0.7.0 chromadb==0.5.5 html2text sentence_transformers pydantic llama-index-vector-stores-chroma==0.1.10 kaleido==0.2.1 llama-index-llms-gemini==0.1.11

In [1]:
import os

# Set the following API Keys in the Python environment. Will be used later.
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"
os.environ["GOOGLE_API_KEY"] = "<YOUR_API_KEY>"

In [2]:
# Allows running asyncio in environments with an existing event loop, like Jupyter notebooks.

import nest_asyncio

nest_asyncio.apply()

# Load a Model


In [3]:
from llama_index.llms.gemini import Gemini

llm = Gemini(model="models/gemini-1.5-flash", temperature=1, max_tokens=512)

  from .autonotebook import tqdm as notebook_tqdm


# Create a VectoreStore


In [4]:
import chromadb

# create client and a new collection
# chromadb.EphemeralClient saves data in-memory.
chroma_client = chromadb.PersistentClient(path="./mini-llama-articles")
chroma_collection = chroma_client.create_collection("mini-llama-articles")

In [5]:
from llama_index.vector_stores.chroma import ChromaVectorStore

# Define a storage context object using the created vector database.
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Load the Dataset (CSV)


## Download


The dataset includes several articles from the TowardsAI blog, which provide an in-depth explanation of the LLaMA2 model. Read the dataset as a long string.


In [6]:
!curl -o ./mini-llama-articles.csv https://raw.githubusercontent.com/AlaFalaki/tutorial_notebooks/main/data/mini-llama-articles.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  169k  100  169k    0     0   784k      0 --:--:-- --:--:-- --:--:--  785k


## Read File


In [7]:
import csv

rows = []

# Load the file as a JSON
with open("./mini-llama-articles.csv", mode="r", encoding="utf-8") as file:
    csv_reader = csv.reader(file)

    for idx, row in enumerate(csv_reader):
        if idx == 0:
            continue
            # Skip header row
        rows.append(row)

# The number of characters in the dataset.
len(rows)

14

# Convert to Document obj


In [8]:
from llama_index.core import Document

# Convert the chunks to Document objects so the LlamaIndex framework can process them.
documents = [
    Document(
        text=row[1], metadata={"title": row[0], "url": row[2], "source_name": row[3]}
    )
    for row in rows
]

# Transforming


In [9]:
from llama_index.core.text_splitter import TokenTextSplitter

# Define the splitter object that split the text into segments with 512 tokens,
# with a 128 overlap between the segments.
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)

In [10]:
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    KeywordExtractor,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.ingestion import IngestionPipeline

# Create the pipeline to apply the transformation on each chunk,
# and store the transformed text in the chroma vector store.
pipeline = IngestionPipeline(
    transformations=[
        text_splitter,
        QuestionsAnsweredExtractor(questions=3, llm=llm),
        SummaryExtractor(summaries=["prev", "self"], llm=llm),
        KeywordExtractor(keywords=10, llm=llm),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

nodes = pipeline.run(documents=documents, show_progress=True)

Parsing nodes:   0%|          | 0/14 [00:00<?, ?it/s]

Parsing nodes: 100%|██████████| 14/14 [00:00<00:00, 28.48it/s]
100%|██████████| 108/108 [00:59<00:00,  1.82it/s]
100%|██████████| 108/108 [01:46<00:00,  1.02it/s]
100%|██████████| 108/108 [00:28<00:00,  3.75it/s]
Generating embeddings: 100%|██████████| 108/108 [00:01<00:00, 59.76it/s]


In [11]:
len(nodes)

108

In [12]:
# Compress the vector store directory to a zip file to be able to download and use later.
!zip -r vectorstore.zip mini-llama-articles

updating: mini-llama-articles/ (stored 0%)
updating: mini-llama-articles/chroma.sqlite3 (deflated 65%)
  adding: mini-llama-articles/1a47984b-079a-4e72-809a-387c43e980b6/ (stored 0%)
  adding: mini-llama-articles/1a47984b-079a-4e72-809a-387c43e980b6/data_level0.bin (deflated 100%)
  adding: mini-llama-articles/1a47984b-079a-4e72-809a-387c43e980b6/length.bin (deflated 63%)
  adding: mini-llama-articles/1a47984b-079a-4e72-809a-387c43e980b6/link_lists.bin (stored 0%)
  adding: mini-llama-articles/1a47984b-079a-4e72-809a-387c43e980b6/header.bin (deflated 61%)


# Load Indexes


If you have already uploaded the zip file for the vector store checkpoint, please uncomment the code in the following cell block to extract its contents. After doing so, you will be able to load the dataset from local storage.


In [13]:
# !unzip vectorstore.zip

In [14]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Load the vector store from the local storage.
db = chromadb.PersistentClient(path="./mini-llama-articles")
chroma_collection = db.get_or_create_collection("mini-llama-articles")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

In [15]:
from llama_index.core import VectorStoreIndex

# Create the index based on the vector store.
vector_index = VectorStoreIndex.from_vector_store(vector_store)

# Disply result


In [16]:
# A simple function to show the response and the sources.
def display_res(response):
    print("Response:\n\t", response.response.replace("\n", ""))

    print("Sources:")
    if response.source_nodes:
        for src in response.source_nodes:
            print("\tNode ID\t", src.node_id)
            print("\tText\t", src.text)
            print("\tScore\t", src.score)
            print("\t" + "-_" * 20)
    else:
        print("\tNo sources used!")

# Chat Engine


In [17]:
# define the chat_engine by using the index
chat_engine = vector_index.as_chat_engine(llm=llm)  # chat_mode="best"

In [18]:
# First Question:
response = chat_engine.chat(
    "Use the tool to answer, How many parameters LLaMA2 model has?"
)
display_res(response)

Response:
	 The LLaMA2 model has four different model sizes with varying parameters: 7 billion, 13 billion, 34 billion, and 70 billion parameters.
Sources:
	Node ID	 c3239b40-e206-4a80-b020-eea87cf471cc
	Text	 I. Llama 2: Revolutionizing Commercial Use Unlike its predecessor Llama 1, which was limited to research use, Llama 2 represents a major advancement as an open-source commercial model. Businesses can now integrate Llama 2 into products to create AI-powered applications. Availability on Azure and AWS facilitates fine-tuning and adoption. However, restrictions apply to prevent exploitation. Companies with over 700 million active daily users cannot use Llama 2. Additionally, its output cannot be used to improve other language models.  II. Llama 2 Model Flavors Llama 2 is available in four different model sizes: 7 billion, 13 billion, 34 billion, and 70 billion parameters. While 7B, 13B, and 70B have already been released, the 34B model is still awaited. The pretrained variant, train

In [19]:
# Second Question:
response = chat_engine.chat("Tell me a joke?")
display_res(response)

Response:
	 Why did the scarecrow win an award? Because he was outstanding in his field!
Sources:
	Node ID	 8685e48d-1fdb-4f55-8f62-6f2ea4cfaf5d
	Text	 with their larger size, outperform Llama 2, this is expected due to their capacity for handling complex language tasks. Llama 2's impressive ability to compete with larger models highlights its efficiency and potential in the market. However, Llama 2 does face challenges in coding and math problems, where models like Chat GPT 4 excel, given their significantly larger size. Chat GPT 4 performed significantly better than Llama 2 for coding (HumanEval benchmark)and math problem tasks (GSM8k benchmark). Open-source AI technologies, like Llama 2, continue to advance, offering strong competition to closed-source models.  V. Ghost Attention: Enhancing Conversational Continuity One unique feature in Llama 2 is Ghost Attention, which ensures continuity in conversations. This means that even after multiple interactions, the model remembers its in

In [20]:
# Third Question: (check if it can recall previous interactions)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 The first question you asked was "How many parameters LLaMA2 model has?"
Sources:
	No sources used!


In [21]:
# Reset the session to clear the memory
chat_engine.reset()

In [22]:
# Fourth Question: (don't recall the previous interactions.)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 The first question you asked was "How can a Q&A bot be built over private documents using OpenAI and LangChain?"
Sources:
	Node ID	 baa8a99c-f38b-4818-b854-5741598c0776
	Text	 Private data to be used The example provided can be used with any dataset. I am using a data set that has Analyst recommendations from various stocks. For the purpose of demonstration, I have gathered publicly available analyst recommendations to showcase its capabilities. You can replace this with your own information to try this. Below is a partial extract of the information commonly found in these documents. If you wish to try it yourself, you can download analyst recommendations for your preferred stocks from online sources or access them through subscription platforms like Barron's. Although the example provided focuses on analyst recommendations, the underlying structure can be utilized to query various other types of documents in any industry as well. I have assembled such data for a few stocks

# Streaming


In [23]:
# Stream the words as soon as they are available instead of waiting for the model to finish generation.
streaming_response = chat_engine.stream_chat(
    "Write a paragraph about the LLaMA2 model's capabilities."
)
for token in streaming_response.response_gen:
    print(token, end="")

Here is a paragraph about the LLaMA2 model's capabilities:

"The Llama 2 model showcases impressive capabilities in the realm of open-source language models. It introduces innovative features like Ghost Attention, which enhances conversational continuity by ensuring consistent responses throughout interactions. Additionally, Llama 2 boasts a groundbreaking temporal capability that organizes information based on time relevance, leading to more contextually accurate responses. Despite facing challenges in coding and math problems compared to larger models like Chat GPT 4, Llama 2 demonstrates efficiency and potential in the market, competing well with both open-source and closed-source models. Its ability to balance helpfulness and safety in optimizing responses further solidifies its position as a reliable and advanced language model for commercial use."

## Condense Question


Enhance the input prompt by looking at the previous chat history along with the present question. The refined prompt can then be used to fetch the nodes.


In [24]:
# Define GPT-4 model that will be used by the chat_engine to improve the query.
gpt4 = OpenAI(temperature=0.9, model="gpt-4o")

In [25]:
chat_engine = vector_index.as_chat_engine(
    chat_mode="condense_question", llm=gpt4, verbose=True
)

In [26]:
response = chat_engine.chat(
    "Use the tool to answer, which company released LLaMA2 model? What is the model useful for?"
)
display_res(response)

Querying with: Using the tool at your disposal, can you please determine which company released the LLaMA2 model and explain what specific functionality or purpose this model is known for?
Response:
	 The LLaMA2 model was released by Meta. The model is known for its temporal awareness feature which enhances the accuracy of its responses by delivering more contextually accurate responses based on time relevance. For example, for the question, "How long ago did Barack Obama become president?", it only considers information relevant after 2008. Meta's open-sourcing of LLaMA2 provides developers and researchers with commercial access to the advanced language model, which represents a significant shift in the AI industry.
Sources:
	Node ID	 7adec56f-6714-4376-8ebf-180b694c4d59
	Text	 LLaMA: Meta's new AI tool According to the official release, LLaMA is a foundational language model developed to assist 'researchers and academics' in their work (as opposed to the average web user) to understa

## REACT


ReAct is an agent-based chat mode that uses a loop to decide on querying a data engine during interactions, offering flexibility but relying on the Large Language Model's quality for effective responses, requiring careful management to avoid inaccurate answers.


In [27]:
chat_engine = vector_index.as_chat_engine(chat_mode="react", verbose=True, llm=llm)

In [28]:
response = chat_engine.chat(
    "Which company released LLaMA2 model? What is the model useful for?"
)

Added user message to memory: Which company released LLaMA2 model? What is the model useful for?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Which company released LLaMA2 model?"}
Got output: Meta released the LLaMA2 model.

=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "What is the LLaMA2 model useful for?"}
Got output: The Llama 2 model is useful for businesses to integrate into products to create AI-powered applications.



In [29]:
display_res(response)

Response:
	 The LLaMA2 model was released by Meta. It is useful for businesses to integrate into products to create AI-powered applications.
Sources:
	Node ID	 7adec56f-6714-4376-8ebf-180b694c4d59
	Text	 LLaMA: Meta's new AI tool According to the official release, LLaMA is a foundational language model developed to assist 'researchers and academics' in their work (as opposed to the average web user) to understand and study these NLP models. Leveraging AI in such a way could give researchers an edge in terms of time spent. You may not know this, but this would be Meta's third LLM after Blender Bot 3 and Galactica. However, the two LLMs were shut down soon, and Meta stopped their further development, as it produced erroneous results. Before moving further, it is important to emphasize that LLaMA is NOT a chatbot like ChatGPT. As I mentioned before, it is a 'research tool' for researchers. We can expect the initial versions of LLaMA to be a bit more technical and indirect to use as oppose