Open-Source AI Cookbook documentation

用 Gemma, MongoDB 和开源模型构建 RAG 系统

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

用 Gemma, MongoDB 和开源模型构建 RAG 系统

作者: Richmond Alake

第一步：安装库

这些命令是用来安装一些软件包的，这些软件包可以帮助你使用和操作 LLMs，处理数据，并且和数据库进行交流。这些库简化了RAG系统的开发，将复杂性减少到少量的代码：

PyMongo：一个用于与 MongoDB 交互的 Python 库，它提供了连接到集群和查询存储在集合和文档中的数据的功能。
Pandas：提供了一个数据结构，用于使用 Python 进行高效的数据处理和分析。
Hugging Face datasets：包含音频、视觉和文本数据集。
Hugging Face Accelerate：抽象了编写利用硬件加速器（如GPU）的代码的复杂性。在实现中，利用 Accelerate 在 GPU 资源上利用 Gemma 模型。
Hugging Face Transformers：访问大量预训练模型。
Hugging Face Sentence Transformers：提供对句子、文本和图像嵌入的访问。

!pip install datasets pandas pymongo sentence_transformers
!pip install -U transformers
# Install below if using GPU
!pip install accelerate

第二步：数据源选择和准备

本教程使用的数据来源于 Hugging Face datasets，具体是 AIatMongoDB/embedded_movies 数据集。

# Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies
dataset = load_dataset("AIatMongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])

dataset_df.head(5)

以下代码片段中的操作侧重于确保数据的完整性和质量。

第一个过程确保每个数据点的 fullplot 属性不为空，因为这是我们嵌入过程中主要使用的数据。
这一步还确保我们移除所有数据点的 plot_embedding 属性，因为这将被一个不同的嵌入模型 gte-large 创建的新嵌入所替换。

>>> # Data Preparation

>>> # Remove data point where plot coloumn is missing
>>> dataset_df = dataset_df.dropna(subset=["fullplot"])
>>> print("\nNumber of missing values in each column after removal:")
>>> print(dataset_df.isnull().sum())

>>> # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face
>>> dataset_df = dataset_df.drop(columns=["plot_embedding"])
>>> dataset_df.head(5)

Number of missing values in each column after removal:
num_mflix_comments      0
genres                  0
countries               0
directors              12
fullplot                0
writers                13
awards                  0
runtime                14
type                    0
rated                 279
metacritic            893
poster                 78
languages               1
imdb                    0
plot                    0
cast                    1
plot_embedding          1
title                   0
dtype: int64

第三步：生成嵌入

代码片段中的步骤如下：

导入 SentenceTransformer 类以访问嵌入模型。
使用 SentenceTransformer 构造函数加载嵌入模型，以实例化 gte-large 嵌入模型。
定义 get_embedding 函数，该函数接受一个文本字符串作为输入，并返回一个代表嵌入的浮点数列表。该函数首先检查输入文本是否为空（去除空白后）。如果文本为空，则返回一个空列表。否则，它使用加载的模型生成嵌入。
通过将 get_embedding 函数应用于 dataset_df DataFrame 的 “fullplot” 列，为每个电影的剧情生成嵌入。生成的嵌入列表被分配到一个名为 embedding 的新列中。

注意：由于我们可以确保文本长度保持在可管理的范围内，因此不需要对完整剧情文本进行分块处理。

from sentence_transformers import SentenceTransformer

# https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer("thenlper/gte-large")


def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()


dataset_df["embedding"] = dataset_df["fullplot"].apply(get_embedding)

dataset_df.head()

第4步：数据库设置和连接

MongoDB 既是一个操作数据库，也是一个向量数据库。它提供了一个数据库解决方案，有效地存储、查询和检索向量嵌入。其优势在于数据库维护、管理和成本的简单性。

创建新的 MongoDB 数据库，设置数据库集群：

前往MongoDB官网，注册一个免费的 MongoDB Atlas 账户，或者对于现有用户，登录 MongoDB Atlas。
在左侧窗格中选择 ‘Database’ 选项，这将导航到数据库部署页面，你可以在其中查看任何现有集群的部署规格。点击 “+Create” 按钮，创建一个新的数据库集群。
选择适用于数据库集群的所有配置。选择所有配置选项后，点击 “Create Cluster” 按钮以部署新创建的集群。MongoDB 还在 “Shared Tab” 上启用了免费集群的创建。

注意：创建概念证明时，不要忘记将 Python 主机的 IP 列入白名单，或设置 0.0.0.0/0 用于任何IP。
成功创建和部署集群后，集群将在 ‘Database Deployment’ 页面中变得可访问。
点击集群的 “Connect” 按钮，查看通过各种语言驱动程序设置与集群的连接的选项。
本教程只需要集群的 URI（唯一资源标识符）。获取 URI 并将其复制到 Google Colabs Secrets 环境中的名为 MONGO_URI 的变量中，或者将其放入 .env 文件或等效文件中。

4.1 数据库和集合设置

在继续之前，请确保满足以下先决条件

在 MongoDB Atlas 上设置数据库集群
获取到你的集群的 URI

有关数据库集群设置和获取 URI 的帮助，请参阅我们的指南：设置 MongoDB 集群和获取你的连接字符串

创建集群后，通过在集群概览页面点击+创建数据库，在 MongoDB Atlas 集群中创建数据库和集合。

这里有关于创建数据库和集合的指南 数据库将被命名为 movies。 集合将被命名为 movie_collection_2。

第5步：创建向量搜索索引

在这一点上，请确保你的向量索引是通过 MongoDB Atlas 创建的。

接下来，你必须做一个非常重要的步骤，那就是在 movie_collection_2 这个数据库的文档里，为那些用来表示电影特点的向量建立一个特殊的搜索索引。这个索引就像是图书馆里的图书索引卡，它帮助计算机快速准确地找到与你的搜索最相似的电影向量。没有这个索引，计算机就得一篇一篇地翻找，效率会非常低。所以，建立这个索引是为了让搜索变得又快又准。

点击此处了解更多关于MongoDB 向量搜索索引的信息。

{
 "fields": [{
     "numDimensions": 1024,
     "path": "embedding",
     "similarity": "cosine",
     "type": "vector"
   }]
}

numDimensions 字段的 1024 值对应于由 gte-large 嵌入模型生成的向量的维度。如果你使用 gte-base 或 gte-small 嵌入模型，向量搜索索引中的 numDimensions 值必须分别设置为 768 和 384。

第6步：建立数据连接

下面的代码片段还使用了 PyMongo 来创建一个 MongoDB 客户端对象，该对象代表与集群的连接，并允许访问其数据库和集合。

>>> import pymongo
>>> from google.colab import userdata


>>> def get_mongo_client(mongo_uri):
...     """Establish connection to the MongoDB."""
...     try:
...         client = pymongo.MongoClient(mongo_uri)
...         print("Connection to MongoDB successful")
...         return client
...     except pymongo.errors.ConnectionFailure as e:
...         print(f"Connection failed: {e}")
...         return None


... mongo_uri = userdata.get("MONGO_URI")
... if not mongo_uri:
...     print("MONGO_URI not set in environment variables")

... mongo_client = get_mongo_client(mongo_uri)

... # Ingest data into MongoDB
... db = mongo_client["movies"]
... collection = db["movie_collection_2"]

Connection to MongoDB successful

# Delete any existing records in the collection
collection.delete_many({})

从 pandas DataFrame 中将数据导入 MongoDB 集合是一个简单的过程，可以通过将 DataFrame 转换为字典，然后在集合上使用 insert_many 方法来传递转换后的数据集记录，从而高效完成。

>>> documents = dataset_df.to_dict("records")
>>> collection.insert_many(documents)

>>> print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed

第7步：对用户查询执行向量搜索

下一步实现了一个函数，该函数通过生成查询嵌入并定义一个 MongoDB 聚合流水线来返回一个向量搜索结果。

该流水线包括 $vectorSearch 和 $project 阶段，它使用生成的向量执行查询，并格式化结果以仅包括所需信息，如剧情、标题和类型，同时为每个结果引入一个搜索分数。

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "fullplot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1,  # Include the genres field
                "score": {"$meta": "vectorSearchScore"},  # Include the search score
            }
        },
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

第 8 步：处理用户查询和加载 Gemma

def get_search_result(query, collection):

    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\n"

    return search_result

>>> # Conduct query with retrival of sources
>>> query = "What is the best romantic movie to watch and why?"
>>> source_information = get_search_result(query, collection)
>>> combined_information = f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."

>>> print(combined_information)

Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

>>> # Moving tensors to GPU
>>> input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
>>> response = model.generate(**input_ids, max_new_tokens=500)
>>> print(tokenizer.decode(response[0]))

Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.

Update on GitHub

←构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG→