Open-Source AI Cookbook documentation

Building RAG with Custom Unstructured Data

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Building RAG with Custom Unstructured Data

Authored by: Maria Khalusova

If you’re new to RAG, please explore the basics of RAG first in this other notebook, and then come back here to learn about building RAG with custom data.

Whether you’re building your own RAG-based personal assistant, a pet project, or an enterprise RAG system, you will quickly discover that a lot of important knowledge is stored in various formats like PDFs, emails, Markdown files, PowerPoint presentations, HTML pages, Word documents, and so on.

How do you preprocess all of this data in a way that you can use it for RAG? In this quick tutorial, you’ll learn how to build a RAG system that will incorporate data from multiple data types. You’ll use Unstructured for data preprocessing, open-source models from Hugging Face Hub for embeddings and text generation, ChromaDB as a vector store, and LangChain for bringing everything together.

Let’s go! We’ll begin by installing the required dependencies:

!pip install -q torch transformers accelerate bitsandbytes sentence-transformers unstructured[all-docs] langchain chromadb langchain_community

Next, let’s get a mix of documents. Suppose, I want to build a RAG system that’ll help me manage pests in my garden. For this purpose, I’ll use diverse documents that cover the topic of IPM (integrated pest management):

PDF: https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf
Powerpoint: https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx
EPUB: https://www.gutenberg.org/ebooks/45957
HTML: https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html

Feel free to use your own documents for your topic of choice from the list of document types supported by Unstructured: .eml, .html, .md, .msg, .rst, .rtf, .txt, .xml, .png, .jpg, .jpeg, .tiff, .bmp, .heic, .csv, .doc, .docx, .epub, .odt, .pdf, .ppt, .pptx, .tsv, .xlsx.

!mkdir -p "./documents"
!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O "./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf"
!wget https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx -O "./documents/Citrus_IPM_090913.pptx"
!wget https://www.gutenberg.org/ebooks/45957.epub3.images -O "./documents/45957.epub"
!wget https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html -O "./documents/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html"

Unstructured data preprocessing

You can use the Unstructured library to preprocess documents one by one, and write your own script to walk through a directory, but it’s easier to use a Local source connector to ingest all documents in a given directory. Unstructured can ingest documents from local directories, S3 buckets, blob storage, SFTP, and many other places your documents might be stored in. The ingestion from those sources will be very similar differing mostly in authentication options. Here you’ll use Local source connector, but feel free to explore other options in the Unstructured documentation.

Optionally, you can also choose a destination for the processed documents - this could be MongoDB, Pinecone, Weaviate, etc. In this notebook, we’ll keep everything local.

# Optional cell to reduce the amount of logs

import logging

logger = logging.getLogger("unstructured.ingest")
logger.root.removeHandler(logger.root.handlers[0])

>>> import os

>>> from unstructured.ingest.connector.local import SimpleLocalConfig
>>> from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
>>> from unstructured.ingest.runner import LocalRunner

>>> output_path = "./local-ingest-output"

>>> runner = LocalRunner(
...     processor_config=ProcessorConfig(
...         # logs verbosity
...         verbose=True,
...         # the local directory to store outputs
...         output_dir=output_path,
...         num_processes=2,
...     ),
...     read_config=ReadConfig(),
...     partition_config=PartitionConfig(
...         partition_by_api=True,
...         api_key="YOUR_UNSTRUCTURED_API_KEY",
...     ),
...     connector_config=SimpleLocalConfig(
...         input_path="./documents",
...         # whether to get the documents recursively from given directory
...         recursive=False,
...     ),
... )
>>> runner.run()

INFO: NumExpr defaulting to 2 threads.

Let’s take a closer look at the configs that we have here.

ProcessorConfig controls various aspects of the processing pipeline, including output locations, number of workers, error handling behavior, logging verbosity and more. The only mandatory parameter here is the output_dir - the local directory where you want to store the outputs.

ReadConfig can be used to customize the data reading process for different scenarios, such as re-downloading data, preserving downloaded files, or limiting the number of documents processed. In most cases the default ReadConfig will work.

In the PartitionConfig you can choose whether to partition the documents locally or via API. This example uses API, and for this reason requires Unstructured API key. You can get yours here. The free Unstructured API is capped at 1000 pages, and offers better OCR models for image-based documents than a local installation of Unstructured. If you remove these two parameters, the documents will be processed locally, but you may need to install additional dependencies if the documents require OCR and/or document understanding models. Namely, you may need to install poppler and tesseract in this case, which you can get with brew:

!brew install poppler
!brew install tesseract

If you’re on Windows, you can find alternative installation instructions in the Unstructured docs.

Finally, in the SimpleLocalConfig you need to specify where your original documents reside, and whether you want to walk through the directory recursively.

Once the documents are processed you’ll find 4 json files in the local-ingest-output directory, one per document that was processed. Unstructured partitions all types of documents in a uniform manner, and returns json with document elements.

Document elements have a type, e.g. NarrativeText, Title, or Table, they contain the extracted text, and metadata that Unstructured was able to obtain. Some metadata is common for all elements, such as filename of the document the element is from. Other metadata depends on file type or element type. For example, a Table element will contain table’s representation as html in the metadata, and metadata for emails will contain information about senders and recipients.

Let’s import element objects from these json files.

from unstructured.staging.base import elements_from_json

elements = []

for filename in os.listdir(output_path):
    filepath = os.path.join(output_path, filename)
    elements.extend(elements_from_json(filepath))

Now that that you have extracted the elements from the documents, you can chunk them to fit the context window of the embeddings model.

Chunking

If you are familiar with chunking methods that split long text documents into smaller chunks, you’ll notice that Unstructured’s chunking methods slightly differ, since the partitioning step already divides an entire document into its structural elements: titles, list items, tables, text, etc. By partitioning documents this way, you can avoid a situation where unrelated pieces of text end up in the same element, and then same chunk.

Now, when you chunk the document elements with Unstructured, individual elements are already small so they will only be split if they exceed the desired maximum chunk size. Otherwise, they will remain as is. You can also optionally choose to combine consecutive text elements such as list items, for instance, that will together fit within chunk size limit.

from unstructured.chunking.title import chunk_by_title

chunked_elements = chunk_by_title(
    elements,
    # maximum for chunk size
    max_characters=512,
    # You can choose to combine consecutive elements that are too small
    # e.g. individual list items
    combine_text_under_n_chars=200,
)

The chunks are ready for RAG. To use them with LangChain, you can easily convert Unstructured elements to LangChain documents.

from langchain_core.documents import Document

documents = []
for chunked_element in chunked_elements:
    metadata = chunked_element.metadata.to_dict()
    metadata["source"] = metadata["filename"]
    del metadata["languages"]
    documents.append(Document(page_content=chunked_element.text, metadata=metadata))

Setting up the retriever

This example uses ChromaDB as a vector store and BAAI/bge-base-en-v1.5 embeddings model, feel free to use any other vector store.

from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import utils as chromautils

# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.
# If you're using a different vector store, you may not need to do this
docs = chromautils.filter_complex_metadata(documents)

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

If you plan to use a gated model from the Hugging Face Hub, be it an embeddings or text generation model, you’ll need to authenticate yourself with your Hugging Face token, which you can get in your Hugging Face profile’s settings.

from huggingface_hub import notebook_login

notebook_login()

RAG with LangChain

Let’s bring everything together and build RAG with LangChain. In this example we’ll be using Llama-3-8B-Instruct from Meta. To make sure it can run smoothly in the free T4 runtime from Google Colab, you’ll need to quantize it.

from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from langchain.chains import RetrievalQA

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=200,
    eos_token_id=terminators,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|start_header_id|>user<|end_header_id|>
You are an assistant for answering questions using provided context.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: {question}
Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)


qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, chain_type_kwargs={"prompt": prompt})

Results and next steps

Now that you have your RAG chain, let’s ask it about aphids. Are they a pest in my garden?

question = "Are aphids a pest?"

qa_chain.invoke(question)["result"]

Output:

Yes, aphids are considered pests because they feed on the nutrient-rich liquids within plants, causing damage and potentially spreading disease. In fact, they're known to multiply quickly, which is why it's essential to control them promptly. As mentioned in the text, aphids can also attract ants, which are attracted to the sweet, sticky substance they produce called honeydew. So, yes, aphids are indeed a pest that requires attention to prevent further harm to your plants!

This looks like a promising start! Now that you know the basics of preprocessing complex unstructured data for RAG, you can continue improving upon this example. Here are some ideas:

You can connect to a different source to ingest the documents from, for example, an S3 bucket.
You can add return_source_documents=True in the qa_chain arguments to make the chain return the documents that were passed to the prompt as context. This can be useful to understand what sources were used to generate the answer.
If you want to leverage the elements metadata at the retrieval stage, consider using Hugging Face agents and creating a custom retriever tool as described in this other notebook.
There are many things you could do to improve search results. For instance, you could use Hybrid search instead of a single similarity-search retriever. Hybrid search combines multiple search algorithms to improve the accuracy and relevance of search results. Typically it’s a combination of keyword-based search algorithms with vector search methods.

Have fun building RAG applications with Unstructured data!

< > Update on GitHub

←RAG with source highlighting using Structured generation Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format→