# Imports

In [1]:
import json
from pathlib import Path
import pickle
from tqdm.auto import tqdm

from haystack.nodes.preprocessor import PreProcessor

In [2]:
proj_dir = Path.cwd().parent
print(proj_dir)

/home/ec2-user/RAGDemo


# Config

In [13]:
file_in = proj_dir / 'data/consolidated/simple_wiki.json'
file_out = proj_dir / 'data/processed/simple_wiki_processed.pkl'

# Preprocessing

Its important to choose good pre-processing options. 

Clean whitespace helps each stage of RAG. It adds noise to the embeddings, and wastes space when we prompt with it.

I chose to split by word as it would be tedious to tokenize here, and that doesnt scale well. The context length for most embedding models ends up being 512 tokens. This is ~400 words. 

I like to respect the sentence boundary, thats why I gave a ~50 word buffer.

In [4]:
pp = PreProcessor(clean_whitespace = True,
 clean_header_footer = False,
 clean_empty_lines = True,
 remove_substrings = None,
 split_by='word',
 split_length = 350,
 split_overlap = 50,
 split_respect_sentence_boundary = True,
 tokenizer_model_folder = None,
 language = "en",
 id_hash_keys = None,
 progress_bar = True,
 add_page_number = False,
 max_chars_check = 10_000)

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.


In [5]:
with open(file_in, 'r', encoding='utf-8') as f:
 list_of_articles = json.load(f)

In [6]:
documents = pp.process(list_of_articles)

Preprocessing: 0%|▌ | 1551/332023 [00:02<09:44, 565.82docs/s]We found one or more sentences whose word count is higher than the split length.
Preprocessing: 83%|████████████████████████████████████████████████████████████████████████████████████████████████▌ | 276427/332023 [02:12<00:20, 2652.57docs/s]Document 81972e5bc1997b1ed4fb86d17f061a41 is 21206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.
Document 5e63e848e42966ddc747257fb7cf4092 is 11206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.
Preprocessing: 100%|███████████████████████████████████████████████████████████████████████████

When we break a wikipedia article up, we lose some of the context. The local context is somewhat preserved by the `split_overlap`. Im trying to preserve the global context by adding a prefix that has the article's title.

You could enhance this with the summary as well. This is mostly to help the retrieval step of RAG. Note that the way Im doing it alters some of `haystack`'s features like the hash and the lengths, but those arent too necessary. 

A more advanced way for many business applications would be to summarize the document and add that as a prefix for sub-documents.

One last thing to note, is that it would be prudent (in some use-cases) to preserve the original document without the summary to give to the reader (retrieve with the summary but prompt without), but since this is a simple use-case I wont be doing that.

In [7]:
documents[0]

 0%| | 0/268980 [00:00

In [9]:
documents[1]



In [10]:
documents[10102]



In [11]:
print(f'Number of Articles: {len(list_of_articles)}')
processed_articles = len([d for d in documents if d.meta['_split_id'] == 0])
print(f'Number of processed articles: {processed_articles}')
print(f'Number of processed documents: {len(documents)}')

Number of Articles: 332023
Number of processed articles: 237724
Number of processed documents: 268980


# Write to file

In [14]:
with open(file_out, 'wb') as handle:
 pickle.dump(documents, handle, protocol=pickle.HIGHEST_PROTOCOL)