{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from pathlib import Path\n",
    "import pickle\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "from haystack.nodes.preprocessor import PreProcessor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ec2-user/RAGDemo\n"
     ]
    }
   ],
   "source": [
    "proj_dir = Path.cwd().parent\n",
    "print(proj_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
   "metadata": {},
   "source": [
    "# Config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "file_in = proj_dir / 'data/consolidated/simple_wiki.json'\n",
    "file_out = proj_dir / 'data/processed/simple_wiki_processed.pkl'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a643cf2-abce-48a9-b4e0-478bcbee28c3",
   "metadata": {},
   "source": [
    "# Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8f9630e-447e-423e-9f6c-e1dbc654f2dd",
   "metadata": {},
   "source": [
    "Its important to choose good pre-processing options. \n",
    "\n",
    "Clean whitespace helps each stage of RAG. It adds noise to the embeddings, and wastes space when we prompt with it.\n",
    "\n",
    "I chose to split by word as it would be tedious to tokenize here, and that doesnt scale well. The context length for most embedding models ends up being 512 tokens. This is ~400 words. \n",
    "\n",
    "I like to respect the sentence boundary, thats why I gave a ~50 word buffer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "18807aea-24e4-4d74-bf10-55b24f3cb52c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
     ]
    }
   ],
   "source": [
    "pp = PreProcessor(clean_whitespace = True,\n",
    "             clean_header_footer = False,\n",
    "             clean_empty_lines = True,\n",
    "             remove_substrings = None,\n",
    "             split_by='word',\n",
    "             split_length = 350,\n",
    "             split_overlap = 50,\n",
    "             split_respect_sentence_boundary = True,\n",
    "             tokenizer_model_folder = None,\n",
    "             language = \"en\",\n",
    "             id_hash_keys = None,\n",
    "             progress_bar = True,\n",
    "             add_page_number = False,\n",
    "             max_chars_check = 10_000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "dab1658a-79a7-40f2-9a8c-1798e0d124bf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(file_in, 'r', encoding='utf-8') as f:\n",
    "    list_of_articles = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4ca6e576-4b7d-4c1a-916f-41d1b82be647",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Preprocessing:   0%|▌                                                                                                                      | 1551/332023 [00:02<09:44, 565.82docs/s]We found one or more sentences whose word count is higher than the split length.\n",
      "Preprocessing:  83%|████████████████████████████████████████████████████████████████████████████████████████████████▌                   | 276427/332023 [02:12<00:20, 2652.57docs/s]Document 81972e5bc1997b1ed4fb86d17f061a41 is 21206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
      "Document 5e63e848e42966ddc747257fb7cf4092 is 11206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
      "Preprocessing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 332023/332023 [02:29<00:00, 2219.16docs/s]\n"
     ]
    }
   ],
   "source": [
    "documents = pp.process(list_of_articles)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f00dbdb2-906f-4d5a-a3f1-b0d84385d85a",
   "metadata": {},
   "source": [
    "When we break a wikipedia article up, we lose some of the context. The local context is somewhat preserved by the `split_overlap`. Im trying to preserve the global context by adding a prefix that has the article's title.\n",
    "\n",
    "You could enhance this with the summary as well. This is mostly to help the retrieval step of RAG. Note that the way Im doing it alters some of `haystack`'s features like the hash and the lengths, but those arent too necessary. \n",
    "\n",
    "A more advanced way for many business applications would be to summarize the document and add that as a prefix for sub-documents.\n",
    "\n",
    "One last thing to note, is that it would be prudent (in some use-cases) to preserve the original document without the summary to give to the reader (retrieve with the summary but prompt without), but since this is a simple use-case I wont be doing that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "076e115d-3e88-49d2-bc5d-f725a94e4964",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ba764e7bf29f4202a74e08576a29f4e4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/268980 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Prefix each document's content\n",
    "for document in tqdm(documents):\n",
    "    if document.meta['_split_id'] != 0:\n",
    "        document.content = f'Title: {document.meta[\"title\"]}. ' + document.content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72c1849c-1f4d-411f-b74b-6208b1e48217",
   "metadata": {},
   "source": [
    "## Pre-processing Examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "02c1c6c8-6283-49a8-9d29-c355f1b08540",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Document: {'content': \"April (Apr.) is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of the four months to have 30 days.\\nApril always begins on the same day of the week as July, and additionally, January in leap years. April always ends on the same day of the week as December.\\nThe Month.\\nApril comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.\\nApril begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.\\nIn common years, April starts on the same day of the week as October of the previous year, and in leap years, May of the previous year. In common years, April finishes on the same day of the week as July of the previous year, and in leap years, February and October of the previous year. In common years immediately after other common years, April starts on the same day of the week as January of the previous year, and in leap years and years immediately after that, April finishes on the same day of the week as January of the previous year.\\nIn years immediately before common years, April starts on the same day of the week as September and December of the following year, and in years immediately before leap years, June of the following year. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. \", 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 0, '_split_overlap': [{'doc_id': '79a74c1e6444dd0a1acd72840e9dd7c0', 'range': (1529, 1835)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': 'a1c2acf337dbc3baa6f7f58403dfb95d'}>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b34890bf-9dba-459a-9b0d-aa4b5929cbe8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Document: {'content': 'Title: April. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.\\nIt is unclear as to where April got its name. A common theory is that it comes from the Latin word \"aperire\", meaning \"to open\", referring to flowers opening in spring. Another theory is that the name could come from Aphrodite, the Greek goddess of love. It was originally the second month in the old Roman Calendar, before the start of the new year was put to January 1.\\nQuite a few festivals are held in this month. In many Southeast Asian cultures, new year is celebrated in this month (including Songkran). In Western Christianity, Easter can be celebrated on a Sunday between March 22 and April 25. In Orthodox Christianity, it can fall between April 4 and May 8. At the end of the month, Central and Northern European cultures celebrate Walpurgis Night on April 30, marking the transition from winter into summer.\\nApril in poetry.\\nPoets use \"April\" to mean the end of winter. For example: \"April showers bring May flowers.\"', 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 1, '_split_overlap': [{'doc_id': 'a1c2acf337dbc3baa6f7f58403dfb95d', 'range': (0, 306)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '79a74c1e6444dd0a1acd72840e9dd7c0'}>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e6f50c27-a486-47e9-ba60-d567f5e530db",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Document: {'content': 'Title: Chief Joseph. He knew he could not trust them anymore. He was tired of being considered a savage. He felt it was not fair for people who were born on the same land to be treated differently. He delivered a lot of speeches on this subject, which are still really good examples of eloquence. But he did not feel listened to, and when he died in his reservation in 1904, the doctor said he \"died from sadness\". He was buried in Colville Native American Burial Ground, in Washington State.', 'content_type': 'text', 'score': None, 'meta': {'id': '19310', 'revid': '16695', 'url': 'https://simple.wikipedia.org/wiki?curid=19310', 'title': 'Chief Joseph', '_split_id': 1, '_split_overlap': [{'doc_id': '4bdf9cecd46c3bfac6b225aed940e798', 'range': (0, 275)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '91bc8240c5d067ab24f35c11f8916fc6'}>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[10102]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5485cc27-3d3f-4b96-8884-accf5324da2d",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of Articles: 332023\n",
      "Number of processed articles: 237724\n",
      "Number of processed documents: 268980\n"
     ]
    }
   ],
   "source": [
    "print(f'Number of Articles: {len(list_of_articles)}')\n",
    "processed_articles = len([d for d in documents if d.meta['_split_id'] == 0])\n",
    "print(f'Number of processed articles: {processed_articles}')\n",
    "print(f'Number of processed documents: {len(documents)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23ce57a8-d14e-426d-abc2-0ce5cdbc881a",
   "metadata": {},
   "source": [
    "# Write to file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0d044870-7a30-4e09-aad2-42f24a52780d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(file_out, 'wb') as handle:\n",
    "    pickle.dump(documents, handle, protocol=pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5833dba-1bf6-48aa-be6f-0d70c71e54aa",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}