{ "cells": [ { "cell_type": "markdown", "id": "883a8a6a-d0b5-40ea-90a0-5b33d3332360", "metadata": {}, "source": [ "# Get Data\n", "The data from wikipedia starts in XML, this is a relatively simple way to format that into a single json for our purposes." ] }, { "cell_type": "markdown", "id": "a7d66da5-185c-409e-9568-f211ca4b725e", "metadata": {}, "source": [ "## Initialize Variables" ] }, { "cell_type": "code", "execution_count": 1, "id": "ea8ae64c-f597-4c94-b93d-1b78060d7953", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pathlib import Path\n", "import sys" ] }, { "cell_type": "code", "execution_count": 16, "id": "2f9527f9-4756-478b-99ac-a3c8c26ab63e", "metadata": { "tags": [] }, "outputs": [], "source": [ "proj_dir_path = Path.cwd().parent\n", "proj_dir = str(proj_dir_path)\n", "\n", "# So we can import later\n", "sys.path.append(proj_dir)" ] }, { "cell_type": "markdown", "id": "860da614-743b-4060-9d22-673896414cbd", "metadata": {}, "source": [ "## Install Libraries" ] }, { "cell_type": "code", "execution_count": 3, "id": "8bec29e3-8434-491f-914c-13f303dc68f3", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r \"$proj_dir\"/requirements.txt" ] }, { "cell_type": "markdown", "id": "b928c71f-7e34-47ee-b55e-aa12d5118ba7", "metadata": {}, "source": [ "## Download Latest Simple Wikipedia" ] }, { "cell_type": "markdown", "id": "f1dc5f57-c877-43e3-8131-4f351b99168d", "metadata": {}, "source": [ "Im getting \"latest\" but its good to see what version it is nonetheless." ] }, { "cell_type": "code", "execution_count": 4, "id": "fe4b357f-88fe-44b5-9fce-354404b1447f", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last-Modified: Sun, 01 Oct 2023 23:32:27 GMT\n" ] } ], "source": [ "!curl -I https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2 --silent | grep \"Last-Modified\"" ] }, { "cell_type": "markdown", "id": "fe62d4a3-b59b-40c4-9a8c-bf0a447a9ec2", "metadata": {}, "source": [ "Download simple wikipedia" ] }, { "cell_type": "code", "execution_count": 5, "id": "0f309c12-12de-4460-a03f-bd5b6fcc942c", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2023-10-18 10:55:38-- https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2\n", "Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142\n", "Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 286759308 (273M) [application/octet-stream]\n", "Saving to: ‘/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2’\n", "\n", "100%[======================================>] 286,759,308 4.22MB/s in 66s \n", "\n", "2023-10-18 10:56:45 (4.13 MB/s) - ‘/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2’ saved [286759308/286759308]\n", "\n" ] } ], "source": [ "!wget -nc -P \"$proj_dir\"/data/raw https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2" ] }, { "cell_type": "markdown", "id": "46af5df6-5785-400a-986c-54a2c98768ea", "metadata": {}, "source": [ "## Extract from XML\n", "The download format from wikipedia is in XML. `wikiextractor` will convert this into a jsonl format split into many folders and files." ] }, { "cell_type": "code", "execution_count": 9, "id": "c22dedcd-73b3-4aad-8eb7-1063954967ed", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Preprocessing '/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.\n", "INFO: Preprocessed 100000 pages\n", "INFO: Preprocessed 200000 pages\n", "INFO: Preprocessed 300000 pages\n", "INFO: Preprocessed 400000 pages\n", "INFO: Loaded 36594 templates in 54.1s\n", "INFO: Starting page extraction from /home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2.\n", "INFO: Using 3 extract processes.\n", "INFO: Extracted 100000 articles (3481.4 art/s)\n", "INFO: Extracted 200000 articles (3764.9 art/s)\n", "INFO: Extracted 300000 articles (4175.8 art/s)\n", "INFO: Finished 3-process extraction of 332024 articles in 86.9s (3822.7 art/s)\n" ] } ], "source": [ "!wikiextractor -o \"$proj_dir\"/data/raw/output --json \"$proj_dir\"/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2 " ] }, { "cell_type": "markdown", "id": "bb8063c6-1bed-49f0-948a-eeb9a7933b4a", "metadata": {}, "source": [ "## Consolidate into json\n", "\n", "The split format is tedious to deal with, so now we we will consolidate this into 1 json file. This is fine since our data fits easily in RAM. But if it didnt, there are better options.\n", "\n", "Feel free to check out the [consolidate file](../src/preprocessing/consolidate.py) for more details." ] }, { "cell_type": "code", "execution_count": 14, "id": "0a4ce3aa-9c1e-45e4-8219-a1714f482371", "metadata": { "tags": [] }, "outputs": [], "source": [ "from src.preprocessing.consolidate import folder_to_json" ] }, { "cell_type": "code", "execution_count": 17, "id": "3e93da6a-e304-450c-a81e-ffecaf0d8a9a", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3f045c61ef544f34a1d6f7c4236b206c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/206 [00:00