{ "cells": [ { "cell_type": "markdown", "id": "883a8a6a-d0b5-40ea-90a0-5b33d3332360", "metadata": {}, "source": [ "# Get Data\n", "The data from wikipedia starts in XML, this is a relatively simple way to format that into a single json for our purposes." ] }, { "cell_type": "markdown", "id": "a7d66da5-185c-409e-9568-f211ca4b725e", "metadata": {}, "source": [ "## Initialize Variables" ] }, { "cell_type": "code", "execution_count": 1, "id": "ea8ae64c-f597-4c94-b93d-1b78060d7953", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pathlib import Path\n", "import sys" ] }, { "cell_type": "code", "execution_count": 2, "id": "2f9527f9-4756-478b-99ac-a3c8c26ab63e", "metadata": { "tags": [] }, "outputs": [], "source": [ "proj_dir_path = Path.cwd().parent\n", "proj_dir = str(proj_dir_path)\n", "\n", "# So we can import later\n", "sys.path.append(proj_dir)" ] }, { "cell_type": "markdown", "id": "860da614-743b-4060-9d22-673896414cbd", "metadata": {}, "source": [ "## Install Libraries" ] }, { "cell_type": "code", "execution_count": 3, "id": "8bec29e3-8434-491f-914c-13f303dc68f3", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r \"$proj_dir\"/requirements.txt" ] }, { "cell_type": "markdown", "id": "b928c71f-7e34-47ee-b55e-aa12d5118ba7", "metadata": {}, "source": [ "## Download Latest Arabic Wikipedia" ] }, { "cell_type": "markdown", "id": "f1dc5f57-c877-43e3-8131-4f351b99168d", "metadata": {}, "source": [ "Im getting \"latest\" but its good to see what version it is nonetheless." ] }, { "cell_type": "code", "execution_count": 4, "id": "fe4b357f-88fe-44b5-9fce-354404b1447f", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last-Modified: Sat, 21 Oct 2023 02:57:42 GMT\n" ] } ], "source": [ "!curl -I https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2 --silent | grep \"Last-Modified\"" ] }, { "cell_type": "markdown", "id": "fe62d4a3-b59b-40c4-9a8c-bf0a447a9ec2", "metadata": {}, "source": [ "Download simple wikipedia" ] }, { "cell_type": "code", "execution_count": 5, "id": "0f309c12-12de-4460-a03f-bd5b6fcc942c", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2023-10-28 08:09:45-- https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2\n", "Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142\n", "Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1671369109 (1.6G) [application/octet-stream]\n", "Saving to: ‘/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2’\n", "\n", "100%[====================================>] 1,671,369,109 4.54MB/s in 5m 54s \n", "\n", "2023-10-28 08:15:39 (4.51 MB/s) - ‘/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2’ saved [1671369109/1671369109]\n", "\n" ] } ], "source": [ "!wget -nc -P \"$proj_dir\"/data/raw https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2" ] }, { "cell_type": "markdown", "id": "46af5df6-5785-400a-986c-54a2c98768ea", "metadata": {}, "source": [ "## Extract from XML\n", "The download format from wikipedia is in XML. `wikiextractor` will convert this into a jsonl format split into many folders and files." ] }, { "cell_type": "code", "execution_count": 6, "id": "c22dedcd-73b3-4aad-8eb7-1063954967ed", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Preprocessing '/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.\n", "INFO: Preprocessed 100000 pages\n", "INFO: Preprocessed 200000 pages\n", "INFO: Preprocessed 300000 pages\n", "INFO: Preprocessed 400000 pages\n", "INFO: Preprocessed 500000 pages\n", "INFO: Preprocessed 600000 pages\n", "INFO: Preprocessed 700000 pages\n", "INFO: Preprocessed 800000 pages\n", "INFO: Preprocessed 900000 pages\n", "INFO: Preprocessed 1000000 pages\n", "INFO: Preprocessed 1100000 pages\n", "INFO: Preprocessed 1200000 pages\n", "INFO: Preprocessed 1300000 pages\n", "INFO: Preprocessed 1400000 pages\n", "INFO: Preprocessed 1500000 pages\n", "INFO: Preprocessed 1600000 pages\n", "INFO: Preprocessed 1700000 pages\n", "INFO: Preprocessed 1800000 pages\n", "INFO: Preprocessed 1900000 pages\n", "INFO: Preprocessed 2000000 pages\n", "INFO: Preprocessed 2100000 pages\n", "INFO: Preprocessed 2200000 pages\n", "INFO: Preprocessed 2300000 pages\n", "INFO: Preprocessed 2400000 pages\n", "INFO: Preprocessed 2500000 pages\n", "INFO: Preprocessed 2600000 pages\n", "INFO: Preprocessed 2700000 pages\n", "INFO: Preprocessed 2800000 pages\n", "INFO: Preprocessed 2900000 pages\n", "INFO: Preprocessed 3000000 pages\n", "INFO: Preprocessed 3100000 pages\n", "INFO: Preprocessed 3200000 pages\n", "INFO: Preprocessed 3300000 pages\n", "INFO: Preprocessed 3400000 pages\n", "INFO: Preprocessed 3500000 pages\n", "INFO: Loaded 130917 templates in 407.5s\n", "INFO: Starting page extraction from /home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2.\n", "INFO: Using 3 extract processes.\n", "INFO: Extracted 100000 articles (1963.4 art/s)\n", "INFO: Extracted 200000 articles (3371.1 art/s)\n", "INFO: Extracted 300000 articles (3008.9 art/s)\n", "INFO: Extracted 400000 articles (2935.2 art/s)\n", "INFO: Extracted 500000 articles (3563.3 art/s)\n", "INFO: Extracted 600000 articles (4209.0 art/s)\n", "INFO: Extracted 700000 articles (6075.1 art/s)\n", "INFO: Extracted 800000 articles (3531.2 art/s)\n", "INFO: Extracted 900000 articles (3466.4 art/s)\n", "INFO: Extracted 1000000 articles (3789.2 art/s)\n", "INFO: Extracted 1100000 articles (2282.6 art/s)\n", "INFO: Extracted 1200000 articles (4499.8 art/s)\n", "INFO: Extracted 1300000 articles (5143.6 art/s)\n", "INFO: Extracted 1400000 articles (5474.2 art/s)\n", "INFO: Extracted 1500000 articles (6086.9 art/s)\n", "INFO: Extracted 1600000 articles (5453.4 art/s)\n", "INFO: Extracted 1700000 articles (5911.6 art/s)\n", "INFO: Extracted 1800000 articles (5087.4 art/s)\n", "INFO: Extracted 1900000 articles (3782.4 art/s)\n", "INFO: Extracted 2000000 articles (2493.9 art/s)\n", "INFO: Extracted 2100000 articles (2742.2 art/s)\n", "INFO: Extracted 2200000 articles (2416.5 art/s)\n", "INFO: Finished 3-process extraction of 2254650 articles in 641.2s (3516.3 art/s)\n" ] } ], "source": [ "!wikiextractor -o \"$proj_dir\"/data/raw/output --json \"$proj_dir\"/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2 " ] }, { "cell_type": "markdown", "id": "bb8063c6-1bed-49f0-948a-eeb9a7933b4a", "metadata": {}, "source": [ "## Consolidate into json\n", "\n", "The split format is tedious to deal with, so now we we will consolidate this into 1 json file. This is fine since our data fits easily in RAM. But if it didnt, there are better options.\n", "\n", "Feel free to check out the [consolidate file](../src/preprocessing/consolidate.py) for more details." ] }, { "cell_type": "code", "execution_count": 9, "id": "0a4ce3aa-9c1e-45e4-8219-a1714f482371", "metadata": { "tags": [] }, "outputs": [], "source": [ "from src.preprocessing.consolidate import folder_to_json" ] }, { "cell_type": "code", "execution_count": 10, "id": "3e93da6a-e304-450c-a81e-ffecaf0d8a9a", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Processing: 100%|█████████████████| 6119/6119 [01:11<00:00, 85.38file/s, File: wiki_18 | Dir: /home/ec2-user/arabic-wiki/data/raw/output/CJ]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Wiki processed in 72.87 seconds!\n" ] } ], "source": [ "folder = proj_dir_path / 'data/raw/output'\n", "folder_out = proj_dir_path / 'data/consolidated/'\n", "folder_to_json(folder, folder_out, 'ar_wiki')" ] }, { "cell_type": "code", "execution_count": null, "id": "553039d0-6315-40b8-b8f5-c6672598a5f3", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }