# Get Data
The data from wikipedia starts in XML, this is a relatively simple way to format that into a single json for our purposes.

## Initialize Variables

In [1]:
from pathlib import Path
import sys

In [2]:
proj_dir_path = Path.cwd().parent
proj_dir = str(proj_dir_path)

# So we can import later
sys.path.append(proj_dir)

## Install Libraries

In [3]:
%pip install -q -r "$proj_dir"/requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Download Latest Arabic Wikipedia

Im getting "latest" but its good to see what version it is nonetheless.

In [4]:
!curl -I https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2 --silent | grep "Last-Modified"

Last-Modified: Sat, 21 Oct 2023 02:57:42 GMT


Download simple wikipedia

In [5]:
!wget -nc -P "$proj_dir"/data/raw https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2

--2023-10-28 08:09:45-- https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1671369109 (1.6G) [application/octet-stream]
Saving to: ‘/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2’


2023-10-28 08:15:39 (4.51 MB/s) - ‘/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2’ saved [1671369109/1671369109]



## Extract from XML
The download format from wikipedia is in XML. `wikiextractor` will convert this into a jsonl format split into many folders and files.

In [6]:
!wikiextractor -o "$proj_dir"/data/raw/output --json "$proj_dir"/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2 

INFO: Preprocessing '/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Preprocessed 1800000 pages
INFO: Preprocessed 1900000 pages
INFO: Preprocessed 2000000 pages
INFO: Preprocessed 2100000 pages
INFO: Preprocessed 2200000 pages
INFO: Preprocessed 2300000 pages
INFO: Preprocessed 2400000 pages
INFO: Preprocessed 2500000 pages
INFO: Preprocessed 

## Consolidate into json

The split format is tedious to deal with, so now we we will consolidate this into 1 json file. This is fine since our data fits easily in RAM. But if it didnt, there are better options.

Feel free to check out the [consolidate file](../src/preprocessing/consolidate.py) for more details.

In [9]:
from src.preprocessing.consolidate import folder_to_json

In [10]:
folder = proj_dir_path / 'data/raw/output'
folder_out = proj_dir_path / 'data/consolidated/'
folder_to_json(folder, folder_out, 'ar_wiki')

Processing: 100%|█████████████████| 6119/6119 [01:11<00:00, 85.38file/s, File: wiki_18 | Dir: /home/ec2-user/arabic-wiki/data/raw/output/CJ]


Wiki processed in 72.87 seconds!
