SaiedAlshahrani commited on
Commit
360dd3d
1 Parent(s): 8a8b943

Upload 7 files

Browse files
Files changed (7) hide show
  1. LICENSE +21 -0
  2. README.md +48 -13
  3. packages.txt +2 -0
  4. report.py +313 -0
  5. requirements.txt +8 -0
  6. update-daemon.sh +20 -0
  7. update-metadata.py +196 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Saied Alshahrani
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,13 +1,48 @@
1
- ---
2
- title: Wikipedia Corpora Report
3
- emoji: 📈
4
- colorFrom: gray
5
- colorTo: pink
6
- sdk: streamlit
7
- sdk_version: 1.33.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Wikipedia Corpora Meta Report
2
+ We, in this repository, share with the community our Python scripts for the “**Wikipedia Corpora Meta Report**”, an online metadata report (dashboard), designed to shed light on how bots or humans generate or edit Wikipedia editions to provide the NLP community with detailed information (metadata) about each Wikipedia edition’s articles, enabling them to make informed decisions regarding using these Wikipedia articles for training their NLP tasks and systems.
3
+
4
+ This dashboard interactively displays the metadata of each Wikipedia edition using sunburst visualization and provides users with the options to view the metadata in a tabular format and to download the displayed metadata as a CSV file. The dashboard is open-sourced on GitHub with an MIT license and publicly hosted on Streamlit Community Cloud at [https://wikipedia-corpora-report.app](https://wikipedia-corpora-report.streamlit.app/).
5
+
6
+ This dashboard was presented as a *transparency* tool in our **accepted** paper, [**Performance Implications of Using Unrepresentative Corpora in Arabic Natural Language Processing**](https://aclanthology.org/2023.arabicnlp-1.19.pdf), at [*The First Arabic Natural Language Processing Conference (ArabicNLP 2023)*](https://sites.google.com/view/wanlp2023), co-located with [EMNLP 2023](https://2023.emnlp.org/) in Singapore (hybrid conference), December 7, 2023.
7
+
8
+
9
+ ### Local Run of Dashboard
10
+ The dashboard is publicly hosted online on Streamlit Community Cloud, yet if you desire to run the dashboard locally on your machine, follow these steps.
11
+
12
+ 1- Clone the dashboard's GitHub repository to your machine. Use this command in your terminal:
13
+
14
+ ```bash
15
+ git clone https://github.com/SaiedAlshahrani/Wikipedia-Corpora-Report.git
16
+ cd Wikipedia-Corpora-Report
17
+ ```
18
+
19
+ 2- Download the required Python packages. Use this command in your terminal:
20
+
21
+ ```bash
22
+ pip install -r requirements.txt
23
+ ```
24
+
25
+ 3- Run Streamlit local server. Use this command in your terminal:
26
+
27
+ ```bash
28
+ streamlit run report.py
29
+ ```
30
+
31
+
32
+ ### BibTeX Citation:
33
+
34
+ ```bash
35
+ @inproceedings{alshahrani-etal-2023-performance,
36
+ title = "{Performance Implications of Using Unrepresentative Corpora in {A}rabic Natural Language Processing}",
37
+ author = "Alshahrani, Saied and Alshahrani, Norah and Dey, Soumyabrata and Matthews, Jeanna",
38
+ booktitle = "Proceedings of the The First Arabic Natural Language Processing Conference (ArabicNLP 2023)",
39
+ month = December,
40
+ year = "2023",
41
+ address = "Singapore (Hybrid)",
42
+ publisher = "Association for Computational Linguistics",
43
+ url = "https://aclanthology.org/2023.arabicnlp-1.19",
44
+ doi = "10.18653/v1/2023.arabicnlp-1.19",
45
+ pages = "218--231",
46
+ abstract = "Wikipedia articles are a widely used source of training data for Natural Language Processing (NLP) research, particularly as corpora for low-resource languages like Arabic. However, it is essential to understand the extent to which these corpora reflect the representative contributions of native speakers, especially when many entries in a given language are directly translated from other languages or automatically generated through automated mechanisms. In this paper, we study the performance implications of using inorganic corpora that are not representative of native speakers and are generated through automated techniques such as bot generation or automated template-based translation. The case of the Arabic Wikipedia editions gives a unique case study of this since the Moroccan Arabic Wikipedia edition (ARY) is small but representative, the Egyptian Arabic Wikipedia edition (ARZ) is large but unrepresentative, and the Modern Standard Arabic Wikipedia edition (AR) is both large and more representative. We intrinsically evaluate the performance of two main NLP upstream tasks, namely word representation and language modeling, using word analogy evaluations and fill-mask evaluations using our two newly created datasets: Arab States Analogy Dataset (ASAD) and Masked Arab States Dataset (MASD). We demonstrate that for good NLP performance, we need both large and organic corpora; neither alone is sufficient. We show that producing large corpora through automated means can be a counter-productive, producing models that both perform worse and lack cultural richness and meaningful representation of the Arabic language and its native speakers.",
47
+ }
48
+ ```
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ wget
2
+ firefox-esr
report.py ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import ssl
2
+ import warnings
3
+ import datasets
4
+ import subprocess
5
+ import pandas as pd
6
+ import urllib.request
7
+ from time import sleep
8
+ import streamlit as st
9
+ from datetime import date
10
+ import plotly.express as px
11
+ from urllib.error import HTTPError
12
+
13
+
14
+ warnings.simplefilter("ignore", UserWarning)
15
+ warnings.simplefilter("ignore", FutureWarning)
16
+ pd.options.display.float_format = '{:.2f}'.format
17
+ ssl._create_default_https_context = ssl._create_unverified_context
18
+
19
+ st.set_page_config(page_title="Wikipedia Corpora Report", page_icon="https://webspace.clarkson.edu/~alshahsf/images/wikipedia1.png")
20
+
21
+ st.markdown("""
22
+ <h1 style='text-align: center';>Wikipedia Corpora Meta Report</h1>
23
+ <h5 style='text-align: center';>A Metadata Report of How Wikipedia Editions Are Generated and Edited</h5>
24
+ """, unsafe_allow_html=True)
25
+
26
+
27
+ def fetch_wikis_codes():
28
+ try:
29
+ urls = [r'https://en.wikipedia.org/wiki/Statistics_of_Wikipedias',
30
+ r'https://meta.wikimedia.org/wiki/List_of_Wikipedias']
31
+
32
+ for url in urls:
33
+ try: tables = pd.read_html(url)
34
+ except urllib.error.HTTPError: continue
35
+
36
+ for i in range(len(tables)):
37
+ dataframe = tables[i]
38
+ columns = list(dataframe.columns.values)
39
+
40
+ if(set(['Language', 'Wiki']).issubset(set(columns))):
41
+ wikis_codes = tables[i]
42
+ break
43
+
44
+ wikis_codes = wikis_codes[['Wiki', 'Language']]
45
+ wikis_codes = wikis_codes[wikis_codes["Language"].str.contains("(closed)") == False]
46
+ wikis_codes = wikis_codes.set_index('Wiki').to_dict()['Language']
47
+ return wikis_codes
48
+
49
+ except:
50
+ wikis_codes = {'en': 'English', 'ceb': 'Cebuano', 'de': 'German', 'sv': 'Swedish', 'fr': 'French', 'nl': 'Dutch', 'ru': 'Russian',
51
+ 'es': 'Spanish', 'it': 'Italian', 'arz': 'Egyptian Arabic', 'pl': 'Polish', 'ja': 'Japanese', 'zh': 'Chinese', 'vi':
52
+ 'Vietnamese', 'uk': 'Ukrainian', 'war': 'Waray', 'ar': 'Arabic', 'pt': 'Portuguese', 'fa': 'Persian', 'ca': 'Catalan',
53
+ 'sr': 'Serbian', 'id': 'Indonesian', 'ko': 'Korean', 'no': 'Norwegian (Bokmål)', 'ce': 'Chechen', 'fi': 'Finnish', 'cs':
54
+ 'Czech', 'tr': 'Turkish', 'hu': 'Hungarian', 'tt': 'Tatar', 'sh': 'Serbo-Croatian', 'ro': 'Romanian', 'zh-min-nan':
55
+ 'Southern Min', 'eu': 'Basque', 'ms': 'Malay', 'eo': 'Esperanto', 'he': 'Hebrew', 'hy': 'Armenian', 'da': 'Danish', 'bg':
56
+ 'Bulgarian', 'cy': 'Welsh', 'sk': 'Slovak', 'azb': 'South Azerbaijani', 'uz': 'Uzbek', 'et': 'Estonian', 'simple':
57
+ 'Simple English', 'be': 'Belarusian', 'kk': 'Kazakh', 'min': 'Minangkabau', 'el': 'Greek', 'hr': 'Croatian', 'lt': 'Lithuanian',
58
+ 'gl': 'Galician', 'az': 'Azerbaijani', 'ur': 'Urdu', 'sl': 'Slovene', 'lld': 'Ladin', 'ka': 'Georgian', 'nn': 'Norwegian (Nynorsk)',
59
+ 'hi': 'Hindi', 'th': 'Thai', 'ta': 'Tamil', 'bn': 'Bengali', 'la': 'Latin', 'mk': 'Macedonian', 'zh-yue': 'Cantonese', 'ast':
60
+ 'Asturian', 'lv': 'Latvian', 'af': 'Afrikaans', 'tg': 'Tajik', 'my': 'Burmese', 'mg': 'Malagasy', 'mr': 'Marathi', 'sq': 'Albanian',
61
+ 'bs': 'Bosnian', 'oc': 'Occitan', 'te': 'Telugu', 'ml': 'Malayalam', 'nds': 'Low German', 'be-tarask': 'Belarusian (Taraškievica)',
62
+ 'br': 'Breton', 'ky': 'Kyrgyz', 'sw': 'Swahili', 'jv': 'Javanese', 'lmo': 'Lombard', 'new': 'Newar', 'pnb': 'Western Punjabi', 'vec':
63
+ 'Venetian', 'ht': 'Haitian Creole', 'pms': 'Piedmontese', 'ba': 'Bashkir', 'lb': 'Luxembourgish', 'su': 'Sundanese', 'ku': 'Kurdish (Kurmanji)',
64
+ 'ga': 'Irish', 'szl': 'Silesian', 'is': 'Icelandic', 'fy': 'West Frisian', 'cv': 'Chuvash', 'ckb': 'Kurdish (Sorani)', 'pa': 'Punjabi', 'tl':
65
+ 'Tagalog', 'an': 'Aragonese', 'wuu': 'Wu Chinese', 'diq': 'Zaza', 'io': 'Ido', 'sco': 'Scots', 'vo': 'Volapük', 'yo': 'Yoruba', 'ne': 'Nepali',
66
+ 'ia': 'Interlingua', 'kn': 'Kannada', 'gu': 'Gujarati', 'als': 'Alemannic German', 'ha': 'Hausa', 'avk': 'Kotava', 'bar': 'Bavarian', 'crh':
67
+ 'Crimean Tatar', 'scn': 'Sicilian', 'bpy': 'Bishnupriya Manipuri', 'qu': 'Quechua (Southern Quechua)', 'nv': 'Navajo', 'mn': 'Mongolian', 'xmf':
68
+ 'Mingrelian', 'ban': 'Balinese', 'si': 'Sinhala', 'tum': 'Tumbuka', 'ps': 'Pashto', 'frr': 'North Frisian', 'os': 'Ossetian', 'mzn': 'Mazanderani',
69
+ 'bat-smg': 'Samogitian', 'or': 'Odia', 'ig': 'Igbo', 'sah': 'Yakut', 'cdo': 'Eastern Min', 'gd': 'Scottish Gaelic', 'bug': 'Buginese', 'yi': 'Yiddish',
70
+ 'sd': 'Sindhi', 'ilo': 'Ilocano', 'am': 'Amharic', 'nap': 'Neapolitan', 'li': 'Limburgish', 'bcl': 'Central Bikol', 'fo': 'Faroese', 'gor': 'Gorontalo',
71
+ 'hsb': 'Upper Sorbian', 'map-bms': 'Banyumasan', 'mai': 'Maithili', 'shn': 'Shan', 'eml': 'Emilian-Romagnol', 'ace': 'Acehnese', 'zh-classical':
72
+ 'Classical Chinese', 'sa': 'Sanskrit', 'as': 'Assamese', 'wa': 'Walloon', 'ie': 'Interlingue', 'hyw': 'Western Armenian', 'lij': 'Ligurian', 'mhr':
73
+ 'Meadow Mari', 'zu': 'Zulu', 'sn': 'Shona', 'hif': 'Fiji Hindi', 'mrj': 'Hill Mari', 'bjn': 'Banjarese', 'mni': 'Meitei', 'km': 'Khmer', 'hak':
74
+ 'Hakka Chinese', 'roa-tara': 'Tarantino', 'pam': 'Kapampangan', 'sat': 'Santali', 'rue': 'Rusyn', 'nso': 'Northern Sotho', 'bh': 'Bihari (Bhojpuri)',
75
+ 'so': 'Somali', 'mi': 'Māori', 'se': 'Northern Sámi', 'myv': 'Erzya', 'vls': 'West Flemish', 'nds-nl': 'Dutch Low Saxon', 'dag': 'Dagbani', 'sc':
76
+ 'Sardinian', 'ary': 'Moroccan Arabic', 'co': 'Corsican', 'kw': 'Cornish', 'bo': 'Lhasa Tibetan', 'vep': 'Veps', 'glk': 'Gilaki', 'tk': 'Turkmen', 'kab':
77
+ 'Kabyle', 'gan': 'Gan Chinese', 'rw': 'Kinyarwanda', 'fiu-vro': 'Võro', 'ab': 'Abkhaz', 'gv': 'Manx', 'ug': 'Uyghur', 'nah': 'Nahuatl', 'zea': 'Zeelandic',
78
+ 'skr': 'Saraiki', 'frp': 'Franco-Provençal', 'udm': 'Udmurt', 'pcd': 'Picard', 'mt': 'Maltese', 'kv': 'Komi', 'csb': 'Kashubian', 'gn': 'Guarani', 'smn':
79
+ 'Inari Sámi', 'ay': 'Aymara', 'nrm': 'Norman', 'ks': 'Kashmiri', 'lez': 'Lezgian', 'lfn': 'Lingua Franca Nova', 'olo': 'Livvi-Karelian', 'mwl': 'Mirandese',
80
+ 'stq': 'Saterland Frisian', 'lo': 'Lao', 'ang': 'Old English', 'mdf': 'Moksha', 'fur': 'Friulian', 'rm': 'Romansh', 'lad': 'Judaeo-Spanish', 'kaa': 'Karakalpak',
81
+ 'gom': 'Konkani (Goan Konkani)', 'ext': 'Extremaduran', 'koi': 'Permyak', 'tyv': 'Tuvan', 'pap': 'Papiamento', 'av': 'Avar', 'dsb': 'Lower Sorbian', 'ln':
82
+ 'Lingala', 'dty': 'Doteli', 'tw': 'Twi', 'cbk-zam': 'Chavacano (Zamboanga)', 'dv': 'Maldivian', 'ksh': 'Ripuarian', 'za': 'Zhuang (Standard Zhuang)', 'gag':
83
+ 'Gagauz', 'bxr': 'Buryat (Russia Buriat)', 'pfl': 'Palatine German', 'lg': 'Luganda', 'szy': 'Sakizaya', 'pag': 'Pangasinan', 'blk': "Pa'O", 'pi': 'Pali',
84
+ 'tay': 'Atayal', 'haw': 'Hawaiian', 'awa': 'Awadhi', 'inh': 'Ingush', 'krc': 'Karachay-Balkar', 'xal': 'Kalmyk Oirat', 'pdc': 'Pennsylvania Dutch', 'to':
85
+ 'Tongan', 'atj': 'Atikamekw', 'tcy': 'Tulu', 'arc': 'Aramaic (Syriac)', 'mnw': 'Mon', 'jam': 'Jamaican Patois', 'shi': 'Shilha', 'kbp': 'Kabiye', 'wo':
86
+ 'Wolof', 'anp': 'Angika', 'kbd': 'Kabardian', 'nia': 'Nias', 'nov': 'Novial', 'om': 'Oromo', 'ki': 'Kikuyu', 'nqo': "N'Ko", 'bi': 'Bislama', 'xh': 'Xhosa',
87
+ 'tpi': 'Tok Pisin', 'tet': 'Tetum', 'ff': 'Fula', 'roa-rup': 'Aromanian', 'jbo': 'Lojban', 'fj': 'Fijian', 'kg': 'Kongo (Kituba)', 'lbe': 'Lak', 'ty': 'Tahitian',
88
+ 'guw': 'Gun', 'cu': 'Old Church Slavonic', 'trv': 'Seediq', 'ami': 'Amis', 'srn': 'Sranan Tongo', 'sm': 'Samoan', 'mad': 'Madurese', 'alt': 'Southern Altai',
89
+ 'ltg': 'Latgalian', 'gcr': 'French Guianese Creole', 'chr': 'Cherokee', 'tn': 'Tswana', 'ny': 'Chewa', 'st': 'Sotho', 'pih': 'Norfuk', 'rmy': 'Romani (Vlax Romani)',
90
+ 'got': 'Gothic', 'ee': 'Ewe', 'pcm': 'Nigerian Pidgin', 'bm': 'Bambara', 'ss': 'Swazi', 'ts': 'Tsonga', 've': 'Venda', 'kcg': 'Tyap', 'chy': 'Cheyenne', 'rn':
91
+ 'Kirundi', 'ch': 'Chamorro', 'gur': 'Frafra', 'ik': 'Iñupiaq', 'ady': 'Adyghe', 'pnt': 'Pontic Greek', 'guc': 'Wayuu', 'iu': 'Inuktitut', 'pwn': 'Paiwan', 'sg':
92
+ 'Sango', 'din': 'Dinka', 'ti': 'Tigrinya', 'kl': 'Greenlandic', 'dz': 'Dzongkha', 'cr': 'Cree', 'ak': 'Akan'}
93
+ return wikis_codes
94
+
95
+
96
+ def run_daemon(args):
97
+ result = subprocess.run(args, capture_output=True, text=True)
98
+ try: result.check_returncode()
99
+ except subprocess.CalledProcessError as exception: raise exception
100
+
101
+
102
+ labels = []
103
+ wiki_codes = fetch_wikis_codes()
104
+ for key, value in wiki_codes.items():
105
+ labels.append(f"{value} ({key})")
106
+
107
+ # st.markdown("<br>",unsafe_allow_html=True)
108
+
109
+ selected_language = st.selectbox("Select or Search for a Wikipedia language:", labels, placeholder="Select or Search for a Wikipedia language")
110
+
111
+
112
+ @st.cache_data
113
+ def fetch_metadata_dataset():
114
+ # HF_TOKEN = st.secrets["HF_TOKEN"]
115
+ dataset = datasets.load_dataset("SaiedAlshahrani/Wikipedia-Corpora-Report", split="train")#, use_auth_token=HF_TOKEN)
116
+ dataset = dataset.to_pandas()
117
+ return dataset
118
+
119
+ dataset = fetch_metadata_dataset()
120
+
121
+ metadata = dataset[dataset['Wiki'] == selected_language]
122
+
123
+ retrieval_date = metadata['Retrieval-Date'].iloc[0]
124
+
125
+ now_date = date.today()
126
+ data_date = date(int(retrieval_date.split('-')[0]), int(retrieval_date.split('-')[1]), int(retrieval_date.split('-')[2]))
127
+ delta = now_date - data_date
128
+
129
+ # if delta.days > 45: run_daemon(["bash", "update-daemon.sh"])
130
+
131
+ pages_content_bots = metadata['Values'].iloc[0]
132
+ pages_content_humans = metadata['Values'].iloc[1]
133
+ pages_non_content_bots = metadata['Values'].iloc[2]
134
+ pages_non_content_humans = metadata['Values'].iloc[3]
135
+
136
+ edits_content_bots = metadata['Values'].iloc[4]
137
+ edits_content_humans = metadata['Values'].iloc[5]
138
+ edits_non_content_bots = metadata['Values'].iloc[6]
139
+ edits_non_content_humans = metadata['Values'].iloc[7]
140
+
141
+ pages_content_pages = pages_content_bots+pages_content_humans
142
+ pages_non_content_pages = pages_non_content_bots+pages_non_content_humans
143
+ total_pages = pages_content_pages+pages_non_content_pages
144
+
145
+ edits_content_pages = edits_content_bots+edits_content_humans
146
+ edits_non_content_pages = edits_non_content_bots+edits_non_content_humans
147
+ total_edits = edits_content_pages + edits_non_content_pages
148
+
149
+ wiki_metadata = pd.DataFrame(metadata).reset_index(drop=True)
150
+
151
+ col1 , cc, col2 = st.columns([1.5, 1.75, 1], gap="small")
152
+
153
+ with col1:
154
+ display_data_table = st.checkbox(f'Display metadata in a table.', value=False)
155
+
156
+ with cc:
157
+ st.markdown(f"<p style='color:lightgray;font-family:'IBM Plex Sans',sans-serif;font-size:18px;'> &#9432; Latest Metadata Update: {retrieval_date}</p>", unsafe_allow_html=True)
158
+
159
+ with col2:
160
+ download_button = st.download_button(label="Download Metadata", data=wiki_metadata.to_csv().encode('utf-8'),
161
+ file_name=f'{selected_language.split("(")[0].strip(" ")}-Metadata-{retrieval_date}.csv', mime='text/csv',)
162
+
163
+ fig = px.sunburst(data_frame=wiki_metadata,
164
+ path=['Wiki','Metric', 'Sub-Metric', 'Editors'],
165
+ values='Values',
166
+ branchvalues="total",
167
+ color_discrete_sequence=['darkgray', 'black'],
168
+ template='xgridoff')
169
+
170
+ fig.update_traces(textinfo='label+percent parent')
171
+ fig.update_traces(hovertemplate="Label=%{label}<br>Value=%{value}<br>Parent=%{parent}</br>")
172
+ fig.update_layout(margin=dict(t=0, l=0, r=0, b=0))
173
+ # fig.update_layout(uniformtext=dict(minsize=12, mode='hide'))
174
+ fig.add_layout_image(dict(x=.430, y=.615, sizex=0.23, sizey=0.23, opacity=0.22, layer="below",
175
+ source="https://upload.wikimedia.org/wikipedia/commons/6/63/Wikipedia-logo.png"))
176
+
177
+ # st.markdown("<br>",unsafe_allow_html=True)
178
+
179
+ st.plotly_chart(fig, theme=None, use_container_width=True, config={'displayModeBar': False})
180
+
181
+ # st.markdown("##")
182
+ # st.markdown("<br>",unsafe_allow_html=True)
183
+
184
+
185
+ if display_data_table:
186
+ table_st_style = """
187
+ <style>
188
+ table {
189
+ border-collapse: collapse;
190
+ border: 1px solid black;
191
+ border-spacing: 0;
192
+ margin-left: 0;
193
+ margin-right: 0;
194
+ width: 100%;}
195
+
196
+ page {
197
+ border-collapse: collapse;}
198
+
199
+ td, th, tr {
200
+ border: 1px solid black;
201
+ padding: 0;}
202
+
203
+ .contentTableHeader {
204
+ background-color: lightgray;
205
+ color: black;}
206
+ </style>
207
+ """
208
+ st.markdown(table_st_style, unsafe_allow_html=True)
209
+
210
+ st.markdown(f"""
211
+ <table border="1" width="100%" cellpadding="0" cellspacing="0">
212
+ <thead class="contentTableHeader">
213
+ <tr>
214
+ <td style="text-align:center"><b>Wikipedia</b></td>
215
+ <td style="text-align:center"><b>Totals</b></td>
216
+ <td style="text-align:center"><b>Pages</b></td>
217
+ <td style="text-align:center"><b>Editors</b></td>
218
+ </tr>
219
+ </thead>
220
+ <tbody style="margin: 0;padding: 0">
221
+ <tr>
222
+ <td style="text-align:center"; rowspan=8>{selected_language}</td>
223
+ <td style="text-align:center"; rowspan=4>Pages ({total_pages:,})</td>
224
+ <td style="text-align:center"; rowspan=2>Articles ({pages_content_pages:,})</td>
225
+ <td style="text-align:center">Bots ({pages_content_bots:,})</td>
226
+ </tr>
227
+ <tr>
228
+ <td style="text-align:center">Humans ({pages_content_humans:,})</td>
229
+ </tr>
230
+ <tr>
231
+ <td style="text-align:center"; rowspan=2>Non-Articles ({pages_non_content_pages:,})</td>
232
+ <td style="text-align:center">Bots ({pages_non_content_bots:,})</td>
233
+ </tr>
234
+ <tr>
235
+ <td style="text-align:center">Humans ({pages_non_content_humans:,})</td>
236
+ </tr>
237
+ <tr>
238
+ <td style="text-align:center"; rowspan=4>Edits ({total_edits:,})</td>
239
+ <td style="text-align:center"; rowspan=2>Articles ({edits_content_pages:,})</td>
240
+ <td style="text-align:center">Bots ({edits_content_bots:,})</td>
241
+ </tr>
242
+ <tr>
243
+ <td style="text-align:center"; >Humans ({edits_content_humans:,})</td>
244
+ </tr>
245
+ <tr>
246
+ <td style="text-align:center"; rowspan=2>Non-Articles ({edits_non_content_pages:,})</td>
247
+ <td style="text-align:center">Bots ({edits_non_content_bots:,})</td>
248
+ </tr>
249
+ <tr>
250
+ <td style="text-align:center">Humans ({edits_non_content_humans:,})</td>
251
+ </tr>
252
+ </tbody>
253
+ </table>
254
+ """, unsafe_allow_html=True)
255
+
256
+ fonts_style = """
257
+ <style>
258
+ @import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@200&display=swap');
259
+ </style>
260
+ """
261
+ st.markdown(fonts_style, unsafe_allow_html=True)
262
+
263
+ hide_st_style = """
264
+ <style>
265
+ MainMenu {visibility: hidden;}
266
+ header {visibility: hidden;}
267
+ footer {visibility: hidden;}
268
+ button[title="View fullscreen"]{visibility: hidden;}
269
+ </style>
270
+ """
271
+ st.markdown(hide_st_style, unsafe_allow_html=True)
272
+
273
+ footer="""
274
+ <style>
275
+ .footer {
276
+ position: fixed;
277
+ left: 0;
278
+ bottom: 0;
279
+ width: 100%;
280
+ background-color: white;
281
+ color: #737373;
282
+ text-align: center;}
283
+
284
+ .p1 {
285
+ font-family: 'IBM Plex Sans', sans-serif;
286
+ font-size: 12px}
287
+
288
+ </style>
289
+
290
+ <div class="footer"> <p class="p1">Copyright © 2023 by Saied Alshahrani<br>Hosted with Streamlit Community Cloud</p> </div>
291
+
292
+ """
293
+ st.markdown(footer, unsafe_allow_html=True)
294
+
295
+ st.markdown("""
296
+ <style>
297
+ .block-container {
298
+ padding-top: 0rem;
299
+ padding-bottom: 0rem;
300
+ padding-left: 0rem;
301
+ padding-right: 0rem;
302
+ }
303
+ </style>
304
+ """, unsafe_allow_html=True)
305
+
306
+ st.markdown("""
307
+ <style>
308
+ .br {
309
+ display: block;
310
+ margin: 0px 0;
311
+ }
312
+ </style>
313
+ """, unsafe_allow_html=True)
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ lxml==4.9.1
2
+ pandas==1.4.3
3
+ plotly==5.15.0
4
+ datasets==2.14.6
5
+ streamlit==1.30.0
6
+ selenium==3.141.0
7
+ geckodriver-autoinstaller==0.1.0
8
+
update-daemon.sh ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python update-metadata.py
2
+
3
+ # git lfs install
4
+ git clone https://huggingface.co/datasets/SaiedAlshahrani/Wikipedia-Corpora-Report
5
+ cd Wikipedia-Corpora-Report/
6
+
7
+ head -n1 ../English--Wikipedia--Metadata.csv > Wikipedia-Corpora-Report.csv
8
+ sed -i '' 1d ../*--Wikipedia--Metadata.csv
9
+ cat ../*--Wikipedia--Metadata.csv >> Wikipedia-Corpora-Report.csv
10
+ # cp -r ../all-metadata .
11
+
12
+ git add .
13
+ git status
14
+ git commit -m "Update Wikipedia-Corpora-Report.csv"
15
+ git push
16
+
17
+ rm ../*--Wikipedia--Metadata.csv
18
+ cp Wikipedia-Corpora-Report.csv ..
19
+ cd ..
20
+
update-metadata.py ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import selenium
2
+ import os, warnings
3
+ import urllib.request
4
+ from time import sleep
5
+ import pandas as pd, ssl
6
+ from selenium import webdriver
7
+ from urllib.error import HTTPError
8
+
9
+ warnings.simplefilter("ignore", UserWarning)
10
+ warnings.simplefilter("ignore", FutureWarning)
11
+ pd.options.display.float_format = '{:.2f}'.format
12
+ ssl._create_default_https_context = ssl._create_unverified_context
13
+
14
+ def fetch_wikis_codes():
15
+ try:
16
+ urls = [r'https://en.wikipedia.org/wiki/Statistics_of_Wikipedias',
17
+ r'https://meta.wikimedia.org/wiki/List_of_Wikipedias']
18
+
19
+ for url in urls:
20
+ try: tables = pd.read_html(url)
21
+ except urllib.error.HTTPError: continue
22
+
23
+ for i in range(len(tables)):
24
+ dataframe = tables[i]
25
+ columns = list(dataframe.columns.values)
26
+
27
+ if(set(['Language', 'Wiki']).issubset(set(columns))):
28
+ wikis_codes = tables[i]
29
+ break
30
+
31
+ wikis_codes = wikis_codes[['Wiki', 'Language']]
32
+ wikis_codes = wikis_codes[wikis_codes["Language"].str.contains("(closed)") == False]
33
+ wikis_codes = wikis_codes.set_index('Wiki').to_dict()['Language']
34
+ return wikis_codes
35
+
36
+ except:
37
+ wikis_codes = {'en': 'English', 'ceb': 'Cebuano', 'de': 'German', 'sv': 'Swedish', 'fr': 'French', 'nl': 'Dutch', 'ru': 'Russian',
38
+ 'es': 'Spanish', 'it': 'Italian', 'arz': 'Egyptian Arabic', 'pl': 'Polish', 'ja': 'Japanese', 'zh': 'Chinese', 'vi':
39
+ 'Vietnamese', 'uk': 'Ukrainian', 'war': 'Waray', 'ar': 'Arabic', 'pt': 'Portuguese', 'fa': 'Persian', 'ca': 'Catalan',
40
+ 'sr': 'Serbian', 'id': 'Indonesian', 'ko': 'Korean', 'no': 'Norwegian (Bokmål)', 'ce': 'Chechen', 'fi': 'Finnish', 'cs':
41
+ 'Czech', 'tr': 'Turkish', 'hu': 'Hungarian', 'tt': 'Tatar', 'sh': 'Serbo-Croatian', 'ro': 'Romanian', 'zh-min-nan':
42
+ 'Southern Min', 'eu': 'Basque', 'ms': 'Malay', 'eo': 'Esperanto', 'he': 'Hebrew', 'hy': 'Armenian', 'da': 'Danish', 'bg':
43
+ 'Bulgarian', 'cy': 'Welsh', 'sk': 'Slovak', 'azb': 'South Azerbaijani', 'uz': 'Uzbek', 'et': 'Estonian', 'simple':
44
+ 'Simple English', 'be': 'Belarusian', 'kk': 'Kazakh', 'min': 'Minangkabau', 'el': 'Greek', 'hr': 'Croatian', 'lt': 'Lithuanian',
45
+ 'gl': 'Galician', 'az': 'Azerbaijani', 'ur': 'Urdu', 'sl': 'Slovene', 'lld': 'Ladin', 'ka': 'Georgian', 'nn': 'Norwegian (Nynorsk)',
46
+ 'hi': 'Hindi', 'th': 'Thai', 'ta': 'Tamil', 'bn': 'Bengali', 'la': 'Latin', 'mk': 'Macedonian', 'zh-yue': 'Cantonese', 'ast':
47
+ 'Asturian', 'lv': 'Latvian', 'af': 'Afrikaans', 'tg': 'Tajik', 'my': 'Burmese', 'mg': 'Malagasy', 'mr': 'Marathi', 'sq': 'Albanian',
48
+ 'bs': 'Bosnian', 'oc': 'Occitan', 'te': 'Telugu', 'ml': 'Malayalam', 'nds': 'Low German', 'be-tarask': 'Belarusian (Taraškievica)',
49
+ 'br': 'Breton', 'ky': 'Kyrgyz', 'sw': 'Swahili', 'jv': 'Javanese', 'lmo': 'Lombard', 'new': 'Newar', 'pnb': 'Western Punjabi', 'vec':
50
+ 'Venetian', 'ht': 'Haitian Creole', 'pms': 'Piedmontese', 'ba': 'Bashkir', 'lb': 'Luxembourgish', 'su': 'Sundanese', 'ku': 'Kurdish (Kurmanji)',
51
+ 'ga': 'Irish', 'szl': 'Silesian', 'is': 'Icelandic', 'fy': 'West Frisian', 'cv': 'Chuvash', 'ckb': 'Kurdish (Sorani)', 'pa': 'Punjabi', 'tl':
52
+ 'Tagalog', 'an': 'Aragonese', 'wuu': 'Wu Chinese', 'diq': 'Zaza', 'io': 'Ido', 'sco': 'Scots', 'vo': 'Volapük', 'yo': 'Yoruba', 'ne': 'Nepali',
53
+ 'ia': 'Interlingua', 'kn': 'Kannada', 'gu': 'Gujarati', 'als': 'Alemannic German', 'ha': 'Hausa', 'avk': 'Kotava', 'bar': 'Bavarian', 'crh':
54
+ 'Crimean Tatar', 'scn': 'Sicilian', 'bpy': 'Bishnupriya Manipuri', 'qu': 'Quechua (Southern Quechua)', 'nv': 'Navajo', 'mn': 'Mongolian', 'xmf':
55
+ 'Mingrelian', 'ban': 'Balinese', 'si': 'Sinhala', 'tum': 'Tumbuka', 'ps': 'Pashto', 'frr': 'North Frisian', 'os': 'Ossetian', 'mzn': 'Mazanderani',
56
+ 'bat-smg': 'Samogitian', 'or': 'Odia', 'ig': 'Igbo', 'sah': 'Yakut', 'cdo': 'Eastern Min', 'gd': 'Scottish Gaelic', 'bug': 'Buginese', 'yi': 'Yiddish',
57
+ 'sd': 'Sindhi', 'ilo': 'Ilocano', 'am': 'Amharic', 'nap': 'Neapolitan', 'li': 'Limburgish', 'bcl': 'Central Bikol', 'fo': 'Faroese', 'gor': 'Gorontalo',
58
+ 'hsb': 'Upper Sorbian', 'map-bms': 'Banyumasan', 'mai': 'Maithili', 'shn': 'Shan', 'eml': 'Emilian-Romagnol', 'ace': 'Acehnese', 'zh-classical':
59
+ 'Classical Chinese', 'sa': 'Sanskrit', 'as': 'Assamese', 'wa': 'Walloon', 'ie': 'Interlingue', 'hyw': 'Western Armenian', 'lij': 'Ligurian', 'mhr':
60
+ 'Meadow Mari', 'zu': 'Zulu', 'sn': 'Shona', 'hif': 'Fiji Hindi', 'mrj': 'Hill Mari', 'bjn': 'Banjarese', 'mni': 'Meitei', 'km': 'Khmer', 'hak':
61
+ 'Hakka Chinese', 'roa-tara': 'Tarantino', 'pam': 'Kapampangan', 'sat': 'Santali', 'rue': 'Rusyn', 'nso': 'Northern Sotho', 'bh': 'Bihari (Bhojpuri)',
62
+ 'so': 'Somali', 'mi': 'Māori', 'se': 'Northern Sámi', 'myv': 'Erzya', 'vls': 'West Flemish', 'nds-nl': 'Dutch Low Saxon', 'dag': 'Dagbani', 'sc':
63
+ 'Sardinian', 'ary': 'Moroccan Arabic', 'co': 'Corsican', 'kw': 'Cornish', 'bo': 'Lhasa Tibetan', 'vep': 'Veps', 'glk': 'Gilaki', 'tk': 'Turkmen', 'kab':
64
+ 'Kabyle', 'gan': 'Gan Chinese', 'rw': 'Kinyarwanda', 'fiu-vro': 'Võro', 'ab': 'Abkhaz', 'gv': 'Manx', 'ug': 'Uyghur', 'nah': 'Nahuatl', 'zea': 'Zeelandic',
65
+ 'skr': 'Saraiki', 'frp': 'Franco-Provençal', 'udm': 'Udmurt', 'pcd': 'Picard', 'mt': 'Maltese', 'kv': 'Komi', 'csb': 'Kashubian', 'gn': 'Guarani', 'smn':
66
+ 'Inari Sámi', 'ay': 'Aymara', 'nrm': 'Norman', 'ks': 'Kashmiri', 'lez': 'Lezgian', 'lfn': 'Lingua Franca Nova', 'olo': 'Livvi-Karelian', 'mwl': 'Mirandese',
67
+ 'stq': 'Saterland Frisian', 'lo': 'Lao', 'ang': 'Old English', 'mdf': 'Moksha', 'fur': 'Friulian', 'rm': 'Romansh', 'lad': 'Judaeo-Spanish', 'kaa': 'Karakalpak',
68
+ 'gom': 'Konkani (Goan Konkani)', 'ext': 'Extremaduran', 'koi': 'Permyak', 'tyv': 'Tuvan', 'pap': 'Papiamento', 'av': 'Avar', 'dsb': 'Lower Sorbian', 'ln':
69
+ 'Lingala', 'dty': 'Doteli', 'tw': 'Twi', 'cbk-zam': 'Chavacano (Zamboanga)', 'dv': 'Maldivian', 'ksh': 'Ripuarian', 'za': 'Zhuang (Standard Zhuang)', 'gag':
70
+ 'Gagauz', 'bxr': 'Buryat (Russia Buriat)', 'pfl': 'Palatine German', 'lg': 'Luganda', 'szy': 'Sakizaya', 'pag': 'Pangasinan', 'blk': "Pa'O", 'pi': 'Pali',
71
+ 'tay': 'Atayal', 'haw': 'Hawaiian', 'awa': 'Awadhi', 'inh': 'Ingush', 'krc': 'Karachay-Balkar', 'xal': 'Kalmyk Oirat', 'pdc': 'Pennsylvania Dutch', 'to':
72
+ 'Tongan', 'atj': 'Atikamekw', 'tcy': 'Tulu', 'arc': 'Aramaic (Syriac)', 'mnw': 'Mon', 'jam': 'Jamaican Patois', 'shi': 'Shilha', 'kbp': 'Kabiye', 'wo':
73
+ 'Wolof', 'anp': 'Angika', 'kbd': 'Kabardian', 'nia': 'Nias', 'nov': 'Novial', 'om': 'Oromo', 'ki': 'Kikuyu', 'nqo': "N'Ko", 'bi': 'Bislama', 'xh': 'Xhosa',
74
+ 'tpi': 'Tok Pisin', 'tet': 'Tetum', 'ff': 'Fula', 'roa-rup': 'Aromanian', 'jbo': 'Lojban', 'fj': 'Fijian', 'kg': 'Kongo (Kituba)', 'lbe': 'Lak', 'ty': 'Tahitian',
75
+ 'guw': 'Gun', 'cu': 'Old Church Slavonic', 'trv': 'Seediq', 'ami': 'Amis', 'srn': 'Sranan Tongo', 'sm': 'Samoan', 'mad': 'Madurese', 'alt': 'Southern Altai',
76
+ 'ltg': 'Latgalian', 'gcr': 'French Guianese Creole', 'chr': 'Cherokee', 'tn': 'Tswana', 'ny': 'Chewa', 'st': 'Sotho', 'pih': 'Norfuk', 'rmy': 'Romani (Vlax Romani)',
77
+ 'got': 'Gothic', 'ee': 'Ewe', 'pcm': 'Nigerian Pidgin', 'bm': 'Bambara', 'ss': 'Swazi', 'ts': 'Tsonga', 've': 'Venda', 'kcg': 'Tyap', 'chy': 'Cheyenne', 'rn':
78
+ 'Kirundi', 'ch': 'Chamorro', 'gur': 'Frafra', 'ik': 'Iñupiaq', 'ady': 'Adyghe', 'pnt': 'Pontic Greek', 'guc': 'Wayuu', 'iu': 'Inuktitut', 'pwn': 'Paiwan', 'sg':
79
+ 'Sango', 'din': 'Dinka', 'ti': 'Tigrinya', 'kl': 'Greenlandic', 'dz': 'Dzongkha', 'cr': 'Cree', 'ak': 'Akan'}
80
+ return wikis_codes
81
+
82
+
83
+ def fetch_wiki_metadata(wiki, metric, submetric, timeout):
84
+ options = webdriver.FirefoxOptions()
85
+ options.add_argument("--headless")
86
+ profile = webdriver.FirefoxProfile()
87
+ profile.set_preference("browser.download.folderList", 2)
88
+ profile.set_preference("browser.download.manager.showWhenStarting", False)
89
+ profile.set_preference("browser.download.dir", f"{os.getcwd()}")
90
+ profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream")
91
+ driver = webdriver.Firefox(options=options, firefox_profile=profile, executable_path='geckodriver', service_log_path=os.devnull)
92
+
93
+ if metric == 'pages':
94
+ base_url = f'https://stats.wikimedia.org/#/{wiki}.wikipedia.org/content/pages-to-date/full|table|'
95
+
96
+ elif metric == 'edits':
97
+ base_url = f'https://stats.wikimedia.org/#/{wiki}.wikipedia.org/contributing/edits/full|table|'
98
+
99
+ parameters = f'1-month|editor_type~anonymous*group-bot*name-bot*user+(page_type)~{submetric}|monthly'
100
+ request_url = "".join([base_url, parameters])
101
+
102
+ driver.implicitly_wait(3)
103
+ driver.get(request_url)
104
+ driver.page_source
105
+
106
+ sleep(timeout)
107
+
108
+ csvFilename = f"{wiki}--{metric}--{submetric}.csv"
109
+ csvFilename = csvFilename.replace(' ','-')
110
+ driver.find_element_by_class_name("ui.icon.button.tooltipped.tooltipped-n").click()
111
+ sleep(3) ; os.rename("undefined.csv", csvFilename)
112
+
113
+ driver.close()
114
+ driver.quit()
115
+
116
+ print(f' [+] Metadata Exported to `{wiki}/{csvFilename}`.')
117
+
118
+ return csvFilename
119
+
120
+
121
+ wiki_codes = fetch_wikis_codes()
122
+ labels = []
123
+ for key, value in wiki_codes.items():
124
+ labels.append(f"{value} ({key})")
125
+
126
+ wikis = list(wiki_codes.keys())
127
+ metrics = ['pages', 'edits']
128
+ submetrics = ['content', 'non-content']
129
+
130
+ timeout = 3
131
+ counter = 1
132
+
133
+ for wiki in wikis:
134
+
135
+ print(f'{counter}## {wiki_codes[wiki]} Wikipedia Files:')
136
+ if not os.path.exists(f'{wiki}'): os.makedirs(f'{wiki}')
137
+ if not os.path.exists('all-metadata'): os.makedirs('all-metadata')
138
+
139
+ for metric in metrics:
140
+
141
+ for submetric in submetrics:
142
+
143
+ try:
144
+ csvFilename = fetch_wiki_metadata(wiki, metric, submetric, timeout)
145
+ dataframe = pd.read_csv(csvFilename).iloc[-1]
146
+
147
+ except selenium.common.exceptions.ElementClickInterceptedException:
148
+ dataframe = pd.read_csv(fetch_wiki_metadata(wiki, metric, submetric, timeout*2)).iloc[-1]
149
+ timeout *= 2
150
+
151
+ retrieval_date = pd.to_datetime(dataframe['timeRange.end']).strftime('%Y-%m-%d')
152
+
153
+ if metric == 'pages':
154
+ if submetric == 'content':
155
+ pages_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
156
+ pages_content_humans = dataframe['total.user']+dataframe['total.anonymous']
157
+ elif submetric == 'non-content':
158
+ pages_non_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
159
+ pages_non_content_humans = dataframe['total.user']+dataframe['total.anonymous']
160
+ else: print(f'Error: this submetric: {submetric} is not supported!')
161
+
162
+ elif metric == 'edits':
163
+ if submetric == 'content':
164
+ edits_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
165
+ edits_content_humans = dataframe['total.user']+dataframe['total.anonymous']
166
+ elif submetric == 'non-content':
167
+ edits_non_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
168
+ edits_non_content_humans = dataframe['total.user']+dataframe['total.anonymous']
169
+ else: print(f'Error: this submetric: {submetric} is not supported!')
170
+
171
+ else: print(f'Error: this metric: {metric} is not supported!')
172
+
173
+ os.system(f'mv {wiki}--{metric}--{submetric}.csv {wiki}/{wiki}--{metric}--{submetric}.csv')
174
+
175
+ selected_language = f'{wiki_codes[wiki]} ({wiki})'
176
+
177
+ metadata = {'Wiki' : [selected_language, selected_language, selected_language, selected_language,
178
+ selected_language, selected_language, selected_language,selected_language],
179
+
180
+ 'Metric' : ['Pages', 'Pages', 'Pages', 'Pages', 'Edits', 'Edits', 'Edits', 'Edits'],
181
+
182
+ 'Sub-Metric' : ['Articles', 'Articles', 'Non-Articles', 'Non-Articles',
183
+ 'Articles', 'Articles', 'Non-Articles', 'Non-Articles'],
184
+
185
+ 'Editors' : ['Bots', 'Humans', 'Bots', 'Humans', 'Bots', 'Humans', 'Bots', 'Humans'],
186
+
187
+ 'Values' : [pages_content_bots, pages_content_humans, pages_non_content_bots, pages_non_content_humans,
188
+ edits_content_bots, edits_content_humans, edits_non_content_bots, edits_non_content_humans]}
189
+
190
+ wiki_metadata = pd.DataFrame(metadata)
191
+ wiki_metadata['Retrieval-Date'] = retrieval_date
192
+ wiki_metadata.to_csv(f'{wiki_codes[wiki].replace(" ","-")}--Wikipedia--Metadata.csv', index=False)
193
+
194
+ os.system(f'mv {wiki} all-metadata/')
195
+ counter = counter + 1
196
+ sleep(1)