FatimahEmadEldin commited on
Commit
afc53bc
·
verified ·
1 Parent(s): cbefe5c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - CAMeL-Lab/BAREC-Shared-Task-2025-doc
4
+ language:
5
+ - ar
6
+ base_model:
7
+ - aubmindlab/bert-base-arabertv2
8
+ - CAMeL-Lab/readability-arabertv2-d3tok-reg
9
+ ---
10
+
11
+
12
+ # MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen
13
+
14
+ <p align="center">
15
+ <img src="https://placehold.co/800x200/dbeafe/3b82f6?text=Barec-Readability-Assessment" alt="Barec Readability Assessment">
16
+ </p>
17
+
18
+
19
+ This repository contains the official models and results for **MorphoArabia**, the submission to the **[BAREC 2025 Shared Task](https://www.google.com/search?q=https://sites.google.com/view/barec-2025/home)** on Arabic Readability Assessment.
20
+
21
+ #### By: [Fatimah Mohamed Emad Elden](https://scholar.google.com/citations?user=CfX6eA8AAAAJ&hl=ar)
22
+
23
+ #### *Cairo University*
24
+
25
+
26
+ [![Paper](https://img.shields.io/badge/arXiv-25XX.XXXXX-b31b1b.svg)](https://arxiv.org/abs/25XX.XXXXX)
27
+ [![Code](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/astral-fate/barec-Arabic-Readability-Assessment)
28
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace-Page-F9D371)](https://huggingface.co/collections/FatimahEmadEldin/barec-shared-task-2025-689195853f581b9a60f9bd6c)
29
+ [![License](https://img.shields.io/badge/License-MIT-lightgrey)](https://github.com/astral-fate/mentalqa2025/blob/main/LICENSE)
30
+
31
+ ---
32
+
33
+ ## Model Description
34
+
35
+ This project introduces a **morphologically-aware approach** for assessing the readability of Arabic text. The system is built around a fine-tuned regression model designed to process morphologically analyzed text. For the **Constrained** and **Open** tracks of the shared task, this core model is extended into a hybrid architecture that incorporates seven engineered lexical features.
36
+
37
+ A key element of this system is its deep morphological preprocessing pipeline, which uses the **CAMEL Tools d3tok analyzer**. This allows the model to capture linguistic complexities that are often missed by surface-level tokenization methods. This approach proved to be highly effective, achieving a peak **Quadratic Weighted Kappa (QWK) score of 84.2** on the strict sentence-level test set.
38
+
39
+ The model predicts a readability score on a **19-level scale**, from 1 (easiest) to 19 (hardest), for a given Arabic sentence or document.
40
+
41
+ -----
42
+
43
+ # Hybrid Arabic Readability Model (Constrained Track - Document Level)
44
+
45
+ This repository contains a fine-tuned hybrid model for **document-level** Arabic readability assessment. It was trained for the Constrained Track of the BAREC competition.
46
+
47
+ The model combines the textual understanding of **CAMeL-Lab/readability-arabertv2-d3tok-reg** with 7 additional lexical features to produce a regression-based readability score for full documents.
48
+
49
+ **NOTE:** This is a custom model architecture. You **must** use the `trust_remote_code=True` argument when loading it.
50
+
51
+ ## How to Use
52
+
53
+ The model requires both the document text and a tensor containing 7 numerical features.
54
+
55
+ ### Step 1: Installation
56
+ Install the necessary libraries:
57
+ ```bash
58
+ pip install transformers torch pandas arabert
59
+ ````
60
+
61
+ ### Step 2: Full Inference Example
62
+
63
+ This example shows how to preprocess a document, extract features, and get a readability score.
64
+
65
+ ```python
66
+ import torch
67
+ import numpy as np
68
+ from transformers import AutoTokenizer, AutoModel
69
+ from arabert.preprocess import ArabertPreprocessor
70
+
71
+ # --- 1. Define the Feature Engineering Function ---
72
+ def get_lexical_features(text, lexicon):
73
+ words = text.split()
74
+ if not words: return [0.0] * 7
75
+ word_difficulties = [lexicon.get(word, 3.0) for word in words]
76
+ features = [
77
+ float(len(text)), float(len(words)),
78
+ float(np.mean([len(w) for w in words]) if words else 0.0),
79
+ float(np.mean(word_difficulties)), float(np.max(word_difficulties)),
80
+ float(np.sum(np.array(word_difficulties) > 4)),
81
+ float(len([w for w in words if w not in lexicon]) / len(words))
82
+ ]
83
+ return features
84
+
85
+ # --- 2. Initialize Models and Processors ---
86
+ repo_id = "FatimahEmadEldin/Constrained-Track-Document-Bassline-Readability-Arabertv2-d3tok-reg"
87
+ arabert_preprocessor = ArabertPreprocessor(model_name="aubmindlab/bert-large-arabertv2")
88
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
89
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
90
+
91
+ # --- 3. Prepare Input Document and Lexicon ---
92
+ # For a real use case, load the full SAMER lexicon.
93
+ sample_lexicon = {'جملة': 2.5, 'عربية': 3.1, 'بسيطة': 1.8, 'النص': 2.8, 'طويل': 3.5}
94
+ document_text = "هذا مثال لجملة عربية بسيطة. هذا النص أطول قليلاً من المثال السابق."
95
+
96
+ # --- 4. Run the Full Pipeline ---
97
+ preprocessed_text = arabert_preprocessor.preprocess(document_text)
98
+ numerical_features_list = get_lexical_features(preprocessed_text, sample_lexicon)
99
+ numerical_features = torch.tensor([numerical_features_list], dtype=torch.float)
100
+
101
+ inputs = tokenizer(preprocessed_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
102
+ inputs['extra_features'] = numerical_features # The model expects 'extra_features'
103
+
104
+ # --- 5. Perform Inference ---
105
+ model.eval()
106
+ with torch.no_grad():
107
+ logits = model(**inputs)[1] # The model returns (loss, logits)
108
+
109
+ # --- 6. Process the Output ---
110
+ predicted_score = logits.item()
111
+ final_level = round(max(0, min(18, predicted_score))) + 1
112
+
113
+ print(f"Input Document: '{document_text}'")
114
+ print(f"Raw Regression Score: {predicted_score:.4f}")
115
+ print(f"Predicted Readability Level (1-19): {final_level}")
116
+
117
+ ```
118
+
119
+
120
+
121
+ ## ⚙️ Training Procedure
122
+
123
+ The system employs two distinct architectures based on the track's constraints:
124
+
125
+ * **Strict Track**: This track uses a base regression model, `CAMeL-Lab/readability-arabertv2-d3tok-reg`, fine-tuned directly on the BAREC dataset.
126
+ * **Constrained and Open Tracks**: These tracks utilize a hybrid model. This architecture combines the deep contextual understanding of the Transformer with explicit numerical features. The final representation for a sentence is created by concatenating the Transformer's `[CLS]` token embedding with a 7-dimensional vector of engineered lexical features derived from the SAMER lexicon.
127
+
128
+ A critical component of the system is its preprocessing pipeline, which leverages the CAMEL Tools `d3tok` format. The `d3tok` analyzer performs a deep morphological analysis by disambiguating words in context and then segmenting them into their constituent morphemes.
129
+
130
+ ### Frameworks
131
+
132
+ * PyTorch
133
+ * Hugging Face Transformers
134
+
135
+ -----
136
+
137
+ ### 📊 Evaluation Results
138
+
139
+ The models were evaluated on the blind test set provided by the BAREC organizers. The primary metric for evaluation is the **Quadratic Weighted Kappa (QWK)**, which penalizes larger disagreements more severely.
140
+
141
+ #### Final Test Set Scores (QWK)
142
+
143
+ | Track | Task | Dev (QWK) | Test (QWK) |
144
+ | :--- | :--- | :---: | :---: |
145
+ | **Strict** | Sentence | 0.823 | **84.2** |
146
+ | | Document | 0.823\* | 79.9 |
147
+ | **Constrained** | Sentence | 0.810 | 82.9 |
148
+ | | Document | 0.835\* | 75.5 |
149
+ | **Open** | Sentence | 0.827 | 83.6 |
150
+ | | Document | 0.827\* | **79.2** |
151
+
152
+ \*Document-level dev scores are based on the performance of the sentence-level model on the validation set.
153
+
154
+ -----
155
+
156
+ ## 📜 Citation
157
+
158
+ If you use the work, please cite the paper:
159
+
160
+ ```
161
+ @inproceedings{eldin2025morphoarabia,
162
+ title={{MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen}},
163
+ author={Eldin, Fatimah Mohamed Emad},
164
+ year={2025},
165
+ booktitle={Proceedings of the BAREC 2025 Shared Task},
166
+ eprint={25XX.XXXXX},
167
+ archivePrefix={arXiv},
168
+ primaryClass={cs.CL}
169
+ }
170
+ ```