Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- CAMeL-Lab/BAREC-Shared-Task-2025-doc
|
| 4 |
+
language:
|
| 5 |
+
- ar
|
| 6 |
+
base_model:
|
| 7 |
+
- aubmindlab/bert-base-arabertv2
|
| 8 |
+
- CAMeL-Lab/readability-arabertv2-d3tok-reg
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
# MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen
|
| 13 |
+
|
| 14 |
+
<p align="center">
|
| 15 |
+
<img src="https://placehold.co/800x200/dbeafe/3b82f6?text=Barec-Readability-Assessment" alt="Barec Readability Assessment">
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
This repository contains the official models and results for **MorphoArabia**, the submission to the **[BAREC 2025 Shared Task](https://www.google.com/search?q=https://sites.google.com/view/barec-2025/home)** on Arabic Readability Assessment.
|
| 20 |
+
|
| 21 |
+
#### By: [Fatimah Mohamed Emad Elden](https://scholar.google.com/citations?user=CfX6eA8AAAAJ&hl=ar)
|
| 22 |
+
|
| 23 |
+
#### *Cairo University*
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
[](https://arxiv.org/abs/25XX.XXXXX)
|
| 27 |
+
[](https://github.com/astral-fate/barec-Arabic-Readability-Assessment)
|
| 28 |
+
[](https://huggingface.co/collections/FatimahEmadEldin/barec-shared-task-2025-689195853f581b9a60f9bd6c)
|
| 29 |
+
[](https://github.com/astral-fate/mentalqa2025/blob/main/LICENSE)
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Model Description
|
| 34 |
+
|
| 35 |
+
This project introduces a **morphologically-aware approach** for assessing the readability of Arabic text. The system is built around a fine-tuned regression model designed to process morphologically analyzed text. For the **Constrained** and **Open** tracks of the shared task, this core model is extended into a hybrid architecture that incorporates seven engineered lexical features.
|
| 36 |
+
|
| 37 |
+
A key element of this system is its deep morphological preprocessing pipeline, which uses the **CAMEL Tools d3tok analyzer**. This allows the model to capture linguistic complexities that are often missed by surface-level tokenization methods. This approach proved to be highly effective, achieving a peak **Quadratic Weighted Kappa (QWK) score of 84.2** on the strict sentence-level test set.
|
| 38 |
+
|
| 39 |
+
The model predicts a readability score on a **19-level scale**, from 1 (easiest) to 19 (hardest), for a given Arabic sentence or document.
|
| 40 |
+
|
| 41 |
+
-----
|
| 42 |
+
|
| 43 |
+
# Hybrid Arabic Readability Model (Constrained Track - Document Level)
|
| 44 |
+
|
| 45 |
+
This repository contains a fine-tuned hybrid model for **document-level** Arabic readability assessment. It was trained for the Constrained Track of the BAREC competition.
|
| 46 |
+
|
| 47 |
+
The model combines the textual understanding of **CAMeL-Lab/readability-arabertv2-d3tok-reg** with 7 additional lexical features to produce a regression-based readability score for full documents.
|
| 48 |
+
|
| 49 |
+
**NOTE:** This is a custom model architecture. You **must** use the `trust_remote_code=True` argument when loading it.
|
| 50 |
+
|
| 51 |
+
## How to Use
|
| 52 |
+
|
| 53 |
+
The model requires both the document text and a tensor containing 7 numerical features.
|
| 54 |
+
|
| 55 |
+
### Step 1: Installation
|
| 56 |
+
Install the necessary libraries:
|
| 57 |
+
```bash
|
| 58 |
+
pip install transformers torch pandas arabert
|
| 59 |
+
````
|
| 60 |
+
|
| 61 |
+
### Step 2: Full Inference Example
|
| 62 |
+
|
| 63 |
+
This example shows how to preprocess a document, extract features, and get a readability score.
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
import torch
|
| 67 |
+
import numpy as np
|
| 68 |
+
from transformers import AutoTokenizer, AutoModel
|
| 69 |
+
from arabert.preprocess import ArabertPreprocessor
|
| 70 |
+
|
| 71 |
+
# --- 1. Define the Feature Engineering Function ---
|
| 72 |
+
def get_lexical_features(text, lexicon):
|
| 73 |
+
words = text.split()
|
| 74 |
+
if not words: return [0.0] * 7
|
| 75 |
+
word_difficulties = [lexicon.get(word, 3.0) for word in words]
|
| 76 |
+
features = [
|
| 77 |
+
float(len(text)), float(len(words)),
|
| 78 |
+
float(np.mean([len(w) for w in words]) if words else 0.0),
|
| 79 |
+
float(np.mean(word_difficulties)), float(np.max(word_difficulties)),
|
| 80 |
+
float(np.sum(np.array(word_difficulties) > 4)),
|
| 81 |
+
float(len([w for w in words if w not in lexicon]) / len(words))
|
| 82 |
+
]
|
| 83 |
+
return features
|
| 84 |
+
|
| 85 |
+
# --- 2. Initialize Models and Processors ---
|
| 86 |
+
repo_id = "FatimahEmadEldin/Constrained-Track-Document-Bassline-Readability-Arabertv2-d3tok-reg"
|
| 87 |
+
arabert_preprocessor = ArabertPreprocessor(model_name="aubmindlab/bert-large-arabertv2")
|
| 88 |
+
tokenizer = AutoTokenizer.from_pretrained(repo_id)
|
| 89 |
+
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
|
| 90 |
+
|
| 91 |
+
# --- 3. Prepare Input Document and Lexicon ---
|
| 92 |
+
# For a real use case, load the full SAMER lexicon.
|
| 93 |
+
sample_lexicon = {'جملة': 2.5, 'عربية': 3.1, 'بسيطة': 1.8, 'النص': 2.8, 'طويل': 3.5}
|
| 94 |
+
document_text = "هذا مثال لجملة عربية بسيطة. هذا النص أطول قليلاً من المثال السابق."
|
| 95 |
+
|
| 96 |
+
# --- 4. Run the Full Pipeline ---
|
| 97 |
+
preprocessed_text = arabert_preprocessor.preprocess(document_text)
|
| 98 |
+
numerical_features_list = get_lexical_features(preprocessed_text, sample_lexicon)
|
| 99 |
+
numerical_features = torch.tensor([numerical_features_list], dtype=torch.float)
|
| 100 |
+
|
| 101 |
+
inputs = tokenizer(preprocessed_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
|
| 102 |
+
inputs['extra_features'] = numerical_features # The model expects 'extra_features'
|
| 103 |
+
|
| 104 |
+
# --- 5. Perform Inference ---
|
| 105 |
+
model.eval()
|
| 106 |
+
with torch.no_grad():
|
| 107 |
+
logits = model(**inputs)[1] # The model returns (loss, logits)
|
| 108 |
+
|
| 109 |
+
# --- 6. Process the Output ---
|
| 110 |
+
predicted_score = logits.item()
|
| 111 |
+
final_level = round(max(0, min(18, predicted_score))) + 1
|
| 112 |
+
|
| 113 |
+
print(f"Input Document: '{document_text}'")
|
| 114 |
+
print(f"Raw Regression Score: {predicted_score:.4f}")
|
| 115 |
+
print(f"Predicted Readability Level (1-19): {final_level}")
|
| 116 |
+
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
## ⚙️ Training Procedure
|
| 122 |
+
|
| 123 |
+
The system employs two distinct architectures based on the track's constraints:
|
| 124 |
+
|
| 125 |
+
* **Strict Track**: This track uses a base regression model, `CAMeL-Lab/readability-arabertv2-d3tok-reg`, fine-tuned directly on the BAREC dataset.
|
| 126 |
+
* **Constrained and Open Tracks**: These tracks utilize a hybrid model. This architecture combines the deep contextual understanding of the Transformer with explicit numerical features. The final representation for a sentence is created by concatenating the Transformer's `[CLS]` token embedding with a 7-dimensional vector of engineered lexical features derived from the SAMER lexicon.
|
| 127 |
+
|
| 128 |
+
A critical component of the system is its preprocessing pipeline, which leverages the CAMEL Tools `d3tok` format. The `d3tok` analyzer performs a deep morphological analysis by disambiguating words in context and then segmenting them into their constituent morphemes.
|
| 129 |
+
|
| 130 |
+
### Frameworks
|
| 131 |
+
|
| 132 |
+
* PyTorch
|
| 133 |
+
* Hugging Face Transformers
|
| 134 |
+
|
| 135 |
+
-----
|
| 136 |
+
|
| 137 |
+
### 📊 Evaluation Results
|
| 138 |
+
|
| 139 |
+
The models were evaluated on the blind test set provided by the BAREC organizers. The primary metric for evaluation is the **Quadratic Weighted Kappa (QWK)**, which penalizes larger disagreements more severely.
|
| 140 |
+
|
| 141 |
+
#### Final Test Set Scores (QWK)
|
| 142 |
+
|
| 143 |
+
| Track | Task | Dev (QWK) | Test (QWK) |
|
| 144 |
+
| :--- | :--- | :---: | :---: |
|
| 145 |
+
| **Strict** | Sentence | 0.823 | **84.2** |
|
| 146 |
+
| | Document | 0.823\* | 79.9 |
|
| 147 |
+
| **Constrained** | Sentence | 0.810 | 82.9 |
|
| 148 |
+
| | Document | 0.835\* | 75.5 |
|
| 149 |
+
| **Open** | Sentence | 0.827 | 83.6 |
|
| 150 |
+
| | Document | 0.827\* | **79.2** |
|
| 151 |
+
|
| 152 |
+
\*Document-level dev scores are based on the performance of the sentence-level model on the validation set.
|
| 153 |
+
|
| 154 |
+
-----
|
| 155 |
+
|
| 156 |
+
## 📜 Citation
|
| 157 |
+
|
| 158 |
+
If you use the work, please cite the paper:
|
| 159 |
+
|
| 160 |
+
```
|
| 161 |
+
@inproceedings{eldin2025morphoarabia,
|
| 162 |
+
title={{MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen}},
|
| 163 |
+
author={Eldin, Fatimah Mohamed Emad},
|
| 164 |
+
year={2025},
|
| 165 |
+
booktitle={Proceedings of the BAREC 2025 Shared Task},
|
| 166 |
+
eprint={25XX.XXXXX},
|
| 167 |
+
archivePrefix={arXiv},
|
| 168 |
+
primaryClass={cs.CL}
|
| 169 |
+
}
|
| 170 |
+
```
|