---
datasets:
- CAMeL-Lab/BAREC-Shared-Task-2025-doc
language:
- ar
base_model:
- aubmindlab/bert-base-arabertv2
- CAMeL-Lab/readability-arabertv2-d3tok-reg
---
# MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen
This repository contains the official models and results for **MorphoArabia**, the submission to the **[BAREC 2025 Shared Task](https://www.google.com/search?q=https://sites.google.com/view/barec-2025/home)** on Arabic Readability Assessment.
#### By: [Fatimah Mohamed Emad Elden](https://scholar.google.com/citations?user=CfX6eA8AAAAJ&hl=ar)
#### *Cairo University*
[](https://arxiv.org/abs/25XX.XXXXX)
[](https://github.com/astral-fate/barec-Arabic-Readability-Assessment)
[](https://huggingface.co/collections/FatimahEmadEldin/barec-shared-task-2025-689195853f581b9a60f9bd6c)
[](https://github.com/astral-fate/mentalqa2025/blob/main/LICENSE)
---
## Model Description
This project introduces a **morphologically-aware approach** for assessing the readability of Arabic text. The system is built around a fine-tuned regression model designed to process morphologically analyzed text. For the **Constrained** and **Open** tracks of the shared task, this core model is extended into a hybrid architecture that incorporates seven engineered lexical features.
A key element of this system is its deep morphological preprocessing pipeline, which uses the **CAMEL Tools d3tok analyzer**. This allows the model to capture linguistic complexities that are often missed by surface-level tokenization methods. This approach proved to be highly effective, achieving a peak **Quadratic Weighted Kappa (QWK) score of 84.2** on the strict sentence-level test set.
The model predicts a readability score on a **19-level scale**, from 1 (easiest) to 19 (hardest), for a given Arabic sentence or document.
-----
# Hybrid Arabic Readability Model (Constrained Track - Document Level)
This repository contains a fine-tuned hybrid model for **document-level** Arabic readability assessment. It was trained for the Constrained Track of the BAREC competition.
The model combines the textual understanding of **CAMeL-Lab/readability-arabertv2-d3tok-reg** with 7 additional lexical features to produce a regression-based readability score for full documents.
**NOTE:** This is a custom model architecture. You **must** use the `trust_remote_code=True` argument when loading it.
## How to Use
The model requires both the document text and a tensor containing 7 numerical features.
### Step 1: Installation
Install the necessary libraries:
```bash
pip install transformers torch pandas arabert
````
### Step 2: Full Inference Example
This example shows how to preprocess a document, extract features, and get a readability score.
```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from arabert.preprocess import ArabertPreprocessor
# --- 1. Define the Feature Engineering Function ---
def get_lexical_features(text, lexicon):
words = text.split()
if not words: return [0.0] * 7
word_difficulties = [lexicon.get(word, 3.0) for word in words]
features = [
float(len(text)), float(len(words)),
float(np.mean([len(w) for w in words]) if words else 0.0),
float(np.mean(word_difficulties)), float(np.max(word_difficulties)),
float(np.sum(np.array(word_difficulties) > 4)),
float(len([w for w in words if w not in lexicon]) / len(words))
]
return features
# --- 2. Initialize Models and Processors ---
repo_id = "FatimahEmadEldin/Constrained-Track-Document-Bassline-Readability-Arabertv2-d3tok-reg"
arabert_preprocessor = ArabertPreprocessor(model_name="aubmindlab/bert-large-arabertv2")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
# --- 3. Prepare Input Document and Lexicon ---
# For a real use case, load the full SAMER lexicon.
sample_lexicon = {'جملة': 2.5, 'عربية': 3.1, 'بسيطة': 1.8, 'النص': 2.8, 'طويل': 3.5}
document_text = "هذا مثال لجملة عربية بسيطة. هذا النص أطول قليلاً من المثال السابق."
# --- 4. Run the Full Pipeline ---
preprocessed_text = arabert_preprocessor.preprocess(document_text)
numerical_features_list = get_lexical_features(preprocessed_text, sample_lexicon)
numerical_features = torch.tensor([numerical_features_list], dtype=torch.float)
inputs = tokenizer(preprocessed_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs['extra_features'] = numerical_features # The model expects 'extra_features'
# --- 5. Perform Inference ---
model.eval()
with torch.no_grad():
logits = model(**inputs)[1] # The model returns (loss, logits)
# --- 6. Process the Output ---
predicted_score = logits.item()
final_level = round(max(0, min(18, predicted_score))) + 1
print(f"Input Document: '{document_text}'")
print(f"Raw Regression Score: {predicted_score:.4f}")
print(f"Predicted Readability Level (1-19): {final_level}")
```
## ⚙️ Training Procedure
The system employs two distinct architectures based on the track's constraints:
* **Strict Track**: This track uses a base regression model, `CAMeL-Lab/readability-arabertv2-d3tok-reg`, fine-tuned directly on the BAREC dataset.
* **Constrained and Open Tracks**: These tracks utilize a hybrid model. This architecture combines the deep contextual understanding of the Transformer with explicit numerical features. The final representation for a sentence is created by concatenating the Transformer's `[CLS]` token embedding with a 7-dimensional vector of engineered lexical features derived from the SAMER lexicon.
A critical component of the system is its preprocessing pipeline, which leverages the CAMEL Tools `d3tok` format. The `d3tok` analyzer performs a deep morphological analysis by disambiguating words in context and then segmenting them into their constituent morphemes.
### Frameworks
* PyTorch
* Hugging Face Transformers
-----
### 📊 Evaluation Results
The models were evaluated on the blind test set provided by the BAREC organizers. The primary metric for evaluation is the **Quadratic Weighted Kappa (QWK)**, which penalizes larger disagreements more severely.
#### Final Test Set Scores (QWK)
| Track | Task | Dev (QWK) | Test (QWK) |
| :--- | :--- | :---: | :---: |
| **Strict** | Sentence | 0.823 | **84.2** |
| | Document | 0.823\* | 79.9 |
| **Constrained** | Sentence | 0.810 | 82.9 |
| | Document | 0.835\* | 75.5 |
| **Open** | Sentence | 0.827 | 83.6 |
| | Document | 0.827\* | **79.2** |
\*Document-level dev scores are based on the performance of the sentence-level model on the validation set.
-----
## 📜 Citation
If you use the work, please cite the paper:
```
@inproceedings{eldin2025morphoarabia,
title={{MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen}},
author={Eldin, Fatimah Mohamed Emad},
year={2025},
booktitle={Proceedings of the BAREC 2025 Shared Task},
eprint={25XX.XXXXX},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```