--- datasets: - CAMeL-Lab/BAREC-Shared-Task-2025-doc language: - ar base_model: - aubmindlab/bert-base-arabertv2 - CAMeL-Lab/readability-arabertv2-d3tok-reg --- # MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen

Barec Readability Assessment

This repository contains the official models and results for **MorphoArabia**, the submission to the **[BAREC 2025 Shared Task](https://www.google.com/search?q=https://sites.google.com/view/barec-2025/home)** on Arabic Readability Assessment. #### By: [Fatimah Mohamed Emad Elden](https://scholar.google.com/citations?user=CfX6eA8AAAAJ&hl=ar) #### *Cairo University* [![Paper](https://img.shields.io/badge/arXiv-25XX.XXXXX-b31b1b.svg)](https://arxiv.org/abs/25XX.XXXXX) [![Code](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/astral-fate/barec-Arabic-Readability-Assessment) [![HuggingFace](https://img.shields.io/badge/HuggingFace-Page-F9D371)](https://huggingface.co/collections/FatimahEmadEldin/barec-shared-task-2025-689195853f581b9a60f9bd6c) [![License](https://img.shields.io/badge/License-MIT-lightgrey)](https://github.com/astral-fate/mentalqa2025/blob/main/LICENSE) --- ## Model Description This project introduces a **morphologically-aware approach** for assessing the readability of Arabic text. The system is built around a fine-tuned regression model designed to process morphologically analyzed text. For the **Constrained** and **Open** tracks of the shared task, this core model is extended into a hybrid architecture that incorporates seven engineered lexical features. A key element of this system is its deep morphological preprocessing pipeline, which uses the **CAMEL Tools d3tok analyzer**. This allows the model to capture linguistic complexities that are often missed by surface-level tokenization methods. This approach proved to be highly effective, achieving a peak **Quadratic Weighted Kappa (QWK) score of 84.2** on the strict sentence-level test set. The model predicts a readability score on a **19-level scale**, from 1 (easiest) to 19 (hardest), for a given Arabic sentence or document. ----- # Hybrid Arabic Readability Model (Constrained Track - Document Level) This repository contains a fine-tuned hybrid model for **document-level** Arabic readability assessment. It was trained for the Constrained Track of the BAREC competition. The model combines the textual understanding of **CAMeL-Lab/readability-arabertv2-d3tok-reg** with 7 additional lexical features to produce a regression-based readability score for full documents. **NOTE:** This is a custom model architecture. You **must** use the `trust_remote_code=True` argument when loading it. ## How to Use The model requires both the document text and a tensor containing 7 numerical features. ### Step 1: Installation Install the necessary libraries: ```bash pip install transformers torch pandas arabert ```` ### Step 2: Full Inference Example This example shows how to preprocess a document, extract features, and get a readability score. ```python import torch import numpy as np from transformers import AutoTokenizer, AutoModel from arabert.preprocess import ArabertPreprocessor # --- 1. Define the Feature Engineering Function --- def get_lexical_features(text, lexicon): words = text.split() if not words: return [0.0] * 7 word_difficulties = [lexicon.get(word, 3.0) for word in words] features = [ float(len(text)), float(len(words)), float(np.mean([len(w) for w in words]) if words else 0.0), float(np.mean(word_difficulties)), float(np.max(word_difficulties)), float(np.sum(np.array(word_difficulties) > 4)), float(len([w for w in words if w not in lexicon]) / len(words)) ] return features # --- 2. Initialize Models and Processors --- repo_id = "FatimahEmadEldin/Constrained-Track-Document-Bassline-Readability-Arabertv2-d3tok-reg" arabert_preprocessor = ArabertPreprocessor(model_name="aubmindlab/bert-large-arabertv2") tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) # --- 3. Prepare Input Document and Lexicon --- # For a real use case, load the full SAMER lexicon. sample_lexicon = {'جملة': 2.5, 'عربية': 3.1, 'بسيطة': 1.8, 'النص': 2.8, 'طويل': 3.5} document_text = "هذا مثال لجملة عربية بسيطة. هذا النص أطول قليلاً من المثال السابق." # --- 4. Run the Full Pipeline --- preprocessed_text = arabert_preprocessor.preprocess(document_text) numerical_features_list = get_lexical_features(preprocessed_text, sample_lexicon) numerical_features = torch.tensor([numerical_features_list], dtype=torch.float) inputs = tokenizer(preprocessed_text, return_tensors="pt", padding=True, truncation=True, max_length=512) inputs['extra_features'] = numerical_features # The model expects 'extra_features' # --- 5. Perform Inference --- model.eval() with torch.no_grad(): logits = model(**inputs)[1] # The model returns (loss, logits) # --- 6. Process the Output --- predicted_score = logits.item() final_level = round(max(0, min(18, predicted_score))) + 1 print(f"Input Document: '{document_text}'") print(f"Raw Regression Score: {predicted_score:.4f}") print(f"Predicted Readability Level (1-19): {final_level}") ``` ## ⚙️ Training Procedure The system employs two distinct architectures based on the track's constraints: * **Strict Track**: This track uses a base regression model, `CAMeL-Lab/readability-arabertv2-d3tok-reg`, fine-tuned directly on the BAREC dataset. * **Constrained and Open Tracks**: These tracks utilize a hybrid model. This architecture combines the deep contextual understanding of the Transformer with explicit numerical features. The final representation for a sentence is created by concatenating the Transformer's `[CLS]` token embedding with a 7-dimensional vector of engineered lexical features derived from the SAMER lexicon. A critical component of the system is its preprocessing pipeline, which leverages the CAMEL Tools `d3tok` format. The `d3tok` analyzer performs a deep morphological analysis by disambiguating words in context and then segmenting them into their constituent morphemes. ### Frameworks * PyTorch * Hugging Face Transformers ----- ### 📊 Evaluation Results The models were evaluated on the blind test set provided by the BAREC organizers. The primary metric for evaluation is the **Quadratic Weighted Kappa (QWK)**, which penalizes larger disagreements more severely. #### Final Test Set Scores (QWK) | Track | Task | Dev (QWK) | Test (QWK) | | :--- | :--- | :---: | :---: | | **Strict** | Sentence | 0.823 | **84.2** | | | Document | 0.823\* | 79.9 | | **Constrained** | Sentence | 0.810 | 82.9 | | | Document | 0.835\* | 75.5 | | **Open** | Sentence | 0.827 | 83.6 | | | Document | 0.827\* | **79.2** | \*Document-level dev scores are based on the performance of the sentence-level model on the validation set. ----- ## 📜 Citation If you use the work, please cite the paper: ``` @inproceedings{eldin2025morphoarabia, title={{MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen}}, author={Eldin, Fatimah Mohamed Emad}, year={2025}, booktitle={Proceedings of the BAREC 2025 Shared Task}, eprint={25XX.XXXXX}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```