reranker-amharic-medium

This is a Cross Encoder model finetuned from rasyosef/roberta-medium-amharic using the sentence-transformers library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.

This model is part of the research presented in the paper "The Multilingual Curse at the Retrieval Layer: Evidence from Amharic".

Paper: The Multilingual Curse at the Retrieval Layer: Evidence from Amharic
Code: https://github.com/rasyosef/amharic-neural-ir

Model Details

Model Description

Model Type: Cross Encoder
Base model: rasyosef/roberta-medium-amharic
Maximum Sequence Length: 510 tokens
Number of Output Labels: 1 label
Language: Amharic (am)
License: MIT

Model Sources

Documentation: Sentence Transformers Documentation
Documentation: Cross Encoder Documentation
Repository: Sentence Transformers on GitHub

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import CrossEncoder

# Download from the 🤗 Hub
model = CrossEncoder("rasyosef/reranker-amharic-medium")

# Get scores for pairs of texts
pairs = [
    ['ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና', 'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።'],
    ['ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና', 'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።']
]
scores = model.predict(pairs)
print(scores.shape)
# (2,)

# Or rank different texts based on similarity to a single text
ranks = model.rank(
    'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
    [
        'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።',
        'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።',
    ]
)
print(ranks)
# [{'corpus_id': 0, 'score': ...}, {'corpus_id': 1, 'score': ...}]

Evaluation

Metrics

Cross Encoder Reranking

Dataset: amh-passage-retrieval-dev
Evaluated with CrossEncoderRerankingEvaluator with these parameters:
```
{
    "at_k": 10
}
```

Metric	Value
mrr@10	0.805
ndcg@10	0.835

Training Details

Training Dataset

Amharic Passage Retrieval Dataset V2

Size: 491,752 training samples
Columns: query, passage, and label

Loss: BinaryCrossEntropyLoss with these parameters:

{
    "activation_fn": "torch.nn.modules.linear.Identity",
    "pos_weight": 7
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
learning_rate: 4e-05
num_train_epochs: 4
lr_scheduler_type: cosine
warmup_ratio: 0.05
fp16: True
dataloader_num_workers: 2
load_best_model_at_end: True
batch_sampler: no_duplicates

Training Logs

Epoch	Step	Training Loss	amh-passage-retrieval-dev_ndcg@10
1.0	7684	0.4048	0.8289
2.0	15368	0.2366	0.8546
3.0	23052	0.1588	0.8353
4.0	30736	0.1024	0.8551

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.11.13
Sentence Transformers: 4.1.0
Transformers: 4.52.4
PyTorch: 2.6.0+cu124
Accelerate: 1.7.0
Datasets: 3.6.0
Tokenizers: 0.21.1

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}