Instructions to use Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m") sentences = [ "هذا شخص سعيد", "هذا كلب سعيد", "هذا شخص سعيد جدا", "اليوم هو يوم مشمس" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Harrier-Arabic-Matryoshka-270m
A 270M-parameter Arabic sentence embedding model based on
microsoft/harrier-oss-v1-270m,
fine-tuned for Arabic semantic similarity with
Matryoshka Representation Learning.
Matryoshka training was applied across the dimension ladder 640 → 512 → 256 → 128 → 64, so you can truncate the output embedding to any of these sizes with minimal quality loss — useful for faster retrieval and lighter indexes.
Model details
| Field | Value |
|---|---|
| Base model | microsoft/harrier-oss-v1-270m |
| Parameters | ~270M |
| Full embedding dimension | 640 |
| Matryoshka dims | 640, 512, 256, 128, 64 |
| Max sequence length | 32,768 |
| Pooling | inherited from base |
| Language | Arabic (cross-lingual capabilities inherited from base) |
Evaluation
Spearman correlation on Arabic semantic textual similarity tasks (MTEB), full-dim embeddings:
| Task | Subset | Baseline (harrier-oss-v1-270m) |
This model | Δ |
|---|---|---|---|---|
| STS17 | ar-ar | 0.7598 | 0.8135 | +0.054 |
| STS17 | en-ar | 0.7601 | 0.8145 | +0.054 |
| STS22.v2 | ar | 0.6510 | 0.6457 | -0.005 |
Usage
Standard (full 640-dim embeddings)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m",
trust_remote_code=True,
)
sentences = [
"تعلم اللغة العربية ممتع ومثير.",
"دراسة العربية تجربة شيقة.",
"القطط تحب اللعب في الحديقة.",
]
embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape) # (3, 640)
Truncated (Matryoshka) embeddings
Pick any dim from the ladder for smaller, faster vectors:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m",
trust_remote_code=True,
truncate_dim=256, # one of: 640, 512, 256, 128, 64
)
embeddings = model.encode(["..."], normalize_embeddings=True)
print(embeddings.shape) # (1, 256)
Cosine similarity
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer(
"Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m",
trust_remote_code=True,
)
a = model.encode("تعلم اللغة العربية ممتع ومثير.", normalize_embeddings=True)
b = model.encode("دراسة العربية تجربة شيقة.", normalize_embeddings=True)
print(cos_sim(a, b))
Intended use
- Arabic semantic textual similarity
- Arabic sentence/passage retrieval and re-ranking
- Cross-lingual retrieval against the base model's supported languages
- Clustering and deduplication of Arabic text
Citation
If you use this model, please cite the base model and Matryoshka Representation Learning:
@misc{harrieross,
title = {Harrier OSS v1},
author = {Microsoft},
url = {https://huggingface.co/microsoft/harrier-oss-v1-270m}
}
@inproceedings{kusupati2022matryoshka,
title = {Matryoshka Representation Learning},
author = {Kusupati, Aditya and others},
booktitle = {NeurIPS},
year = {2022}
}
License
This model inherits the license of its base model.
See microsoft/harrier-oss-v1-270m
for terms.
- Downloads last month
- 60