Arabic SPLADE — Phase 1

Inference-free SPLADE: frozen SparseStaticEmbedding for queries, MLMTransformer + SpladePooling for documents.

Architecture

Asymmetric (Router: query=SparseStaticEmbedding, doc=MLMTransformer+SpladePooling)

Base model: aubmindlab/bert-base-arabertv2

Training

  • Dataset: oddadmix/arabic-triplets-large (104K triplets, 92K unique passages)
  • Loss: SpladeLoss(SparseMultipleNegativesRankingLoss, q_reg=5e-5, d_reg=3e-5)
  • Batch: 16 per GPU, grad accum 4
  • Learning rate: 2e-5
  • Epochs: 1
  • AMP: fp16
  • Sampler: NO_DUPLICATES

Evaluation on Arabic NanoBEIR (13 datasets)

Metric Score
NDCG@10 0.2995
MRR@10 0.3584

For reference: BM25 scores 0.3824 NDCG@10, 0.4483 MRR@10 on the same benchmark.

Training Details

AraBERTv2 base (12-layer BERT, 64K vocab)

Hardware

  • 2× NVIDIA TITAN RTX (23.5 GB each)
  • DDP via torchrun

Usage

from sentence_transformers.sparse_encoder import SparseEncoder

model = SparseEncoder("Abdelkareem/arabic-splade-asymmetric")
embeddings = model.encode([
    "ما هي عاصمة مصر؟",
    "القاهرة هي عاصمة مصر وأكبر مدنها.",
])
print(embeddings.shape)
# Decode top tokens
decoded = model.decode(embeddings, top_k=10)
for d in decoded:
    print(d)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Waqf-AI/arabic-splade-asymmetric

Finetuned
(70)
this model

Dataset used to train Waqf-AI/arabic-splade-asymmetric

Collection including Waqf-AI/arabic-splade-asymmetric