ajan-embed-q

A Turkish-optimized sentence embedding model for retrieval / RAG, distilled from BAAI/bge-m3 into a multilingual-e5-base student over 500k Turkish web sentences. 768→1024-dim projected to match the teacher's space.

🔗 Code + recipe: github.com/AJANLAR-AI/ajanlar

Part of Ajanlar — open Turkish AI agent infra. The retrieval engine under the agents, where multilingual models underperform on agglutinative Turkish.

Benchmark (MTEB, Turkish) — with ablation

Main score per task (NDCG@10 retrieval; Spearman STS). multilingual-e5-base is the undistilled ablation (same base as this model).

Model Params TurHistQuad (retrieval) STS22.v2 STS17 avg
ajan-embed-q (this) 278M 0.465 0.651 0.724 0.613
multilingual-e5-small 118M 0.433 0.643 0.767 0.614
multilingual-e5-base (ablation) 278M 0.444 0.651 0.777 0.624
multilingual-e5-large 560M 0.469 0.675 0.810 0.652
BAAI/bge-m3 (teacher) 568M 0.478 0.680 0.814 0.657

Honest reading: this is a retrieval-specialised model. On Turkish retrieval (TurHistQuad) it beats e5-small/e5-base and matches e5-large at half the size — the RAG/agent use case it's built for. On general STS it trails the e5 family, so on the 4-task average it lands ~on par with (slightly below) undistilled e5-base. Use it for retrieval/RAG in Turkish, not as a general-purpose STS model.

Usage

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("fredoline005/ajan-embed-q")
emb = m.encode(["query: kargom ne zaman gelir?",
                "passage: Siparişler 1–3 iş günü içinde kargoya verilir."],
               normalize_embeddings=True)

Use query: / passage: prefixes for retrieval (inherited from the e5 family).

Training

  • Method: offline distillation — teacher embeddings cached over the corpus, the student trained (MSE) to reproduce them; a Dense layer projects 768→1024.
  • Teacher: BAAI/bge-m3 (MIT). Student: intfloat/multilingual-e5-base (MIT).
  • Data: 500k Turkish sentences (allenai/c4, tr).
  • Hardware: 1× RTX 4090, ~1 hour.

Limitations (honest)

  • Benchmarked on 2 Turkish tasks, no confidence intervals — margins are modest.
  • The win over e5-small is partly the larger base; an undistilled-e5-base ablation is not yet run.
  • Distillation is bounded by the teacher; the native-Turkish-tokenizer edge is v1.
  • Pin a corpus snapshot + add PII/dedup filtering for a production retrain.

License

Apache-2.0 (weights/recipe). Base + teacher are MIT.

Downloads last month
72
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fredoline005/ajan-embed-q

Finetuned
(138)
this model