ajan-embed-q

A Turkish-optimized sentence embedding model for retrieval / RAG, distilled from BAAI/bge-m3 into a multilingual-e5-base student over 500k Turkish web sentences. 768→1024-dim projected to match the teacher's space.

🔗 Code + recipe: github.com/AJANLAR-AI/ajanlar

Part of Ajanlar — open Turkish AI agent infra. The retrieval engine under the agents, where multilingual models underperform on agglutinative Turkish.

Benchmark (MTEB, Turkish) — with ablation

Main score per task (NDCG@10 retrieval; Spearman STS). multilingual-e5-base is the undistilled ablation (same base as this model).

Model	Params	TurHistQuad (retrieval)	STS22.v2	STS17	avg
ajan-embed-q (this)	278M	0.465	0.651	0.724	0.613
multilingual-e5-small	118M	0.433	0.643	0.767	0.614
multilingual-e5-base (ablation)	278M	0.444	0.651	0.777	0.624
multilingual-e5-large	560M	0.469	0.675	0.810	0.652
BAAI/bge-m3 (teacher)	568M	0.478	0.680	0.814	0.657

Honest reading: this is a retrieval-specialised model. On Turkish retrieval (TurHistQuad) it beats e5-small/e5-base and matches e5-large at half the size — the RAG/agent use case it's built for. On general STS it trails the e5 family, so on the 4-task average it lands ~on par with (slightly below) undistilled e5-base. Use it for retrieval/RAG in Turkish, not as a general-purpose STS model.

Usage

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("fredoline005/ajan-embed-q")
emb = m.encode(["query: kargom ne zaman gelir?",
                "passage: Siparişler 1–3 iş günü içinde kargoya verilir."],
               normalize_embeddings=True)

Use query: / passage: prefixes for retrieval (inherited from the e5 family).

Training

Method: offline distillation — teacher embeddings cached over the corpus, the student trained (MSE) to reproduce them; a Dense layer projects 768→1024.
Teacher: BAAI/bge-m3 (MIT). Student: intfloat/multilingual-e5-base (MIT).
Data: 500k Turkish sentences (allenai/c4, tr).
Hardware: 1× RTX 4090, ~1 hour.

Limitations (honest)

Benchmarked on 2 Turkish tasks, no confidence intervals — margins are modest.
The win over e5-small is partly the larger base; an undistilled-e5-base ablation is not yet run.
Distillation is bounded by the teacher; the native-Turkish-tokenizer edge is v1.
Pin a corpus snapshot + add PII/dedup filtering for a production retrain.

License

Apache-2.0 (weights/recipe). Base + teacher are MIT.

Downloads last month: 72

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for fredoline005/ajan-embed-q

Base model

intfloat/multilingual-e5-base

Finetuned

(138)

this model