devdata-search-harrier-270m-cmnrl

A bi-encoder embedding model for search over structured statistical metadata, part of the DevData Search family. It is a fine-tune of microsoft/harrier-oss-v1-270m produced with schema-invariant fine-tuning on DevDataBench: full-schema serialization with per-example field-order permutation and field dropout, so the encoder binds meaning to field labels rather than to serialization order. This is an embedding model that powers retrieval; it is not a hosted search service.

See the paper Field Order Should Not Matter: Permutation-Invariant Fine-Tuning for Structured Metadata Retrieval.

Training

  • Base model: microsoft/harrier-oss-v1-270m
  • Loss: cmnrl
  • Field permutation: True; field dropout: 0.15
  • Max sequence length: 512
  • No query/document prefixes

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ai4data/devdata-search-harrier-270m-cmnrl")
queries = ["mobile-broadband subscriptions per 100 people, reported annually"]
docs = ["name: Active mobile-broadband subscriptions | ..."]
q = model.encode(queries)
d = model.encode(docs)

Cosine similarity of q and d ranks documents for each query.

License

Apache-2.0. Derived from microsoft/harrier-oss-v1-270m; trained on public World Bank Data360 metadata.

Downloads last month
20
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ai4data/devdata-search-harrier-270m-cmnrl

Finetuned
(11)
this model

Dataset used to train ai4data/devdata-search-harrier-270m-cmnrl