metadata
license: mit
datasets:
- squad
- eli5
- sentence-transformers/embedding-training-data
language:
- da
MiniLM-L6-danish-reranker
This is a lightweight (~22 M parameters) sentence-transformers model for Danish NLP: It takes two sentences as input and outputs a relevance score. Therefore, the model can be used for information retrieval, e.g. given a query and candidate matches, rank the candidates by their relevance.
The maximum sequence length is 512 tokens (for both passages).
The model was not pre-trained from scratch but adapted from the English version of cross-encoder/ms-marco-MiniLM-L-6-v2 with a Danish tokenizer.
Trained on ELI5 and SQUAD data machine translated from English to Danish.
Usage with Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained('KennethTM/MiniLM-L6-danish-reranker')
tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-reranker')
features = tokenizer(['Kører der cykler på vejen?', 'Kører der cykler på vejen?'], ['En panda løber på vejen.', 'En mand kører hurtigt forbi på cykel.'], padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
scores = model(**features).logits
print(scores)
Usage with SentenceTransformers
The usage becomes easier when you have SentenceTransformers installed. Then, you can use the pre-trained models like this:
from sentence_transformers import CrossEncoder
model = CrossEncoder('KennethTM/MiniLM-L6-danish-reranker', max_length=512)
scores = model.predict([('Kører der cykler på vejen?', 'En panda løber på vejen.'), ('Kører der cykler på vejen?', 'En mand kører hurtigt forbi på cykel.')])