KennethTM's picture
Update README.md
62ba551 verified
|
raw
history blame
1.97 kB
metadata
license: mit
datasets:
  - squad
  - eli5
  - sentence-transformers/embedding-training-data
language:
  - da

MiniLM-L6-danish-reranker

This is a lightweight (~22 M parameters) sentence-transformers model for Danish NLP: It takes two sentences as input and outputs a relevance score. Therefore, the model can be used for information retrieval, e.g. given a query and candidate matches, rank the candidates by their relevance.

The maximum sequence length is 512 tokens (for both passages).

The model was not pre-trained from scratch but adapted from the English version of cross-encoder/ms-marco-MiniLM-L-6-v2 with a Danish tokenizer.

Trained on ELI5 and SQUAD data machine translated from English to Danish.

Usage with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('KennethTM/MiniLM-L6-danish-reranker')
tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-reranker')
features = tokenizer(['Kører der cykler på vejen?', 'Kører der cykler på vejen?'], ['En panda løber på vejen.', 'En mand kører hurtigt forbi på cykel.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

Usage with SentenceTransformers

The usage becomes easier when you have SentenceTransformers installed. Then, you can use the pre-trained models like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('KennethTM/MiniLM-L6-danish-reranker', max_length=512)
scores = model.predict([('Kører der cykler på vejen?', 'En panda løber på vejen.'), ('Kører der cykler på vejen?', 'En mand kører hurtigt forbi på cykel.')])