File size: 5,163 Bytes
c8ba904
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a456a9e
c9b4eeb
a456a9e
c9b4eeb
 
c8ba904
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
license: mit
datasets:
- squad
- eli5
- sentence-transformers/embedding-training-data
- KennethTM/gooaq_pairs_danish
- sentence-transformers/gooaq
- KennethTM/squad_pairs_danish
- KennethTM/eli5_question_answer_danish
language:
- da
library_name: sentence-transformers
widget:
- source_sentence: 'Kører der cykler på vejen?'
  sentences:
  - 'I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.'
  - 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.'
---

# Note

*This an updated version of [KennethTM/MiniLM-L6-danish-encoder](https://huggingface.co/KennethTM/MiniLM-L6-danish-encoder). This version is just trained on more data ([GooAQ dataset](https://huggingface.co/datasets/sentence-transformers/gooaq) translated to [Danish](https://huggingface.co/datasets/KennethTM/gooaq_pairs_danish)) and is otherwise the same*


# MiniLM-L6-danish-encoder 

This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. 

The maximum sequence length is 512 tokens.

The model was not pre-trained from scratch but adapted from the English version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish).

Trained on ELI5 and SQUAD data machine translated from English to Danish.

# Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```
Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# Given a query
query = ['Kører der cykler på vejen?']

# And two passages
passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', 
           'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.']

# Compute embeddings
model = SentenceTransformer("KennethTM/MiniLM-L6-danish-encoder-v2")
query_embeddings = model.encode(query)
passage_embeddings = model.encode(passage)

# To find most relevant passage for the query (closer to 1 means more similar)
cosine_scores = cos_sim(query_embeddings, passage_embeddings)
print(cosine_scores)
```
# Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2")
model = AutoModel.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2")

# Given a query
query = ['Kører der cykler på vejen?']

# And two passages
passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', 
           'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.']

# Tokenize sentences
query_encoded = tokenizer(query, padding=True, truncation=True, return_tensors='pt')
passage_encoded = tokenizer(passage, padding=True, truncation=True, return_tensors='pt')

# Compute embeddings
with torch.no_grad():
    query_features = model(**query_encoded)
    passage_features  = model(**passage_encoded)

# Perform pooling
query_embeddings = mean_pooling(query_features, query_encoded['attention_mask'])
passage_embeddings = mean_pooling(passage_features, passage_encoded['attention_mask'])

# To find most relevant passage for the query (closer to 1 means more similar)
cosine_scores = F.cosine_similarity(query_embeddings, passage_embeddings)
print(cosine_scores)
```