File size: 3,243 Bytes
bdea6d0
 
 
 
 
 
32edf05
 
 
 
af04a3a
32edf05
 
 
af75341
bdea6d0
c6aeb4b
 
32edf05
 
 
 
af04a3a
32edf05
da5dfae
bdea6d0
af04a3a
3b19791
32edf05
bdea6d0
 
 
 
 
 
 
 
 
 
af04a3a
bdea6d0
32edf05
bdea6d0
 
 
32edf05
 
bdea6d0
32edf05
 
 
 
bdea6d0
32edf05
 
 
 
 
bdea6d0
32edf05
af04a3a
3b19791
32edf05
 
 
3b19791
32edf05
 
af75341
32edf05
 
 
3b19791
32edf05
 
af75341
32edf05
 
af75341
32edf05
 
af04a3a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
license: mit
datasets:
- squad
- eli5
- sentence-transformers/embedding-training-data
language:
- da
library_name: sentence-transformers
---

*New version available, trained on more data and otherwise identical [KennethTM/MiniLM-L6-danish-encoder-v2](https://huggingface.co/KennethTM/MiniLM-L6-danish-encoder-v2)*

# MiniLM-L6-danish-encoder 

This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. 

The maximum sequence length is 512 tokens.

The model was not pre-trained from scratch but adapted from the English version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish).

Trained on ELI5 and SQUAD data machine translated from English to Danish.

# Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```
Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]

model = SentenceTransformer('KennethTM/MiniLM-L6-danish-encoder')
embeddings = model.encode(sentences)
print(embeddings)
```
# Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
model = AutoModel.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)
```