File size: 3,243 Bytes
bdea6d0 32edf05 af04a3a 32edf05 af75341 bdea6d0 c6aeb4b 32edf05 af04a3a 32edf05 da5dfae bdea6d0 af04a3a 3b19791 32edf05 bdea6d0 af04a3a bdea6d0 32edf05 bdea6d0 32edf05 bdea6d0 32edf05 bdea6d0 32edf05 bdea6d0 32edf05 af04a3a 3b19791 32edf05 3b19791 32edf05 af75341 32edf05 3b19791 32edf05 af75341 32edf05 af75341 32edf05 af04a3a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
license: mit
datasets:
- squad
- eli5
- sentence-transformers/embedding-training-data
language:
- da
library_name: sentence-transformers
---
*New version available, trained on more data and otherwise identical [KennethTM/MiniLM-L6-danish-encoder-v2](https://huggingface.co/KennethTM/MiniLM-L6-danish-encoder-v2)*
# MiniLM-L6-danish-encoder
This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.
The maximum sequence length is 512 tokens.
The model was not pre-trained from scratch but adapted from the English version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish).
Trained on ELI5 and SQUAD data machine translated from English to Danish.
# Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
model = SentenceTransformer('KennethTM/MiniLM-L6-danish-encoder')
embeddings = model.encode(sentences)
print(embeddings)
```
# Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
model = AutoModel.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
``` |