KennethTM commited on
Commit
28185e8
·
verified ·
1 Parent(s): 53ed9c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -3
README.md CHANGED
@@ -1,3 +1,68 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - squad
5
+ - eli5
6
+ - sentence-transformers/embedding-training-data
7
+ - sentence-transformers/gooaq
8
+ - KennethTM/squad_pairs_danish
9
+ - KennethTM/eli5_question_answer_danish
10
+ - KennethTM/gooaq_pairs_danish
11
+ language:
12
+ - da
13
+ ---
14
+
15
+ *This an updated version of [KennethTM/MiniLM-L6-danish-reranker](https://huggingface.co/KennethTM/MiniLM-L6-danish-reranker). This version is just trained on more data ([GooAQ dataset](https://huggingface.co/datasets/sentence-transformers/gooaq) translated to [Danish](https://huggingface.co/KennethTM/gooaq_pairs_danish)) and is otherwise the same*
16
+
17
+ # MiniLM-L6-danish-reranker-v2
18
+
19
+ This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It takes two sentences as input and outputs a relevance score. Therefore, the model can be used for information retrieval, e.g. given a query and candidate matches, rank the candidates by their relevance.
20
+
21
+ The maximum sequence length is 512 tokens (for both passages).
22
+
23
+ The model was not pre-trained from scratch but adapted from the English version of [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish).
24
+
25
+ Trained on ELI5 and SQUAD data machine translated from English to Danish.
26
+
27
+ ## Usage with Transformers
28
+
29
+ ```python
30
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
31
+ import torch
32
+ import torch.nn.functional as F
33
+
34
+ model = AutoModelForSequenceClassification.from_pretrained('KennethTM/MiniLM-L6-danish-reranker-v2')
35
+ tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-reranker-v2')
36
+
37
+ # Two examples where the first is a positive example and the second is a negative example
38
+ queries = ['Kører der cykler på vejen?',
39
+ 'Kører der cykler på vejen?']
40
+ passages = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.',
41
+ 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.']
42
+
43
+ features = tokenizer(queries, passages, padding=True, truncation=True, return_tensors="pt")
44
+
45
+ model.eval()
46
+ with torch.no_grad():
47
+ scores = model(**features).logits
48
+
49
+ # The scores are raw logits, these can be transformed into probabilities using the sigmoid function
50
+ print(scores)
51
+ print(F.sigmoid(scores))
52
+ ```
53
+
54
+ ## Usage with SentenceTransformers
55
+
56
+ The usage becomes easier when you have [SentenceTransformers](https://www.sbert.net/) installed. Then, you can use the pre-trained models like this:
57
+ ```python
58
+ from sentence_transformers import CrossEncoder
59
+ import numpy as np
60
+
61
+ sigmoid_numpy = lambda x: 1/(1 + np.exp(-x))
62
+
63
+ model = CrossEncoder('KennethTM/MiniLM-L6-danish-reranker-v2, max_length=512)
64
+ scores = model.predict([queries, passages])
65
+
66
+ print(scores)
67
+ print(sigmoid_numpy(scores))
68
+ ```