File size: 3,743 Bytes
2f341ce
 
3b9b6cc
faba9ef
2f2fec5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f341ce
 
 
 
 
3b9b6cc
 
7fc8423
1120190
 
 
 
 
 
2f341ce
 
3b9b6cc
2f341ce
4127d14
2f341ce
3b9b6cc
2f341ce
 
2f2fec5
2f341ce
 
3b9b6cc
 
 
2f341ce
3b9b6cc
2f341ce
 
3b9b6cc
 
 
 
 
2f341ce
 
3b9b6cc
2f341ce
 
 
8423799
2f341ce
 
 
 
 
 
 
 
 
 
3b9b6cc
 
 
 
2f341ce
 
 
 
 
3b9b6cc
2f341ce
 
 
 
 
 
 
4e8b396
 
dae03a8
 
 
d45db09
 
 
4e8b396
 
d324c6c
 
 
 
 
 
 
4e8b396
3b9b6cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
pipeline_tag: sentence-similarity
license: apache-2.0
language:
- cs
- da
- de
- en
- es
- fi
- fr
- he
- hr
- hu
- id
- it
- nl
- 'no'
- pl
- pt
- ro
- ru
- sv
- tr
- vi
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
datasets:
- clips/mfaq
widget:
  source_sentence: "<Q>How many models can I host on HuggingFace?"
  sentences:
    - "<A>All plans come with unlimited private models and datasets."
    - "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
    - "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."

---

# MFAQ

We present a multilingual FAQ retrieval model trained on the [MFAQ dataset](https://huggingface.co/datasets/clips/mfaq), it ranks candidate answers according to a given question.

## Installation

```
pip install sentence-transformers transformers
```

## Usage
You can use MFAQ with sentence-transformers or directly with a HuggingFace model. 
In both cases, questions need to be prepended with `<Q>`, and answers with `<A>`.

#### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer

question = "<Q>How many models can I host on HuggingFace?"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."

model = SentenceTransformer('clips/mfaq')
embeddings = model.encode([question, answer_1, answer_3, answer_3])
print(embeddings)
```

#### HuggingFace Transformers

```python
from transformers import AutoTokenizer, AutoModel
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

question = "<Q>How many models can I host on HuggingFace?"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."

tokenizer = AutoTokenizer.from_pretrained('clips/mfaq')
model = AutoModel.from_pretrained('clips/mfaq')

# Tokenize sentences
encoded_input = tokenizer([question, answer_1, answer_3, answer_3], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
```

## Training
You can find the training script for the model [here](https://github.com/clips/mfaq).

## People
This model was developed by [Maxime De Bruyn](https://www.linkedin.com/in/maximedebruyn/), Ehsan Lotfi, Jeska Buhmann and Walter Daelemans.

## Citation information
```
@misc{debruyn2021mfaq,
      title={MFAQ: a Multilingual FAQ Dataset}, 
      author={Maxime De Bruyn and Ehsan Lotfi and Jeska Buhmann and Walter Daelemans},
      year={2021},
      eprint={2109.12870},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```