--- pipeline_tag: sentence-similarity license: apache-2.0 language: - cs - da - de - en - es - fi - fr - he - hr - hu - id - it - nl - 'no' - pl - pt - ro - ru - sv - tr - vi tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers datasets: - clips/mfaq widget: source_sentence: "How many models can I host on HuggingFace?" sentences: - "All plans come with unlimited private models and datasets." - "AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem." - "Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job." --- # MFAQ We present a multilingual FAQ retrieval model trained on the [MFAQ dataset](https://huggingface.co/datasets/clips/mfaq), it ranks candidate answers according to a given question. ## Installation ``` pip install sentence-transformers transformers ``` ## Usage You can use MFAQ with sentence-transformers or directly with a HuggingFace model. In both cases, questions need to be prepended with ``, and answers with ``. #### Sentence Transformers ```python from sentence_transformers import SentenceTransformer question = "How many models can I host on HuggingFace?" answer_1 = "All plans come with unlimited private models and datasets." answer_2 = "AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem." answer_3 = "Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job." model = SentenceTransformer('clips/mfaq') embeddings = model.encode([question, answer_1, answer_3, answer_3]) print(embeddings) ``` #### HuggingFace Transformers ```python from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) question = "How many models can I host on HuggingFace?" answer_1 = "All plans come with unlimited private models and datasets." answer_2 = "AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem." answer_3 = "Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job." tokenizer = AutoTokenizer.from_pretrained('clips/mfaq') model = AutoModel.from_pretrained('clips/mfaq') # Tokenize sentences encoded_input = tokenizer([question, answer_1, answer_3, answer_3], padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, max pooling. sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) ``` ## Training You can find the training script for the model [here](https://github.com/clips/mfaq). ## People This model was developed by [Maxime De Bruyn](https://www.linkedin.com/in/maximedebruyn/), Ehsan Lotfi, Jeska Buhmann and Walter Daelemans. ## Citation information ``` @misc{debruyn2021mfaq, title={MFAQ: a Multilingual FAQ Dataset}, author={Maxime De Bruyn and Ehsan Lotfi and Jeska Buhmann and Walter Daelemans}, year={2021}, eprint={2109.12870}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```