--- language: fr license: mit library_name: sentence-transformers pipeline_tag: feature-extraction tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers datasets: - stsb_multi_mt metrics: - pearsonr base_model: almanach/camembert-base model-index: - name: sts-camembert-base results: - task: name: Sentence Similarity type: sentence-similarity dataset: name: STSb French type: stsb_multi_mt args: fr metrics: - name: Pearson Correlation - stsb_multi_mt fr type: pearsonr value: 0.837 --- ## Description Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle [`almanach/camembert-base`](https://huggingface.co/almanach/camembert-base) à l'aide de la librairie [sentence-transformers](https://www.SBERT.net). Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768. Le modèle [CamemBERT](https://arxiv.org/abs/1911.03894) sur lequel il est basé est un modèle de type RoBERTa qui est à l'état de l'art pour la langue française. ## Utilisation via la librairie `sentence-transformers` ``` pip install -U sentence-transformers ``` ```python from sentence_transformers import SentenceTransformer sentences = ["Ceci est un exemple", "deuxième exemple"] model = SentenceTransformer('h4c5/sts-camembert-base') embeddings = model.encode(sentences) print(embeddings) ``` ## Utilisation via la librairie `transformers` ``` pip install -U transformers ``` ```python from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-camembert-base") model = AutoModel.from_pretrained("h4c5/sts-camembert-base") model.eval() # Mean Pooling def mean_pooling(model_output, attention_mask): token_embeddings = model_output[ 0 ] # First element of model_output contains all token embeddings input_mask_expanded = ( attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() ) return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp( input_mask_expanded.sum(1), min=1e-9 ) # Tokenization et calcul des embeddings des tokens encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") model_output = model(**encoded_input) # Mean pooling sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) print(sentence_embeddings) ``` ## Evaluation Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) : ```python from datasets import load_dataset from sentence_transformers import InputExample, evaluation def dataset_to_input_examples(dataset): return [ InputExample( texts=[example["sentence1"], example["sentence2"]], label=example["similarity_score"] / 5.0, ) for example in dataset ] sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test") sts_test_examples = dataset_to_input_examples(sts_test_dataset) sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples( sts_test_examples, name="sts-test" ) sts_test_evaluator(model, ".") ``` ### Résultats Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt) (données `fr`, split `test`) | Model | Pearson Correlation | Paramètres | | :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: | | [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base) | **0.837** | 110M | | [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M | | [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M | | [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base) | 0.817 | 68M | | [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M | ## Training The model was trained with the parameters: **DataLoader**: `torch.utils.data.dataloader.DataLoader` of length 180 with parameters: ``` {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} ``` **Loss**: `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` Parameters of the `fit()` method: ``` { "epochs": 10, "evaluation_steps": 1000, "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator", "max_grad_norm": 1, "optimizer_class": "", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", "steps_per_epoch": null, "warmup_steps": 500, "weight_decay": 0.01 } ``` ## Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ## Citing @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } @inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, journal={https://arxiv.org/abs/1911.03894}, year={2020} }