h4c5's picture
maj readme
357942a
---
language: fr
license: mit
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
datasets:
- stsb_multi_mt
metrics:
- pearsonr
base_model: cmarkea/distilcamembert-base
model-index:
- name: sts-distilcamembert-base
results:
- task:
name: Sentence Similarity
type: sentence-similarity
dataset:
name: STSb French
type: stsb_multi_mt
args: fr
metrics:
- name: Pearson Correlation - stsb_multi_mt fr
type: pearsonr
value: 0.8165
---
## Description
Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle
[`cmarkea/distilcamembert-base`](https://huggingface.co/cmarkea/distilcamembert-base) à l'aide de la librairie
[sentence-transformers](https://www.SBERT.net).
Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.
Le modèle [DistilCamemBERT](https://huggingface.co/papers/2205.11111) sur lequel il est basé est une distillation du
modèlel [CamemBERT](https://arxiv.org/abs/1911.03894) permettant de diviser par deux le nombre de paramètres du modèle
et améliorer le temps d'inférence.
## Utilisation via la librairie `sentence-transformers`
```
pip install -U sentence-transformers
```
```python
from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]
model = SentenceTransformer('h4c5/sts-distilcamembert-base')
embeddings = model.encode(sentences)
print(embeddings)
```
## Utilisation via la librairie `transformers`
```
pip install -U transformers
```
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base")
model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base")
model.eval()
# Mean Pooling
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[
0
] # First element of model_output contains all token embeddings
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
# Tokenization et calcul des embeddings des tokens
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)
# Mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print(sentence_embeddings)
```
## Evaluation
Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) :
```python
from datasets import load_dataset
from sentence_transformers import InputExample, evaluation
def dataset_to_input_examples(dataset):
return [
InputExample(
texts=[example["sentence1"], example["sentence2"]],
label=example["similarity_score"] / 5.0,
)
for example in dataset
]
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)
sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
sts_test_examples, name="sts-test"
)
sts_test_evaluator(model, ".")
```
### Résultats
Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt)
(données `fr`, split `test`)
| Model | Pearson Correlation | Paramètres |
| :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: |
| [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base) | **0.837** | 110M |
| [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M |
| [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M |
| [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base) | 0.817 | 68M |
| [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M |
## Training
The model was trained with the parameters:
**DataLoader**:
`torch.utils.data.dataloader.DataLoader` of length 180 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```
**Loss**:
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
Parameters of the `fit()` method:
```
{
"epochs": 10,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 500,
"weight_decay": 0.01
}
```
## Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Citing
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
journal={"https://arxiv.org/abs/1908.10084"},
}
@inproceedings{sanh2019distilbert,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
booktitle={NeurIPS EMC^2 Workshop},
journal={https://arxiv.org/abs/1910.01108},
year={2019}
}
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
journal={https://arxiv.org/abs/1911.03894},
year={2020}
}
@inproceedings{delestre:hal-03674695,
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
URL = {https://hal.archives-ouvertes.fr/hal-03674695},
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
ADDRESS = {Vannes, France},
YEAR = {2022},
MONTH = Jul,
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
HAL_ID = {hal-03674695},
HAL_VERSION = {v1},
journal={https://arxiv.org/abs/2205.11111},
}