Edit model card

Description

Ce modèle sentence-transformers a été obtenu en finetunant le modèle cmarkea/distilcamembert-base à l'aide de la librairie sentence-transformers.

Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.

Le modèle DistilCamemBERT sur lequel il est basé est une distillation du modèlel CamemBERT permettant de diviser par deux le nombre de paramètres du modèle et améliorer le temps d'inférence.

Utilisation via la librairie sentence-transformers

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]

model = SentenceTransformer('h4c5/sts-distilcamembert-base')
embeddings = model.encode(sentences)
print(embeddings)

Utilisation via la librairie transformers

pip install -U transformers
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base")
model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base")
model.eval()


# Mean Pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[
        0
    ]  # First element of model_output contains all token embeddings
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

# Tokenization et calcul des embeddings des tokens
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)

# Mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print(sentence_embeddings)

Evaluation

Le modèle a été évalué sur le jeu de données STSb fr :

from datasets import load_dataset
from sentence_transformers import InputExample, evaluation


def dataset_to_input_examples(dataset):
    return [
        InputExample(
            texts=[example["sentence1"], example["sentence2"]],
            label=example["similarity_score"] / 5.0,
        )
        for example in dataset
    ]


sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)

sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_test_examples, name="sts-test"
)

sts_test_evaluator(model, ".")

Résultats

Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données stsb_multi_mt (données fr, split test)

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 180 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit() method:

{
    "epochs": 10,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 500,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Citing

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    journal={"https://arxiv.org/abs/1908.10084"},
}

@inproceedings{sanh2019distilbert,
    title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
    author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
    booktitle={NeurIPS EMC^2 Workshop},
    journal={https://arxiv.org/abs/1910.01108},
    year={2019}
}

@inproceedings{martin2020camembert,
    title={CamemBERT: a Tasty French Language Model},
    author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
    booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
    journal={https://arxiv.org/abs/1911.03894},
    year={2020}
}

@inproceedings{delestre:hal-03674695,
    TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
    AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
    URL = {https://hal.archives-ouvertes.fr/hal-03674695},
    BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
    ADDRESS = {Vannes, France},
    YEAR = {2022},
    MONTH = Jul,
    KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
    PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
    HAL_ID = {hal-03674695},
    HAL_VERSION = {v1},
    journal={https://arxiv.org/abs/2205.11111},
}
Downloads last month
135
Safetensors
Model size
68.1M params
Tensor type
F32
·

Finetuned from

Dataset used to train h4c5/sts-distilcamembert-base

Evaluation results

  • Pearson Correlation - stsb_multi_mt fr on STSb French
    self-reported
    0.817