language: fr
license: mit
tags:
- zero-shot-classification
- sentence-similarity
- nli
pipeline_tag: zero-shot-classification
widget:
- text: >-
Selon certains physiciens, un univers parallèle, miroir du nôtre ou
relevant de ce que l'on appelle la théorie des branes, autoriserait des
neutrons à sortir de notre Univers pour y entrer à nouveau. L'idée a été
testée une nouvelle fois avec le réacteur nucléaire de l'Institut
Laue-Langevin à Grenoble, plus précisément en utilisant le détecteur de
l'expérience Stereo initialement conçu pour chasser des particules de
matière noire potentielles, les neutrinos stériles.
candidate_labels: politique, science, sport, santé
hypothesis_template: Ce texte parle de {}.
datasets:
- flue
DistilCamemBERT-NLI
We present DistilCamemBERT-NLI which is DistilCamemBERT fine-tuned for the Natural Language Inference (NLI) task for the french language, also known as recognizing textual entailment (RTE). This model is constructed on the XNLI dataset which consists to determine whether a premise entails, contradicts or neither entails nor contradicts a hypothesis.
This modelization is close to BaptisteDoyen/camembert-base-xnli based on CamemBERT model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase for example. Indeed, inference cost can be a technological issue especially as in a context of cross-encoding like for this task. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power thanks to DistilCamemBERT.
Dataset
The dataset XNLI from FLUE is composed of 392,702 premises with their hypothesis for the train and 5,010 couples for the test. The goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B?) and is a classification task (given two sentences, predict one of three labels). The sentence A is called premise and sentence B is called hypothesis, then the goal of modelization is determined as follows:
Evaluation results
class | precision (%) | f1-score (%) | support |
---|---|---|---|
global | 77.70 | 77.45 | 5,010 |
contradiction | 78.00 | 79.54 | 1,670 |
entailment | 82.90 | 78.87 | 1,670 |
neutral | 72.18 | 74.04 | 1,670 |
Benchmark
We compare the DistilCamemBERT model to 2 other modelizations working on french language. The first one BaptisteDoyen/camembert-base-xnli is based on well named CamemBERT, the french RoBERTa model and the second one MoritzLaurer/mDeBERTa-v3-base-mnli-xnli based on mDeBERTav3 a multilingual model. To compare the performances the metrics of accuracy and MCC (Matthews Correlation Coefficient) was used and for the mean inference time measure, an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores was used:
model | time (ms) | accuracy (%) | MCC (x100) |
---|---|---|---|
cmarkea/distilcamembert-base-nli | 51.35 | 77.45 | 66.24 |
BaptisteDoyen/camembert-base-xnli | 105.0 | 81.72 | 72.67 |
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 299.18 | 83.43 | 75.15 |
Zero-shot classification
The main advantage of such modelization is to create a zero-shot classifier allowing text classification without training. This task can be summarized by:
For this part, we use 2 datasets, the first one: allocine used to train the sentiment analysis models. The dataset is composed of 2 classes: "positif" and "négatif" appreciation of movies reviews. Here we use "Ce commentaire est {}." as the hypothesis template and "positif" and "négatif" as candidate labels.
model | time (ms) | accuracy (%) | MCC (x100) |
---|---|---|---|
cmarkea/distilcamembert-base-nli | 195.54 | 80.59 | 63.71 |
BaptisteDoyen/camembert-base-xnli | 378.39 | 86.37 | 73.74 |
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 520.58 | 84.97 | 70.05 |
The second one: mlsum used to train the summarization models. We use the articles summary part to predict their topics. In this aim, we aggregate sub-topics and select a few of them. In this case, the hypothesis template used is "C'est un article traitant de {}." and the candidate labels are: "économie", "politique", "sport" and "science".
model | time (ms) | accuracy (%) | MCC (x100) |
---|---|---|---|
cmarkea/distilcamembert-base-nli | 217.77 | 79.30 | 70.55 |
BaptisteDoyen/camembert-base-xnli | 448.27 | 70.7 | 64.10 |
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 591.34 | 64.45 | 58.67 |
How to use DistilCamemBERT-NLI
from transformers import pipeline
classifier = pipeline(
task='zero-shot-classification',
model="cmarkea/distilcamembert-base-nli",
tokenizer="cmarkea/distilcamembert-base-nli"
)
result = classifier (
sequences="Le style très cinéphile de Quentin Tarantino "
"se reconnaît entre autres par sa narration postmoderne "
"et non linéaire, ses dialogues travaillés souvent "
"émaillés de références à la culture populaire, et ses "
"scènes hautement esthétiques mais d'une violence "
"extrême, inspirées de films d'exploitation, d'arts "
"martiaux ou de western spaghetti.",
candidate_labels="cinéma, technologie, littérature, politique",
hypothesis_template="Ce texte parle de {}."
)
result
{"labels": ["cinéma",
"littérature",
"technologie",
"politique"],
"scores": [0.7164115309715271,
0.12878799438476562,
0.1092301607131958,
0.0455702543258667]}
Citation
@inproceedings{delestre:hal-03674695,
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
URL = {https://hal.archives-ouvertes.fr/hal-03674695},
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
ADDRESS = {Vannes, France},
YEAR = {2022},
MONTH = Jul,
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
HAL_ID = {hal-03674695},
HAL_VERSION = {v1},
}