|
--- |
|
language: fr |
|
license: mit |
|
tags: |
|
- zero-shot-classification |
|
- sentence-similarity |
|
- nli |
|
pipeline_tag: zero-shot-classification |
|
widget: |
|
- text: "Selon certains physiciens, un univers parallèle, miroir du nôtre ou relevant de ce que l'on appelle la théorie des branes, autoriserait des neutrons à sortir de notre Univers pour y entrer à nouveau. L'idée a été testée une nouvelle fois avec le réacteur nucléaire de l'Institut Laue-Langevin à Grenoble, plus précisément en utilisant le détecteur de l'expérience Stereo initialement conçu pour chasser des particules de matière noire potentielles, les neutrinos stériles." |
|
candidate_labels: "politique, science, sport, santé" |
|
hypothesis_template: "Ce texte parle de {}." |
|
datasets: |
|
- flue |
|
--- |
|
|
|
DistilCamemBERT-NLI |
|
=================== |
|
|
|
We present DistilCamemBERT-NLI, which is [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) fine-tuned for the Natural Language Inference (NLI) task for the french language, also known as recognizing textual entailment (RTE). This model is constructed on the XNLI dataset, which determines whether a premise entails, contradicts or neither entails or contradicts a hypothesis. |
|
|
|
This modelization is close to [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) based on [CamemBERT](https://huggingface.co/camembert-base) model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase, for example. Indeed, inference cost can be a technological issue especially in the context of cross-encoding like this task. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power, thanks to DistilCamemBERT. |
|
|
|
Dataset |
|
------- |
|
|
|
The dataset XNLI from [FLUE](https://huggingface.co/datasets/flue) comprises 392,702 premises with their hypothesis for the train and 5,010 couples for the test. The goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B?) and is a classification task (given two sentences, predict one of three labels). Sentence A is called *premise*, and sentence B is called *hypothesis*, then the goal of modelization is determined as follows: |
|
$$P(premise=c\in\{contradiction, entailment, neutral\}\vert hypothesis)$$ |
|
|
|
Evaluation results |
|
------------------ |
|
|
|
| **class** | **precision (%)** | **f1-score (%)** | **support** | |
|
| :----------------: | :---------------: | :--------------: | :---------: | |
|
| **global** | 77.70 | 77.45 | 5,010 | |
|
| **contradiction** | 78.00 | 79.54 | 1,670 | |
|
| **entailment** | 82.90 | 78.87 | 1,670 | |
|
| **neutral** | 72.18 | 74.04 | 1,670 | |
|
|
|
Benchmark |
|
--------- |
|
|
|
We compare the [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) model to 2 other modelizations working on the french language. The first one [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) is based on well named [CamemBERT](https://huggingface.co/camembert-base), the french RoBERTa model and the second one [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) based on [mDeBERTav3](https://huggingface.co/microsoft/mdeberta-v3-base) a multilingual model. To compare the performances, the metrics of accuracy and [MCC (Matthews Correlation Coefficient)](https://en.wikipedia.org/wiki/Phi_coefficient) were used. We used an **AMD Ryzen 5 4500U @ 2.3GHz with 6 cores** for mean inference time measure. |
|
|
|
| **model** | **time (ms)** | **accuracy (%)** | **MCC (x100)** | |
|
| :--------------: | :-----------: | :--------------: | :------------: | |
|
| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) | **51.35** | 77.45 | 66.24 | |
|
| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) | 105.0 | 81.72 | 72.67 | |
|
| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) | 299.18 | **83.43** | **75.15** | |
|
|
|
Zero-shot classification |
|
------------------------ |
|
|
|
The main advantage of such modelization is to create a zero-shot classifier allowing text classification without training. This task can be summarized by: |
|
$$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$ |
|
|
|
For this part, we use two datasets, the first one: [allocine](https://huggingface.co/datasets/allocine) used to train the sentiment analysis models. The dataset comprises two classes: "positif" and "négatif" appreciation of movie reviews. Here we use "Ce commentaire est {}." as the hypothesis template and "positif" and "négatif" as candidate labels. |
|
|
|
| **model** | **time (ms)** | **accuracy (%)** | **MCC (x100)** | |
|
| :--------------: | :-----------: | :--------------: | :------------: | |
|
| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) | **195.54** | 80.59 | 63.71 | |
|
| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) | 378.39 | **86.37** | **73.74** | |
|
| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) | 520.58 | 84.97 | 70.05 | |
|
|
|
The second one: [mlsum](https://huggingface.co/datasets/mlsum) used to train the summarization models. In this aim, we aggregate sub-topics and select a few of them. We use the articles summary part to predict their topics. In this case, the hypothesis template used is "C'est un article traitant de {}." and the candidate labels are: "économie", "politique", "sport" and "science". |
|
|
|
| **model** | **time (ms)** | **accuracy (%)** | **MCC (x100)** | |
|
| :--------------: | :-----------: | :--------------: | :------------: | |
|
| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) | **217.77** | **79.30** | **70.55** | |
|
| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) | 448.27 | 70.7 | 64.10 | |
|
| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) | 591.34 | 64.45 | 58.67 | |
|
|
|
How to use DistilCamemBERT-NLI |
|
------------------------------ |
|
```python |
|
from transformers import pipeline |
|
|
|
classifier = pipeline( |
|
task='zero-shot-classification', |
|
model="cmarkea/distilcamembert-base-nli", |
|
tokenizer="cmarkea/distilcamembert-base-nli" |
|
) |
|
result = classifier ( |
|
sequences="Le style très cinéphile de Quentin Tarantino " |
|
"se reconnaît entre autres par sa narration postmoderne " |
|
"et non linéaire, ses dialogues travaillés souvent " |
|
"émaillés de références à la culture populaire, et ses " |
|
"scènes hautement esthétiques mais d'une violence " |
|
"extrême, inspirées de films d'exploitation, d'arts " |
|
"martiaux ou de western spaghetti.", |
|
candidate_labels="cinéma, technologie, littérature, politique", |
|
hypothesis_template="Ce texte parle de {}." |
|
) |
|
|
|
result |
|
{"labels": ["cinéma", |
|
"littérature", |
|
"technologie", |
|
"politique"], |
|
"scores": [0.7164115309715271, |
|
0.12878799438476562, |
|
0.1092301607131958, |
|
0.0455702543258667]} |
|
``` |
|
|
|
### Optimum + ONNX |
|
|
|
```python |
|
from optimum.onnxruntime import ORTModelForSequenceClassification |
|
from transformers import AutoTokenizer, pipeline |
|
|
|
HUB_MODEL = "cmarkea/distilcamembert-base-nli" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL) |
|
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL) |
|
onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer) |
|
|
|
# Quantized onnx model |
|
quantized_model = ORTModelForSequenceClassification.from_pretrained( |
|
HUB_MODEL, file_name="model_quantized.onnx" |
|
) |
|
``` |
|
|
|
Citation |
|
-------- |
|
```bibtex |
|
@inproceedings{delestre:hal-03674695, |
|
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}}, |
|
AUTHOR = {Delestre, Cyrile and Amar, Abibatou}, |
|
URL = {https://hal.archives-ouvertes.fr/hal-03674695}, |
|
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}}, |
|
ADDRESS = {Vannes, France}, |
|
YEAR = {2022}, |
|
MONTH = Jul, |
|
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation}, |
|
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf}, |
|
HAL_ID = {hal-03674695}, |
|
HAL_VERSION = {v1}, |
|
} |
|
``` |