Update README.md

3413f20 about 2 years ago

8.77 kB

	---
	language: fr
	license: mit
	tags:
	- zero-shot-classification
	- sentence-similarity
	- nli
	pipeline_tag: zero-shot-classification
	widget:
	- text: "Selon certains physiciens, un univers parallèle, miroir du nôtre ou relevant de ce que l'on appelle la théorie des branes, autoriserait des neutrons à sortir de notre Univers pour y entrer à nouveau. L'idée a été testée une nouvelle fois avec le réacteur nucléaire de l'Institut Laue-Langevin à Grenoble, plus précisément en utilisant le détecteur de l'expérience Stereo initialement conçu pour chasser des particules de matière noire potentielles, les neutrinos stériles."
	candidate_labels: "politique, science, sport, santé"
	hypothesis_template: "Ce texte parle de {}."
	datasets:
	- flue
	---

	DistilCamemBERT-NLI
	===================

	We present DistilCamemBERT-NLI, which is [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) fine-tuned for the Natural Language Inference (NLI) task for the french language, also known as recognizing textual entailment (RTE). This model is constructed on the XNLI dataset, which determines whether a premise entails, contradicts or neither entails or contradicts a hypothesis.

	This modelization is close to [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) based on [CamemBERT](https://huggingface.co/camembert-base) model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase, for example. Indeed, inference cost can be a technological issue especially in the context of cross-encoding like this task. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power, thanks to DistilCamemBERT.

	Dataset
	-------

	The dataset XNLI from [FLUE](https://huggingface.co/datasets/flue) comprises 392,702 premises with their hypothesis for the train and 5,010 couples for the test. The goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B?) and is a classification task (given two sentences, predict one of three labels). Sentence A is called premise, and sentence B is called hypothesis, then the goal of modelization is determined as follows:
	$$P(premise=c\in\{contradiction, entailment, neutral\}\vert hypothesis)$$

	Evaluation results
	------------------

	\| class \| precision (%) \| f1-score (%) \| support \|
	\| :----------------: \| :---------------: \| :--------------: \| :---------: \|
	\| global \| 77.70 \| 77.45 \| 5,010 \|
	\| contradiction \| 78.00 \| 79.54 \| 1,670 \|
	\| entailment \| 82.90 \| 78.87 \| 1,670 \|
	\| neutral \| 72.18 \| 74.04 \| 1,670 \|

	Benchmark
	---------

	We compare the [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) model to 2 other modelizations working on the french language. The first one [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) is based on well named [CamemBERT](https://huggingface.co/camembert-base), the french RoBERTa model and the second one [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) based on [mDeBERTav3](https://huggingface.co/microsoft/mdeberta-v3-base) a multilingual model. To compare the performances, the metrics of accuracy and [MCC (Matthews Correlation Coefficient)](https://en.wikipedia.org/wiki/Phi_coefficient) were used. We used an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores for mean inference time measure.

	\| model \| time (ms) \| accuracy (%) \| MCC (x100) \|
	\| :--------------: \| :-----------: \| :--------------: \| :------------: \|
	\| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) \| 51.35 \| 77.45 \| 66.24 \|
	\| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) \| 105.0 \| 81.72 \| 72.67 \|
	\| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) \| 299.18 \| 83.43 \| 75.15 \|

	Zero-shot classification
	------------------------

	The main advantage of such modelization is to create a zero-shot classifier allowing text classification without training. This task can be summarized by:
	$$P(hypothesis=i\in\mathcal{C}\|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$

	For this part, we use two datasets, the first one: [allocine](https://huggingface.co/datasets/allocine) used to train the sentiment analysis models. The dataset comprises two classes: "positif" and "négatif" appreciation of movie reviews. Here we use "Ce commentaire est {}." as the hypothesis template and "positif" and "négatif" as candidate labels.

	\| model \| time (ms) \| accuracy (%) \| MCC (x100) \|
	\| :--------------: \| :-----------: \| :--------------: \| :------------: \|
	\| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) \| 195.54 \| 80.59 \| 63.71 \|
	\| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) \| 378.39 \| 86.37 \| 73.74 \|
	\| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) \| 520.58 \| 84.97 \| 70.05 \|

	The second one: [mlsum](https://huggingface.co/datasets/mlsum) used to train the summarization models. In this aim, we aggregate sub-topics and select a few of them. We use the articles summary part to predict their topics. In this case, the hypothesis template used is "C'est un article traitant de {}." and the candidate labels are: "économie", "politique", "sport" and "science".

	\| model \| time (ms) \| accuracy (%) \| MCC (x100) \|
	\| :--------------: \| :-----------: \| :--------------: \| :------------: \|
	\| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) \| 217.77 \| 79.30 \| 70.55 \|
	\| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) \| 448.27 \| 70.7 \| 64.10 \|
	\| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) \| 591.34 \| 64.45 \| 58.67 \|

	How to use DistilCamemBERT-NLI
	------------------------------
	```python
	from transformers import pipeline

	classifier = pipeline(
	task='zero-shot-classification',
	model="cmarkea/distilcamembert-base-nli",
	tokenizer="cmarkea/distilcamembert-base-nli"
	)
	result = classifier (
	sequences="Le style très cinéphile de Quentin Tarantino "
	"se reconnaît entre autres par sa narration postmoderne "
	"et non linéaire, ses dialogues travaillés souvent "
	"émaillés de références à la culture populaire, et ses "
	"scènes hautement esthétiques mais d'une violence "
	"extrême, inspirées de films d'exploitation, d'arts "
	"martiaux ou de western spaghetti.",
	candidate_labels="cinéma, technologie, littérature, politique",
	hypothesis_template="Ce texte parle de {}."
	)

	result
	{"labels": ["cinéma",
	"littérature",
	"technologie",
	"politique"],
	"scores": [0.7164115309715271,
	0.12878799438476562,
	0.1092301607131958,
	0.0455702543258667]}
	```

	### Optimum + ONNX

	```python
	from optimum.onnxruntime import ORTModelForSequenceClassification
	from transformers import AutoTokenizer, pipeline

	HUB_MODEL = "cmarkea/distilcamembert-base-nli"

	tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
	model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
	onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

	# Quantized onnx model
	quantized_model = ORTModelForSequenceClassification.from_pretrained(
	HUB_MODEL, file_name="model_quantized.onnx"
	)
	```

	Citation
	--------
	```bibtex
	@inproceedings{delestre:hal-03674695,
	TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
	AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
	URL = {https://hal.archives-ouvertes.fr/hal-03674695},
	BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
	ADDRESS = {Vannes, France},
	YEAR = {2022},
	MONTH = Jul,
	KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
	PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
	HAL_ID = {hal-03674695},
	HAL_VERSION = {v1},
	}
	```