Jivanni commited on
Commit
a973561
0 Parent(s):
.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: italian
3
+ tags:
4
+ - text-classification
5
+ - pytorch
6
+ - tensorflow
7
+ datasets:
8
+ - multi_nli
9
+ license: mit
10
+ pipeline_tag: zero-shot-classification
11
+ widget:
12
+ - text: "La seconda guerra mondiale vide contrapporsi, tra il 1939 e il 1945, le cosiddette potenze dell'Asse e gli Alleati che, come già accaduto ai belligeranti della prima guerra mondiale, si combatterono su gran parte del pianeta; il conflitto ebbe inizio il 1º settembre 1939 con l'attacco della Germania nazista alla Polonia e terminò, nel teatro europeo, l'8 maggio 1945 con la resa tedesca e, in quello asiatico, il successivo 2 settembre con la resa dell'Impero giapponese dopo i bombardamenti atomici di Hiroshima e Nagasaki."
13
+ candidate_labels: "storia, geografia, moda, politica, macchine, cibo"
14
+ multi_class: true
15
+ ---
16
+
17
+ # XLM-roBERTa-large-it-mnli
18
+
19
+ ## Version 0.1
20
+ | | matched-it acc | mismatched-it acc |
21
+ | -------------------------------------------------------------------------------- |----------------|-------------------|
22
+ | XLM-roBERTa-large-it-mnli | 84.75 | 85.39 |
23
+
24
+ ## Model Description
25
+ This model takes [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) and fine-tunes it on a subset of NLI data taken from a automatically translated version of the MNLI corpus. It is intended to be used for zero-shot text classification, such as with the Hugging Face [ZeroShotClassificationPipeline](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline).
26
+ ## Intended Usage
27
+ This model is intended to be used for zero-shot text classification of italian texts.
28
+ Since the base model was pre-trained trained on 100 different languages, the
29
+ model has shown some effectiveness in languages beyond those listed above as
30
+ well. See the full list of pre-trained languages in appendix A of the
31
+ [XLM Roberata paper](https://arxiv.org/abs/1911.02116)
32
+ For English-only classification, it is recommended to use
33
+ [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or
34
+ [a distilled bart MNLI model](https://huggingface.co/models?filter=pipeline_tag%3Azero-shot-classification&search=valhalla).
35
+ #### With the zero-shot classification pipeline
36
+ The model can be loaded with the `zero-shot-classification` pipeline like so:
37
+ ```python
38
+ from transformers import pipeline
39
+ classifier = pipeline("zero-shot-classification",
40
+ model="Jiva/xlm-roberta-large-it-mnli", device=0, use_fast=True)
41
+ ```
42
+ You can then classify in any of the above languages. You can even pass the labels in one language and the sequence to
43
+ classify in another:
44
+ ```python
45
+ # we will classify the following wikipedia entry about Sardinia"
46
+ sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
47
+ # we can specify candidate labels in Italian:
48
+ candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
49
+ classifier(sequence_to_classify, candidate_labels)
50
+ # {'labels': ['geografia', 'moda', 'politica', 'macchine', 'cibo'],
51
+ # 'scores': [0.5027586221694946, 0.19790762662887573, 0.1900099515914917, 0.10961027443408966, 0.07802766561508179]}
52
+ ```
53
+ The default hypothesis template is the English, `This text is {}`. If you are working strictly within one language, it
54
+ may be worthwhile to translate this to the language you are working with:
55
+ ```python
56
+ sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
57
+ candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
58
+ hypothesis_template = "si parla di {}""
59
+ classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
60
+ # {'labels': ['geografia', 'moda', 'politica', 'macchine', 'cibo'],
61
+ # 'scores': [0.5027586221694946, 0.19790762662887573, 0.1900099515914917, 0.10961027443408966, 0.07802766561508179]}
62
+ ```
63
+ #### With manual PyTorch
64
+ ```python
65
+ # pose sequence as a NLI premise and label as a hypothesis
66
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
67
+ nli_model = AutoModelForSequenceClassification.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
68
+ tokenizer = AutoTokenizer.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
69
+ premise = sequence
70
+ hypothesis = f'si parla di{ label}.'
71
+ # run through model pre-trained on MNLI
72
+ x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
73
+ truncation_strategy='only_first')
74
+ logits = nli_model(x.to(device))[0]
75
+ # we throw away "neutral" (dim 1) and take the probability of
76
+ # "entailment" (2) as the probability of the label being true
77
+ entail_contradiction_logits = logits[:,[0,2]]
78
+ probs = entail_contradiction_logits.softmax(dim=1)
79
+ prob_label_is_true = probs[:,1]
80
+ ```
81
+ ## Training
82
+
83
+ ## Version 0.1
84
+ The model has been now retrained on the full training set. Around 1000 sentences pairs have been removed from the set bacause their translation was botched by the translation model.
85
+
86
+ ## Version 0.0
87
+ This model was pre-trained on set of 100 languages, as described in
88
+ [the original paper](https://arxiv.org/abs/1911.02116). It was then fine-tuned on the task of NLI on an Italian translation of the MNLI dataset (85% of the train set only so far). The model used for translating the texts is Helsinki-NLP/opus-mt-en-it, with a max output sequence lenght of 120. The model has been trained for 1 epoch with learning rate 4e-6 and batch size 80, currently it scores 82 acc. on the remaining 15% of the training.
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "xlm-roberta-large-it-mnli",
3
+ "architectures": [
4
+ "XLMRobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "id2label": {
14
+ "0": "neutral",
15
+ "1": "entailment",
16
+ "2": "contradiction"
17
+ },
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 4096,
20
+ "label2id": {
21
+ "contradiction": 2,
22
+ "entailment": 1,
23
+ "neutral": 0
24
+ },
25
+ "layer_norm_eps": 1e-05,
26
+ "max_position_embeddings": 514,
27
+ "model_type": "xlm-roberta",
28
+ "num_attention_heads": 16,
29
+ "num_hidden_layers": 24,
30
+ "output_past": true,
31
+ "pad_token_id": 1,
32
+ "position_embedding_type": "absolute",
33
+ "problem_type": "single_label_classification",
34
+ "torch_dtype": "float32",
35
+ "transformers_version": "4.13.0",
36
+ "type_vocab_size": 1,
37
+ "use_cache": true,
38
+ "vocab_size": 250002
39
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4db27ff0ea43ef8cee78f2fd988af78b9da15c5020e2a43b418f5a90d80f9161
3
+ size 2239747529
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "do_lower_case": false, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "xlm-roberta-large", "tokenizer_class": "XLMRobertaTokenizer"}