ibaucells commited on
Commit
a304cd1
1 Parent(s): e305616

Initial commit

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+
9
+ ---
10
+
11
+ # ST-NLI-ca_paraphrase-multilingual-mpnet-base
12
+
13
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
+
15
+ It has been developed through further training of a multilingual fine-tuned model, paraphrase-multilingual-mpnet-base-v2, [paraphrase-multilingual-mpnet-base-v2] (https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) using NLI data. Specifically, it has been trained on two Catalan NLI datasets, [TE-ca] (https://huggingface.co/datasets/projecte-aina/teca) and the professional translation of XNLI into Catalan. The training employed the Multiple Negatives Ranking Loss with Hard Negatives, which leverages triplets composed of a premise, an entailed hypothesis, and a contradiction. It is important to note that, given this format, neutral hypotheses from the NLI datasets were not used for training. However, as a form of data augmentation, the model's training set was expanded by duplicating the triplets, wherein the order of the premise and entailed hypothesis was reversed, resulting in a total of 18,928 triplets.
16
+
17
+ ## Usage (Sentence-Transformers)
18
+
19
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
+
21
+ ```
22
+ pip install -U sentence-transformers
23
+ ```
24
+
25
+ Then you can use the model like this:
26
+
27
+ ```python
28
+ from sentence_transformers import SentenceTransformer
29
+ sentences = ["This is an example sentence", "Each sentence is converted"]
30
+
31
+ model = SentenceTransformer('{MODEL_NAME}')
32
+ embeddings = model.encode(sentences)
33
+ print(embeddings)
34
+ ```
35
+
36
+ For instance, to sort a list of sentences by their similarity to a reference sentence, the following code can be used:
37
+
38
+ ```python
39
+ reference_sent = "Avui és un bon dia."
40
+ sentences = [
41
+ "M'agrada el dia que fa.",
42
+ "Tothom té un mal dia.",
43
+ "És dijous.",
44
+ "Fa un dia realment dolent",
45
+ ]
46
+
47
+ reference_sent_embedding = model.encode(reference_sent)
48
+ similarity_scores = {}
49
+ for sentence in sentences:
50
+ sent_embedding = model.encode(sentence)
51
+ cosine_similarity = util.pytorch_cos_sim(reference_sent_embedding, sent_embedding)
52
+ similarity_scores[float(cosine_similarity.data[0][0])] = sentence
53
+
54
+ print(f"Sentences in order of similarity to '{reference_sent}' (from max to min):")
55
+ for i, (cosine_similarity,sent) in enumerate(dict(sorted(similarity_scores.items(), reverse=True)).items()):
56
+ print(f"{i}) '{sent}': {cosine_similarity}")
57
+ ```
58
+
59
+
60
+ ## Usage (HuggingFace Transformers)
61
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
62
+
63
+ ```python
64
+ from transformers import AutoTokenizer, AutoModel
65
+ import torch
66
+
67
+
68
+ #Mean Pooling - Take attention mask into account for correct averaging
69
+ def mean_pooling(model_output, attention_mask):
70
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
71
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
72
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
73
+
74
+
75
+ # Sentences we want sentence embeddings for
76
+ sentences = ['This is an example sentence', 'Each sentence is converted']
77
+
78
+ # Load model from HuggingFace Hub
79
+ tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
80
+ model = AutoModel.from_pretrained('{MODEL_NAME}')
81
+
82
+ # Tokenize sentences
83
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
84
+
85
+ # Compute token embeddings
86
+ with torch.no_grad():
87
+ model_output = model(**encoded_input)
88
+
89
+ # Perform pooling. In this case, mean pooling.
90
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
91
+
92
+ print("Sentence embeddings:")
93
+ print(sentence_embeddings)
94
+ ```
95
+
96
+
97
+
98
+ ## Evaluation Results
99
+
100
+ We evaluated the model on the test set of the Catalan Semantic Text Similarity (STS) [STS-ca] (https://huggingface.co/datasets/projecte-aina/sts-ca) based on the similarity of the embeddings (Pearson correlation), and on two paraphrase identification tasks in Catalan: [Parafraseja] (https://huggingface.co/datasets/projecte-aina/Parafraseja) and the professional translation of PAWS into Catalan.
101
+
102
+ | STS-ca (Pearson) | Parafraseja (acc) | PAWS-ca (acc) |
103
+ |------------------|-------------------|---------------|
104
+ | 0.65 | 0.72 | 0.65 |
105
+
106
+
107
+ ## Training
108
+ The model was trained with the parameters:
109
+
110
+ **DataLoader**:
111
+
112
+ `sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader` of length 147 with parameters:
113
+ ```
114
+ {'batch_size': 128}
115
+ ```
116
+
117
+ **Loss**:
118
+
119
+ `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
120
+ ```
121
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
122
+ ```
123
+
124
+ Parameters of the fit()-Method:
125
+ ```
126
+ {
127
+ "epochs": 1,
128
+ "evaluation_steps": 14,
129
+ "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
130
+ "max_grad_norm": 1,
131
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
132
+ "optimizer_params": {
133
+ "lr": 2e-05
134
+ },
135
+ "scheduler": "WarmupLinear",
136
+ "steps_per_epoch": null,
137
+ "warmup_steps": 15,
138
+ "weight_decay": 0.01
139
+ }
140
+ ```
141
+
142
+
143
+ ## Full Model Architecture
144
+ ```
145
+ SentenceTransformer(
146
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
147
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
148
+ )
149
+ ```
150
+
151
+ ## Citing & Authors
152
+
153
+ For further information, send an email to aina@bsc.es
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/gpfs/projects/bsc88/huggingface/models/paraphrase-multilingual-mpnet-base-v2/",
3
+ "architectures": [
4
+ "XLMRobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "layer_norm_eps": 1e-05,
17
+ "max_position_embeddings": 514,
18
+ "model_type": "xlm-roberta",
19
+ "num_attention_heads": 12,
20
+ "num_hidden_layers": 12,
21
+ "output_past": true,
22
+ "pad_token_id": 1,
23
+ "position_embedding_type": "absolute",
24
+ "torch_dtype": "float32",
25
+ "transformers_version": "4.33.2",
26
+ "type_vocab_size": 1,
27
+ "use_cache": true,
28
+ "vocab_size": 250002
29
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.7.0",
5
+ "pytorch": "1.9.0+cu102"
6
+ }
7
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ccd79a44ad1889ba22714efdc6893a40a62708cc65c50f0e049d5862b448733
3
+ size 1112241321
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051