LLMXperts commited on
Commit
feb4435
·
verified ·
1 Parent(s): 7939728

Transferred from Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: aubmindlab/bert-base-arabertv02
3
+ datasets:
4
+ - akhooli/arabic-triplets-1m-curated-sims-len
5
+ language:
6
+ - ar
7
+ library_name: sentence-transformers
8
+ pipeline_tag: sentence-similarity
9
+ tags:
10
+ - sentence-transformers
11
+ - transformers.js
12
+ - transformers
13
+ - sentence-similarity
14
+ - feature-extraction
15
+ - dataset_size:75000
16
+ - loss:MatryoshkaLoss
17
+ - loss:MultipleNegativesRankingLoss
18
+ - mteb
19
+ license: apache-2.0
20
+ ---
21
+
22
+ # Arabic Triplet Matryoshka V2 Model [ATM2]
23
+
24
+
25
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/FrLQzFUJ3grEUOdONWGME.png)
26
+
27
+ ## Model Description
28
+
29
+ Arabic-Triplet-Matryoshka-V2-Model is a state-of-the-art Arabic language embedding model based on the [sentence-transformers](https://www.SBERT.net) framework. It is fine-tuned from [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) and specifically designed to capture the rich semantic nuances of Arabic text.
30
+
31
+ This model maps sentences and paragraphs to a 768-dimensional dense vector space, enabling high-quality semantic text operations including:
32
+ - Semantic textual similarity
33
+ - Semantic search
34
+ - Paraphrase mining
35
+ - Text classification
36
+ - Clustering
37
+ - Information retrieval
38
+ - Question answering
39
+
40
+ ## Key Features
41
+
42
+ - **State-of-the-Art Performance**: Achieved 0.85 on STS17 and 0.64 on STS22.v2 with an average score of 74.5, making it the leading Arabic embedding model currently available.
43
+ - **MatryoshkaLoss Training**: Utilizes nested embedding learning techniques to create hierarchical embeddings at multiple resolutions.
44
+ - **Optimization**: Trained for 3 epochs with a final training loss of 0.718.
45
+ - **Full Arabic Language Support**: Designed specifically to handle the complexity and morphological richness of Arabic language.
46
+
47
+ ## Training Details
48
+
49
+ The model was trained using a combination of two loss functions:
50
+ - **MatryoshkaLoss**: Enables the creation of nested embeddings at multiple resolutions, allowing for efficient and adaptable representations.
51
+ - **MultipleNegativesRankingLoss**: Enhances the model's ability to discriminate between semantically similar and dissimilar text pairs.
52
+
53
+ Training parameters:
54
+ - **Base model**: aubmindlab/bert-base-arabertv02
55
+ - **Dataset**: akhooli/arabic-triplets-1m-curated-sims-len (1M samples)
56
+ - **Epochs**: 3
57
+ - **Final Loss**: 0.718
58
+ - **Embedding Dimension**: 768
59
+
60
+ ## Performance
61
+
62
+ The model demonstrates exceptional performance on standard Arabic semantic textual similarity benchmarks:
63
+ - **STS17**: 0.85
64
+ - **STS22.v2**: 0.64
65
+ - **Average Performance**: 74.5
66
+
67
+ This represents the current state-of-the-art for Arabic embedding models, outperforming previous approaches by a significant margin.
68
+
69
+ ## Use Cases
70
+
71
+ This model is particularly well-suited for:
72
+ - **Information Retrieval**: Enhancing search capabilities for Arabic content.
73
+ - **Document Similarity**: Identifying similar documents or text passages.
74
+ - **Text Classification**: Powering classification systems for Arabic content.
75
+ - **Question Answering**: Supporting Arabic QA systems with improved semantic understanding.
76
+ - **Semantic Clustering**: Organizing Arabic text data based on meaning.
77
+ - **Cross-lingual Applications**: When combined with other language models for multilingual applications.
78
+
79
+ ## Usage Examples
80
+
81
+ ```python
82
+ from sentence_transformers import SentenceTransformer
83
+
84
+ # Download from the 🤗 Hub
85
+ model = SentenceTransformer("Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2")
86
+ # Run inference
87
+ sentences = [
88
+ 'SENTENCE 1',
89
+ 'SENTENCE 2',
90
+ 'SENTENCE 3',
91
+ ]
92
+ embeddings = model.encode(sentences)
93
+ print(embeddings.shape)
94
+ # [3, 768]
95
+
96
+ # Get the similarity scores for the embeddings
97
+ similarities = model.similarity(embeddings, embeddings)
98
+ print(similarities.shape)
99
+ # [3, 3]
100
+ ```
101
+
102
+ ## Limitations
103
+
104
+ Despite its strong performance, users should be aware of the following limitations:
105
+ - The model may not perform optimally on highly technical or domain-specific Arabic text that was underrepresented in the training data.
106
+ - As with all embedding models, performance may vary across different Arabic dialects and regional variations.
107
+ - The model is optimized for semantic similarity tasks and may require fine-tuning for other specific applications.
108
+
109
+ ## Ethical Considerations
110
+
111
+ This model is intended for research and applications that benefit Arabic language processing. Users should be mindful of potential biases that may exist in the training data and the resulting embeddings. We encourage responsible use of this technology and welcome feedback on ways to improve fairness and representation.
112
+
113
+ ## Citation
114
+
115
+ If you use the Arabic Matryoshka Embeddings Model in your research or applications, please cite it as follows:
116
+
117
+ ```bibtex
118
+ @article{nacar2024enhancing,
119
+ title={Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning},
120
+ author={Nacar, Omer and Koubaa, Anis},
121
+ journal={arXiv preprint arXiv:2407.21139},
122
+ year={2024}
123
+ }
124
+ ```
125
+
126
+ ## Acknowledgements
127
+
128
+ We would like to acknowledge [AraBERT](https://github.com/aub-mind/arabert) for the base model and [akhooli](https://huggingface.co/akhooli) for the valuable dataset that made this work possible.
129
+
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "aubmindlab/bert-base-arabertv02",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.43.1",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 64000
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.43.1",
5
+ "pytorch": "2.2.2"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee761022a6b8fc559f75ccb1ebb143e695acdf7d2263e41b6ba1535c9c131798
3
+ size 540795752
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[رابط]",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": true,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "[بريد]",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": true,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "[مستخدم]",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": true,
65
+ "special": true
66
+ }
67
+ },
68
+ "clean_up_tokenization_spaces": true,
69
+ "cls_token": "[CLS]",
70
+ "do_basic_tokenize": true,
71
+ "do_lower_case": false,
72
+ "mask_token": "[MASK]",
73
+ "max_len": 512,
74
+ "model_max_length": 512,
75
+ "never_split": [
76
+ "[بريد]",
77
+ "[مستخدم]",
78
+ "[رابط]"
79
+ ],
80
+ "pad_token": "[PAD]",
81
+ "sep_token": "[SEP]",
82
+ "strip_accents": null,
83
+ "tokenize_chinese_chars": true,
84
+ "tokenizer_class": "BertTokenizer",
85
+ "unk_token": "[UNK]"
86
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff