Upload 10 files

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +126 -36
config.json +6 -0
config_sentence_transformers.json +10 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +7 -1
tokenizer.json +0 -0
tokenizer_config.json +57 -1

.gitattributes CHANGED Viewed

@@ -7,3 +7,4 @@
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text

 *.ot filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
+model.safetensors filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,53 +1,143 @@
 ---
-language:
-- ru
 tags:
-- PyTorch
-- Transformers
 ---
-# BERT large model (uncased) for Sentence Embeddings in Russian language.
-The model is described [in this article](https://habr.com/ru/company/sberdevices/blog/527576/)
-For better quality, use mean token embeddings.
-## Usage (HuggingFace Models Repository)
-You can use the model directly from the model repository to compute sentence embeddings:
-```python
-from transformers import AutoTokenizer, AutoModel
-import torch
-#Mean Pooling - Take attention mask into account for correct averaging
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
-    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-    return sum_embeddings / sum_mask
-#Sentences we want sentence embeddings for
-sentences = ['Привет! Как твои дела?',
-             'А правда, что 42 твое любимое число?']
-#Load AutoModel from huggingface model repository
-tokenizer = AutoTokenizer.from_pretrained("ai-forever/sbert_large_nlu_ru")
-model = AutoModel.from_pretrained("ai-forever/sbert_large_nlu_ru")
-#Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt')
-#Compute token embeddings
-with torch.no_grad():
-    model_output = model(**encoded_input)
-#Perform pooling. In this case, mean pooling
-sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
 ```
-# Authors
-+ [SberDevices](https://sberdevices.ru/) Team.
-+ Aleksandr Abramov: [HF profile](https://huggingface.co/Andrilko), [Github](https://github.com/Ab1992ao), [Kaggle Competitions Master](https://www.kaggle.com/andrilko);
-+ Denis Antykhov: [Github](https://github.com/gaphex);

 ---
+language: []
+library_name: sentence-transformers
 tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+base_model: ai-forever/sbert_large_nlu_ru
+widget: []
+pipeline_tag: sentence-similarity
 ---
+# SentenceTransformer based on ai-forever/sbert_large_nlu_ru
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [ai-forever/sbert_large_nlu_ru](https://huggingface.co/ai-forever/sbert_large_nlu_ru). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [ai-forever/sbert_large_nlu_ru](https://huggingface.co/ai-forever/sbert_large_nlu_ru) <!-- at revision 95c66a03e1cea189286bf8ba895999f1fd355d8c -->
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 1024 tokens
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
+  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("sentence_transformers_model_id")
+# Run inference
+sentences = [
+    'The weather is lovely today.',
+    "It's so sunny outside!",
+    'He drove to the stadium.',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 1024]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
 ```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Framework Versions
+- Python: 3.9.6
+- Sentence Transformers: 3.0.0
+- Transformers: 4.41.2
+- PyTorch: 2.3.0
+- Accelerate:
+- Datasets: 2.19.2
+- Tokenizers: 0.19.1
+## Citation
+### BibTeX
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json CHANGED Viewed

@@ -1,8 +1,10 @@
 {
   "architectures": [
     "BertModel"
   ],
   "attention_probs_dropout_prob": 0.1,
   "directionality": "bidi",
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
@@ -21,6 +23,10 @@
   "pooler_num_fc_layers": 3,
   "pooler_size_per_head": 128,
   "pooler_type": "first_token_transform",
   "type_vocab_size": 2,
   "vocab_size": 120138
 }

 {
+  "_name_or_path": "ai-forever/sbert_large_nlu_ru",
   "architectures": [
     "BertModel"
   ],
   "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
   "directionality": "bidi",
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "pooler_num_fc_layers": 3,
   "pooler_size_per_head": 128,
   "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.41.2",
   "type_vocab_size": 2,
+  "use_cache": true,
   "vocab_size": 120138
 }

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "3.0.0",
+    "transformers": "4.41.2",
+    "pytorch": "2.3.0"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": null
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cea5e5ebffd98391d7c119f2d35a50e546aad6aea7c883bb584754874d27f622
+size 1707679808

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}

special_tokens_map.json CHANGED Viewed

	@@ -1 +1,7 @@
1	- {~~"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}~~

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

	@@ -1 +1,57 @@
1	- {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "./cased_L-24_H-1024_A-16_pytorch", "do_basic_tokenize": true, "never_split": null}

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}