firqaaa
/

indo-sentence-bert-large

@@ -7,7 +7,7 @@ tags:
 ---
-# {MODEL_NAME}
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 2048 dimensional dense vector space and can be used for tasks like clustering or semantic search.
@@ -25,14 +25,55 @@ Then you can use the model like this:
 ```python
 from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('{MODEL_NAME}')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
 ## Evaluation Results
@@ -87,4 +128,25 @@ SentenceTransformer(
 ## Citing & Authors
-<!--- Describe where people can find more information -->

 ---
+# indo-sentence-bert-large
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 2048 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 ```python
 from sentence_transformers import SentenceTransformer
+sentences = ["Ibukota Perancis adalah Paris",
+            "Menara Eifel terletak di Paris, Perancis",
+            "Pizza adalah makanan khas Italia",
+            "Saya kuliah di Carneige Mellon University"]
+model = SentenceTransformer('firqaaa/indo-sentence-bert-large')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
+## Usage (HuggingFace Transformers)
+Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+#Mean Pooling - Take attention mask into account for correct averaging
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+# Sentences we want sentence embeddings for
+sentences = ["Ibukota Perancis adalah Paris",
+             "Menara Eifel terletak di Paris, Perancis",
+             "Pizza adalah makanan khas Italia",
+             "Saya kuliah di Carneige Mellon University"]
+# Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained('firqaaa/indo-sentence-bert-large')
+model = AutoModel.from_pretrained('firqaaa/indo-sentence-bert-large')
+# Tokenize sentences
+encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+# Compute token embeddings
+with torch.no_grad():
+    model_output = model(**encoded_input)
+# Perform pooling. In this case, mean pooling.
+sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+print("Sentence embeddings:")
+print(sentence_embeddings)
+```
 ## Evaluation Results
 ## Citing & Authors
+<!--- Describe where people can find more information -->
+```
+  @inproceedings{reimers-2019-sentence-bert,
+  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+  author = "Reimers, Nils and Gurevych, Iryna",
+  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+  month = "11",
+  year = "2019",
+  publisher = "Association for Computational Linguistics",
+  url = "https://arxiv.org/abs/1908.10084",
+```
+```
+  author = "Arasyi, Firqa",
+  title  = "indo-sentence-bert: Sentence Transformer for Bahasa Indonesia with Multiple Negative Ranking Loss",
+  year = "2024",
+  month = "2"
+  publisher = "Huggingface",
+  journal = "Huggingface"
+  howpublished = "https://huggingface.co/firqaaa/indo-sentence-bert-large/",
+```