|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
language: |
|
- id |
|
--- |
|
|
|
# indoSBERT-large |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 256 dimensional dense vector space and can be used for tasks like clustering or semantic search. |
|
|
|
IndoSBERT is a modification of `https://huggingface.co/indobenchmark/indobert-large-p1` that has been fine-tuned using the siamese network scheme inspired by SBERT (Reimers et al., 2019). |
|
This model was fine-tuned with the STS Dataset (2012-2016) which was machine-translated into Indonesian languange. |
|
|
|
This model can provide meaningful semantic sentence embeddings for Indonesian sentences. |
|
|
|
<!--- Describe your model here --> |
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["Komposer favorit saya adalah Joe Hisaishi", "Sapo tahu enak banget"] |
|
|
|
model = SentenceTransformer('denaya/indoSBERT-large') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
|
|
## Training |
|
The model was trained with the parameters: |
|
|
|
**DataLoader**: |
|
|
|
`torch.utils.data.dataloader.DataLoader` of length 1291 with parameters: |
|
``` |
|
{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} |
|
``` |
|
|
|
**Loss**: |
|
|
|
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` |
|
|
|
Parameters of the fit()-Method: |
|
``` |
|
{ |
|
"epochs": 50, |
|
"evaluation_steps": 1, |
|
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator", |
|
"max_grad_norm": 1, |
|
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
|
"optimizer_params": { |
|
"lr": 2e-05 |
|
}, |
|
"scheduler": "WarmupLinear", |
|
"steps_per_epoch": null, |
|
"warmup_steps": 100, |
|
"weight_decay": 0.01 |
|
} |
|
``` |
|
|
|
|
|
## Full Model Architecture |
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) |
|
(2): Dense({'in_features': 1024, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'}) |
|
) |
|
``` |
|
|
|
## Citing & Authors |
|
``` |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
|
|
@article{author = {Diana, Denaya}, |
|
title = {IndoSBERT: Indonesian SBERT for Semantic Textual Similarity tasks}, |
|
year = {2023}, |
|
url = {https://huggingface.co/denaya/indoSBERT-large} |
|
} |
|
``` |