|
--- |
|
license: mit |
|
datasets: |
|
- anti-ai/ViNLI-SimCSE-supervised_v2 |
|
- anti-ai/vi_news_wseg |
|
language: |
|
- vi |
|
metrics: |
|
- spearmanr |
|
pipeline_tag: sentence-similarity |
|
library_name: rage |
|
--- |
|
# Introduce |
|
## Installation 🔥 |
|
- We recommend `python 3.9` or higher, `torch 2.0.0` or higher, `transformers 4.31.0` or higher. |
|
|
|
- Currently, you can only download from the source, however, in the future, we will upload it to PyPI. RagE can be installed from source with the following commands: |
|
``` |
|
git clone https://github.com/anti-aii/RagE.git |
|
cd RagE |
|
pip install -e . |
|
``` |
|
## Quick start 🥮 |
|
- [1. Initialize the model](#initialize_model) |
|
- [2. Load model from Huggingface Hub](#download_hf) |
|
- [3. List of pretrained models](#list_pretrained) |
|
|
|
We have detailed instructions for using our models for inference. See [notebook](notebook) |
|
### 1. Initialize the model |
|
<a name= 'initialize_model'></a> |
|
Let's initalize the SentenceEmbedding model |
|
|
|
```python |
|
>>> import torch |
|
>>> from pyvi import ViTokenizer |
|
>>> from rage import SentenceEmbedding |
|
>>> device= torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
>>> model= SentenceEmbedding(model_name= "vinai/phobert-base-v2", torch_dtype= torch.float32, aggregation_hidden_states= False, strategy_pooling= "dense_first") |
|
>>> model.to(device) |
|
SentenceEmbeddingConfig(model_base: {'model_type_base': 'RobertaModel', 'model_name': 'vinai/phobert-base-v2', 'type_backbone': 'mlm', 'required_grad_base_model': True, 'aggregation_hidden_states': False, 'concat_embeddings': False, 'dropout': 0.1, 'quantization_config': None}, pooling: {'strategy_pooling': 'dense_first'}) |
|
``` |
|
Then, we can show the number of parameters in the model. |
|
```python |
|
>>> model.summary_params() |
|
trainable params: 135588864 || all params: 135588864 || trainable%: 100.0 |
|
>>> model.summary() |
|
+---------------------------+-------------+------------------+ |
|
| Layer (type) | Params | Trainable params | |
|
+---------------------------+-------------+------------------+ |
|
| model (RobertaModel) | 134,998,272 | 134998272 | |
|
| pooling (PoolingStrategy) | 590,592 | 590592 | |
|
| drp1 (Dropout) | 0 | 0 | |
|
+---------------------------+-------------+------------------+ |
|
``` |
|
Now we can use the SentenceEmbedding model to encode the input words. The output of the model will be a matrix in the shape of (batch, dim). Additionally, we can load weights that we have previously trained and saved. |
|
``` python |
|
>>> model.load("best_sup_general_embedding_phobert2.pt", key= False) |
|
>>> sentences= ["Tôi đang đi học", "Bạn tên là gì?",] |
|
>>> sentences= list(map(lambda x: ViTokenizer.tokenize(x), sentences)) |
|
>>> model.encode(sentences, batch_size= 1, normalize_embedding= "l2", return_tensors= "np", verbose= 1) |
|
2/2 [==============================] - 0s 43ms/Sample |
|
array([[ 0.00281098, -0.00829096, -0.01582766, ..., 0.00878178, |
|
0.01830498, -0.00459659], |
|
[ 0.00249859, -0.03076724, 0.00033016, ..., 0.01299141, |
|
-0.00984358, -0.00703243]], dtype=float32) |
|
``` |
|
### 2. Load model from Huggingface Hub |
|
<a name= 'download_hf'> </a> |
|
|
|
First, download a pretrained model. |
|
```python |
|
>>> model= SentenceEmbedding.from_pretrained('anti-ai/VieSemantic-base') |
|
``` |
|
Then, we encode the input sentences and compare their similarity. |
|
```python |
|
>>> sentences = ["Nó rất thú_vị", "Nó không thú_vị ."] |
|
>>> output= model.encode(sentences, batch_size= 1, return_tensors= 'pt') |
|
>>> torch.cosine_similarity(output[0].view(1, -1), output[1].view(1, -1)).cpu().tolist() |
|
2/2 [==============================] - 0s 40ms/Sample |
|
[0.5605039596557617] |
|
``` |
|
|
|
### 3. List of pretrained models |
|
<a name= 'list_pretrained'></a> |
|
This list will be updated with our prominent models. Our models will primarily aim to support Vietnamese language. |
|
Additionally, you can access our datasets and pretrained models by visiting https://huggingface.co/anti-ai. |
|
|
|
| Model Name | Model Type | #params | checkpoint| |
|
| - | - | - | - | |
|
| anti-ai/ViEmbedding-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/ViEmbedding-base) | |
|
| anti-ai/BioViEmbedding-base-unsup | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/BioViEmbedding-base-unsup) | |
|
| anti-ai/VieSemantic-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/VieSemantic-base) | |
|
|
|
|
|
## Contacts |
|
If you have any questions about this repo, please contact me (nduc0231@gmail.com) |