File size: 4,454 Bytes
e639dec
 
fa157b5
 
89df11d
fa157b5
 
 
 
c514579
91fb45a
fecd6cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: mit
datasets:
- anti-ai/ViNLI-SimCSE-supervised_v2
- anti-ai/vi_news_wseg
language:
- vi
metrics:
- spearmanr
pipeline_tag: sentence-similarity
library_name: rage
---
# Introduce
## Installation 🔥
- We recommend `python 3.9` or higher, `torch 2.0.0` or higher, `transformers 4.31.0` or higher.

- Currently, you can only download from the source, however, in the future, we will upload it to PyPI. RagE can be installed from source with the following commands: 
```
git clone https://github.com/anti-aii/RagE.git
cd RagE
pip install -e .
```
## Quick start 🥮
- [1. Initialize the model](#initialize_model)
- [2. Load model from Huggingface Hub](#download_hf)
- [3. List of pretrained models](#list_pretrained)

We have detailed instructions for using our models for inference. See [notebook](notebook)
### 1. Initialize the model
<a name= 'initialize_model'></a>
Let's initalize the SentenceEmbedding model  

```python
>>> import torch 
>>> from pyvi import ViTokenizer
>>> from rage import SentenceEmbedding
>>> device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')
>>> model= SentenceEmbedding(model_name= "vinai/phobert-base-v2", torch_dtype= torch.float32, aggregation_hidden_states= False, strategy_pooling= "dense_first")
>>> model.to(device)
SentenceEmbeddingConfig(model_base: {'model_type_base': 'RobertaModel', 'model_name': 'vinai/phobert-base-v2', 'type_backbone': 'mlm', 'required_grad_base_model': True, 'aggregation_hidden_states': False, 'concat_embeddings': False, 'dropout': 0.1, 'quantization_config': None}, pooling: {'strategy_pooling': 'dense_first'})
```
Then, we can show the number of parameters in the model.
```python 
>>> model.summary_params()
trainable params: 135588864 || all params: 135588864 || trainable%: 100.0
>>> model.summary()
+---------------------------+-------------+------------------+
|        Layer (type)       |    Params   | Trainable params |
+---------------------------+-------------+------------------+
|    model (RobertaModel)   | 134,998,272 |    134998272     |
| pooling (PoolingStrategy) |   590,592   |      590592      |
|       drp1 (Dropout)      |      0      |        0         |
+---------------------------+-------------+------------------+
```
Now we can use the SentenceEmbedding model to encode the input words. The output of the model will be a matrix in the shape of (batch, dim). Additionally, we can load weights that we have previously trained and saved.
``` python
>>> model.load("best_sup_general_embedding_phobert2.pt", key= False)
>>> sentences= ["Tôi đang đi học", "Bạn tên là gì?",]
>>> sentences= list(map(lambda x: ViTokenizer.tokenize(x), sentences))
>>> model.encode(sentences, batch_size= 1, normalize_embedding= "l2", return_tensors= "np", verbose= 1)
2/2 [==============================] - 0s 43ms/Sample
array([[ 0.00281098, -0.00829096, -0.01582766, ...,  0.00878178,
         0.01830498, -0.00459659],
       [ 0.00249859, -0.03076724,  0.00033016, ...,  0.01299141,
        -0.00984358, -0.00703243]], dtype=float32)
```
### 2. Load model from Huggingface Hub
<a name= 'download_hf'> </a>

First, download a pretrained model. 
```python
>>> model= SentenceEmbedding.from_pretrained('anti-ai/VieSemantic-base')
```
Then, we encode the input sentences and compare their similarity.
```python
>>> sentences = ["Nó rất thú_vị", "Nó không thú_vị ."]
>>> output= model.encode(sentences, batch_size= 1, return_tensors= 'pt')
>>> torch.cosine_similarity(output[0].view(1, -1), output[1].view(1, -1)).cpu().tolist()
2/2 [==============================] - 0s 40ms/Sample
[0.5605039596557617]
```

### 3. List of pretrained models
<a name= 'list_pretrained'></a>
This list will be updated with our prominent models. Our models will primarily aim to support Vietnamese language.
Additionally, you can access our datasets and pretrained models by visiting https://huggingface.co/anti-ai.

| Model Name | Model Type | #params | checkpoint|
| - | - | - | - |
| anti-ai/ViEmbedding-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/ViEmbedding-base) |
| anti-ai/BioViEmbedding-base-unsup | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/BioViEmbedding-base-unsup) | 
| anti-ai/VieSemantic-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/VieSemantic-base) |


## Contacts
If you have any questions about this repo, please contact me (nduc0231@gmail.com)