--- license: mit language: - vi datasets: - anti-ai/ViNLI-Zalo-supervised - anti-ai/ViNLI-SimCSE-supervised_v2 - anti-ai/Vietnamese_MsMacro metrics: - repllabs/mean_reciprocal_rank - spearmanr pipeline_tag: sentence-similarity library_name: rage --- # Introduce ## Installation 🔥 - We recommend `python 3.9` or higher, `torch 2.0.0` or higher, `transformers 4.31.0` or higher. - Currently, you can only download from the source, however, in the future, we will upload it to PyPI. RagE can be installed from source with the following commands: ``` git clone https://github.com/anti-aii/RagE.git cd RagE pip install -e . ``` ## Quick start 🥮 - [1. Initialize the model](#initialize_model) - [2. Load model from Huggingface Hub](#download_hf) - [3. List of pretrained models](#list_pretrained) We have detailed instructions for using our models for inference. See [notebook](notebook) ### 1. Initialize the model Let's initalize the SentenceEmbedding model ```python >>> import torch >>> from pyvi import ViTokenizer >>> from rage import SentenceEmbedding >>> device= torch.device('cuda' if torch.cuda.is_available() else 'cpu') >>> model= SentenceEmbedding(model_name= "vinai/phobert-base-v2", torch_dtype= torch.float32, aggregation_hidden_states= False, strategy_pooling= "dense_first") >>> model.to(device) SentenceEmbeddingConfig(model_base: {'model_type_base': 'RobertaModel', 'model_name': 'vinai/phobert-base-v2', 'type_backbone': 'mlm', 'required_grad_base_model': True, 'aggregation_hidden_states': False, 'concat_embeddings': False, 'dropout': 0.1, 'quantization_config': None}, pooling: {'strategy_pooling': 'dense_first'}) ``` Then, we can show the number of parameters in the model. ```python >>> model.summary_params() trainable params: 135588864 || all params: 135588864 || trainable%: 100.0 >>> model.summary() +---------------------------+-------------+------------------+ | Layer (type) | Params | Trainable params | +---------------------------+-------------+------------------+ | model (RobertaModel) | 134,998,272 | 134998272 | | pooling (PoolingStrategy) | 590,592 | 590592 | | drp1 (Dropout) | 0 | 0 | +---------------------------+-------------+------------------+ ``` Now we can use the SentenceEmbedding model to encode the input words. The output of the model will be a matrix in the shape of (batch, dim). Additionally, we can load weights that we have previously trained and saved. ``` python >>> model.load("best_sup_general_embedding_phobert2.pt", key= False) >>> sentences= ["Tôi đang đi học", "Bạn tên là gì?",] >>> sentences= list(map(lambda x: ViTokenizer.tokenize(x), sentences)) >>> model.encode(sentences, batch_size= 1, normalize_embedding= "l2", return_tensors= "np", verbose= 1) 2/2 [==============================] - 0s 43ms/Sample array([[ 0.00281098, -0.00829096, -0.01582766, ..., 0.00878178, 0.01830498, -0.00459659], [ 0.00249859, -0.03076724, 0.00033016, ..., 0.01299141, -0.00984358, -0.00703243]], dtype=float32) ``` ### 2. Load model from Huggingface Hub First, download a pretrained model. ```python >>> model= SentenceEmbedding.from_pretrained('anti-ai/VieSemantic-base') ``` Then, we encode the input sentences and compare their similarity. ```python >>> sentences = ["Nó rất thú_vị", "Nó không thú_vị ."] >>> output= model.encode(sentences, batch_size= 1, return_tensors= 'pt') >>> torch.cosine_similarity(output[0].view(1, -1), output[1].view(1, -1)).cpu().tolist() 2/2 [==============================] - 0s 40ms/Sample [0.5605039596557617] ``` ### 3. List of pretrained models This list will be updated with our prominent models. Our models will primarily aim to support Vietnamese language. Additionally, you can access our datasets and pretrained models by visiting https://huggingface.co/anti-ai. | Model Name | Model Type | #params | checkpoint| | - | - | - | - | | anti-ai/ViEmbedding-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/ViEmbedding-base) | | anti-ai/BioViEmbedding-base-unsup | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/BioViEmbedding-base-unsup) | | anti-ai/VieSemantic-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/VieSemantic-base) | ## Contacts If you have any questions about this repo, please contact me (nduc0231@gmail.com)