itdainb
/

PhoRanker

@@ -24,6 +24,29 @@ tags:
 	- `pip install pyvi`
 ## Usage with transformers
@@ -34,18 +57,10 @@ import torch
 model = AutoModelForSequenceClassification.from_pretrained('itdainb/vietnamese-cross-encoder')
 tokenizer = AutoTokenizer.from_pretrained('itdainb/vietnamese-cross-encoder')
-features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'],  padding=True, truncation=True, return_tensors="pt")
 model.eval()
 with torch.no_grad():
     scores = model(**features).logits
     print(scores)
-```
-## Usage with sentence-transformers
-```python
-from sentence_transformers import CrossEncoder
-model = CrossEncoder('itdainb/vietnamese-cross-encoder', max_length=256)
-scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])
 ```

 	- `pip install pyvi`
+## Pre-processing
+```python
+from pyvi import ViTokenizer
+query = "UIT là gì?"
+sentences = [
+    "UIT là Trường Đại học Công nghệ Thông tin (ĐH CNTT), Đại học Quốc gia Thành phố Hồ Chí Minh (ĐHQG-HCM)",
+    "Mô hình rerank — còn được gọi là cross-encoder — là một loại mô hình mà, khi được cung cấp một cặp truy vấn và tài liệu, sẽ đưa ra một điểm tương đồng.",
+    "Việt Nam, quốc hiệu là Cộng hòa xã hội chủ nghĩa Việt Nam, là một quốc gia xã hội chủ nghĩa nằm ở cực Đông của bán đảo Đông Dương thuộc khu vực Đông Nam Á"
+]
+tokenized_query = ViTokenizer.tokenize(query)
+tokenized_sentences = [ViTokenizer.tokenize(sent) for sent in sentences]
+```
+## Usage with sentence-transformers
+```python
+from sentence_transformers import CrossEncoder
+model = CrossEncoder('itdainb/vietnamese-cross-encoder', max_length=256)
+scores = model.predict([(tokenized_query, sent) for sent in tokenized_sentences])
+```
 ## Usage with transformers
 model = AutoModelForSequenceClassification.from_pretrained('itdainb/vietnamese-cross-encoder')
 tokenizer = AutoTokenizer.from_pretrained('itdainb/vietnamese-cross-encoder')
+features = tokenizer([[tokenized_query, sent] for sent in tokenized_sentences],  padding=True, truncation=True, return_tensors="pt")
 model.eval()
 with torch.no_grad():
     scores = model(**features).logits
     print(scores)
 ```