qducnguyen commited on
Commit
b54e8b2
1 Parent(s): e0f651c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -3
README.md CHANGED
@@ -27,7 +27,7 @@ license: apache-2.0
27
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
28
  We train the model on a merged training dataset that consists of:
29
  - MS Macro (translated in Vietnamese)
30
- - Squadv2 (translated in Vietnamese)
31
  - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
32
 
33
  We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
@@ -57,7 +57,7 @@ Then you can use the model like this:
57
  ```python
58
  from sentence_transformers import SentenceTransformer
59
 
60
- # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
61
  sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
62
 
63
  model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
@@ -65,6 +65,13 @@ embeddings = model.encode(sentences)
65
  print(embeddings)
66
  ```
67
 
 
 
 
 
 
 
 
68
  ## Usage (HuggingFace Transformers)
69
 
70
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
@@ -148,4 +155,16 @@ SentenceTransformer(
148
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
149
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
150
  )
151
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
28
  We train the model on a merged training dataset that consists of:
29
  - MS Macro (translated in Vietnamese)
30
+ - SQuAD v2 (translated in Vietnamese)
31
  - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
32
 
33
  We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
 
57
  ```python
58
  from sentence_transformers import SentenceTransformer
59
 
60
+ # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
61
  sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
62
 
63
  model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
 
65
  print(embeddings)
66
  ```
67
 
68
+
69
+ ## Usage (Widget HuggingFace)
70
+ The widget use custom pipeline on top of the default pipeline by adding additional word segmenter before PhobertTokenizer. So you do not need to segment words before using the API:
71
+
72
+ An example could be seen in Hosted inference API.
73
+
74
+
75
  ## Usage (HuggingFace Transformers)
76
 
77
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 
155
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
156
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
157
  )
158
+ ```
159
+
160
+ ## Citing & Authors
161
+
162
+ ```bibtex
163
+ @inproceedings{phobert,
164
+ title = {{PhoBERT: Pre-trained language models for Vietnamese}},
165
+ author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
166
+ booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
167
+ year = {2020},
168
+ pages = {1037--1042}
169
+ }
170
+ ```