Teja-Gollapudi commited on
Commit
249ba6c
1 Parent(s): dcdbfca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -24,7 +24,7 @@ license: "apache-2.0"
24
  #### Motivation
25
  Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (<a href =https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99>Weaknesses of WordPiece Tokenization</a>)
26
 
27
- We have created our vBERT model to address the aforementioned issues. We have replaced the first 1k unused tokens of BERT's vocabulary with VMware-specific terms to create a modified vocabulary. We then pretrained the 'bert-base-uncased' model for additional 78K steps (71k With MSL_128 and 7k with MSL_512) (approximately 5 epochs) on VMware domain data.
28
 
29
  #### Intended Use
30
  The model functions as a VMware-specific Language Model.
 
24
  #### Motivation
25
  Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (<a href =https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99>Weaknesses of WordPiece Tokenization</a>)
26
 
27
+ We have pretrained our vBERT model to address the aforementioned issues using our (BERT Pretraining Library)[https://medium.com/vmware-data-ml-blog/pretraining-a-custom-bert-model-6e37df97dfc4]. We have replaced the first 1k unused tokens of BERT's vocabulary with VMware-specific terms to create a modified vocabulary. We then pretrained the 'bert-base-uncased' model for additional 78K steps (71k With MSL_128 and 7k with MSL_512) (approximately 5 epochs) on VMware domain data.
28
 
29
  #### Intended Use
30
  The model functions as a VMware-specific Language Model.