Teja Gollapudi commited on
Commit
64e591b
1 Parent(s): a849054

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "eng"
4
+ thumbnail: "URL to a thumbnail used in social sharing"
5
+ tags:
6
+ - "PyTorch"
7
+ - "tensorflow"
8
+ license: "apache-2.0"
9
+
10
+ ---
11
+
12
+
13
+ # viniLM-2021-from-large
14
+
15
+ ### Model Info:
16
+ <ul>
17
+ <li> Authors: R&D AI Lab, VMware Inc.
18
+ <li> Model date: Jun 2022
19
+ <li> Model version: 2021-distilled-from-large
20
+ <li> Model type: Pretrained language model
21
+ <li> License: Apache 2.0
22
+ </ul>
23
+
24
+ #### Motivation
25
+ Based on [MiniLMv2 distillation](<a href = https://arxiv.org/pdf/2012.15828.pdf </a>), we have distilled vBERT-2021-large into a smaller minilmv2-type model for faster inference times without a significant loss of performance.
26
+
27
+ #### Intended Use
28
+ The model functions as a VMware-specific Language Model.
29
+
30
+
31
+ #### How to Use
32
+ Here is how to use this model to get the features of a given text in PyTorch:
33
+
34
+ ```
35
+ from transformers import BertTokenizer, BertModel
36
+ tokenizer = BertTokenizer.from_pretrained('VMware/vinilm-2021-from-large')
37
+ model = BertModel.from_pretrained("VMware/vinilm-2021-from-large")
38
+ text = "Replace me by any text you'd like."
39
+ encoded_input = tokenizer(text, return_tensors='pt')
40
+ output = model(**encoded_input)
41
+ ```
42
+
43
+ and in TensorFlow:
44
+
45
+ ```
46
+ from transformers import BertTokenizer, TFBertModel
47
+ tokenizer = BertTokenizer.from_pretrained('VMware/vinilm-2021-from-large')
48
+ model = TFBertModel.from_pretrained('VMware/vinilm-2021-from-large')
49
+ text = "Replace me by any text you'd like."
50
+ encoded_input = tokenizer(text, return_tensors='tf')
51
+ output = model(encoded_input)
52
+
53
+ ```
54
+
55
+ ### Training
56
+
57
+ #### - Datasets
58
+ Publically available VMware text data such as VMware Docs, Blogs, etc. were used for distilling the teacher vBERT-2021-large model into vinilm-2021-from-large model. Sourced in May 2021. (~320,000 Documents)
59
+ #### - Preprocessing
60
+ <ul>
61
+ <li>Decoding HTML
62
+ <li>Decoding Unicode
63
+ <li>Stripping repeated characters
64
+ <li>Splitting compound word
65
+ <li>Spelling correction
66
+ </ul>
67
+
68
+ #### - Model performance measures
69
+ We benchmarked vBERT on various VMware-specific NLP downstream tasks (IR, classification, etc).
70
+ The model scored higher than the 'bert-base-uncased' model on all benchmarks.
71
+
72
+ ### Limitations and bias
73
+ Since the model is distilled from a vBERT model based on the BERT model, it may have the same biases embedded within the original BERT model.
74
+
75
+ The data needs to be preprocessed using our internal vNLP Preprocessor (not available to the public) to maximize its performance.