LysandreJik commited on
Commit
3902973
1 Parent(s): d1c2c6c

README & Tokenizer

Browse files
Files changed (4) hide show
  1. README.md +146 -0
  2. tokenizer.json +0 -0
  3. tokenizer_config.json +3 -0
  4. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!---
2
+ # ##############################################################################################
3
+ #
4
+ # Copyright (c) 2021-, NVIDIA CORPORATION. All rights reserved.
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License");
7
+ # you may not use this file except in compliance with the License.
8
+ # You may obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15
+ # See the License for the specific language governing permissions and
16
+ # limitations under the License.
17
+ #
18
+ # ##############################################################################################
19
+ -->
20
+
21
+ # How to run Megatron BERT using Transformers
22
+
23
+ ## Prerequisites
24
+
25
+ In that guide, we run all the commands from a folder called `$MYDIR` and defined as (in `bash`):
26
+
27
+ ```
28
+ export MYDIR=$HOME
29
+ ```
30
+
31
+ Feel free to change the location at your convenience.
32
+
33
+ To run some of the commands below, you'll have to clone `Transformers`.
34
+
35
+ ```
36
+ git clone https://github.com/huggingface/transformers.git $MYDIR/transformers
37
+ ```
38
+
39
+ ## Get the checkpoint from the NVIDIA GPU Cloud
40
+
41
+ You must create a directory called `nvidia/megatron-bert-cased-345m`.
42
+
43
+ ```
44
+ mkdir -p $MYDIR/nvidia/megatron-bert-cased-345m
45
+ ```
46
+
47
+ You can download the checkpoint from the NVIDIA GPU Cloud (NGC). For that you
48
+ have to [sign up](https://ngc.nvidia.com/signup) for and setup the NVIDIA GPU
49
+ Cloud (NGC) Registry CLI. Further documentation for downloading models can be
50
+ found in the [NGC
51
+ documentation](https://docs.nvidia.com/dgx/ngc-registry-cli-user-guide/index.html#topic_6_4_1).
52
+
53
+ Alternatively, you can directly download the checkpoint using:
54
+
55
+ ### BERT 345M cased
56
+
57
+ ```
58
+ wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O $MYDIR/nvidia/megatron-bert-cased-345m/checkpoint.zip
59
+ ```
60
+
61
+ ## Converting the checkpoint
62
+
63
+ In order to be loaded into `Transformers`, the checkpoint have to be converted. You should run the following commands for that purpose.
64
+ Those commands will create `config.json` and `pytorch_model.bin` in `$MYDIR/nvidia/megatron-bert-cased-345m`.
65
+ You can move those files to different directories if needed.
66
+
67
+ ### BERT 345M cased
68
+
69
+ ```
70
+ python3 $MYDIR/transformers/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py $MYDIR/nvidia/megatron-bert-cased-345m/checkpoint.zip
71
+ ```
72
+
73
+ ## Masked LM
74
+
75
+ The following code shows how to use the Megatron BERT checkpoint and the Transformers API to perform a `Masked LM` task.
76
+
77
+ ```
78
+ import os
79
+ import torch
80
+
81
+ from transformers import BertTokenizer, MegatronBertForMaskedLM
82
+
83
+ # The tokenizer. Megatron was trained with standard tokenizer(s).
84
+ tokenizer = BertTokenizer.from_pretrained('nvidia/megatron-bert-cased-345m')
85
+ # The path to the config/checkpoint (see the conversion step above).
86
+ directory = os.path.join(os.environ['MYDIR'], 'nvidia/megatron-bert-cased-345m')
87
+ # Load the model from $MYDIR/nvidia/megatron-bert-cased-345m.
88
+ model = MegatronBertForMaskedLM.from_pretrained(directory)
89
+
90
+ # Copy to the device and use FP16.
91
+ assert torch.cuda.is_available()
92
+ device = torch.device("cuda")
93
+ model.to(device)
94
+ model.eval()
95
+ model.half()
96
+
97
+ # Create inputs (from the BERT example page).
98
+ input = tokenizer("The capital of France is [MASK]", return_tensors="pt").to(device)
99
+ label = tokenizer("The capital of France is Paris", return_tensors="pt")["input_ids"].to(device)
100
+
101
+ # Run the model.
102
+ with torch.no_grad():
103
+ output = model(**input, labels=label)
104
+ print(output)
105
+ ```
106
+
107
+ ## Next sentence prediction
108
+
109
+ The following code shows how to use the Megatron BERT checkpoint and the Transformers API to perform next
110
+ sentence prediction.
111
+
112
+ ```
113
+ import os
114
+ import torch
115
+
116
+ from transformers import BertTokenizer, MegatronBertForNextSentencePrediction
117
+
118
+ # The tokenizer. Megatron was trained with standard tokenizer(s).
119
+ tokenizer = BertTokenizer.from_pretrained('nvidia/megatron-bert-cased-345m')
120
+ # The path to the config/checkpoint (see the conversion step above).
121
+ directory = os.path.join(os.environ['MYDIR'], 'nvidia/megatron-bert-cased-345m')
122
+ # Load the model from $MYDIR/nvidia/megatron-bert-cased-345m.
123
+ model = MegatronBertForNextSentencePrediction.from_pretrained(directory)
124
+
125
+ # Copy to the device and use FP16.
126
+ assert torch.cuda.is_available()
127
+ device = torch.device("cuda")
128
+ model.to(device)
129
+ model.eval()
130
+ model.half()
131
+
132
+ # Create inputs (from the BERT example page).
133
+ input = tokenizer('In Italy, pizza served in formal settings is presented unsliced.',
134
+ 'The sky is blue due to the shorter wavelength of blue light.',
135
+ return_tensors='pt').to(device)
136
+ label = torch.LongTensor([1]).to(device)
137
+
138
+ # Run the model.
139
+ with torch.no_grad():
140
+ output = model(**input, labels=label)
141
+ print(output)
142
+ ```
143
+
144
+ # Original code
145
+
146
+ The original code for Megatron can be found here: [https://github.com/NVIDIA/Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ {
2
+ "do_lower_case": false
3
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff