Dimitre commited on
Commit
28fa5e4
1 Parent(s): 4e9b2c5

Adding model card

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md CHANGED
@@ -1,3 +1,102 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: tfhub
4
+ language: en
5
+ tags:
6
+ - text
7
+ - tokenizer
8
+ - preprocessor
9
+ - bert
10
+ - tensorflow
11
+ datasets:
12
+ - bookcorpus
13
+ - wikipedia
14
  ---
15
+
16
+ ## Model name: bert_en_cased_preprocess
17
+ ## Description adapted from [TFHub](https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3)
18
+
19
+ # Overview
20
+
21
+ This SavedModel is a companion of [BERT models](https://tfhub.dev/google/collections/bert/1) to preprocess plain text inputs into the input format expected by BERT. **Check the model documentation** to find the correct preprocessing model for each particular BERT or other Transformer encoder model.
22
+
23
+ BERT and its preprocessing were originally published by
24
+ - Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: ["BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"](https://arxiv.org/abs/1810.04805), 2018.
25
+
26
+ This model uses a vocabulary for English extracted from the Wikipedia and BooksCorpus (same as in the models by the original BERT authors). Text inputs have been normalized the "cased" way, meaning that the distinction between lower and upper case as well as accent markers have been preserved.
27
+
28
+ This model has no trainable parameters and can be used in an input pipeline outside the training loop.
29
+
30
+ # Prerequisites
31
+
32
+ This SavedModel uses TensorFlow operations defined by the [TensorFlow Text](https://github.com/tensorflow/text) library. On [Google Colaboratory](https://colab.research.google.com/), it can be installed with
33
+ ```
34
+ !pip install tensorflow_text
35
+ import tensorflow_text as text # Registers the ops.
36
+ ```
37
+
38
+ # Usage
39
+ This SavedModel implements the preprocessor API for [text embeddings with Transformer encoders](https://www.tensorflow.org/hub/common_saved_model_apis/text#transformer-encoders), which offers several ways to go from one or more batches of text segments (plain text encoded as UTF-8) to the inputs for the Transformer encoder model.
40
+
41
+ ## Basic usage for single segments
42
+
43
+ Inputs with a single text segment can be mapped to encoder inputs like this:
44
+
45
+ ### Using TF Hub and HF Hub
46
+ ```
47
+ model_path = snapshot_download(repo_id="Dimitre/bert_en_cased_preprocess")
48
+ preprocessor = KerasLayer(handle=model_path)
49
+ text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
50
+ encoder_inputs = preprocessor(text_input)
51
+ ```
52
+
53
+ ### Using [TF Hub fork](https://github.com/dimitreOliveira/hub)
54
+ ```
55
+ preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess")
56
+ text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
57
+ encoder_inputs = preprocessor(text_input)
58
+ ```
59
+ The resulting encoder inputs have `seq_length=128`.
60
+
61
+ ## General usage
62
+ For pairs of input segments, to control the `seq_length`, or to modify tokenized sequences before packing them into encoder inputs, the preprocessor can be called like this:
63
+ ```
64
+ preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess")
65
+
66
+ # Step 1: tokenize batches of text inputs.
67
+ text_inputs = [tf.keras.layers.Input(shape=(), dtype=tf.string),
68
+ ...] # This SavedModel accepts up to 2 text inputs.
69
+ tokenize = hub.KerasLayer(preprocessor.tokenize)
70
+ tokenized_inputs = [tokenize(segment) for segment in text_inputs]
71
+
72
+ # Step 2 (optional): modify tokenized inputs.
73
+ pass
74
+
75
+ # Step 3: pack input sequences for the Transformer encoder.
76
+ seq_length = 128 # Your choice here.
77
+ bert_pack_inputs = hub.KerasLayer(
78
+ preprocessor.bert_pack_inputs,
79
+ arguments=dict(seq_length=seq_length)) # Optional argument.
80
+ encoder_inputs = bert_pack_inputs(tokenized_inputs)
81
+ ```
82
+
83
+ The call to `tokenize()` returns an int32 [RaggedTensor](https://www.tensorflow.org/guide/ragged_tensor) of shape `[batch_size, (words), (tokens_per_word)]`. Correspondingly, the call to `bert_pack_inputs()` accepts a RaggedTensor of shape `[batch_size, ...]` with rank 2 or 3.
84
+
85
+ # Output details
86
+
87
+ The result of preprocessing is a batch of fixed-length input sequences for the Transformer encoder.
88
+
89
+ An input sequence starts with one start-of-sequence token, followed by the tokenized segments, each terminated by one end-of-segment token. Remaining positions up to `seq_length`, if any, are filled up with padding tokens. If an input sequence would exceed `seq_length`, the tokenized segments in it are truncated to prefixes of approximately equal sizes to fit exactly.
90
+
91
+ The `encoder_inputs` are a dict of three int32 Tensors, all with shape `[batch_size, seq_length]`, whose elements represent the batch of input sequences as follows:
92
+
93
+ - `"input_word_ids"`: has the token ids of the input sequences.
94
+ - `"input_mask"`: has value 1 at the position of all input tokens present before padding and value 0 for the padding tokens.
95
+ - `"input_type_ids"`: has the index of the input segment that gave rise to the input token at the respective position. The first input segment (index 0) includes the start-of-sequence token and its end-of-segment token. The second segment (index 1, if present) includes its end-of-segment token. Padding tokens get index 0 again.
96
+
97
+ ## Custom input packing and MLM support
98
+ The function
99
+
100
+ ```special_tokens_dict = preprocessor.tokenize.get_special_tokens_dict()```
101
+
102
+ returns a dict of scalar int32 Tensors that report the tokenizer's `"vocab_size"` as well as the ids of certain special tokens: `"padding_id"`, `"start_of_sequence_id"` (aka. [CLS]), `"end_of_segment_id"` (aka. [SEP]) and `"mask_id"`. This allows users to replace `preprocessor.bert_pack_inputs()` with Python code such as `text.combine_segments()`, possibly `text.masked_language_model()`, and `text.pad_model_inputs()` from the [TensorFlow Text](https://github.com/tensorflow/text) library.