Description adapted from TFHub
This SavedModel is a companion of BERT models to preprocess plain text inputs into the input format expected by BERT. Check the model documentation to find the correct preprocessing model for each particular BERT or other Transformer encoder model.
BERT and its preprocessing were originally published by
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", 2018.
This model uses a vocabulary for English extracted from the Wikipedia and BooksCorpus (same as in the models by the original BERT authors). Text inputs have been normalized the "cased" way, meaning that the distinction between lower and upper case as well as accent markers have been preserved.
This model has no trainable parameters and can be used in an input pipeline outside the training loop.
!pip install tensorflow_text import tensorflow_text as text # Registers the ops.
This SavedModel implements the preprocessor API for text embeddings with Transformer encoders, which offers several ways to go from one or more batches of text segments (plain text encoded as UTF-8) to the inputs for the Transformer encoder model.
Inputs with a single text segment can be mapped to encoder inputs like this:
model_path = snapshot_download(repo_id="Dimitre/bert_en_cased_preprocess") preprocessor = KerasLayer(handle=model_path) text_input = tf.keras.layers.Input(shape=(), dtype=tf.string) encoder_inputs = preprocessor(text_input)
Using TF Hub fork
preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess") text_input = tf.keras.layers.Input(shape=(), dtype=tf.string) encoder_inputs = preprocessor(text_input)
The resulting encoder inputs have
For pairs of input segments, to control the
seq_length, or to modify tokenized sequences before packing them into encoder inputs, the preprocessor can be called like this:
preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess") # Step 1: tokenize batches of text inputs. text_inputs = [tf.keras.layers.Input(shape=(), dtype=tf.string), ...] # This SavedModel accepts up to 2 text inputs. tokenize = hub.KerasLayer(preprocessor.tokenize) tokenized_inputs = [tokenize(segment) for segment in text_inputs] # Step 2 (optional): modify tokenized inputs. pass # Step 3: pack input sequences for the Transformer encoder. seq_length = 128 # Your choice here. bert_pack_inputs = hub.KerasLayer( preprocessor.bert_pack_inputs, arguments=dict(seq_length=seq_length)) # Optional argument. encoder_inputs = bert_pack_inputs(tokenized_inputs)
The call to
tokenize() returns an int32 RaggedTensor of shape
[batch_size, (words), (tokens_per_word)]. Correspondingly, the call to
bert_pack_inputs() accepts a RaggedTensor of shape
[batch_size, ...] with rank 2 or 3.
The result of preprocessing is a batch of fixed-length input sequences for the Transformer encoder.
An input sequence starts with one start-of-sequence token, followed by the tokenized segments, each terminated by one end-of-segment token. Remaining positions up to
seq_length, if any, are filled up with padding tokens. If an input sequence would exceed
seq_length, the tokenized segments in it are truncated to prefixes of approximately equal sizes to fit exactly.
encoder_inputs are a dict of three int32 Tensors, all with shape
[batch_size, seq_length], whose elements represent the batch of input sequences as follows:
"input_word_ids": has the token ids of the input sequences.
"input_mask": has value 1 at the position of all input tokens present before padding and value 0 for the padding tokens.
"input_type_ids": has the index of the input segment that gave rise to the input token at the respective position. The first input segment (index 0) includes the start-of-sequence token and its end-of-segment token. The second segment (index 1, if present) includes its end-of-segment token. Padding tokens get index 0 again.
special_tokens_dict = preprocessor.tokenize.get_special_tokens_dict()
returns a dict of scalar int32 Tensors that report the tokenizer's
"vocab_size" as well as the ids of certain special tokens:
"start_of_sequence_id" (aka. [CLS]),
"end_of_segment_id" (aka. [SEP]) and
"mask_id". This allows users to replace
preprocessor.bert_pack_inputs() with Python code such as
text.pad_model_inputs() from the TensorFlow Text library.
- Downloads last month
Unable to determine this model’s pipeline type. Check the docs .