# Overview

BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. It was originally published by

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", 2018.

This model uses the implementation of BERT from the TensorFlow Models repository on GitHub at tensorflow/models/official/legacy/bert. It uses L=12 hidden layers (i.e., Transformer blocks), a hidden size of H=768, and A=12 attention heads. For other model sizes, see the BERT collection.

The weights of this model are those released by the original BERT authors. This model has been pre-trained for English on the Wikipedia and BooksCorpus. Text inputs have been normalized the "cased" way, meaning that the distinction between lower and upper case as well as accent markers have been preserved. For training, random input masking has been applied independently to word pieces (as in the original BERT paper).

All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

# Usage

This SavedModel implements the encoder API for text embeddings with transformer encoders. It expects a dict with three int32 Tensors as input: input_word_ids, input_mask, and input_type_ids.

The separate *preprocessor SavedModel at https://huggingface.co/Dimitre/bert_en_cased_preprocess transforms plain text inputs into this format, which its documentation describes in greater detail.

## Basic usage

The simplest way to use this model in the Keras functional API is

### Using TF Hub and HF Hub

preprocessor_path = snapshot_download(repo_id="Dimitre/bert_en_cased_preprocess")
preprocessor =  KerasLayer(handle=preprocessor_path)
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_inputs = preprocessor(text_input)

encoder =  KerasLayer(handle=model_path, trainable=True)
outputs = encoder(encoder_inputs)

pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].


### Using TF Hub fork

preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess")
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_inputs = preprocessor(text_input)

encoder = pull_from_hub(repo_id="Dimitre/bert_en_cased_L-12_H-768_A-12", trainable=True)
outputs = encoder(encoder_inputs)

pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].


The encoder's outputs are the pooled_output to represents each input sequence as a whole, and the sequence_output to represent each input token in context. Either of those can be used as input to further model building.

To print pooled_outputs for inspection, the following code can be used:

embedding_model = tf.keras.Model(text_input, pooled_output)
print(embedding_model(sentences))


The preprocessor documentation explains how to input segment pairs and how to control seq_length.

The intermediate activations of all L=12 Transformer blocks (hidden layers) are returned as a Python list: outputs["encoder_outputs"][i] is a Tensor of shape [batch_size, seq_length, 768] with the outputs of the i-th Transformer block, for 0 <= i < L. The last value of the list is equal to sequence_output.

The preprocessor can be run from inside a callable passed to tf.data.Dataset.map() while this encoder stays a part of a larger model that gets trained on that dataset. The Keras input objects for running on preprocessed inputs are

encoder_inputs = dict(
input_word_ids=tf.keras.layers.Input(shape=(seq_length,), dtype=tf.int32),
input_type_ids=tf.keras.layers.Input(shape=(seq_length,), dtype=tf.int32),
)


This SavedModel provides a trainable .mlm subobject with predictions for the Masked Language Model task it was originally trained with. This allows advanced users to continue MLM training for fine-tuning to a downstream task. It extends the encoder interface above with a zero-padded tensor of positions in the input sequence for which the input_word_ids have been randomly masked or altered. (See the preprocessor model page for how to get the id of the mask token and more.)

mlm_inputs = dict(
input_word_ids=tf.keras.layers.Input(shape=(seq_length,), dtype=tf.int32),
input_type_ids=tf.keras.layers.Input(shape=(seq_length,), dtype=tf.int32),
)

encoder = pull_from_hub(repo_id="Dimitre/bert_en_cased_L-12_H-768_A-12")
mlm = hub.KerasLayer(encoder.mlm, trainable=True)
mlm_outputs = mlm(mlm_inputs)
mlm_logits = mlm_outputs["mlm_logits"]  # [batch_size, num_predict, vocab_size]
# ...plus pooled_output, sequence_output and encoder_outputs as above.