File size: 5,567 Bytes

b057cdb
 
28fa5e4
 
 
 
 
 
 
 
 
 
 
b057cdb
28fa5e4

---
license: apache-2.0
library_name: tfhub
language: en
tags:
- text
- tokenizer
- preprocessor
- bert
- tensorflow
datasets:
- bookcorpus
- wikipedia
---

## Model name: bert_en_cased_preprocess
## Description adapted from [TFHub](https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3)

# Overview

This SavedModel is a companion of [BERT models](https://tfhub.dev/google/collections/bert/1) to preprocess plain text inputs into the input format expected by BERT. **Check the model documentation** to find the correct preprocessing model for each particular BERT or other Transformer encoder model.

BERT and its preprocessing were originally published by
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: ["BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"](https://arxiv.org/abs/1810.04805), 2018.

This model uses a vocabulary for English extracted from the Wikipedia and BooksCorpus (same as in the models by the original BERT authors). Text inputs have been normalized the "cased" way, meaning that the distinction between lower and upper case as well as accent markers have been preserved.

This model has no trainable parameters and can be used in an input pipeline outside the training loop.

# Prerequisites

This SavedModel uses TensorFlow operations defined by the [TensorFlow Text](https://github.com/tensorflow/text) library. On [Google Colaboratory](https://colab.research.google.com/), it can be installed with
```
!pip install tensorflow_text
import tensorflow_text as text  # Registers the ops.
```

# Usage
This SavedModel implements the preprocessor API for [text embeddings with Transformer encoders](https://www.tensorflow.org/hub/common_saved_model_apis/text#transformer-encoders), which offers several ways to go from one or more batches of text segments (plain text encoded as UTF-8) to the inputs for the Transformer encoder model.

## Basic usage for single segments

Inputs with a single text segment can be mapped to encoder inputs like this:

### Using TF Hub and HF Hub
```
model_path = snapshot_download(repo_id="Dimitre/bert_en_cased_preprocess")
preprocessor =  KerasLayer(handle=model_path)
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_inputs = preprocessor(text_input)
```

### Using [TF Hub fork](https://github.com/dimitreOliveira/hub)
```
preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess")
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_inputs = preprocessor(text_input)
```
The resulting encoder inputs have `seq_length=128`.

## General usage
For pairs of input segments, to control the `seq_length`, or to modify tokenized sequences before packing them into encoder inputs, the preprocessor can be called like this:
```
preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess")

# Step 1: tokenize batches of text inputs.
text_inputs = [tf.keras.layers.Input(shape=(), dtype=tf.string),
               ...] # This SavedModel accepts up to 2 text inputs.
tokenize = hub.KerasLayer(preprocessor.tokenize)
tokenized_inputs = [tokenize(segment) for segment in text_inputs]

# Step 2 (optional): modify tokenized inputs.
pass

# Step 3: pack input sequences for the Transformer encoder.
seq_length = 128  # Your choice here.
bert_pack_inputs = hub.KerasLayer(
    preprocessor.bert_pack_inputs,
    arguments=dict(seq_length=seq_length))  # Optional argument.
encoder_inputs = bert_pack_inputs(tokenized_inputs)
```

The call to `tokenize()` returns an int32 [RaggedTensor](https://www.tensorflow.org/guide/ragged_tensor) of shape `[batch_size, (words), (tokens_per_word)]`. Correspondingly, the call to `bert_pack_inputs()` accepts a RaggedTensor of shape `[batch_size, ...]` with rank 2 or 3.

# Output details

The result of preprocessing is a batch of fixed-length input sequences for the Transformer encoder.

An input sequence starts with one start-of-sequence token, followed by the tokenized segments, each terminated by one end-of-segment token. Remaining positions up to `seq_length`, if any, are filled up with padding tokens. If an input sequence would exceed `seq_length`, the tokenized segments in it are truncated to prefixes of approximately equal sizes to fit exactly.

The `encoder_inputs` are a dict of three int32 Tensors, all with shape `[batch_size, seq_length]`, whose elements represent the batch of input sequences as follows:

- `"input_word_ids"`: has the token ids of the input sequences.
- `"input_mask"`: has value 1 at the position of all input tokens present before padding and value 0 for the padding tokens.
- `"input_type_ids"`: has the index of the input segment that gave rise to the input token at the respective position. The first input segment (index 0) includes the start-of-sequence token and its end-of-segment token. The second segment (index 1, if present) includes its end-of-segment token. Padding tokens get index 0 again.

## Custom input packing and MLM support
The function

```special_tokens_dict = preprocessor.tokenize.get_special_tokens_dict()```

returns a dict of scalar int32 Tensors that report the tokenizer's `"vocab_size"` as well as the ids of certain special tokens: `"padding_id"`, `"start_of_sequence_id"` (aka. [CLS]), `"end_of_segment_id"` (aka. [SEP]) and `"mask_id"`. This allows users to replace `preprocessor.bert_pack_inputs()` with Python code such as `text.combine_segments()`, possibly `text.masked_language_model()`, and `text.pad_model_inputs()` from the [TensorFlow Text](https://github.com/tensorflow/text) library.