I am planning to work on **SST-2 (Stanford Sentiment Treebank)** dataset. <br>
https://nlp.stanford.edu/sentiment/index.html <br>
https://paperswithcode.com/dataset/sst <br>
https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary

In this dataset each phrase is labelled as either negative or positive. There is a SST-5 dataset as well in which each phrase is labelled as negative, somewhat negative, neutral, somewhat positive or positive. 

In [None]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
from datasets import load_dataset
from transformers import DistilBertTokenizerFast, TFAutoModelForSequenceClassification

There are many ways to load the dataset, for example using tensorflow_datasets (https://www.tensorflow.org/datasets/api_docs/python/tfds/load), but I am planning to use datasets package

In [None]:
data_sst2 = load_dataset("glue", "sst2")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
data_sst2

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [None]:
data_sst2['train'][0]

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

In [None]:
data_sst2['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

Refrence: https://huggingface.co/docs/transformers/index, 
https://github.com/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb

we need to preprocess our text. Tokenization and preprocessing is generally based on the model architecture you use. 

Let's use pretrained distilbert. We can use huggingface transformer library. We can also train our tokenizer from scratch.

we can use AutoTokenizer.from_pretrained as well instead of below function

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [None]:
def preprocess(data):
    return tokenizer(data['sentence'], truncation=True)

we can use the map method of our dataset object to apply above function on all datapoints of all splits.

Note that we passed batched=True to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

In [None]:
dataset_enc = data_sst2.map(preprocess, batched=True)

  0%|          | 0/68 [00:00<?, ?ba/s]



In [None]:
dataset_enc["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [None]:
dataset_enc["train"][0]

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0,
 'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
dataset_enc["train"].features["label"]

ClassLabel(num_classes=2, names=['negative', 'positive'], id=None)

Convert datasets to tf.data.Dataset, so that Keras can understand it.

In [None]:
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_projector', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_39', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [None]:
tf_train_dataset = model.prepare_tf_dataset(
    dataset_enc["train"],
    shuffle=True,
    batch_size=64,
    tokenizer=tokenizer
)

tf_validation_dataset = model.prepare_tf_dataset(
    dataset_enc["validation"],
    shuffle=False,
    batch_size=64,
    tokenizer=tokenizer,
)

tf_validation_test = model.prepare_tf_dataset(
    dataset_enc["test"],
    shuffle=False,
    batch_size=64,
    tokenizer=tokenizer,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
# WE can use tf_train_dataset and tf_validation_dataset in model.fit