Perceiver IO masked language model (IMDb)

This model is a Perceiver IO masked language model fine-tuned with masked language modeling on the IMDb dataset. It is a training example of the perceiver-io library.

Model description

The pretrained model is specified in Section 4 (Table 1) and Appendix F (Table 11) of the Perceiver IO paper (UTF-8 bytes tokenization, vocabulary size of 262, 201M parameters). The fine-tuned model has the same architecture as the pretrained model. It cross-attends to the raw UTF-8 bytes of the input.

Model training

The model was trained with masked language modeling and whole word masking on the unsupervised split of the IMDb dataset. Input data are tokenized with a UTF-8 bytes tokenizer (vocabulary size = 262). Word masking is done dynamically at data loading time i.e. each epoch has a different set of words masked. Training was done with PyTorch Lightning and the resulting checkpoint was converted to this 🤗 model with a library-specific conversion utility.

Intended use and limitations

The fine-tuned model can be used for downstream tasks related to movie reviews such as movie review sentiment analysis (example). Direct usage of the model is shown below.

Usage examples

To use this model you first need to install the perceiver-io library with extension text.

pip install perceiver-io[text]

Then the model can be used with PyTorch. Either use the model and tokenizer directly

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from perceiver.model.text import mlm  # auto-class registration

    repo_id = "krasserm/perceiver-io-mlm-imdb"

    model = AutoModelForMaskedLM.from_pretrained(repo_id)
    tokenizer = AutoTokenizer.from_pretrained(repo_id)

    masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome."
    encoding = tokenizer(masked_text, return_tensors="pt")

    # get index of first and last mask token
    _, mask_indices = torch.where(encoding.input_ids == tokenizer.mask_token_id)
    mask_beg = mask_indices[0]
    mask_end = mask_indices[-1]

    outputs = model(**encoding)

    # get predictions for 9 [MASK] tokens (exclude [SEP] token at the end)
    masked_token_predictions = outputs.logits[0, mask_beg : mask_end + 1].argmax(dim=-1)
    print(tokenizer.decode(masked_token_predictions))

 film

or use a fill-mask pipeline:

from transformers import pipeline
from perceiver.model.text import mlm  # auto-class registration

repo_id = "krasserm/perceiver-io-mlm-imdb"

masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome."

filler_pipeline = pipeline("fill-mask", model=repo_id)
masked_token_predictions = filler_pipeline(masked_text)
print("".join([pred[0]["token_str"] for pred in masked_token_predictions]))

 film

Checkpoint conversion

The krasserm/perceiver-io-mlm-imdb model has been created from a training checkpoint with:

from perceiver.model.text.mlm import convert_checkpoint

convert_checkpoint(
    save_dir="krasserm/perceiver-io-mlm-imdb",
    ckpt_url="https://martin-krasser.com/perceiver/logs-0.8.0/mlm/version_0/checkpoints/epoch=012-val_loss=1.165.ckpt",
    tokenizer_name="krasserm/perceiver-io-mlm",
    push_to_hub=True,
)

krasserm
/

perceiver-io-mlm-imdb

Perceiver IO masked language model (IMDb)

Model description

Model training

Intended use and limitations

Usage examples

Checkpoint conversion

Dataset used to train krasserm/perceiver-io-mlm-imdb

Collection including krasserm/perceiver-io-mlm-imdb

perceiver