---
license: apache-2.0
inference: false
datasets:
- imdb
language:
- en
pipeline_tag: fill-mask
---

# Perceiver IO masked language model (IMDb)

This model is a [Perceiver IO masked language model](https://huggingface.co/krasserm/perceiver-io-mlm) fine-tuned with 
masked language modeling on the [IMDb](https://huggingface.co/datasets/imdb) dataset. It is a 
[training example](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#masked-language-modeling)
of the [perceiver-io](https://github.com/krasserm/perceiver-io) library.

## Model description

The [pretrained model](https://huggingface.co/krasserm/perceiver-io-mlm) is specified in Section 4 (Table 1) and Appendix F
(Table 11) of the [Perceiver IO paper](https://arxiv.org/abs/2107.14795) (UTF-8 bytes tokenization, vocabulary size of 262, 
201M parameters). The fine-tuned model has the same architecture as the pretrained model. It cross-attends to the raw
UTF-8 bytes of the input.

## Model training

The model was [trained](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#masked-language-modeling)
with masked language modeling and whole word masking on the *unsupervised* split of the IMDb dataset. Input data are
tokenized with a UTF-8 bytes tokenizer (vocabulary size = 262). Word masking is done dynamically at data loading time 
i.e. each epoch has a different set of words masked. Training was done with [PyTorch Lightning](https://www.pytorchlightning.ai/index.html)
and the resulting checkpoint was converted to this 🤗 model with a library-specific [conversion utility](#checkpoint-conversion).

## Intended use and limitations

The fine-tuned model can be used for downstream tasks related to movie reviews such as movie review sentiment analysis
([example](https://huggingface.co/krasserm/perceiver-io-txt-clf-imdb)). Direct usage of the model is shown [below](#usage-examples). 

## Usage examples

To use this model you first need to [install](https://github.com/krasserm/perceiver-io/blob/main/README.md#installation) 
the `perceiver-io` library with extension `text`.

```shell
pip install perceiver-io[text]
```

Then the model can be used with PyTorch. Either use the model and tokenizer directly

```python
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from perceiver.model.text import mlm  # auto-class registration

    repo_id = "krasserm/perceiver-io-mlm-imdb"

    model = AutoModelForMaskedLM.from_pretrained(repo_id)
    tokenizer = AutoTokenizer.from_pretrained(repo_id)

    masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome."
    encoding = tokenizer(masked_text, return_tensors="pt")

    # get index of first and last mask token
    _, mask_indices = torch.where(encoding.input_ids == tokenizer.mask_token_id)
    mask_beg = mask_indices[0]
    mask_end = mask_indices[-1]

    outputs = model(**encoding)

    # get predictions for 9 [MASK] tokens (exclude [SEP] token at the end)
    masked_token_predictions = outputs.logits[0, mask_beg : mask_end + 1].argmax(dim=-1)
    print(tokenizer.decode(masked_token_predictions))
```
```
 film
```

or use a `fill-mask` pipeline:

```python
from transformers import pipeline
from perceiver.model.text import mlm  # auto-class registration

repo_id = "krasserm/perceiver-io-mlm-imdb"

masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome."

filler_pipeline = pipeline("fill-mask", model=repo_id)
masked_token_predictions = filler_pipeline(masked_text)
print("".join([pred[0]["token_str"] for pred in masked_token_predictions]))
```
```
 film
```

## Checkpoint conversion

The `krasserm/perceiver-io-mlm-imdb` model has been created from a training checkpoint with: 

```python
from perceiver.model.text.mlm import convert_checkpoint

convert_checkpoint(
    save_dir="krasserm/perceiver-io-mlm-imdb",
    ckpt_url="https://martin-krasser.com/perceiver/logs-0.8.0/mlm/version_0/checkpoints/epoch=012-val_loss=1.165.ckpt",
    tokenizer_name="krasserm/perceiver-io-mlm",
    push_to_hub=True,
)
```