--- license: apache-2.0 inference: false datasets: - imdb language: - en pipeline_tag: fill-mask --- # Perceiver IO masked language model (IMDb) This model is a [Perceiver IO masked language model](https://huggingface.co/krasserm/perceiver-io-mlm) fine-tuned with masked language modeling on the [IMDb](https://huggingface.co/datasets/imdb) dataset. It is a [training example](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#masked-language-modeling) of the [perceiver-io](https://github.com/krasserm/perceiver-io) library. ## Model description The [pretrained model](https://huggingface.co/krasserm/perceiver-io-mlm) is specified in Section 4 (Table 1) and Appendix F (Table 11) of the [Perceiver IO paper](https://arxiv.org/abs/2107.14795) (UTF-8 bytes tokenization, vocabulary size of 262, 201M parameters). The fine-tuned model has the same architecture as the pretrained model. It cross-attends to the raw UTF-8 bytes of the input. ## Model training The model was [trained](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#masked-language-modeling) with masked language modeling and whole word masking on the *unsupervised* split of the IMDb dataset. Input data are tokenized with a UTF-8 bytes tokenizer (vocabulary size = 262). Word masking is done dynamically at data loading time i.e. each epoch has a different set of words masked. Training was done with [PyTorch Lightning](https://www.pytorchlightning.ai/index.html) and the resulting checkpoint was converted to this 🤗 model with a library-specific [conversion utility](#checkpoint-conversion). ## Intended use and limitations The fine-tuned model can be used for downstream tasks related to movie reviews such as movie review sentiment analysis ([example](https://huggingface.co/krasserm/perceiver-io-txt-clf-imdb)). Direct usage of the model is shown [below](#usage-examples). ## Usage examples To use this model you first need to [install](https://github.com/krasserm/perceiver-io/blob/main/README.md#installation) the `perceiver-io` library with extension `text`. ```shell pip install perceiver-io[text] ``` Then the model can be used with PyTorch. Either use the model and tokenizer directly ```python import torch from transformers import AutoModelForMaskedLM, AutoTokenizer from perceiver.model.text import mlm # auto-class registration repo_id = "krasserm/perceiver-io-mlm-imdb" model = AutoModelForMaskedLM.from_pretrained(repo_id) tokenizer = AutoTokenizer.from_pretrained(repo_id) masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome." encoding = tokenizer(masked_text, return_tensors="pt") # get index of first and last mask token _, mask_indices = torch.where(encoding.input_ids == tokenizer.mask_token_id) mask_beg = mask_indices[0] mask_end = mask_indices[-1] outputs = model(**encoding) # get predictions for 9 [MASK] tokens (exclude [SEP] token at the end) masked_token_predictions = outputs.logits[0, mask_beg : mask_end + 1].argmax(dim=-1) print(tokenizer.decode(masked_token_predictions)) ``` ``` film ``` or use a `fill-mask` pipeline: ```python from transformers import pipeline from perceiver.model.text import mlm # auto-class registration repo_id = "krasserm/perceiver-io-mlm-imdb" masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome." filler_pipeline = pipeline("fill-mask", model=repo_id) masked_token_predictions = filler_pipeline(masked_text) print("".join([pred[0]["token_str"] for pred in masked_token_predictions])) ``` ``` film ``` ## Checkpoint conversion The `krasserm/perceiver-io-mlm-imdb` model has been created from a training checkpoint with: ```python from perceiver.model.text.mlm import convert_checkpoint convert_checkpoint( save_dir="krasserm/perceiver-io-mlm-imdb", ckpt_url="https://martin-krasser.com/perceiver/logs-0.8.0/mlm/version_0/checkpoints/epoch=012-val_loss=1.165.ckpt", tokenizer_name="krasserm/perceiver-io-mlm", push_to_hub=True, ) ```