tanikina's picture
fix comment
a557cec verified
|
raw
history blame
3.48 kB
metadata
language:
  - en
base_model:
  - allenai/longformer-large-4096

This version of the longformer-large-4096 model was additionally pre-trained on the S2ORC corpus (Lo et al., 2020) by (Wadden et al., 2022). The S2ORC is a large corpus of 81.1M English-language academic papers from different disciplines. The model uses the weights of the longformer large science checkpoint that was also used as the starting point for training the MultiVerS model (Wadden et al., 2022) on the task of scientific claim verification.

Note that the vocabulary size of this model (50275) differs from the original longformer-large-4096 (50265) since 10 new tokens were included:

<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>.

Transferring the checkpoint weights and saving the model was done based on this code from the MultiVerS repository, the versions of transformers==4.2.2 and torch==1.7.1 correspond to the MultiVerS requirements.txt:

import os
import pathlib
import subprocess

import torch
from transformers import LongformerModel

model = LongformerModel.from_pretrained(
    "allenai/longformer-large-4096", gradient_checkpointing=False
)

# Load the pre-trained checkpoint.
url = f"https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/#longformer_large_science.ckpt"
out_file = f"checkpoints/longformer_large_science.ckpt"
cmd = ["wget", "-O", out_file, url]

if not pathlib.Path(out_file).exists():
    subprocess.run(cmd)

checkpoint_prefixed = torch.load("checkpoints/longformer_large_science.ckpt")

# New checkpoint
new_state_dict = {}
# Add items from loaded checkpoint.
for k, v in checkpoint_prefixed.items():
    # Don't need the language model head.
    if "lm_head." in k:
        continue
    # Get rid of the first 8 characters, which say `roberta.`.
    new_key = k[8:]
    new_state_dict[new_key] = v

# Resize embeddings and load state dict.
target_embed_size = new_state_dict["embeddings.word_embeddings.weight"].shape[0]
model.resize_token_embeddings(target_embed_size)
model.load_state_dict(new_state_dict)

model_dir = "checkpoints/longformer_large_science"
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

model.save_pretrained(model_dir)

The tokenizer was resized and saved following this code from the MultiVerS repository:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-large-4096")
ADDITIONAL_TOKENS = {
    "section_start": "<|sec|>",
    "section_end": "</|sec|>",
    "section_title_start": "<|sec-title|>",
    "section_title_end": "</|sec-title|>",
    "abstract_start": "<|abs|>",
    "abstract_end": "</|abs|>",
    "title_start": "<|title|>",
    "title_end": "</|title|>",
    "sentence_sep": "<|sent|>",
    "paragraph_sep": "<|par|>",
}
tokenizer.add_tokens(list(ADDITIONAL_TOKENS.values()))
tokenizer.save_pretrained("checkpoints/longformer_large_science")