File size: 3,411 Bytes
f708205
 
 
 
 
 
0a46586
f708205
 
 
 
0a46586
7cb5df4
94b923d
0a46586
 
 
 
 
 
 
 
 
 
 
a557cec
0a46586
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7cb5df4
94b923d
0a46586
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
language:
- en
base_model:
- allenai/longformer-large-4096
---
This version of the `longformer-large-4096` model was additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782) by [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640). The S2ORC is a large corpus of 81.1M English-language academic papers from different disciplines. The model uses the weights of [the longformer large science checkpoint](https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/longformer_large_science.ckpt) that was also used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification.

Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:

`<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`.

Transferring the checkpoint weights and saving the model was done based on [this code](https://github.com/dwadden/multivers/blob/main/multivers/model.py#L145) from the MultiVerS repository, the versions of `transformers==4.2.2` and `torch==1.7.1` correspond to the MultiVerS [requirements.txt](https://github.com/dwadden/multivers/blob/main/requirements.txt):
```python
import os
import pathlib
import subprocess

import torch
from transformers import LongformerModel

model = LongformerModel.from_pretrained(
    "allenai/longformer-large-4096", gradient_checkpointing=False
)

# Load the pre-trained checkpoint.
url = f"https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/#longformer_large_science.ckpt"
out_file = f"checkpoints/longformer_large_science.ckpt"
cmd = ["wget", "-O", out_file, url]

if not pathlib.Path(out_file).exists():
    subprocess.run(cmd)

checkpoint_prefixed = torch.load("checkpoints/longformer_large_science.ckpt")

# New checkpoint
new_state_dict = {}
# Add items from loaded checkpoint.
for k, v in checkpoint_prefixed.items():
    # Don't need the language model head.
    if "lm_head." in k:
        continue
    # Get rid of the first 8 characters, which say `roberta.`.
    new_key = k[8:]
    new_state_dict[new_key] = v

# Resize embeddings and load state dict.
target_embed_size = new_state_dict["embeddings.word_embeddings.weight"].shape[0]
model.resize_token_embeddings(target_embed_size)
model.load_state_dict(new_state_dict)

model_dir = "checkpoints/longformer_large_science"
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

model.save_pretrained(model_dir)
```

The tokenizer was resized and saved following [this code](https://github.com/dwadden/multivers/blob/main/multivers/data.py#L14) from the MultiVerS repository:
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-large-4096")
ADDITIONAL_TOKENS = {
    "section_start": "<|sec|>",
    "section_end": "</|sec|>",
    "section_title_start": "<|sec-title|>",
    "section_title_end": "</|sec-title|>",
    "abstract_start": "<|abs|>",
    "abstract_end": "</|abs|>",
    "title_start": "<|title|>",
    "title_end": "</|title|>",
    "sentence_sep": "<|sent|>",
    "paragraph_sep": "<|par|>",
}
tokenizer.add_tokens(list(ADDITIONAL_TOKENS.values()))
tokenizer.save_pretrained("checkpoints/longformer_large_science")
```