File size: 8,276 Bytes
83d5354
 
 
 
 
969bdd1
83d5354
969bdd1
83d5354
969bdd1
83d5354
 
 
969bdd1
83d5354
969bdd1
4ca06d9
78ba23e
 
83d5354
 
969bdd1
4ca06d9
78ba23e
 
9c1318b
83d5354
 
 
 
 
 
5e4a22c
83d5354
 
d71bb28
9828c51
 
 
 
 
 
969bdd1
83d5354
 
969bdd1
 
83d5354
 
 
 
 
 
969bdd1
 
83d5354
39fdf5d
969bdd1
bb7643e
 
 
83d5354
bb7643e
 
 
 
83d5354
bb7643e
83d5354
bb7643e
83d5354
 
 
 
 
 
 
 
 
 
0bedba4
83d5354
 
0bedba4
83d5354
 
 
969bdd1
83d5354
 
 
 
969bdd1
 
83d5354
969bdd1
bb7643e
 
 
83d5354
bb7643e
 
 
 
83d5354
bb7643e
83d5354
 
 
 
 
 
 
 
0bedba4
83d5354
 
 
 
 
 
969bdd1
83d5354
 
 
 
 
7740dbc
83d5354
 
7740dbc
 
 
 
 
83d5354
 
7740dbc
83d5354
 
 
7740dbc
83d5354
5660717
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
tags:
- biology
- medical
---
# ProteinGLM 

##  Introduction to ProteinGLM Family Models

ProteinGLM is the open-source version of the latest protein language models towards protein understanding tasks (Masked Protein Language Models) and protein design (Casual Protein Language Models). The ProteinGLM family models are developed by Tsinghua University. Along with this, we have released the int4 quantization ProteinGLM-100B weights and other small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.

### Out-of-Distribution Perplexity Evaluation

We evaluated the ProteinGLM (MLM or CLM) and ProteinGLM-INT4 (100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The MLM perplexity results, compared against ESM2-3B and ESM2-15B and the CLM perplexity againest ProGen2-xlarge (6.4B), are as follows (lower is better):

| Model               | ESM2(3B)| ESM2 (15B) | PGLM (1B) | PGLM (3B) | PGLM (10B) |  PGLM-INT4 (100B) |
|:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
| < 0.9 ID           |   7.7   |   7.3    | 9.3    |    7.8    |      7.6    |             **6.8**         | 
| < 0.5 ID            |  11.5 |  11.0  |  13.5 |   11.9  |   11.6   |          **10.8**          |


| Model               | ProGen2-xlarge (6.4B) | PGLM (1B) | PGLM (3B) | PGLM (7B) |  PGLM-INT4 (100B) |
|:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
| < 0.9 ID            |   9.7     | 9.8    |    9.3    |      8.9    |            **8.9**         | 
| < 0.5 ID            |   14.3    |  14.0  |    13.7  |    13.5   |           **13.5**          |


## Downstream Protein Understanding Tasks Evaluation
(TODO)

## Get Started
### Model List
You can choose to manually download the necessary weights

| Model            |Download                                                                                                                                |                                                                                                                                                                                
|------------------|-----------------------------------------------------------------------------------------------------------|
| ProteinGLM-1B-MLM        | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-1b-mlm) |
| ProteinGLM-3B-MLM     | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-3b-mlm)    |
| ProteinGLM-10B-MLM   | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-10b-mlm) |     
| ProteinGLM-1B-CLM        | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-1b-clm)  |
| ProteinGLM-3B-CLM     | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-3b-clm)   |
| ProteinGLM-7B-CLM   | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-7b-clm)   |   
| ProteinGLM-INT4 (100B)  (MLM or CLM) | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-100b-int4)|                                                                                                                                                                                              |                                                                                                                                                                                  |

## How to use
### ProteinGLM-MLM: Masked Langeuage Models for Protein Understanding Tasks
(The ProteinGLM-100B INT4 quantization requires approximately 50 GB of GPU memory. It can be inferred on a single A100/800 GPU with 80 GB of memory or across multiple GPUs totaling 60 GB.)
```python

# Obtain residue embeddings
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
import torch

tokenizer  = AutoTokenizer.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, use_fast=True)
config = AutoConfig.from_pretrained("Bo1015/proteinglm-100b-int4",  trust_remote_code=True, torch_dtype=torch.half)
config.is_causal=False
config.post_layer_norm=True # use the final layernorm or not, some tasks set to false would be better.
model = AutoModelForMaskedLM.from_pretrained("Bo1015/proteinglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
if torch.cuda.is_available():
    model = model.cuda()

# # if you don't have the single gpu with 80G memory, try the dispatch load.
# from accelerate import load_checkpoint_and_dispatch, init_empty_weights
# with init_empty_weights():
  # model = AutoModelForMaskedLM.from_config(config, trust_remote_code=True)
# 
# model = load_checkpoint_and_dispatch(
#     model, "<your model cached dir>", device_map="auto", no_split_module_classes=["xTrimoPGLMBlock"], strict=True, dtype=dtype
# )

model.eval()

seq = 'MILMCQHFSGQFSKYFLAVSSDFCHFVFPIILVSHVNFKQMKRKGFALWNDRAVPFTQGIFTTVMILLQYLHGTG'
output = tokenizer(seq, add_special_tokens=True, return_tensors='pt')
with torch.inference_mode():
    inputs = {"input_ids": output["input_ids"].cuda(), "attention_mask": output["attention_mask"].cuda()}
    output_embeddings = model(**inputs, output_hidden_states=True, return_last_hidden_state=True).hidden_states[:-1, 0] # get rid of the <eos> token


# model for the sequence-level tasks
model = AutoModelForSequenceClassification.from_pretrained(config, trust_remote_code=True, torch_dtype=torch.half)

# model for the token-level tasks
model = AutoModelForTokenClassification.from_pretrained(config, trust_remote_code=True, torch_dtype=torch.half)

```

### ProteinGLM-CLM: Casusal Langeuage Models for Protein Design 
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch

tokenizer  = AutoTokenizer.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, use_fast=True)
config = AutoConfig.from_pretrained("Bo1015/proteinglm-100b-int4",  trust_remote_code=True, torch_dtype=torch.half)
config.is_causal=True
model = AutoModelForCausalLM.from_pretrained("Bo1015/proteinglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
if torch.cuda.is_available():
    model = model.cuda()

# # if you don't have the single gpu with 80G memory, try the dispatch load.
# from accelerate import load_checkpoint_and_dispatch, init_empty_weights
# with init_empty_weights():
  # model = AutoModelForMaskedLM.from_config(config, trust_remote_code=True)
# 
# model = load_checkpoint_and_dispatch(
#     model, "<your model cached dir>", device_map="auto", no_split_module_classes=["xTrimoPGLMBlock"], strict=True, dtype=dtype
# )
model.eval()

gen_kwargs = {'max_length': 256, 'top_p': 0.8, 'temperature':0.9, "num_beams": 1}
prompt=['', 'MLFVVL', 'LDL', 'VTQA']

for idx, each in enumerate(prompt):
    print(f"Begin generating idx: {idx} with prompt {each}")
    output = model.chat(tokenizer, each, **gen_kwargs)
    print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
```


## LICENSE

The model in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).

## Citations

If you find our work useful, please consider citing the following paper:
```
@misc{chen2024xtrimopglm,
  title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
  author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
  year={2024},
  eprint={2401.06199},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  note={arXiv preprint arXiv:2401.06199}
}

@misc{cheng2024training,
  title={Training Compute-Optimal Protein Language Models},
  author={Cheng, Xingyi and Chen, Bo and Li, Pan and Gong, Jing and Tang, Jie and Song, Le},
  year={2024},
  note={bioRxiv, Cold Spring Harbor Laboratory, pages 2024--06}
}
```