Bo1015 commited on
Commit
cfc4ed5
·
verified ·
1 Parent(s): 1f0f1fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -22
README.md CHANGED
@@ -2,29 +2,28 @@
2
  tags:
3
  - biology
4
  ---
5
- # xTrimoPGLM-1B-CLM
6
 
7
  ## Model Introduction
8
 
9
- **xTrimoPGLM-1B-CLM** is the open-source version of the latest generative protein language models designed to generate faithful and diverse protein sequences. The xTrimoPGLM family models are developed by BioMap and Tsinghua University. Along with this, we have released the int4 quantization xTrimoPGLM-100B weights and other xTrimo-series small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
10
 
11
  ### Out-of-Distribution Perplexity Evaluation
12
 
13
- We evaluated the xTrimoPGLM-CLM (xTCLM) and xTrimoPGLM(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The perplexity results, compared against ProGen2-xlarge (6.4B), are as follows (lower is better):
14
 
15
- | Model | ProGen2-xlarge (6.4B) | xTCLM (1B) | xTCLM (3B) | xTCLM (7B) | xT (100B)-INT4 |
16
  |:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
17
- | < 0.9 ID | 9.7 | 9.8 | 9.3 | 8.9 | **8.9** |
18
- | < 0.5 ID | 14.3 | 14.0 | 13.7 | 13.5 | **13.5** |
19
-
20
 
21
  ## How to use
22
  ```python
23
  from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
24
  import torch
25
 
26
- tokenizer = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-1b-clm", trust_remote_code=True, use_fast=True)
27
- model = AutoModelForCausalLM.from_pretrained("biomap-research/xtrimopglm-1b-clm", trust_remote_code=True, torch_dtype=torch.bfloat16)
28
  if torch.cuda.is_available():
29
  model = model.cuda()
30
  model.eval()
@@ -38,9 +37,6 @@ for idx, each in enumerate(prompt):
38
  print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
39
  ```
40
 
41
-
42
- For more inference or fine-tuning code, datasets, and requirements, please visit our [GitHub page](https://github.com/biomap-research/xTrimoPGLM).
43
-
44
  ## LICENSE
45
 
46
  The code in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).
@@ -48,21 +44,20 @@ The code in this repository is open source under the [Creative Commons Attributi
48
  ## Citations
49
 
50
  If you find our work useful, please consider citing the following paper:
51
- ```latex
52
- @misc{chen2024xtrimopglm,
53
  title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
54
  author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
55
- year={2024},
56
- eprint={2401.06199},
57
- archivePrefix={arXiv},
58
- primaryClass={cs.CL},
59
- note={arXiv preprint arXiv:2401.06199}
60
  }
61
 
62
- @misc{cheng2024training,
63
  title={Training Compute-Optimal Protein Language Models},
64
  author={Cheng, Xingyi and Chen, Bo and Li, Pan and Gong, Jing and Tang, Jie and Song, Le},
 
 
65
  year={2024},
66
- note={bioRxiv, Cold Spring Harbor Laboratory, pages 2024--06}
67
  }
68
- ```
 
2
  tags:
3
  - biology
4
  ---
5
+ # ProteinGLM-1B-CLM
6
 
7
  ## Model Introduction
8
 
9
+ **ProteinGLM-1B-CLM** is the open-source version of the latest generative protein language models designed to generate faithful and diverse protein sequences. The ProteinGLM family models are developed by Tsinghua University. Along with this, we have released the int4 quantization ProteinGLM weights and other small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
10
 
11
  ### Out-of-Distribution Perplexity Evaluation
12
 
13
+ We evaluated the ProteinGLM-CLM (PGLM) and ProteinGLM-INT4(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The perplexity results, compared against ProGen2-xlarge (6.4B), are as follows (lower is better):
14
 
15
+ | Model | ProGen2-xlarge (6.4B) | PGLM (1B) | PGLM (3B) | PGLM (7B) | PGLM-INT4 (100B) |
16
  |:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
17
+ | < 0.9 ID | 9.7 | 9.8 | 9.3 | 8.9 | **8.7** |
18
+ | < 0.5 ID | 14.3 | 14.0 | 13.7 | 13.5 | **13.3** |
 
19
 
20
  ## How to use
21
  ```python
22
  from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
23
  import torch
24
 
25
+ tokenizer = AutoTokenizer.from_pretrained("Bo1015/proteinglm-1b-clm", trust_remote_code=True, use_fast=True)
26
+ model = AutoModelForCausalLM.from_pretrained("Bo1015/proteinglm-1b-clm", trust_remote_code=True, torch_dtype=torch.bfloat16)
27
  if torch.cuda.is_available():
28
  model = model.cuda()
29
  model.eval()
 
37
  print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
38
  ```
39
 
 
 
 
40
  ## LICENSE
41
 
42
  The code in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).
 
44
  ## Citations
45
 
46
  If you find our work useful, please consider citing the following paper:
47
+ ```
48
+ @article{chen2024xtrimopglm,
49
  title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
50
  author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
51
+ journal={arXiv preprint arXiv:2401.06199},
52
+ year={2024}
 
 
 
53
  }
54
 
55
+ @article{cheng2024training,
56
  title={Training Compute-Optimal Protein Language Models},
57
  author={Cheng, Xingyi and Chen, Bo and Li, Pan and Gong, Jing and Tang, Jie and Song, Le},
58
+ journal={bioRxiv},
59
+ pages={2024--06},
60
  year={2024},
61
+ publisher={Cold Spring Harbor Laboratory}
62
  }
63
+ ```