lhallee commited on
Commit
e2755c1
1 Parent(s): 3ab27b4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-2.0
3
+ library_name: transformers
4
+ datasets:
5
+ - CCDS
6
+ - Ensembl
7
+ pipeline_tag: fill-mask
8
+ tags:
9
+ - protein language model
10
+ - biology
11
+ widget:
12
+ - text: >-
13
+ ( Z [MASK] V L P Y G D E K L S P Y G D G G D V G Q I F s C B L Q D T N N F F G A
14
+ g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p
15
+ ^ v
16
+ example_title: Fill mask (E)
17
+ ---
18
+
19
+ # cdsBERT
20
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">
21
+
22
+ ## Model description
23
+
24
+ [cdsBERT](https://doi.org/10.1101/2023.09.15.558027) is pLM with a codon vocabulary that was seeded with [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) and trained with a novel vocabulary extension pipeline called MELD. cdsBERT offers a highly biologically relevant latent space with excellent EC number prediction.
25
+ Specifically, this is the full-precision checkpoint after the MLM objective on 4 million CDS examples.
26
+
27
+ ## How to use
28
+
29
+ ```python
30
+ # Imports
31
+ import torch
32
+ from transformers import BertForMaskedLM, BertTokenizer, pipeline
33
+
34
+ model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
35
+ tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
36
+ device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
37
+ model.to(device) # move to device
38
+ model.eval() # put in eval mode
39
+
40
+ sequence = '( Z [MASK] V L P Y G D E K L S P Y G D G G D V G Q I F s C # L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v ]' # CCDS207.1|Hs110|chr1
41
+
42
+ # Create a fill-mask prediction pipeline
43
+ unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
44
+ # Predict the masked token
45
+ prediction = unmasker(sequence)
46
+ print(prediction)
47
+ ```
48
+
49
+ ## Intended use and limitations
50
+ cdsBERT serves as a general-purpose protein language model with a codon vocabulary. Fine-tuning with Huggingface transformers models like BertForSequenceClassification enables downstream classification and regression tasks. Currently, the base capability enables feature extraction. This checkpoint after MLM can conduct mask-filling, while the cdsBERT+ checkpoint has a more biochemically relevant latent space.
51
+
52
+ ## Our lab
53
+ The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.
54
+
55
+ ## Please cite
56
+ @article {Hallee_cds_2023,
57
+ author = {Logan Hallee, Nikolaos Rafailidis, and Jason P. Gleghorn},
58
+ title = {cdsBERT - Extending Protein Language Models with Codon Awareness},
59
+ year = {2023},
60
+ doi = {10.1101/2023.09.15.558027},
61
+ publisher = {Cold Spring Harbor Laboratory},
62
+ journal = {bioRxiv}
63
+ }