--- license: mit datasets: - drug-discovery/UR100P language: - en tags: - biology --- ## AMPLIFY AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP ([UR100P](https://huggingface.co/datasets/drug-discovery/UR100P)). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the `_base` models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1). - [`AMPLIFY_350M`](https://huggingface.co/drug-discovery/AMPLIFY_350M) - [`AMPLIFY_350M_base`](https://huggingface.co/drug-discovery/AMPLIFY_350M_base) - [`AMPLIFY_120M`](https://huggingface.co/drug-discovery/AMPLIFY_120M) - [`AMPLIFY_120M_base`](https://huggingface.co/drug-discovery/AMPLIFY_120M_base) ### Model Descritpion | | AMPLIFY 120M | AMPLIFY 350M | | :----------------------------- | -----------: | -----------: | | `hidden-size` | 640 | 960 | | `num-hidden-layers` | 24 | 32 | | `num-attention-heads` | 10 | 15 | | `intermediate-size` | 2560 | 3840 | | `max-position-embeddings` | 2048 | 2048 | | `vocab-size` | 27 | 27 | | `rope-theta` | 10000 | 10000 | | `dropout-prob` | 0 | 0 | | `embedding-init-range` | 0.02 | 0.02 | | `norm-eps` | 1.0e-05 | 1.0e-05 | | `hidden-act` | swiglu | swiglu | | `pre-activation-layer-norm` | true | true | | `layer-norm-after-embedding` | false | false | | `layer-norm-before-last-layer` | true | true | | `rms-norm` | true | true | | `ffn-bias` | false | false | | `attn-bias` | false | false | ### Training Descritpion | | Stage 1 | Stage 2 | | :------------------ | ----------: | ---------------------------: | | `dataset` | UR100P | UR100P | | `max-steps` | 1000000 | 25000 (120M) or 50000 (350M) | | `max-length` | 512 | 2048 | | `optimizer` | adamw | adamw | | `lr` | 0.001 | 0.001 | | `betas` | (0.9, 0.95) | (0.9, 0.95) | | `eps` | 1.0e-08 | 1.0e-08 | | `weight-decay` | 0.01 | 0.01 | | `scheduler` | cosinedecay | none | | `warmup-steps` | 1,000 | none | | `final-step` | 900,000 | none | | `warmup-steps` | 1,000 | none | | `gradient-clipping` | 1.0 | 1.0 | | `tf32` | true | true | | `mixed-precision` | bf16 | bf16 | | `padding` | max-length | max-length | | `random-truncate` | true | true | | `mask-probability` | 0.15 | 0.15 | | `total-batch-size` | 4096 | 4096 | | `deepspeed` | true | true | | `zero-stage` | 3 | 3 | ## Get Started ```python from transformers import AutoModel from transformers import AutoTokenizer from datasets import load_dataset # Load AMPLIFY and tokenizer model = AutoModel.from_pretrained("drug-discovery/AMPLIFY_350M", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("drug-discovery/AMPLIFY_350M", trust_remote_code=True) # Move the model to GPU (required due to Flash Attention) model = model.to("cuda") # Load the UniProt validation set dataset = load_dataset("drug-discovery/UR100P", data_dir="UniProt", split="test") for sample in dataset: # Protein print("Sample: ", sample["name"], sample["sequence"]) # Tokenize the protein input = tokenizer.encode(sample["sequence"], return_tensors="pt") print("Input: ", input) # Move to the GPU and make a prediction input = input.to("cuda") output = model(input) print("Output: ", output) break ``` ## Citations If you find the models useful in your research, we ask that you cite the paper: ```bibtex @article{Fournier2024.09.23.614603, title = {Protein Language Models: Is Scaling Necessary?}, author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James}, year = {2024}, journal = {bioRxiv}, publisher = {Cold Spring Harbor Laboratory}, doi = {10.1101/2024.09.23.614603}, url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603}, elocation-id = {2024.09.23.614603}, eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf} } ```