nvidia
/

AMPLIFY_120M

@@ -14,67 +14,104 @@ tags:
 > library. Slight numerical differences may be observed between the original model and the optimized
 > model. For instructions on how to install TransformerEngine, please refer to the
 > [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).
->
-> The original xformers-based models are available at [chandar-lab/AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_350M).
-## AMPLIFY
-AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP ([UR100P](https://huggingface.co/datasets/chandar-lab/UR100P)). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the `_base` models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1).
-- [`AMPLIFY_350M`](https://huggingface.co/nvidia/AMPLIFY_350M)
-- [`AMPLIFY_350M_base`](https://huggingface.co/chandar-lab/AMPLIFY_350M_base)
-- [`AMPLIFY_120M`](https://huggingface.co/nvidia/AMPLIFY_120M)
-- [`AMPLIFY_120M_base`](https://huggingface.co/chandar-lab/AMPLIFY_120M_base)
-### Model Description
-|                                | AMPLIFY 120M | AMPLIFY 350M |
-| :----------------------------- | -----------: | -----------: |
-| `hidden-size`                  |          640 |          960 |
-| `num-hidden-layers`            |           24 |           32 |
-| `num-attention-heads`          |           10 |           15 |
-| `intermediate-size`            |         2560 |         3840 |
-| `max-position-embeddings`      |         2048 |         2048 |
-| `vocab-size`                   |           27 |           27 |
-| `rope-theta`                   |        10000 |        10000 |
-| `dropout-prob`                 |            0 |            0 |
-| `embedding-init-range`         |         0.02 |         0.02 |
-| `norm-eps`                     |      1.0e-05 |      1.0e-05 |
-| `hidden-act`                   |       swiglu |       swiglu |
-| `pre-activation-layer-norm`    |         true |         true |
-| `layer-norm-after-embedding`   |        false |        false |
-| `layer-norm-before-last-layer` |         true |         true |
-| `rms-norm`                     |         true |         true |
-| `ffn-bias`                     |        false |        false |
-| `attn-bias`                    |        false |        false |
-### Training Description
-|                     |     Stage 1 |                      Stage 2 |
-| :------------------ | ----------: | ---------------------------: |
-| `dataset`           |      UR100P |                       UR100P |
-| `max-steps`         |     1000000 | 25000 (120M) or 50000 (350M) |
-| `max-length`        |         512 |                         2048 |
-| `optimizer`         |       adamw |                        adamw |
-| `lr`                |       0.001 |                       0.0001 |
-| `betas`             | (0.9, 0.95) |                  (0.9, 0.95) |
-| `eps`               |     1.0e-08 |                      1.0e-08 |
-| `weight-decay`      |        0.01 |                         0.01 |
-| `scheduler`         | cosinedecay |                         none |
-| `warmup-steps`      |       1,000 |                         none |
-| `final-step`        |     900,000 |                         none |
-| `warmup-steps`      |       1,000 |                         none |
-| `gradient-clipping` |         1.0 |                          1.0 |
-| `tf32`              |        true |                         true |
-| `mixed-precision`   |        bf16 |                         bf16 |
-| `padding`           |  max-length |                   max-length |
-| `random-truncate`   |        true |                         true |
-| `mask-probability`  |        0.15 |                         0.15 |
-| `total-batch-size`  |        4096 |                         4096 |
-| `deepspeed`         |        true |                         true |
-| `zero-stage`        |           3 |                            3 |
-## Get Started
 ```python
 from transformers import AutoModel
@@ -82,8 +119,10 @@ from transformers import AutoTokenizer
 from datasets import load_dataset
 # Load AMPLIFY and tokenizer
-model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
 # Move the model to GPU (required due to Flash Attention)
 model = model.to("cuda")
@@ -107,20 +146,164 @@ for sample in dataset:
     break
 ```
-## Citations
-If you find the models useful in your research, we ask that you cite the paper:
-```bibtex
-@article{Fournier2024.09.23.614603,
-	title        = {Protein Language Models: Is Scaling Necessary?},
-	author       = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
-	year         = {2024},
-	journal      = {bioRxiv},
-	publisher    = {Cold Spring Harbor Laboratory},
-	doi          = {10.1101/2024.09.23.614603},
-	url          = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
-	elocation-id = {2024.09.23.614603},
-	eprint       = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
-}
-```

 > library. Slight numerical differences may be observed between the original model and the optimized
 > model. For instructions on how to install TransformerEngine, please refer to the
 > [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).
+# AMPLIFY (TransformerEngine-Optimized) Overview
+## Description:
+AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein
+embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two
+sizes, 120M and 350M parameters.
+This version of the AMPLIFY model is optimized with NVIDIA's
+[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from
+Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs.
+This model is ready for commercial/non-commercial use.
+## Third-Party Community Consideration
+This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
+for this application and use case; see link to Non-NVIDIA [AMPLIFY Model
+Card](https://huggingface.co/chandar-lab/AMPLIFY_120M).
+### License/Terms of Use:
+AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE).
+### Deployment Geography:
+Global
+### Use Case:
+Protein design, mutation prediction, and function analysis.
+### Release Date:
+Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M)
+## References:
+- [Protein Language Models: Is Scaling
+  Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed
+  information on the model architecture and training data.
+## Model Architecture:
+**Architecture Type:** Transformer
+**Network Architecture:** ESM-2
+**This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_120M) <br>
+**Number of model parameters:** 1.2 x 10^8
+## Input:
+**Input Type:** Text (Protein Sequences) <br>
+**Input Format:** String <br>
+**Input Parameters:** One-Dimensional (1D) <br>
+**Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum
+context length is 2048 residues.
+## Output:
+**Output Type:** Embeddings (Amino acid and sequence-level) <br>
+**Output Format:** Numeric vector <br>
+**Output Parameters:** One-Dimensional (1D) <br>
+**Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each
+amino acid in the input protein sequence.
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware
+(e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times
+compared to CPU-only solutions.
+## Software Integration:
+**Runtime Engines:**
+- Hugging Face Transformers
+**Supported Hardware Microarchitecture Compatibility:**
+- NVIDIA Ampere
+- NVIDIA Blackwell
+- NVIDIA Hopper
+**Preferred Operating System(s):**
+- Linux
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific
+data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at
+both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure
+compliance with safety and ethical standards before deployment.
+## Model and checkpoint versions are noted below:
+- [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br>
+- [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br>
+**Get Started**
 ```python
 from transformers import AutoModel
 from datasets import load_dataset
 # Load AMPLIFY and tokenizer
+model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(
+    "nvidia/AMPLIFY_120M", trust_remote_code=True
+)
 # Move the model to GPU (required due to Flash Attention)
 model = model.to("cuda")
     break
 ```
+## Training and Evaluation Datasets:
+## Training Datasets:
+**Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0)
+**Data Modality:**
+- Text (Protein Sequences)
+**Text Training Data Size:**
+- 1 Billion to 10 Trillion Tokens
+**Data Collection Method:**
+- Human
+**Labeling Method:**
+- N/A
+**Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase
+and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using
+the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the
+longest sequence is not always the most informative. There is often more biologically relevant information and
+annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are
+ranked to facilitate the selection of a biologically relevant representative for the cluster.
+**Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/)
+**Data Modality:**
+- Text (Protein Sequences)
+**Text Training Data Size:**
+- 1 Billion to 10 Trillion Tokens
+**Data Collection Method:**
+- Human
+**Labeling Method:**
+- Human
+**Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for
+use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These
+repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals.
+**Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download)
+**Data Modality:**
+- Text (Protein Sequences)
+**Text Training Data Size:**
+- 1 Billion to 10 Trillion Tokens
+**Data Collection Method:**
+- Hybrid: Human, Automated
+**Labeling Method:**
+- Hybrid: Human, Automated
+**Properties:** The main levels of classification in SCOP are:
+- Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and
+  alpha+beta.
+- Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same
+  topological connections.
+- Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and
+  functional features, even if sequence similarity is low.
+- Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through
+  sequence comparison methods.
+- Species: Represents a distinct protein sequence.
+- Protein: Groups similar sequences with the same function.
+## Evaluation Datasets:
+**Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/)
+**Benchmark Score:** LR P@L of 17.8±14.1
+**Data Collection Method:**
+- Human
+**Labeling Method:**
+- N/A
+**Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by
+the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction
+servers, which then return their predictions.
+**Link:** [CASP14 (Critical Assessment of Methods of Protein Structure
+Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/)
+**Benchmark Score:** LR P@L of 12.4±11.3
+**Data Collection Method:**
+- Human
+**Labeling Method:**
+- N/A
+**Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental
+structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
+three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
+participating research groups and servers, who must submit their predicted structures within a specific time frame.
+**Link:** [CASP15 (Critical Assessment of Methods of Protein Structure
+Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/)
+**Benchmark Score:** LR P@L of 16.9±13.2
+**Data Collection Method:**
+- Human
+**Labeling Method:**
+- N/A
+**Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental
+structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
+three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
+participating research groups and servers, who must submit their predicted structures within a specific time frame.
+## Inference:
+**Acceleration Engine:**
+- Hugging Face Transformers
+**Test Hardware:**
+- A100
+- H100
+- H200
+- GB200
+## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable
+development for a wide array of AI applications. When downloaded or used in accordance with our terms of service,
+developers should work with their internal model team to ensure this model meets requirements for the relevant industry
+and use case and addresses unforeseen product misuse.
+Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and
+comply with applicable safety regulations and ethical standards.
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns
+[here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

config.json CHANGED Viewed

@@ -32,7 +32,7 @@
   "padded_vocab_size": 32,
   "pre_activation_layer_norm": true,
   "rms_norm": true,
-  "transformers_version": "4.56.1",
   "unk_token_id": 1,
   "vocab_path": "conf/tokenizer/amplify_vocab.txt",
   "vocab_size": 27

   "padded_vocab_size": 32,
   "pre_activation_layer_norm": true,
   "rms_norm": true,
+  "transformers_version": "4.56.2",
   "unk_token_id": 1,
   "vocab_path": "conf/tokenizer/amplify_vocab.txt",
   "vocab_size": 27