### Model Description
This model card describes the distilled version of ProtGPT2, referred to as `protgpt2-distilled-tiny`. The distillation process for this model follows the methodology of knowledge distillation from a larger teacher model to a smaller, more efficient student model. The process combines both "Soft Loss" (Knowledge Distillation Loss) and "Hard Loss" (Cross-Entropy Loss) to ensure the student model not only generalizes like its teacher but also retains practical prediction capabilities.

### Technical Details
**Distillation Parameters:**
- **Temperature (T):** 10
- **Alpha (α):** 0.1
- **Model Architecture:**
  - **Number of Layers:** 4
  - **Number of Attention Heads:** 4
  - **Embedding Size:** 512

**Dataset Used:**
- The model was distilled using a subset of the evaluation dataset provided by `nferruz/UR50_2021_04`.

**Loss Formulation:**
- **Soft Loss:** \( L_{soft} = \text{KL}(\text{softmax}(\frac{s}{T}), \text{softmax}(\frac{t}{T})) \)
- **Hard Loss:** \( L_{hard} = -\sum_{i} y_i \log(\text{softmax}(s_i)) \)
- **Combined Loss:** \( L = \alpha L_{hard} + (1 - \alpha) L_{soft} \)

### Performance
The distilled model, `protgpt2-distilled-tiny`, exhibits a significant improvement in inference speed—up to 6 times faster than the pretrained version—while maintaining comparable perplexities.

![Running time](https://images.mobilism.org/?di=Y7IS2NH7)


### Use Cases
The distilled version of ProtGPT2 is particularly useful in scenarios where efficiency and speed are crucial without significant compromise on performance. Here are three use cases:

1. **Real-Time Applications:** For applications requiring real-time protein sequence analysis, such as in clinical diagnostic tools or real-time biological data processing, the reduced inference time can be critical.

2. **Mobile and Edge Devices:** Deployment on resource-constrained environments such as mobile devices or edge computing platforms where computational resources and memory are limited.

3. **Ensemble Models:** It can serve as a component in an ensemble of models where multiple predictions are aggregated to improve accuracy. The reduced resource requirement per model allows for more models to operate in parallel.

### References
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
- Original ProtGPT2 Paper: [Link to paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9329459/)

This model card outlines the distilled model's creation, its underlying methodology, performance metrics, and potential applications, ensuring transparency and aiding potential users in understanding how to best utilize this model in their specific contexts.