|
--- |
|
license: apache-2.0 |
|
base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T |
|
datasets: |
|
- cerebras/SlimPajama-627B |
|
- bigcode/starcoderdata |
|
- monsoon-nlp/greenbeing-proteins |
|
language: |
|
- en |
|
--- |
|
|
|
# tinyllama-proteinpretrain-quinoa |
|
|
|
Full model finetuning of TinyLLaMA-1.1B on the "research" split (quinoa |
|
protein sequences) of GreenBeing-Proteins dataset. |
|
|
|
Notes: pretraining only on sequences leads the model to only generate protein sequences, eventually repeating VVVV ot KKKK. |
|
- This model may be replaced with mixed training (bio/chem text and protein). |
|
- This model might need "biotokens" to represent the amino acids instead of using the existing tokenizer. |
|
|
|
More details TBD |
|
|