--- license: apache-2.0 base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T datasets: - cerebras/SlimPajama-627B - bigcode/starcoderdata - monsoon-nlp/greenbeing-proteins language: - en --- # tinyllama-proteinpretrain-quinoa Full model finetuning of TinyLLaMA-1.1B on the "research" split (quinoa protein sequences) of GreenBeing-Proteins dataset. Notes: pretraining only on sequences leads the model to only generate protein sequences, eventually repeating VVVV ot KKKK. - This model may be replaced with mixed training (bio/chem text and protein). - This model might need "biotokens" to represent the amino acids instead of using the existing tokenizer. More details TBD