monsoon-nlp's picture
base model link
abc79d9 verified
|
raw
history blame
690 Bytes
metadata
license: apache-2.0
base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
datasets:
  - cerebras/SlimPajama-627B
  - bigcode/starcoderdata
  - monsoon-nlp/greenbeing-proteins
language:
  - en

tinyllama-proteinpretrain-quinoa

Full model finetuning of TinyLLaMA-1.1B on the "research" split (quinoa protein sequences) of GreenBeing-Proteins dataset.

Notes: pretraining only on sequences leads the model to only generate protein sequences, eventually repeating VVVV ot KKKK.

  • This model may be replaced with mixed training (bio/chem text and protein).
  • This model might need "biotokens" to represent the amino acids instead of using the existing tokenizer.

More details TBD