monsoon-nlp
/

tinyllama-proteinpretrain-quinoa

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

tinyllama-proteinpretrain-quinoa / README.md

monsoon-nlp's picture

base model link

abc79d9 verified 7 months ago

|

690 Bytes

	---
	license: apache-2.0
	base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
	datasets:
	- cerebras/SlimPajama-627B
	- bigcode/starcoderdata
	- monsoon-nlp/greenbeing-proteins
	language:
	- en
	---

	# tinyllama-proteinpretrain-quinoa

	Full model finetuning of TinyLLaMA-1.1B on the "research" split (quinoa
	protein sequences) of GreenBeing-Proteins dataset.

	Notes: pretraining only on sequences leads the model to only generate protein sequences, eventually repeating VVVV ot KKKK.
	- This model may be replaced with mixed training (bio/chem text and protein).
	- This model might need "biotokens" to represent the amino acids instead of using the existing tokenizer.

	More details TBD