Edit model card

tinyllama-mixpretrain-quinoa-sciphi

TinyLLaMA model with continued pretraining / full-model finetuning on amino acids and simulated science textbooks.

The goal is to a create models which understand amino acid sequences and natural language descriptions or Q&A.

Training data was shuffled with:

  • 50% amino acid sequences / proteins from the GreenBeing research dataset (mostly quinoa)
  • 50% textbook content from the SciPhi training dataset

Training procedure

CoLab notebook: https://colab.research.google.com/drive/1dah43byt-T0HQC9eCigNbxSZ8aHu6s-W?usp=sharing

To fit on an L4 GPU, it was necessary to use max_length=400 and train_batch_size=1

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 1
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • training_steps: 15000
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.38.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.19.0
  • Tokenizers 0.15.2
Downloads last month
5
Safetensors
Model size
1.1B params
Tensor type
F32
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Finetuned from

Datasets used to train monsoon-nlp/tinyllama-mixpretrain-quinoa-sciphi

Collection including monsoon-nlp/tinyllama-mixpretrain-quinoa-sciphi