Help with training dataset output and validation .txts

#33
by miraclewizard - opened

Hello! This is such a great resource and I'm really looking forward to using this. I want to train on a dataset (relatively small, <677 lines so ~10k tokens). I followed the directions and created a training.txt file with <|endoftext|> as a header for each AA sequence. I saved ~10% (70 lines) as a validation.txt dataset. My command: python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2 --output_dir /home/grant/test

The command runs successfully without any errors and to my knowledge there is no error.txt output (yay!). However, my output only has a README.md that reads as follows (see below). Clearly my results is null ([]) so I'm assuming that the training dataset is too small (I saw some other posts where people used >500 sequences). Am I right in my assumption? Is this the end of the road?

Another way to answer this would be if there were an example training.txt (and accompanying validation.txt) I could download so I know what a "good" validation run looks like.

ANY help would be appreicated. THANKS!

my output: README.md

license: apache-2.0
base_model: nferruz/ProtGPT2
tags:
- generated_from_trainer
model-index:
- name: test
results: []

test

This model is a fine-tuned version of nferruz/ProtGPT2 on an unknown dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3.0

Framework versions

  • Transformers 4.35.0.dev0
  • Pytorch 2.1.0+cu118
  • Datasets 2.14.5
  • Tokenizers 0.14.1

Sign up or log in to comment