Finetuning ProtGPT2 on a set of user-defined sequences

#9
by SweetGAN - opened

Hi,
Many thanks for sharing ProtGPT2.
I want to fine-tune ProtGPT2 on my dataset. Based on your explanation here, I edited the sequences by swapping the header with "<|endoftext|>". After making the training and validation set, I ran the command: python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir output --learning_rate 1e-06
My first question is that by only doing this I got a CUDA out-of-memory error and I had to trim my train set (from 240 sequences to 10). Could you please provide documentation on how to resolve this issue?
Second, after fine-tuning with the only sets of 10 and 4 training and validation sequences respectively, in the output folder, there are several files. Can you explain what merges.txt contains? Because it is also including the letters from "<|endoftext|>". As a next step, I have to use pytorch_model.bin in order to generate new sequences (something like a zero-shot fashion?).
And finally, can you maybe explain how we can get the optimized learning rate? Because with only this line of code I got confused.
At the very end, thank you in advance for your help, and sorry if my questions are naive (first-time user).

Hi SweetGAN,

The CUDA OOM errors are very common when you use large language models. What is your GPU if I may ask?
Usually, during training we split the sequences into batches to avoid this, then you feed the largest batch size that fits your card. What batch size did you use? If you used the command above, the default batch size is 8. You can decrease it to 4, or perhaps even 2 and 1 with the flag "--per_device_train_batch_size". This way you can still use the 240 sequences.

The output files you find after training correspond to the tokenizer and model weights. The merges are part of the tokenization, you have an explanation of this file in this post: https://github.com/huggingface/transformers/issues/1083#issuecomment-524303077. The most important file in that folder is called pytorch_model.bin, it contains the model weights, In practice, you don't need to do anything in particular to that folder; pytorch_model.bin is indeed the model, but the huggingface pipeline recognizes the folder and the files inside it and generates from it directly. See the example below to directly load tokenizer and model from a local folder:

    tokenizer = AutoTokenizer.from_pretrained('/path/to/output')
    model = GPT2LMHeadModel.from_pretrained('/path/to/output').to(device) 

    outputs = model.generate(
        input_ids, 
        top_k=950, 
        repetition_penalty=1.2,
        max_length=1024,
        eos_token_id=0,
           do_sample=True,
           num_return_sequences=100)

To optimize the learning rate, I would train using the command that you posted above with different learning rates. Then, you should visualise the training curves, i.e., during training you will get training and validation loss in the output, extract them from the command log and plot them externally. Both validation and training curves should descend until some point when usually the validation loss starts increasing again.

Thanks for reaching out and let me know if you still have questions!
Noelia

Dear Noelia,

Many thanks for your comments. I did as you said and it worked. I found the best model with the optimized learning rate.
For sequence generation from scratch (zero-shot fashion), I used this block of code, and first, I want to be sure if this is correct:

input_ids = tokenizer.encode("<|endoftext|>", return_tensors='pt').to(device)

outputs = model.generate(
input_ids,
top_k=950,
repetition_penalty=1.2,
max_length=512,
eos_token_id=0,
do_sample=True,
num_return_sequences=100)

for seq in outputs:
print(tokenizer.decode(seq, skip_special_tokens=True))

Second, I have a question regarding the calculatePerplexity function that you provided. This function gets a tokenizer as input but is never used, and instead uses input_ids which is never created in the function. Can you please explain what should I use as the input_ids? Is it the same as the train data?

Many thanks,
SweetGAN

Hi SweetGAN,

Great to hear fine-tuning worked fine. The code that you typed looks good.
Regarding the perplexity function, you are right. See below this function which uses the tokenizer for a given sequence:

def calculatePerplexity(sequence,model,tokenizer):
    input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0) 
    input_ids = input_ids.to(device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
    loss, logits = outputs[:2]
    return math.exp(loss)



sequence='LETHVLTCASAAYVARHGEPRHPRDLADGHRCVLIRDPVTGRPYGWEFHRGDEVVPFDATGRLTVN'

Thanks for noticing. I'm updating the documentation.

Hi Noelia,
Thanks for updating the documentation.

Sign up or log in to comment