AI4PD/ZymCTRL · Example 2: Fine-tuning on a set of user-defined sequences

May 9, 2023

•

edited May 9, 2023

Hello!

I have a question on Example 2: Fine-tuning on a set of user-defined sequences.
You provided two parts of scripts:
[part 1] input="sequences.fasta" ==> output="2.7.3.13_processed.txt" (each line has len(tokenizer(value+padding)['input_ids']))
[part 2] input="2.7.3.13_processed.txt" ==> output=('./dataset/train2/.arrow) + ('./dataset/eval2/.arrow')

Since run_clm.py contains a preprocessing like in [part2],
I'm wondering that can't we just fine-tune run directly with a "2.7.3.13_processed.txt" by skipping [part2] such as following?

$ python run_clm.py --tokenizer_name /path/to/ZymCTRL
--train_file 2.7.3.13_processed.txt
--validation_split_percentage 10
--do_train --do_eval --output_dir output ...

Thank you!

nferruz

AI for protein design org May 16, 2023

mmmh, yes, you're right! I think that would also be possible if I am not missing any critical step in the script.
Let me know how it goes!

ipark

May 16, 2023

Thanks for the reply.

Actually I did only [part1] and then ran using run_clm.py (in lieu of your 5.run_clm-post.py)
Seems working well.

Thanks!