gandhi-gpt / code /README.md

ritwikm

added readme in code

bb630ca over 2 years ago

preview code

raw

history blame

364 Bytes

myocr.py is responsible for scrapping all the writings of Mahatma Gandhi.

data_preprocessing.py does the data cleaning, and prepares a file which is ready to be inputted into the gpt-2 finetuning pipeline. In this code, we have set the threshold of 200 i.e., paragraphs whose number of token_ids are > 200, they will be split in half (recursively).