gandhi-gpt / code /README.md
ritwikm's picture
added readme in code
bb630ca
|
raw
history blame
364 Bytes

myocr.py is responsible for scrapping all the writings of Mahatma Gandhi.

data_preprocessing.py does the data cleaning, and prepares a file which is ready to be inputted into the gpt-2 finetuning pipeline. In this code, we have set the threshold of 200 i.e., paragraphs whose number of token_ids are > 200, they will be split in half (recursively).