```myocr.py``` is responsible for scrapping all the writings of Mahatma Gandhi. ```data_preprocessing.py``` does the data cleaning, and prepares a file which is ready to be inputted into the gpt-2 finetuning pipeline. In this code, we have set the threshold of 200 i.e., paragraphs whose number of token_ids are > 200, they will be split in half (recursively).