gandhi-gpt / code /README.md
ritwikm's picture
added readme in code
bb630ca
|
raw
history blame
364 Bytes
```myocr.py``` is responsible for scrapping all the writings of Mahatma Gandhi.
```data_preprocessing.py``` does the data cleaning, and prepares a file which is ready to be inputted into the gpt-2 finetuning pipeline. In this code, we have set the threshold of 200 i.e., paragraphs whose number of token_ids are > 200, they will be split in half (recursively).