ritwikm commited on
Commit
bb630ca
1 Parent(s): 01de82f

added readme in code

Browse files
Files changed (1) hide show
  1. code/README.md +4 -0
code/README.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ ```myocr.py``` is responsible for scrapping all the writings of Mahatma Gandhi.
2
+
3
+
4
+ ```data_preprocessing.py``` does the data cleaning, and prepares a file which is ready to be inputted into the gpt-2 finetuning pipeline. In this code, we have set the threshold of 200 i.e., paragraphs whose number of token_ids are > 200, they will be split in half (recursively).