MinzaKhan commited on
Commit
357aa51
1 Parent(s): 8a5ef3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -7
README.md CHANGED
@@ -15,6 +15,8 @@ The limitations of this model are that it can only generate text in the style of
15
 
16
  I created my own dataset to train this model. I chose 14 novels written by H G Wells for my dataset. Most of the novels in the dataset are of the genre science fiction. The dataset contains more than 1 million tokens.
17
 
 
 
18
  The texts included in the corpus are novels written by H G Wells. The novels in the corpus are:
19
 
20
  The Time Machine - 37677
@@ -47,13 +49,14 @@ The Red Room - 4618
47
 
48
  The total number of tokens in the corpus is 1043588.
49
 
50
- The corpus was created by downloading and combining 14 novels of the famous author H G Wells from Project Gutenberg. Most of these novels are science fiction novels, so this model has been trained to generate text of the science fiction genre. It produces text in the style of H G Wells.
 
51
 
52
- The corpus consists of 14 novels written by H G Wells downloaded from Project Gutenberg. The text added by Project Gutenberg at the beginning and end of each novel were removed. Then the entire text in each novel
53
  was converted into one line. Then the single line was broken into 20 parts. In this way 20 lines were generated for each novel. The lines from each novel were then combined and
54
- stored in a single text file. The text was tokenized by using the tokenizer from the GPT2Tokenizer library. This text file was then used to finetune the model.
55
 
56
- The values of the parameters used during finetuning are:
57
 
58
  batch_size = 2
59
 
@@ -65,12 +68,13 @@ learning rate = 5e-4
65
 
66
  warmup steps = 1e2
67
 
68
- The corpus has been uploaded on HuggingFace. It can be accessed from the following link: https://huggingface.co/datasets/MinzaKhan/HGWells
69
-
70
  Training Loss: 2.26
71
 
72
  Training Perplexity: 9.57
73
 
74
  Validation Loss: 3.84
75
 
76
- Validation Perplexity: 46.43
 
 
 
 
15
 
16
  I created my own dataset to train this model. I chose 14 novels written by H G Wells for my dataset. Most of the novels in the dataset are of the genre science fiction. The dataset contains more than 1 million tokens.
17
 
18
+ The evaluation results are good. The model is able to generate text in the style of H G Wells. Most of the text generated is of the science fiction genre.
19
+
20
  The texts included in the corpus are novels written by H G Wells. The novels in the corpus are:
21
 
22
  The Time Machine - 37677
 
49
 
50
  The total number of tokens in the corpus is 1043588.
51
 
52
+ The corpus was created by downloading and combining 14 novels of the famous author H G Wells from Project Gutenberg. Most of these novels are science fiction novels, so this model has been trained to generate text of the science fiction genre. It has been trained to produce text in the style of H G Wells.
53
+ This model was created on 23rd February, 2023.
54
 
55
+ The corpus consists of 14 novels written by H G Wells downloaded from Project Gutenberg. The text added by Project Gutenberg at the beginning and end of each novel was removed. Then the entire text in each novel
56
  was converted into one line. Then the single line was broken into 20 parts. In this way 20 lines were generated for each novel. The lines from each novel were then combined and
57
+ stored in a single text file. This is the preprocessing that was done on the text files. The text was tokenized by using the GPT2Tokenizer from the transformers library. This text file was then used to finetune the model.
58
 
59
+ The values of the hyperparameters used during finetuning are:
60
 
61
  batch_size = 2
62
 
 
68
 
69
  warmup steps = 1e2
70
 
 
 
71
  Training Loss: 2.26
72
 
73
  Training Perplexity: 9.57
74
 
75
  Validation Loss: 3.84
76
 
77
+ Validation Perplexity: 46.43
78
+
79
+ The corpus has been uploaded on HuggingFace. It can be accessed from the following link: https://huggingface.co/datasets/MinzaKhan/HGWells
80
+