license: mit
tags:
- generated_from_trainer
model-index:
- name: PROJECT_GUTENBERG_GOTHIC_FICTION_TEXT_GENERATION_gpt2
results: []
datasets:
- Dwaraka/Training_Dataset_of_Project_Gutebberg_Gothic_Fiction
- Dwaraka/Testing_Dataset_of_Project_Gutebberg_Gothic_Fiction
language:
- en
metrics:
- code_eval
library_name: transformers
pipeline_tag: text-generation
PROJECT_GUTENBERG_GOTHIC_FICTION_TEXT_GENERATION_gpt2
This model is a fine-tuned version of gpt2 trained on the dataset of the text from Project Guttenberg(https://www.gutenberg.org/), From which we picked the contents of 12 books related to Gothic Fiction which has 1051581 Words and 6002980 characters to perform the Text-Generation for the same Genre.
Can be used to generate the text of the vocabulary and writing style: GOTHIC FICTION
Model description
GPT-2 is a transformer model originally trained on a very large corpus(billions of tokens) from the internet, by OpenAI in a supervised learning pattern. The Dataset consists huge amount of data types, which include: news articles, books, scientific papers, web pages, and online forum discussions, which made the model to experience diverse varieties of data, from which it learned the patterns and relationships in language that are common across a wide range of contexts.
Apart from this, GPT-2 is trained in a supervised learning fashion for language modeling to perform the tasks like predicting the next/preceding words in a text, which helps the model to learn further.
For the model: PROJECT_GUTENBERG_GOTHIC_FICTION_TEXT_GENERATION_gpt2, GPT-2 is re-trained and fine-tuned on a large corpus which is collected from the Project Guttenberg Website:https://www.gutenberg.org/ The specific dataset consists of the texts on which it is trained on.Gothic Fiction:Book: The modern Prometheus, The liar of the white worm by bram Stoker , The Vampyre; a Tale, Nightmare Abbey; by Thomas Love Peacock', The History of Caliph Vathek by William Beckford, The Lock and Key Library :Classic Mystery and Detectives Stories: Old Time, Caleb Williams; Or,Things as they are by William Godwin, The Private Memoirs and confessions of a justified sinner , Confessions of an English Opium Eater, The mysteries of udolpho, Wieland;Or,The TRansformation: An American Tale by Charles Brocken Brown, The Castle of Otranto
Intended uses & limitations
Uses: This model can be used for creative writing, content generation, language modeling, and language learning.
Limitations: Since the model is trained on the related to Gothic Fiction, it might not do well in generating the text from a different genre.
Training and evaluation data
Training Data: The TRAINING_CORPUS is the collection of 12 books(:Book: The Modern Prometheus, The liar of the white worm by Bram Stoker , The Vampyre; a Tale, Nightmare Abbey; by Thomas Love Peacock', The History of Caliph Vathek by William Beckford, The Lock and Key Library :Classic Mystery and DEtectives Stories: Old Time, Caleb Williams; Or, Things as they are by William Godwin, The Private Memoirs and confessions of a justified sinner, Confessions of an English Opium Eater, The mysteries of udolpho, Wieland;Or, The Transformation: An American Tale by Charles Brocken Brown, The Castle of Otranto) which contains 1051518 tokens and 6002980 characters from Project Guttenberg (https://www.gutenberg.org/), of the GOTHIC FICTION genre. This text is fed as input to the PROJECT_GUTENBERG_GOTHIC_FICTION_TEXT_GENERATION_gpt2 model to perform the Text-Generation to get the Gothic Fiction style outputs. Can be accessed at (https://huggingface.co/datasets/Dwaraka/Training_Dataset_of_Project_Gutebberg_Gothic_Fiction).
Evaluation Data: The TESTING_CORPUS is the random text picked from the chosen Tokens to evaluate the model. Can be accessed at : (https://huggingface.co/datasets/Dwaraka/Testing_Dataset_of_Project_Gutebberg_Gothic_Fiction)
List of Texts included in the Corpus and their respective contribution of tokens:
1.The modern Prometheus : 78094 tokens
2.The liar of the white worm by bram Stoker : 58202 tokens
3.The Vampyre; a Tale : 15678 tokens
4.Nightmare Abbey; by Thomas Love Peacock' : 30386 tokens
5.The History of Caliph Vathek by William Beckford : 40047 tokens
6.The Lock and Key Library :Classic Mystery and DEtectives Stories: Old Time : 132481 tokens
7.Caleb Williams; Or,Things as they are by William Godwin : 149178 tokens
8.The Private Memoirs and confessions of a justified sinner : 87012 tokens
9.Confessions of an English Opium Eater : 41723 tokens
10.The mysteries of udolpho : 293421 tokens
11.Wieland;Or,The TRansformation: An American Tale by Charles Brocken Brown : 85697 tokens
12.The Castle of Otranto : 39662 tokens
A Total of 1051581 Tokens.
Training procedure
As the Initializer, the training dataset will be loaded using load_dataset(), and tokenized further with pre-trained GPT-2 tokenizer , tokenizer(), where the plain text will be converted to GPT-2 understandable format. To make sure that all the tokenized sequences have the same size, padding tokens will be added : tokenizer.pad_token - tokenizer.unk_token. This tokenized data will be further collated to batches with Data Collator object: DataCollatorForLanguageModeling The tokenized dataset is then passed as the input to GPT-2 for fine-tuning.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
- Gradient Accumulation steps: 1
Training results
Validation Loss: 0.2594448818879969
TRAINING Loss: 3.422200(At Step:40000)
How to Use:
We can use the model directly with a pipeline for text generation:
!pip install transformers
from transformers import GPT2Tokenizer, GPT2LMHeadModel
model_name = "Dwaraka/PROJECT_GUTENBERG_GOTHIC_FICTION_TEXT_GENERATION_gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
prompt= "Once upon a time, in a dark and spooky castle, there lived a "
input_ids = tokenizer.encode(prompt, return_tensors="pt" )
output = model.generate(input_ids, max_length=50, do_sample=True)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True )
print(generated_text)
Github Link :
This Fine-Tuned model is available at: https://github.com/DwarakaVelasiri/Dwaraka-PROJECT_GUTENBERG_GOTHIC_FICTION_TEXT_GENERATION_gpt2
Framework versions
- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Datasets 2.10.0
- Tokenizers 0.13.2