Edit model card

gpt2-finetuned-agatha-christie

This model is a fine-tuned version of gpt2 on a text dataset containing Agatha Christies Books It achieves the following results on the evaluation set:

  • Loss: 3.0911

Model description

This is a GPT-2 model that is fine tuned with text corpus from Agatha Cristies books. GPT-2 is a transformers model pre trained on a very large corpus of English data in a self-supervised fashion.

Intended uses & limitations

Intended use for this model is to generate texts in the style of Agatha Christie, the queen of crime .

Although Ms. Christie has around 80 original works, not all of them are selected for copyright issues. Due to fine-tuning with a small dataset, sometimes the texts generated by this model might not be good enough.

Training and evaluation data

A custom made text corpus is made for training and validation purpose. Thirteen of the original work of Ms. Christie that are available in the public domain are chosen. Raw texts are downloaded from https://www.gutenberg.org/ and other available sources in the time frame of february 15th to 20th of 2023.

Data preprocessing:

  • Header and footer texts of project gutenberg are removed manually.
  • Text Illustrations are identified and removed.
  • All the new lines have been stripped off.
  • Special characters like = “” are removed.
  • Sentences that are longer than 200 characters have been removed.

Here are the list of books after data cleaning and pre processing:

File Name Word Count Character Count Book Type
The_Murder_on_the_Links.txt 64470 383665 Novel
The_Mysterious_Affair_at_Styles.txt 56456 341202 Novel
The_Secret_of_Chimneys.txt 74431 455894 Novel
And_Then_There_Were_None.txt 52607 320398 Novel
The_murder_of_Roger_Ackroyd.txt 69485 416920 Novel
Poirot_Investigates.txt 52494 313466 Novel
The_Big_Four.txt 55230 319360 Novel
The_Mystery_of_the_Blue_Train.txt 71222 414922 Novel
The_Secret_Adversary.txt 10855 75138 Novel
The_Man_in_the_Brown_Suit.txt 10317 75261 Novel
The_Hunters_Lodge_Case.txt 4352 25602 Short Story
The_Missing_Will.txt 3257 19004 Short Story
The_Plymouth_Express_Affair.txt 4858 29493 Short Story
Total 659261 3928209

Spliting Training and evaluation data

nltk sentence tokenizer have been used to split all the texts into sentences. Then scikit learns train_test_split method is used to split 85% of random data sentence as training data, and the rest 15% of sentence as a evaluation data set. In training and test file, each line is placed in separate lines.

During training and validation, GPT2 default tokenizer is used.

Training procedure

Trainer class from transformer library is used to train the fine-tuned model.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 32
  • eval_batch_size: 64
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 6

Training results

Training Loss Epoch Step Validation Loss
4.2824 0.26 50 3.8764
3.8824 0.51 100 3.5931
3.6336 0.77 150 3.4378
3.5056 1.03 200 3.3445
3.4038 1.28 250 3.2881
3.3502 1.54 300 3.2506
3.3135 1.79 350 3.2224
3.2839 2.05 400 3.2028
3.2193 2.31 450 3.1816
3.2066 2.56 500 3.1660
3.2043 2.82 550 3.1470
3.1619 3.08 600 3.1380
3.1092 3.33 650 3.1271
3.1073 3.59 700 3.1187
3.099 3.85 750 3.1109
3.0695 4.1 800 3.1089
3.0281 4.36 850 3.1044
3.0322 4.62 900 3.1002
3.0358 4.87 950 3.0944
3.0126 5.13 1000 3.0958
2.9889 5.38 1050 3.0931
2.9874 5.64 1100 3.0917
2.9915 5.9 1150 3.0911

Framework versions

  • Transformers 4.26.1
  • Pytorch 1.13.1+cu116
  • Datasets 2.10.0
  • Tokenizers 0.13.2
Downloads last month
5
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.