armahlovis's picture
Update README.md
a26e6ea
|
raw
history blame
3.04 kB
metadata
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: GPT2FinnedtunnedEwritersRAll
    results: []

GPT2FinnedtunnedEwritersRAll

This model is a fine-tuned version of gpt2 using the writings of W. E. Burghardt Du Bois, Frederick Douglass, Booker T Washington and William Still on black emancipation and black civil rights.

Model description

The model is designed to be finned tunning with writting from Historical black black writers who wrote on freedom and emancipation. This first version has GPT2 fintunned with the writings of W. E. Burghardt Du Bois, Frederick Douglass, Booker T Washington and William Still

Intended uses & limitations

This can be used to complete sentences where historical context advocating for black freedom and emancipation is required. It may not fit outside historical settings as most of the issues used in the training may not apply in recent times

Training and evaluation data

This is a more than 1 million word token dataset consist of Historical black writers who wrote about black emancipation. Include in this datasets are Collected Articles of Frederick Douglass(8000 word tokens),THREE ADDRESSES BY Fred Douglas(28K word token), Why is the Negro Lynched?(15K word token) by FREDERICK DOUGLASS, MY BONDAGE and MY FREEDOM(135Kword token), Narrative of the Life of Frederick Douglass(40K word tokens) darkwater by W. E.(67K word tokens), GIFT of BLACK FOLK(77K word tokens), John Brown (101K word token), Negro problem(36K word tokens), THE CONSERVATION OF RACES(5k word token), The Negro(57K word token), The quest of the Fleece(109k), THE SUPPRESSION OF THE AFRICAN SLAVE-TRADE(123K word tokens) by W. E. BURGHARDT DU BOIS, UP FROM SLAVERY AN AUTOBIOGRAPHY BY Booker T Washington(77K word tokens).

The evaluation data set consist of The Underground Railroad, by William Still(400K word token)

Training procedure

After corpus was put together, the text was preprocessed to remove extra text and license information added by Gutenberg organization. Also the word token was kept below 1000,000 word token and the number of epocs set to 1 so that it could be trained on basic package provided by Google Colab. It was then tokenized using GPT2Tokenizer and afterwards finned tunned on GPT2.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1

Training results

At the end of the training , the loss function was reduced from 4.339300 to 3.582800. With much training epocs this value can be reduced further

Framework versions

  • Transformers 4.26.1
  • Pytorch 1.13.1+cu116
  • Datasets 2.9.0
  • Tokenizers 0.13.2