SCRIPBOZO
This model is based on GPT2-Medium finetuned on chat logs from twitch.tv/MOONMOON.
Data
The data consists of ~3.8GB of plaintext across 632 days of logs, ranging from 2021-01-01 to 2022-09-26. They were sourced from https://logs.ivr.fi/. The logs were cleaned by dropping
bots: messages from a (manually determined, non-exhaustive) list of bots
links: messages matching the regex
r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
long messages: messages longer than 300 characters
short messages: messages shorter than 2 characters
caps spam: messages containing more than 100 characters that are more than 80% capital letters
commands: messages starting with !
The data was batched into groups of up to 512 tokens, preferring to end on a newline (\n
) rather than start another line and truncate it. The batches were then padded to 512 tokens using a pad token added to the model and tokenizer.
10% of the data was set aside for validation.
Training
Training was done on a system with a 6800XT (16GB of VRAM) and 32GB of RAM. The following hyperparameters were used:
- epochs: 1
- learning rate: 3e-4
- weight decay: 1e-4
- warmup ratio: 0.01
- optimizer:
adamw_torch
- gradient accumulation steps: 1
- gradient checkpointing:
true
- fp16: true
Evaluation
Evaluation was performed 10 times throughout training. Accuracy and perplexity were calculated.
- Downloads last month
- 7