bigjoedata
commited on
Commit
β’
c58b005
1
Parent(s):
5e150da
Initial push
Browse files- README.md +61 -0
- config.json +30 -0
- merges.txt +0 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +1 -0
- vocab.json +0 -0
README.md
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
# πΈ π₯ Rockbot π€ π§
|
3 |
+
A [GPT-2](https://openai.com/blog/better-language-models/) based lyrics generator fine-tuned on the writing styles of 16000 songs by 270 artists across MANY genres (not just rock).
|
4 |
+
|
5 |
+
**Instructions:** Type in a fake song title, pick an artist, click "Generate".
|
6 |
+
|
7 |
+
Most language models are imprecise and Rockbot is no exception. You may see NSFW lyrics unexpectedly. I have made no attempts to censor. Generated lyrics may be repetitive and/or incoherent at times, but hopefully you'll encounter something interesting or memorable.
|
8 |
+
|
9 |
+
Oh, and generation is resource intense and can be slow. I set governors on song length to keep generation time somewhat reasonable. You may adjust song length and other parameters on the left or check out [Github](https://github.com/bigjoedata/rockbot) to spin up your own Rockbot.
|
10 |
+
|
11 |
+
Just have fun.
|
12 |
+
|
13 |
+
[Demo](https://share.streamlit.io/bigjoedata/rockbot/main/src/main.py) Adjust settings to increase speed
|
14 |
+
|
15 |
+
[Github](https://github.com/bigjoedata/rockbot)
|
16 |
+
|
17 |
+
[GPT-2 124M version Model page on Hugging Face](https://huggingface.co/bigjoedata/rockbot)
|
18 |
+
|
19 |
+
[DistilGPT2 version Model page on Hugging Face](https://huggingface.co/bigjoedata/rockbot-distilgpt2/) This is leaner with the tradeoff being that the lyrics are more simplistic.
|
20 |
+
|
21 |
+
πΉ πͺ π· πΊ πͺ πͺ π»
|
22 |
+
## Background
|
23 |
+
With the shutdown of [Google Play Music](https://en.wikipedia.org/wiki/Google_Play_Music) I used Google's takeout function to gather the metadata from artists I've listened to over the past several years. I wanted to take advantage of this bounty to build something fun. I scraped the top 50 lyrics for artists I'd listened to at least once from [Genius](https://genius.com/), then fine tuned [GPT-2's](https://openai.com/blog/better-language-models/) 124M token model using the [AITextGen](https://github.com/minimaxir/aitextgen) framework after considerable post-processing. For more on generation, see [here.](https://huggingface.co/blog/how-to-generate)
|
24 |
+
|
25 |
+
### Full Tech Stack
|
26 |
+
[Google Play Music](https://en.wikipedia.org/wiki/Google_Play_Music) (R.I.P.).
|
27 |
+
[Python](https://www.python.org/).
|
28 |
+
[Streamlit](https://www.streamlit.io/).
|
29 |
+
[GPT-2](https://openai.com/blog/better-language-models/).
|
30 |
+
[AITextGen](https://github.com/minimaxir/aitextgen).
|
31 |
+
[Pandas](https://pandas.pydata.org/).
|
32 |
+
[LyricsGenius](https://lyricsgenius.readthedocs.io/en/master/).
|
33 |
+
[Google Colab](https://colab.research.google.com/) (GPU based Training).
|
34 |
+
[Knime](https://www.knime.com/) (data cleaning).
|
35 |
+
|
36 |
+
|
37 |
+
## How to Use The Model
|
38 |
+
Please refer to [AITextGen](https://github.com/minimaxir/aitextgen) for much better documentation.
|
39 |
+
|
40 |
+
### Training Parameters Used
|
41 |
+
|
42 |
+
ai.train("lyrics.txt",
|
43 |
+
line_by_line=False,
|
44 |
+
from_cache=False,
|
45 |
+
num_steps=10000,
|
46 |
+
generate_every=2000,
|
47 |
+
save_every=2000,
|
48 |
+
save_gdrive=False,
|
49 |
+
learning_rate=1e-3,
|
50 |
+
batch_size=3,
|
51 |
+
eos_token="<|endoftext|>",
|
52 |
+
#fp16=True
|
53 |
+
)
|
54 |
+
### To Use
|
55 |
+
|
56 |
+
|
57 |
+
Generate With Prompt (Use Title Case):
|
58 |
+
Song Name
|
59 |
+
BY
|
60 |
+
Artist Name
|
61 |
+
|
config.json
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "aitextgen/pytorch_model_355M.bin",
|
3 |
+
"activation_function": "gelu_new",
|
4 |
+
"architectures": [
|
5 |
+
"GPT2LMHeadModel"
|
6 |
+
],
|
7 |
+
"attn_pdrop": 0.1,
|
8 |
+
"bos_token_id": 50256,
|
9 |
+
"embd_pdrop": 0.1,
|
10 |
+
"eos_token_id": 50256,
|
11 |
+
"gradient_checkpointing": false,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"layer_norm_epsilon": 1e-05,
|
14 |
+
"model_type": "gpt2",
|
15 |
+
"n_ctx": 1024,
|
16 |
+
"n_embd": 1024,
|
17 |
+
"n_head": 16,
|
18 |
+
"n_inner": null,
|
19 |
+
"n_layer": 24,
|
20 |
+
"n_positions": 1024,
|
21 |
+
"n_vocab": 50257,
|
22 |
+
"resid_pdrop": 0.1,
|
23 |
+
"summary_activation": null,
|
24 |
+
"summary_first_dropout": 0.1,
|
25 |
+
"summary_proj_to_labels": true,
|
26 |
+
"summary_type": "cls_index",
|
27 |
+
"summary_use_proj": true,
|
28 |
+
"use_cache": true,
|
29 |
+
"vocab_size": 50257
|
30 |
+
}
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:fa7a3384dfbabc90d3f74175185ad17efe4de7dfe30090852bc153724b493e41
|
3 |
+
size 1444581811
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": "<|endoftext|>"}
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"errors": "replace", "unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": false, "pad_token": "<|endoftext|>"}
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|