bigjoedata commited on
Commit
b2dcde1
β€’
1 Parent(s): 059af0e

Added artists trained longer

Browse files
Files changed (5) hide show
  1. README.md +51 -34
  2. config.json +2 -2
  3. merges.txt +0 -0
  4. pytorch_model.bin +2 -2
  5. vocab.json +0 -0
README.md CHANGED
@@ -1,44 +1,61 @@
1
- 🎹 πŸͺ˜ 🎷 🎺 πŸͺ— πŸͺ• 🎻
2
- ## Rockbot Background
3
- Two of my passions are music and data! I realized I had a bounty of metadata from artists I've listened to over the past several years and I decided to take advantage to build something fun. I scraped the top 50 lyrics for artists I'd listened to at least once from [Genius](https://genius.com/), added some other selected top artists, did a ton of post-processing and trained a [GPT-2's](https://openai.com/blog/better-language-models/) based model from scratch using the [AITextGen](https://github.com/minimaxir/aitextgen) framework. The UI / back end is built in [Streamlit](https://www.streamlit.io/) The vocabulary was built from scratch, rather than fine-tuned off an existing model. I also fine-tuned a GPT-2 based model available [here](https://huggingface.co/bigjoedata/rockbot) but this model weighs in at a fraction of the size.
4
 
5
- A demo is available [here](https://share.streamlit.io/bigjoedata/rockbot/main/src/main.py) Generation is resource intense and can be slow in the demo. I set governors on song length to keep generation time somewhat reasonable. You may adjust song length and other parameters on the left or check out [Github](https://github.com/bigjoedata/rockbot) to spin up your own Rockbot.
 
 
 
 
 
 
 
6
 
7
- Data Prep Cleaning Notes:
8
- - Removed duplicate lyrics from each song
9
- - Deduped similar songs based on overall similarity to remove cover versions
10
- - Removed as much noise / junk as possible. There is still some.
11
- - Added tokens to delineate song
12
- - Used language to remove non-English versions of songs
13
- - Many others!
14
 
15
- ### Tech Stack and technical notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- - [Python](https://www.python.org/).
18
- - [Streamlit](https://www.streamlit.io/).
19
- - [GPT-2](https://openai.com/blog/better-language-models/).
20
- - [AITextGen](https://github.com/minimaxir/aitextgen).
21
- - [LyricsGenius](https://lyricsgenius.readthedocs.io/en/master/) (retrieving lyrics for training).
22
- - [Knime](https://www.knime.com/) (data cleaning and post processing)
23
- - [GPT-2 generation](https://huggingface.co/blog/how-to-generate)
24
 
25
  ## How to Use The Model
26
- Please refer to [AITextGen](https://github.com/minimaxir/aitextgen) and [Huggingface](https://huggingface.co/) for much better documentation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- Generate With Prompt (Use lower case for Song Name, First Line):
29
  Song Name
30
  BY
31
- Artist Name (Use unmodified from [Github](https://github.com/bigjoedata/rockbot/blob/main/theartists.parquet)
32
- Beginning of song
33
 
34
- ## Spin up your own with Docker
35
- Running your own is very easy. Visit my [Streamlit-Plus repository](https://github.com/bigjoedata/streamlit-plus) for more details on the image build
36
-
37
- - Install [Docker Compose](https://docs.docker.com/compose/install/)
38
- - Follow the following steps
39
- ```
40
- git clone https://github.com/bigjoedata/rockbot
41
- cd rockbot
42
- nano docker-compose.yml # Edit environmental variables for max song length and max songs to generate to match your computing power (higher is more resource intensive)
43
- docker-compose up -d # launch in daemon (background) mode
44
- ```
 
 
 
1
 
2
+ # 🎸 πŸ₯ Rockbot 🎀 🎧
3
+ A [GPT-2](https://openai.com/blog/better-language-models/) based lyrics generator fine-tuned on the writing styles of 16000 songs by 270 artists across MANY genres (not just rock).
4
+
5
+ **Instructions:** Type in a fake song title, pick an artist, click "Generate".
6
+
7
+ Most language models are imprecise and Rockbot is no exception. You may see NSFW lyrics unexpectedly. I have made no attempts to censor. Generated lyrics may be repetitive and/or incoherent at times, but hopefully you'll encounter something interesting or memorable.
8
+
9
+ Oh, and generation is resource intense and can be slow. I set governors on song length to keep generation time somewhat reasonable. You may adjust song length and other parameters on the left or check out [Github](https://github.com/bigjoedata/rockbot) to spin up your own Rockbot.
10
 
11
+ Just have fun.
 
 
 
 
 
 
12
 
13
+ [Demo](https://share.streamlit.io/bigjoedata/rockbot/main/src/main.py) Adjust settings to increase speed
14
+
15
+ [Github](https://github.com/bigjoedata/rockbot)
16
+
17
+ [GPT-2 124M version Model page on Hugging Face](https://huggingface.co/bigjoedata/rockbot)
18
+
19
+ [DistilGPT2 version Model page on Hugging Face](https://huggingface.co/bigjoedata/rockbot-distilgpt2/) This is leaner with the tradeoff being that the lyrics are more simplistic.
20
+
21
+ 🎹 πŸͺ˜ 🎷 🎺 πŸͺ— πŸͺ• 🎻
22
+ ## Background
23
+ With the shutdown of [Google Play Music](https://en.wikipedia.org/wiki/Google_Play_Music) I used Google's takeout function to gather the metadata from artists I've listened to over the past several years. I wanted to take advantage of this bounty to build something fun. I scraped the top 50 lyrics for artists I'd listened to at least once from [Genius](https://genius.com/), then fine tuned [GPT-2's](https://openai.com/blog/better-language-models/) 124M token model using the [AITextGen](https://github.com/minimaxir/aitextgen) framework after considerable post-processing. For more on generation, see [here.](https://huggingface.co/blog/how-to-generate)
24
+
25
+ ### Full Tech Stack
26
+ [Google Play Music](https://en.wikipedia.org/wiki/Google_Play_Music) (R.I.P.).
27
+ [Python](https://www.python.org/).
28
+ [Streamlit](https://www.streamlit.io/).
29
+ [GPT-2](https://openai.com/blog/better-language-models/).
30
+ [AITextGen](https://github.com/minimaxir/aitextgen).
31
+ [Pandas](https://pandas.pydata.org/).
32
+ [LyricsGenius](https://lyricsgenius.readthedocs.io/en/master/).
33
+ [Google Colab](https://colab.research.google.com/) (GPU based Training).
34
+ [Knime](https://www.knime.com/) (data cleaning).
35
 
 
 
 
 
 
 
 
36
 
37
  ## How to Use The Model
38
+ Please refer to [AITextGen](https://github.com/minimaxir/aitextgen) for much better documentation.
39
+
40
+ ### Training Parameters Used
41
+
42
+ ai.train("lyrics.txt",
43
+ line_by_line=False,
44
+ from_cache=False,
45
+ num_steps=10000,
46
+ generate_every=2000,
47
+ save_every=2000,
48
+ save_gdrive=False,
49
+ learning_rate=1e-3,
50
+ batch_size=3,
51
+ eos_token="<|endoftext|>",
52
+ #fp16=True
53
+ )
54
+ ### To Use
55
+
56
 
57
+ Generate With Prompt (Use Title Case):
58
  Song Name
59
  BY
60
+ Artist Name
 
61
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -23,7 +23,7 @@
23
  "summary_proj_to_labels": true,
24
  "summary_type": "cls_index",
25
  "summary_use_proj": true,
26
- "transformers_version": "4.2.2",
27
  "use_cache": true,
28
- "vocab_size": 50000
29
  }
23
  "summary_proj_to_labels": true,
24
  "summary_type": "cls_index",
25
  "summary_use_proj": true,
26
+ "transformers_version": "4.3.2",
27
  "use_cache": true,
28
+ "vocab_size": 75000
29
  }
merges.txt CHANGED
The diff for this file is too large to render. See raw diff
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1eef82203f60b187801705744494677f03fac6087044defef197e62c28129dfe
3
- size 79137807
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:588e5186b28450aa011770f686a8d86652a7e429e15460e56b7f03cc14690ede
3
+ size 104737544
vocab.json CHANGED
The diff for this file is too large to render. See raw diff