Create README.md

#12
by ylacombe HF staff - opened
Files changed (1) hide show
  1. README.md +188 -0
README.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - es
6
+ - fr
7
+ - hi
8
+ - it
9
+ - ja
10
+ - ko
11
+ - pl
12
+ - pt
13
+ - ru
14
+ - tr
15
+ - zh
16
+ thumbnail: https://user-images.githubusercontent.com/5068315/230698495-cbb1ced9-c911-4c9a-941d-a1a4a1286ac6.png
17
+ library: "bark"
18
+ license: "cc-by-nc-4.0"
19
+ tags:
20
+ - bark
21
+ - audio
22
+ - text-to-speech
23
+ ---
24
+
25
+ # Bark
26
+
27
+ Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai).
28
+ Bark can generate highly realistic, multilingual speech as well as other audio - including music,
29
+ background noise and simple sound effects. The model can also produce nonverbal
30
+ communications like laughing, sighing and crying. To support the research community,
31
+ we are providing access to pretrained model checkpoints ready for inference.
32
+
33
+ The original github repo and model card can be found [here](https://github.com/suno-ai/bark).
34
+
35
+ This model is meant for research purposes only.
36
+ The model output is not censored and the authors do not endorse the opinions in the generated content.
37
+ Use at your own risk.
38
+
39
+ Two checkpoints are released:
40
+ - [**small** (this checkpoint)](https://huggingface.co/suno/bark-small)
41
+ - [large](https://huggingface.co/suno/bark)
42
+
43
+
44
+ ## Example
45
+
46
+ Try out Bark yourself!
47
+
48
+ * Bark Colab:
49
+
50
+ <a target="_blank" href="https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing">
51
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
52
+ </a>
53
+
54
+ * Hugging Face Colab:
55
+
56
+ <a target="_blank" href="https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing">
57
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
58
+ </a>
59
+
60
+ * Hugging Face Demo:
61
+
62
+ <a target="_blank" href="https://huggingface.co/spaces/suno/bark">
63
+ <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
64
+ </a>
65
+
66
+
67
+ ## πŸ€— Transformers Usage
68
+
69
+
70
+ You can run Bark locally with the πŸ€— Transformers library from version 4.31.0 onwards.
71
+
72
+ 1. First install the πŸ€— [Transformers library](https://github.com/huggingface/transformers) from main:
73
+
74
+ ```
75
+ pip install git+https://github.com/huggingface/transformers.git
76
+ ```
77
+
78
+ 2. Run the following Python code to generate speech samples:
79
+
80
+ ```python
81
+ from transformers import AutoProcessor, AutoModel
82
+
83
+
84
+ processor = AutoProcessor.from_pretrained("suno/bark-small")
85
+ model = AutoModel.from_pretrained("suno/bark-small")
86
+
87
+ inputs = processor(
88
+ text=["Hello, my name is Suno. And, uh β€” and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
89
+ return_tensors="pt",
90
+ )
91
+
92
+ speech_values = model.generate_speech(**inputs, do_sample=True)
93
+ ```
94
+
95
+ 3. Listen to the speech samples either in an ipynb notebook:
96
+
97
+ ```python
98
+ from IPython.display import Audio
99
+
100
+ sampling_rate = model.generation_config.sample_rate
101
+ Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
102
+ ```
103
+
104
+ Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
105
+
106
+ ```python
107
+ import scipy
108
+
109
+ sampling_rate = model.config.sample_rate
110
+ scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())
111
+ ```
112
+
113
+ For more details on using the Bark model for inference using the πŸ€— Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
114
+
115
+ ## Suno Usage
116
+
117
+ You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):
118
+
119
+ 1. First install the [`bark` library](https://github.com/suno-ai/bark)
120
+
121
+ 3. Run the following Python code:
122
+
123
+ ```python
124
+ from bark import SAMPLE_RATE, generate_audio, preload_models
125
+ from IPython.display import Audio
126
+
127
+ # download and load all models
128
+ preload_models()
129
+
130
+ # generate audio from text
131
+ text_prompt = """
132
+ Hello, my name is Suno. And, uh β€” and I like pizza. [laughs]
133
+ But I also have other interests such as playing tic tac toe.
134
+ """
135
+ speech_array = generate_audio(text_prompt)
136
+
137
+ # play text in notebook
138
+ Audio(speech_array, rate=SAMPLE_RATE)
139
+ ```
140
+
141
+ [pizza.webm](https://user-images.githubusercontent.com/5068315/230490503-417e688d-5115-4eee-9550-b46a2b465ee3.webm)
142
+
143
+
144
+ To save `audio_array` as a WAV file:
145
+
146
+ ```python
147
+ from scipy.io.wavfile import write as write_wav
148
+
149
+ write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)
150
+ ```
151
+
152
+ ## Model Details
153
+
154
+
155
+ The following is additional information about the models released here.
156
+
157
+ Bark is a series of three transformer models that turn text into audio.
158
+
159
+ ### Text to semantic tokens
160
+ - Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
161
+ - Output: semantic tokens that encode the audio to be generated
162
+
163
+ ### Semantic to coarse tokens
164
+ - Input: semantic tokens
165
+ - Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook
166
+
167
+ ### Coarse to fine tokens
168
+ - Input: the first two codebooks from EnCodec
169
+ - Output: 8 codebooks from EnCodec
170
+
171
+ ### Architecture
172
+ | Model | Parameters | Attention | Output Vocab size |
173
+ |:-------------------------:|:----------:|------------|:-----------------:|
174
+ | Text to semantic tokens | 80/300 M | Causal | 10,000 |
175
+ | Semantic to coarse tokens | 80/300 M | Causal | 2x 1,024 |
176
+ | Coarse to fine tokens | 80/300 M | Non-causal | 6x 1,024 |
177
+
178
+
179
+ ### Release date
180
+ April 2023
181
+
182
+ ## Broader Implications
183
+ We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.
184
+
185
+ While we hope that this release will enable users to express their creativity and build applications that are a force
186
+ for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward
187
+ to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark,
188
+ we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).