Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,144 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- es
|
| 5 |
+
---
|
| 6 |
+
# PL-BERT-wordpiece-es-
|
| 7 |
+
|
| 8 |
+
## Overview
|
| 9 |
+
|
| 10 |
+
<details>
|
| 11 |
+
<summary>Click to expand</summary>
|
| 12 |
+
|
| 13 |
+
- **Model type:** Phoneme-level Language Model (PL-BERT)
|
| 14 |
+
- **Architecture:** AlBERT-base (12 layers, 768 hidden units, 12 attention heads)
|
| 15 |
+
- **Language:** Spanish
|
| 16 |
+
- **License:** Apache 2.0
|
| 17 |
+
- **Data:** Crowdsourced phonemized Spanish speech text
|
| 18 |
+
|
| 19 |
+
</details>
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Model description
|
| 24 |
+
|
| 25 |
+
**PL-BERT-wordpiece-es** is a phoneme-level masked language model trained on Spanish text with diverse regional accents. It is based on the [PL-BERT architecture](https://github.com/yl4579/PL-BERT), which learns phoneme representations via a BERT-style masked language modeling objective.
|
| 26 |
+
|
| 27 |
+
This model is designed to support **phoneme-based text-to-speech (TTS) systems**, including but not limited to [StyleTTS2](https://github.com/yl4579/StyleTTS2). Thanks to its Spanish-specific phoneme vocabulary and contextual embedding capabilities, it can serve as a phoneme encoder for any TTS architecture requiring phoneme-level features.
|
| 28 |
+
|
| 29 |
+
Features of our PL-BERT:
|
| 30 |
+
- It is trained **exclusively on Spanish** phonemized text.
|
| 31 |
+
- It uses a reduced **phoneme vocabulary of 178 tokens**.
|
| 32 |
+
- It uses wordpiece tokenizer.
|
| 33 |
+
- It includes a custom `token_maps.pkl` and adapted `util.py`.
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## Intended uses and limitations
|
| 38 |
+
|
| 39 |
+
### Intended uses
|
| 40 |
+
|
| 41 |
+
- Integration into phoneme-based TTS pipelines such as StyleTTS2, Matxa-TTS, or custom diffusion-based synthesizers.
|
| 42 |
+
- Accent-aware synthesis and phoneme embedding extraction for Spanish.
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
### Limitations
|
| 46 |
+
|
| 47 |
+
- Not designed for general NLP tasks like classification or sentiment analysis.
|
| 48 |
+
- Only supports Spanish phoneme tokens.
|
| 49 |
+
- Some accents may be underrepresented in the training data.
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## How to use (with StyleTTS2)
|
| 54 |
+
|
| 55 |
+
Here is an example of how to use this model within the StyleTTS2 framework:
|
| 56 |
+
|
| 57 |
+
1. Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2
|
| 58 |
+
2. Inside the `Utils` directory, create a new folder, for example: `PLBERT_es`.
|
| 59 |
+
3. Copy the following files into that folder:
|
| 60 |
+
- `config.yml` (training configuration)
|
| 61 |
+
- `step_1000000.t7` (trained checkpoint)
|
| 62 |
+
- `token_maps.pkl` (phoneme to ID mapping)
|
| 63 |
+
- `util.py` (modified to fix position ID loading)
|
| 64 |
+
|
| 65 |
+
4. In your StyleTTS2 configuration file, update the `PLBERT_dir` entry to:
|
| 66 |
+
|
| 67 |
+
`PLBERT_dir: Utils/PLBERT_es`
|
| 68 |
+
|
| 69 |
+
5. Update the import statement in your code to:
|
| 70 |
+
|
| 71 |
+
`from Utils.PLBERT_es.util import load_plbert`
|
| 72 |
+
|
| 73 |
+
6. Use `espeak-ng` with the language code `es-419` to phonemize your Spanish text files for training and validation.
|
| 74 |
+
|
| 75 |
+
Note: Although this example uses StyleTTS2, the model is compatible with other TTS architectures that operate on phoneme sequences. You can use the contextualized phoneme embeddings from PL-BERT in any compatible synthesis system.
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## Training
|
| 80 |
+
|
| 81 |
+
### Training data
|
| 82 |
+
|
| 83 |
+
The model was trained on a Spanish corpus phonemized using espeak-ng. It uses a consistent phoneme token set with boundary markers and masking tokens.
|
| 84 |
+
|
| 85 |
+
Tokenizer: custom (split using whitespaces)
|
| 86 |
+
Phoneme masking strategy: word-level and phoneme-level masking and replacement
|
| 87 |
+
Training steps: 1,000,000
|
| 88 |
+
Precision: Mixed (fp16)
|
| 89 |
+
|
| 90 |
+
### Training configuration
|
| 91 |
+
|
| 92 |
+
Model parameters:
|
| 93 |
+
|
| 94 |
+
- Vocabulary size: 178
|
| 95 |
+
- Hidden size: 768
|
| 96 |
+
- Attention heads: 12
|
| 97 |
+
- Intermediate size: 2048
|
| 98 |
+
- Number of layers: 12
|
| 99 |
+
- Max position embeddings: 512
|
| 100 |
+
- Dropout: 0.1
|
| 101 |
+
|
| 102 |
+
Other parameters:
|
| 103 |
+
|
| 104 |
+
- Batch size: 8
|
| 105 |
+
- Max mel length: 512
|
| 106 |
+
- Word mask probability: 0.15
|
| 107 |
+
- Phoneme mask probability: 0.1
|
| 108 |
+
- Replacement probability: 0.2
|
| 109 |
+
- Token separator: space
|
| 110 |
+
- Token mask: M
|
| 111 |
+
- Word separator ID: 102
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
## Evaluation
|
| 117 |
+
|
| 118 |
+
The model has not been benchmarked via perplexity or extrinsic evaluation, but has been successfully integrated into TTS pipelines such as StyleTTS2, where it enables the synthesis of Spanish.
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## Additional information
|
| 123 |
+
|
| 124 |
+
### Contact
|
| 125 |
+
|
| 126 |
+
For questions or feedback, please contact:
|
| 127 |
+
rodolfo.zevallos@bsc.es
|
| 128 |
+
|
| 129 |
+
### License
|
| 130 |
+
|
| 131 |
+
Distributed under the Apache License, Version 2.0: https://www.apache.org/licenses/LICENSE-2.0
|
| 132 |
+
|
| 133 |
+
### Citation
|
| 134 |
+
|
| 135 |
+
Citation coming soon. Please cite the Hugging Face model card once published.
|
| 136 |
+
|
| 137 |
+
### Disclaimer
|
| 138 |
+
|
| 139 |
+
<details>
|
| 140 |
+
<summary>Click to expand</summary>
|
| 141 |
+
|
| 142 |
+
This model is released for research and educational use. It may exhibit biases or limitations based on training data characteristics. Users are responsible for ensuring appropriate use in deployed systems and for complying with all applicable regulations.
|
| 143 |
+
|
| 144 |
+
</details>
|