Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# <a name="introduction"></a> XPhoneBERT : A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech
|
2 |
+
|
3 |
+
XPhoneBERT is the first pre-trained multilingual model for phoneme representations for text-to-speech(TTS). XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encoder significantly boosts the performance of a strong neural TTS model in terms of naturalness and prosody and also helps produce fairly high-quality speech with limited training data.
|
4 |
+
|
5 |
+
The general architecture and experimental results of XPhoneBERT can be found in [our INTERSPEECH 2023 paper](https://www.doi.org/10.21437/Interspeech.2023-444):
|
6 |
+
|
7 |
+
@inproceedings{xphonebert,
|
8 |
+
title = {{XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech}},
|
9 |
+
author = {Linh The Nguyen and Thinh Pham and Dat Quoc Nguyen},
|
10 |
+
booktitle = {Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH)},
|
11 |
+
year = {2023},
|
12 |
+
pages = {5506--5510}
|
13 |
+
}
|
14 |
+
|
15 |
+
**Please CITE** our paper when XPhoneBERT is used to help produce published results or is incorporated into other software.
|
16 |
+
|
17 |
+
## <a name="transformers"></a> Using XPhoneBERT with `transformers`
|
18 |
+
|
19 |
+
### Installation <a name="install2"></a>
|
20 |
+
|
21 |
+
- Install `transformers` with pip: `pip install transformers`, or install `transformers` [from source](https://huggingface.co/docs/transformers/installation#installing-from-source).
|
22 |
+
|
23 |
+
- Install `text2phonemesequence`: `pip install text2phonemesequence` <br> Our [`text2phonemesequence`](https://github.com/thelinhbkhn2014/Text2PhonemeSequence) package is to convert text sequences into phoneme-level sequences, employed to construct our multilingual phoneme-level pre-training data. We build `text2phonemesequence` by incorporating the [CharsiuG2P](https://github.com/lingjzhu/CharsiuG2P/tree/main) and the [segments](https://pypi.org/project/segments/) toolkits that perform text-to-phoneme conversion and phoneme segmentation, respectively.
|
24 |
+
|
25 |
+
- **Notes**
|
26 |
+
|
27 |
+
- Initializing `text2phonemesequence` for each language requires its corresponding ISO 639-3 code. The ISO 639-3 codes of supported languages are available at [HERE](https://github.com/VinAIResearch/XPhoneBERT/blob/main/LanguageISO639-3Codes.md).
|
28 |
+
|
29 |
+
- `text2phonemesequence` takes a word-segmented sequence as input. And users might also perform text normalization on the word-segmented sequence before feeding into `text2phonemesequence`. When creating our pre-training data, we perform word and sentence segmentation on all text documents in each language by using the [spaCy](https://spacy.io) toolkit, except for Vietnamese where we employ the [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) toolkit. We also use the text normalization component from the [NVIDIA NeMo toolkit](https://github.com/NVIDIA/NeMo) for English, German, Spanish and Chinese, and the [Vinorm](https://github.com/v-nhandt21/Vinorm) text normalization package for Vietnamese.
|
30 |
+
|
31 |
+
|
32 |
+
### <a name="models2"></a> Pre-trained model
|
33 |
+
|
34 |
+
Model | #params | Arch. | Max length | Pre-training data
|
35 |
+
---|---|---|---|---
|
36 |
+
`vinai/xphonebert-base` | 88M | base | 512 | 330M phoneme-level sentences from nearly 100 languages and locales
|
37 |
+
|
38 |
+
|
39 |
+
### Example usage <a name="usage2"></a>
|
40 |
+
|
41 |
+
```python
|
42 |
+
from transformers import AutoModel, AutoTokenizer
|
43 |
+
from text2phonemesequence import Text2PhonemeSequence
|
44 |
+
|
45 |
+
# Load XPhoneBERT model and its tokenizer
|
46 |
+
xphonebert = AutoModel.from_pretrained("vinai/xphonebert-base")
|
47 |
+
tokenizer = AutoTokenizer.from_pretrained("vinai/xphonebert-base")
|
48 |
+
|
49 |
+
# Load Text2PhonemeSequence
|
50 |
+
# text2phone_model = Text2PhonemeSequence(language='eng-us', is_cuda=True)
|
51 |
+
text2phone_model = Text2PhonemeSequence(language='jpn', is_cuda=True)
|
52 |
+
|
53 |
+
# Input sequence that is already WORD-SEGMENTED (and text-normalized if applicable)
|
54 |
+
# sentence = "That is , it is a testing text ."
|
55 |
+
sentence = "γγ γ― γ γγΉγ γγγΉγ γ§γ ."
|
56 |
+
|
57 |
+
input_phonemes = text2phone_model.infer_sentence(sentence)
|
58 |
+
|
59 |
+
input_ids = tokenizer(input_phonemes, return_tensors="pt")
|
60 |
+
|
61 |
+
with torch.no_grad():
|
62 |
+
features = xphonebert(**input_ids)
|
63 |
+
```
|
64 |
+
|