Upload 11 files

Browse files

Files changed (10) hide show

README.md +160 -0
config.json +52 -0
config.original.json +52 -0
gpt2_model.safetensors +3 -0
gpt_config.py +143 -0
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer.py +928 -0
tokenizer_config.json +192 -0
xtts2_gpt_modeling.py +460 -0

README.md ADDED Viewed

	@@ -0,0 +1,160 @@

+---
+license: apache-2.0
+base_model:
+- coqui/XTTS-v2
+---
+# Auralis 🌌
+## Model Details 🛠️
+**Model Name:** Auralis
+**Model Architecture:** Based on [Coqui XTTS-v2](https://huggingface.co/coqui/XTTS-v2)
+**License:**
+- license: Apache 2.0
+- base_model: XTTS-v2 Components [Coqui AI License](https://coqui.ai/cpml)
+**Language Support:** English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Simplified), Hungarian, Korean, Japanese, Hindi
+**Developed by:** [AstraMind.ai](https://www.astramind.ai)
+**GitHub:** [AstraMind AI](https://github.com/astramind-ai/Auralis/tree/main)
+**Primary Use Case:** Text-to-Speech (TTS) generation for real-world applications, including books, dialogues, and multilingual tasks.
+---
+## Model Description 🚀
+Auralis transforms text into natural, high-quality speech with exceptional speed and scalability. It is powered by [Coqui XTTS-v2](https://huggingface.co/coqui/XTTS-v2) and optimized for both consumer-grade and high-performance GPUs. Auralis is designed to meet real-world needs like long-text processing, voice cloning, and concurrent request handling.
+### Key Features:
+- **Warp-Speed Processing:** Generate speech for an entire novel (e.g., Harry Potter) in ~10 minutes.
+- **Hardware Friendly:** Requires <10GB VRAM on a single NVIDIA RTX 3090.
+- **Scalable:** Handles multiple requests simultaneously.
+- **Streaming:** Seamlessly processes long texts in a streaming format.
+- **Custom Voices:** Enables voice cloning from short reference audio.
+---
+## Quick Start ⭐
+```python
+from auralis import TTS, TTSRequest
+# Initialize the model
+tts = TTS().from_pretrained("AstraMindAI/xtts2-gpt")
+# Create a TTS request
+request = TTSRequest(
+    text="Hello Earth! This is Auralis speaking.",
+    speaker_files=["reference.wav"]
+)
+# Generate speech
+output = tts.generate_speech(request)
+output.save("output.wav")
+```
+---
+## Ebook Generation 📚
+Auralis converting ebooks into audio formats at lightning speed. For Python script, check out [ebook_audio_generator.py](https://github.com/astramind-ai/Auralis/blob/main/examples/vocalize_a_ebook.py).
+```python
+def process_book(chapter_file: str, speaker_file: str):
+    # Read chapter
+    with open(chapter_file, 'r') as f:
+        chapter = f.read()
+    # You can pass the whole book, auralis will take care of splitting
+    request = TTSRequest(
+            text=chapter,
+            speaker_files=[speaker_file],
+            audio_config=AudioPreprocessingConfig(
+                enhance_speech=True,
+                normalize=True
+            )
+        )
+    output = tts.generate_speech(request)
+    output.play()
+    output.save("chapter_output.wav")
+# Example usage
+process_book("chapter1.txt", "reference_voice.wav")
+```
+---
+## Intended Use 🌟
+Auralis is designed for:
+- **Content Creators:** Generate audiobooks, podcasts, or voiceovers.
+- **Developers:** Integrate TTS into applications via a simple Python API.
+- **Accessibility**: Providing audio versions of digital content for people with visual or reading difficulties.
+- **Multilingual Scenarios:** Convert text to speech in multiple supported languages.
+---
+## Performance 📊
+**Benchmarks on NVIDIA RTX 3090:**
+- Short phrases (<100 characters): ~1 second
+- Medium texts (<1,000 characters): ~5-10 seconds
+- Full books (~100,000 characters): ~10 minutes
+**Memory Usage:**
+- Base VRAM: ~4GB
+- Peak VRAM: ~10GB
+---
+## Model Features 🛸
+1. **Speed & Efficiency:**
+   - Smart batching for rapid processing of long texts.
+   - Memory-optimized for consumer GPUs.
+2. **Easy Integration:**
+   - Python API with support for synchronous and asynchronous workflows.
+   - Streaming mode for continuous playback during generation.
+3. **Audio Quality Enhancements:**
+   - Background noise reduction.
+   - Voice clarity and volume normalization.
+   - Customizable audio preprocessing.
+4. **Multilingual Support:**
+   - Automatic language detection.
+   - High-quality speech in 15+ languages.
+5. **Customization:**
+   - Voice cloning using short reference clips.
+   - Adjustable parameters for tone, pacing, and language.
+---
+## Limitations & Ethical Considerations ⚠️
+- **Voice Cloning Risks:** Auralis supports voice cloning, which may raise ethical concerns about misuse. Use responsibly and ensure proper consent.
+- **Accent Limitations:** While robust for many languages, accents and intonations may vary based on the input.
+---
+## Citation 📜
+If you use Auralis in your research or projects, please cite:
+```bibtex
+@misc{auralis2024,
+  author = {AstraMind AI},
+  title = {Auralis: High-Performance Text-to-Speech Engine},
+  year = {2024},
+  url = {https://huggingface.co/AstraMindAI/auralis}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "model_type": "xtts_gpt",
+  "architectures": [
+    "XttsGPT"
+  ],
+  "vocab_size": 6681,
+  "hidden_size": 1024,
+  "num_hidden_layers": 30,
+  "num_attention_heads": 16,
+  "n_inner": 4096,
+  "number_text_tokens": 6681,
+  "num_audio_tokens": 1026,
+  "max_audio_tokens": 605,
+  "start_audio_token": 1024,
+  "stop_audio_token": 1025,
+  "max_text_tokens": 402,
+  "max_prompt_tokens": 70,
+  "activation_function": "gelu_new",
+  "attn_pdrop": 0.1,
+  "layer_norm_epsilon": 1e-05,
+  "initializer_range": 0.02,
+  "use_masking_gt_prompt_approach": true,
+  "use_perceiver_resampler": true,
+  "kv_cache": true,
+  "enable_redaction": false,
+  "reorder_and_upcast_attn": false,
+  "scale_attn_by_inverse_layer_idx": false,
+  "auto_map": {
+    "AutoConfig": "AstraMindAI/xtts2-gpt--gpt_config.XTTSGPTConfig",
+    "AutoModelForCausalLM": "AstraMindAI/xtts2-gpt--xtts2_gpt_modeling.XttsGPT",
+    "AutoTokenizer": "AstraMindAI/xtts2-gpt--tokenizer.XTTSTokenizerFast"
+  },
+  "languages": [
+    "en",
+    "es",
+    "fr",
+    "de",
+    "it",
+    "pt",
+    "pl",
+    "tr",
+    "ru",
+    "nl",
+    "cs",
+    "ar",
+    "zh-cn",
+    "hu",
+    "ko",
+    "ja",
+    "hi"
+  ]
+}

config.original.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "model_type": "xtts_gpt",
+  "architectures": [
+    "XttsGPT"
+  ],
+  "vocab_size": 6681,
+  "hidden_size": 1024,
+  "num_hidden_layers": 30,
+  "num_attention_heads": 16,
+  "n_inner": 4096,
+  "number_text_tokens": 6681,
+  "num_audio_tokens": 1026,
+  "max_audio_tokens": 605,
+  "start_audio_token": 1024,
+  "stop_audio_token": 1025,
+  "max_text_tokens": 402,
+  "max_prompt_tokens": 70,
+  "activation_function": "gelu_new",
+  "attn_pdrop": 0.1,
+  "layer_norm_epsilon": 1e-05,
+  "initializer_range": 0.02,
+  "use_masking_gt_prompt_approach": true,
+  "use_perceiver_resampler": true,
+  "kv_cache": true,
+  "enable_redaction": false,
+  "reorder_and_upcast_attn": false,
+  "scale_attn_by_inverse_layer_idx": false,
+  "auto_map": {
+    "AutoConfig": "AstraMindAI/xtts2-gpt--gpt_config.XTTSGPTConfig",
+    "AutoModelForCausalLM": "AstraMindAI/xtts2-gpt--xtts2_gpt_modeling.XttsGPT",
+    "AutoTokenizer": "AstraMindAI/xtts2-gpt--tokenizer.XTTSTokenizerFast"
+  },
+  "languages": [
+    "en",
+    "es",
+    "fr",
+    "de",
+    "it",
+    "pt",
+    "pl",
+    "tr",
+    "ru",
+    "nl",
+    "cs",
+    "ar",
+    "zh-cn",
+    "hu",
+    "ko",
+    "ja",
+    "hi"
+  ]
+}

gpt2_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:104d92b2297c243b64d1417bd5cfda015faca0a670e9bc90088eed0e844f8e35
+size 1522497936

gpt_config.py ADDED Viewed

	@@ -0,0 +1,143 @@

+from dataclasses import asdict, dataclass
+from typing import Dict, Optional, List
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+@dataclass
+class GPTAudioConfig:
+    """Configuration for GPT audio processing parameters"""
+    mel_channels: int = 80
+    sample_rate: int = 22050
+    output_sample_rate: int = 24000
+@dataclass
+class XTTSAudioConfig:
+    """Configuration for audio processing parameters"""
+    sample_rate: int = 22050
+    output_sample_rate: int = 24000
+    mel_channels: int = 80
+    hop_length: int = 256
+    win_length: int = 1024
+    n_fft: int = 1024
+    fmin: int = 0
+    fmax: int = 8000
+    power: float = 1.0
+    mel_norms_file: Optional[str] = None
+class XTTSGPTConfig(PretrainedConfig):
+    """Configuration class for the GPT component of XTTS."""
+    model_type = "xtts_gpt"
+    def __init__(
+            self,
+            # Model architecture
+            hidden_size: int = 1024,  # gpt_n_model_channels in original
+            n_inner: int = 4096,
+            num_hidden_layers: int = 30,  # gpt_layers in original
+            num_attention_heads: int = 16,  # gpt_n_heads in original
+            # Tokenizer settings
+            vocab_size: int = 6681,  # gpt_number_text_tokens in original
+            number_text_tokens: int = 6681,  # Explicit text token vocabulary size
+            start_text_token: Optional[int] = None,
+            stop_text_token: Optional[int] = None,
+            # Audio token settings
+            num_audio_tokens: int = 1026,  # gpt_num_audio_tokens in original
+            start_audio_token: int = 1024,  # gpt_start_audio_token in original
+            stop_audio_token: int = 1025,  # gpt_stop_audio_token in original
+            # Sequence length settings
+            max_audio_tokens: int = 605,  # gpt_max_audio_tokens in original
+            max_text_tokens: int = 402,  # gpt_max_text_tokens in original
+            max_prompt_tokens: int = 70,  # gpt_max_prompt_tokens in original
+            gpt_max_audio_tokens: int = 605,  # Used for generation
+            # Model behavior settings
+            use_masking_gt_prompt_approach: bool = True,  # gpt_use_masking_gt_prompt_approach in original
+            use_perceiver_resampler: bool = True,  # gpt_use_perceiver_resampler in original
+            kv_cache: bool = True,
+            enable_redaction: bool = False,
+            # GPT batch settings
+            gpt_batch_size: int = 1,
+            # Audio processing
+            audio_config: Optional[Dict] = None,
+            # Architecture specifics
+            layer_norm_epsilon: float = 1e-5,
+            initializer_range: float = 0.02,
+            add_cross_attention: bool = False,
+            scale_attn_by_inverse_layer_idx: bool = False,
+            reorder_and_upcast_attn: bool = False,
+            # Size settings for the decoder
+            decoder_input_dim: int = 1024,
+            architectures=["XttsGPT"],
+            auto_map={
+                "AutoConfig": "AstraMindAI/xtts2-gpt--gpt_config.XTTSGPTConfig",
+                "AutoModelForCausalLM": "AstraMindAI/xtts2-gpt--xtts2_gpt_modeling.XttsGPT",
+            },
+            activation_function: str = "gelu",
+            attn_pdrop: float = 0.1,
+            **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.architectures = architectures
+        self.auto_map = auto_map
+        self.audio_config = GPTAudioConfig(
+            **audio_config if audio_config is not None else {}
+        )
+        self.activation_function = activation_function
+        self.attn_pdrop = attn_pdrop
+        self.hidden_size = hidden_size
+        self.n_inner = n_inner
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.vocab_size = vocab_size
+        self.number_text_tokens = number_text_tokens
+        self.start_text_token = start_text_token
+        self.stop_text_token = stop_text_token
+        self.num_audio_tokens = num_audio_tokens
+        self.start_audio_token = start_audio_token
+        self.stop_audio_token = stop_audio_token
+        self.max_audio_tokens = max_audio_tokens
+        self.max_text_tokens = max_text_tokens
+        self.max_prompt_tokens = max_prompt_tokens
+        self.gpt_max_audio_tokens = gpt_max_audio_tokens
+        self.use_masking_gt_prompt_approach = use_masking_gt_prompt_approach
+        self.use_perceiver_resampler = use_perceiver_resampler
+        self.kv_cache = kv_cache
+        self.enable_redaction = enable_redaction
+        self.gpt_batch_size = gpt_batch_size
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.add_cross_attention = add_cross_attention
+        self.scale_attn_by_inverse_layer_idx = scale_attn_by_inverse_layer_idx
+        self.reorder_and_upcast_attn = reorder_and_upcast_attn
+        self.decoder_input_dim = decoder_input_dim
+    def to_dict(self) -> Dict:
+        """Convert the config to a dictionary."""
+        output = super().to_dict()
+        output["audio_config"] = asdict(self.audio_config)
+        return output
+    @classmethod
+    def from_dict(cls, config_dict: Dict, *args, **kwargs) -> "XTTSGPTConfig":
+        """Create a config from a dictionary."""
+        return cls(**config_dict)

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "[START]",
+  "eos_token": "[STOP]",
+  "pad_token": "[PAD]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.py ADDED Viewed

	@@ -0,0 +1,928 @@

+import re
+from typing import List, Optional, Union, Dict, Any
+from functools import cached_property
+import pypinyin
+import torch
+from hangul_romanize import Transliter
+from hangul_romanize.rule import academic
+from num2words import num2words
+from spacy.lang.ar import Arabic
+from spacy.lang.en import English
+from spacy.lang.es import Spanish
+from spacy.lang.ja import Japanese
+from spacy.lang.zh import Chinese
+from transformers import PreTrainedTokenizerFast, BatchEncoding
+from transformers.tokenization_utils_base import TruncationStrategy, PaddingStrategy
+from tokenizers import Tokenizer
+from tokenizers.pre_tokenizers import WhitespaceSplit
+from tokenizers.processors import TemplateProcessing
+from auralis.models.xttsv2.components.tts.layers.xtts.zh_num2words import TextNorm as zh_num2words
+import cutlet
+def get_spacy_lang(lang):
+    if lang == "zh":
+        return Chinese()
+    elif lang == "ja":
+        return Japanese()
+    elif lang == "ar":
+        return Arabic()
+    elif lang == "es":
+        return Spanish()
+    else:
+        # For most languages, English does the job
+        return English()
+def find_best_split_point(text: str, target_pos: int, window_size: int = 30) -> int:
+    """
+    Find best split point near target position considering punctuation and language markers.
+    added for better sentence splitting in TTS.
+    """
+    # Define split markers by priority
+    markers = [
+        # Strong breaks (longest pause)
+        (r'[.!?؟။။။]+[\s]*', 1.0),  # Periods, exclamation, question (multi-script)
+        (r'[\n\r]+\s*[\n\r]+', 1.0),  # Multiple newlines
+        (r'[:|;；：；][\s]*', 0.9),  # Colons, semicolons (multi-script)
+        # Medium breaks
+        (r'[,，،、][\s]*', 0.8),  # Commas (multi-script)
+        (r'[)}\]）】』»›》\s]+', 0.7),  # Closing brackets/parentheses
+        (r'[-—−]+[\s]*', 0.7),  # Dashes
+        # Weak breaks
+        (r'\s+[&+=/\s]+\s+', 0.6),  # Special characters with spaces
+        (r'[\s]+', 0.5),  # Any whitespace as last resort
+    ]
+    # Calculate window boundaries
+    start = max(0, target_pos - window_size)
+    end = min(len(text), target_pos + window_size)
+    window = text[start:end]
+    best_pos = target_pos
+    best_score = 0
+    for pattern, priority in markers:
+        matches = list(re.finditer(pattern, window))
+        for match in matches:
+            # Calculate position score based on distance from target
+            pos = start + match.end()
+            distance = abs(pos - target_pos)
+            distance_score = 1 - (distance / (window_size * 2))
+            # Combine priority and position scores
+            score = priority * distance_score
+            if score > best_score:
+                best_score = score
+                best_pos = pos
+    return best_pos
+def split_sentence(text: str, lang: str, text_split_length: int = 250) -> List[str]:
+    """
+    Enhanced sentence splitting with language awareness and optimal breakpoints.
+    Args:
+        text: Input text to split
+        lang: Language code
+        text_split_length: Target length for splits
+    Returns:
+        List of text splits optimized for TTS
+    """
+    text = text.strip()
+    if len(text) <= text_split_length:
+        return [text]
+    nlp = get_spacy_lang(lang)
+    if "sentencizer" not in nlp.pipe_names:
+        nlp.add_pipe("sentencizer")
+    # Get base sentences using spaCy
+    doc = nlp(text)
+    sentences = list(doc.sents)
+    splits = []
+    current_split = []
+    current_length = 0
+    for sent in sentences:
+        sentence_text = str(sent).strip()
+        sentence_length = len(sentence_text)
+        # If sentence fits in current split
+        if current_length + sentence_length <= text_split_length:
+            current_split.append(sentence_text)
+            current_length += sentence_length + 1
+        # Handle long sentences
+        elif sentence_length > text_split_length:
+            # Add current split if exists
+            if current_split:
+                splits.append(" ".join(current_split))
+                current_split = []
+                current_length = 0
+            # Split long sentence at optimal points
+            remaining = sentence_text
+            while len(remaining) > text_split_length:
+                split_pos = find_best_split_point(
+                    remaining,
+                    text_split_length,
+                    window_size=30
+                )
+                # Add split and continue with remainder
+                splits.append(remaining[:split_pos].strip())
+                remaining = remaining[split_pos:].strip()
+            # Handle remaining text
+            if remaining:
+                current_split = [remaining]
+                current_length = len(remaining)
+        # Start new split
+        else:
+            splits.append(" ".join(current_split))
+            current_split = [sentence_text]
+            current_length = sentence_length
+    # Add final split if needed
+    if current_split:
+        splits.append(" ".join(current_split))
+    cleaned_sentences = [s[:-1]+' ' if s.endswith('.') else s for s in splits if s] # prevents annoying sounds in italian
+    # Clean up splits
+    return cleaned_sentences
+_whitespace_re = re.compile(r"\s+")
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = {
+    "en": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("mrs", "misess"),
+            ("mr", "mister"),
+            ("dr", "doctor"),
+            ("st", "saint"),
+            ("co", "company"),
+            ("jr", "junior"),
+            ("maj", "major"),
+            ("gen", "general"),
+            ("drs", "doctors"),
+            ("rev", "reverend"),
+            ("lt", "lieutenant"),
+            ("hon", "honorable"),
+            ("sgt", "sergeant"),
+            ("capt", "captain"),
+            ("esq", "esquire"),
+            ("ltd", "limited"),
+            ("col", "colonel"),
+            ("ft", "fort"),
+        ]
+    ],
+    "es": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("sra", "señora"),
+            ("sr", "señor"),
+            ("dr", "doctor"),
+            ("dra", "doctora"),
+            ("st", "santo"),
+            ("co", "compañía"),
+            ("jr", "junior"),
+            ("ltd", "limitada"),
+        ]
+    ],
+    "fr": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("mme", "madame"),
+            ("mr", "monsieur"),
+            ("dr", "docteur"),
+            ("st", "saint"),
+            ("co", "compagnie"),
+            ("jr", "junior"),
+            ("ltd", "limitée"),
+        ]
+    ],
+    "de": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("fr", "frau"),
+            ("dr", "doktor"),
+            ("st", "sankt"),
+            ("co", "firma"),
+            ("jr", "junior"),
+        ]
+    ],
+    "pt": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("sra", "senhora"),
+            ("sr", "senhor"),
+            ("dr", "doutor"),
+            ("dra", "doutora"),
+            ("st", "santo"),
+            ("co", "companhia"),
+            ("jr", "júnior"),
+            ("ltd", "limitada"),
+        ]
+    ],
+    "it": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # ("sig.ra", "signora"),
+            ("sig", "signore"),
+            ("dr", "dottore"),
+            ("st", "santo"),
+            ("co", "compagnia"),
+            ("jr", "junior"),
+            ("ltd", "limitata"),
+        ]
+    ],
+    "pl": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("p", "pani"),
+            ("m", "pan"),
+            ("dr", "doktor"),
+            ("sw", "święty"),
+            ("jr", "junior"),
+        ]
+    ],
+    "ar": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # There are not many common abbreviations in Arabic as in English.
+        ]
+    ],
+    "zh": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # Chinese doesn't typically use abbreviations in the same way as Latin-based scripts.
+        ]
+    ],
+    "cs": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dr", "doktor"),  # doctor
+            ("ing", "inženýr"),  # engineer
+            ("p", "pan"),  # Could also map to pani for woman but no easy way to do it
+            # Other abbreviations would be specialized and not as common.
+        ]
+    ],
+    "ru": [
+        (re.compile("\\b%s\\b" % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("г-жа", "госпожа"),  # Mrs.
+            ("г-н", "господин"),  # Mr.
+            ("д-р", "доктор"),  # doctor
+            # Other abbreviations are less common or specialized.
+        ]
+    ],
+    "nl": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dhr", "de heer"),  # Mr.
+            ("mevr", "mevrouw"),  # Mrs.
+            ("dr", "dokter"),  # doctor
+            ("jhr", "jonkheer"),  # young lord or nobleman
+            # Dutch uses more abbreviations, but these are the most common ones.
+        ]
+    ],
+    "tr": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("b", "bay"),  # Mr.
+            ("byk", "büyük"),  # büyük
+            ("dr", "doktor"),  # doctor
+            # Add other Turkish abbreviations here if needed.
+        ]
+    ],
+    "hu": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dr", "doktor"),  # doctor
+            ("b", "bácsi"),  # Mr.
+            ("nőv", "nővér"),  # nurse
+            # Add other Hungarian abbreviations here if needed.
+        ]
+    ],
+    "ko": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # Korean doesn't typically use abbreviations in the same way as Latin-based scripts.
+        ]
+    ],
+}
+def expand_abbreviations_multilingual(text, lang="en"):
+    if lang in _abbreviations:
+        for regex, replacement in _abbreviations[lang]:
+            text = re.sub(regex, replacement, text)
+    return text
+_symbols_multilingual = {
+    "en": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " and "),
+            ("@", " at "),
+            ("%", " percent "),
+            ("#", " hash "),
+            ("$", " dollar "),
+            ("£", " pound "),
+            ("°", " degree "),
+        ]
+    ],
+    "es": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " y "),
+            ("@", " arroba "),
+            ("%", " por ciento "),
+            ("#", " numeral "),
+            ("$", " dolar "),
+            ("£", " libra "),
+            ("°", " grados "),
+        ]
+    ],
+    "fr": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " et "),
+            ("@", " arobase "),
+            ("%", " pour cent "),
+            ("#", " dièse "),
+            ("$", " dollar "),
+            ("£", " livre "),
+            ("°", " degrés "),
+        ]
+    ],
+    "de": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " und "),
+            ("@", " at "),
+            ("%", " prozent "),
+            ("#", " raute "),
+            ("$", " dollar "),
+            ("£", " pfund "),
+            ("°", " grad "),
+        ]
+    ],
+    "pt": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " e "),
+            ("@", " arroba "),
+            ("%", " por cento "),
+            ("#", " cardinal "),
+            ("$", " dólar "),
+            ("£", " libra "),
+            ("°", " graus "),
+        ]
+    ],
+    "it": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " e "),
+            ("@", " chiocciola "),
+            ("%", " per cento "),
+            ("#", " cancelletto "),
+            ("$", " dollaro "),
+            ("£", " sterlina "),
+            ("°", " gradi "),
+        ]
+    ],
+    "pl": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " i "),
+            ("@", " małpa "),
+            ("%", " procent "),
+            ("#", " krzyżyk "),
+            ("$", " dolar "),
+            ("£", " funt "),
+            ("°", " stopnie "),
+        ]
+    ],
+    "ar": [
+        # Arabic
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " و "),
+            ("@", " على "),
+            ("%", " في المئة "),
+            ("#", " رقم "),
+            ("$", " دولار "),
+            ("£", " جنيه "),
+            ("°", " درجة "),
+        ]
+    ],
+    "zh": [
+        # Chinese
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " 和 "),
+            ("@", " 在 "),
+            ("%", " 百分之 "),
+            ("#", " 号 "),
+            ("$", " 美元 "),
+            ("£", " 英镑 "),
+            ("°", " 度 "),
+        ]
+    ],
+    "cs": [
+        # Czech
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " a "),
+            ("@", " na "),
+            ("%", " procento "),
+            ("#", " křížek "),
+            ("$", " dolar "),
+            ("£", " libra "),
+            ("°", " stupně "),
+        ]
+    ],
+    "ru": [
+        # Russian
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " и "),
+            ("@", " собака "),
+            ("%", " процентов "),
+            ("#", " номер "),
+            ("$", " доллар "),
+            ("£", " фунт "),
+            ("°", " градус "),
+        ]
+    ],
+    "nl": [
+        # Dutch
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " en "),
+            ("@", " bij "),
+            ("%", " procent "),
+            ("#", " hekje "),
+            ("$", " dollar "),
+            ("£", " pond "),
+            ("°", " graden "),
+        ]
+    ],
+    "tr": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " ve "),
+            ("@", " at "),
+            ("%", " yüzde "),
+            ("#", " diyez "),
+            ("$", " dolar "),
+            ("£", " sterlin "),
+            ("°", " derece "),
+        ]
+    ],
+    "hu": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " és "),
+            ("@", " kukac "),
+            ("%", " százalék "),
+            ("#", " kettőskereszt "),
+            ("$", " dollár "),
+            ("£", " font "),
+            ("°", " fok "),
+        ]
+    ],
+    "ko": [
+        # Korean
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " 그리고 "),
+            ("@", " 에 "),
+            ("%", " 퍼센트 "),
+            ("#", " 번호 "),
+            ("$", " 달러 "),
+            ("£", " 파운드 "),
+            ("°", " 도 "),
+        ]
+    ],
+}
+def expand_symbols_multilingual(text, lang="en"):
+    if lang in _symbols_multilingual:
+        for regex, replacement in _symbols_multilingual[lang]:
+            text = re.sub(regex, replacement, text)
+            text = text.replace("  ", " ")  # Ensure there are no double spaces
+    return text.strip()
+_ordinal_re = {
+    "en": re.compile(r"([0-9]+)(st|nd|rd|th)"),
+    "es": re.compile(r"([0-9]+)(º|ª|er|o|a|os|as)"),
+    "fr": re.compile(r"([0-9]+)(º|ª|er|re|e|ème)"),
+    "de": re.compile(r"([0-9]+)(st|nd|rd|th|º|ª|\.(?=\s|$))"),
+    "pt": re.compile(r"([0-9]+)(º|ª|o|a|os|as)"),
+    "it": re.compile(r"([0-9]+)(º|°|ª|o|a|i|e)"),
+    "pl": re.compile(r"([0-9]+)(º|ª|st|nd|rd|th)"),
+    "ar": re.compile(r"([0-9]+)(ون|ين|ث|ر|ى)"),
+    "cs": re.compile(r"([0-9]+)\.(?=\s|$)"),  # In Czech, a dot is often used after the number to indicate ordinals.
+    "ru": re.compile(r"([0-9]+)(-й|-я|-е|-ое|-ье|-го)"),
+    "nl": re.compile(r"([0-9]+)(de|ste|e)"),
+    "tr": re.compile(r"([0-9]+)(\.|inci|nci|uncu|üncü|\.)"),
+    "hu": re.compile(r"([0-9]+)(\.|adik|edik|odik|edik|ödik|ödike|ik)"),
+    "ko": re.compile(r"([0-9]+)(번째|번|차|째)"),
+}
+_number_re = re.compile(r"[0-9]+")
+# noinspection Annotator
+_currency_re = {
+    "USD": re.compile(r"((\$[0-9\.\,]*[0-9]+)|([0-9\.\,]*[0-9]+\$))"),
+    "GBP": re.compile(r"((£[0-9\.\,]*[0-9]+)|([0-9\.\,]*[0-9]+£))"),
+    "EUR": re.compile(r"(([0-9\.\,]*[0-9]+€)|((€[0-9\.\,]*[0-9]+)))"),
+}
+_comma_number_re = re.compile(r"\b\d{1,3}(,\d{3})*(\.\d+)?\b")
+_dot_number_re = re.compile(r"\b\d{1,3}(\.\d{3})*(\,\d+)?\b")
+_decimal_number_re = re.compile(r"([0-9]+[.,][0-9]+)")
+def _remove_commas(m):
+    text = m.group(0)
+    if "," in text:
+        text = text.replace(",", "")
+    return text
+def _remove_dots(m):
+    text = m.group(0)
+    if "." in text:
+        text = text.replace(".", "")
+    return text
+def _expand_decimal_point(m, lang="en"):
+    amount = m.group(1).replace(",", ".")
+    return num2words(float(amount), lang=lang if lang != "cs" else "cz")
+def _expand_currency(m, lang="en", currency="USD"):
+    amount = float((re.sub(r"[^\d.]", "", m.group(0).replace(",", "."))))
+    full_amount = num2words(amount, to="currency", currency=currency, lang=lang if lang != "cs" else "cz")
+    and_equivalents = {
+        "en": ", ",
+        "es": " con ",
+        "fr": " et ",
+        "de": " und ",
+        "pt": " e ",
+        "it": " e ",
+        "pl": ", ",
+        "cs": ", ",
+        "ru": ", ",
+        "nl": ", ",
+        "ar": ", ",
+        "tr": ", ",
+        "hu": ", ",
+        "ko": ", ",
+    }
+    if amount.is_integer():
+        last_and = full_amount.rfind(and_equivalents.get(lang, ", "))
+        if last_and != -1:
+            full_amount = full_amount[:last_and]
+    return full_amount
+def _expand_ordinal(m, lang="en"):
+    return num2words(int(m.group(1)), ordinal=True, lang=lang if lang != "cs" else "cz")
+def _expand_number(m, lang="en"):
+    return num2words(int(m.group(0)), lang=lang if lang != "cs" else "cz")
+def expand_numbers_multilingual(text, lang="en"):
+    if lang == "zh":
+        text = zh_num2words()(text)
+    else:
+        if lang in ["en", "ru"]:
+            text = re.sub(_comma_number_re, _remove_commas, text)
+        else:
+            text = re.sub(_dot_number_re, _remove_dots, text)
+        try:
+            text = re.sub(_currency_re["GBP"], lambda m: _expand_currency(m, lang, "GBP"), text)
+            text = re.sub(_currency_re["USD"], lambda m: _expand_currency(m, lang, "USD"), text)
+            text = re.sub(_currency_re["EUR"], lambda m: _expand_currency(m, lang, "EUR"), text)
+        except Exception as e:
+            pass
+        if lang != "tr":
+            text = re.sub(_decimal_number_re, lambda m: _expand_decimal_point(m, lang), text)
+        if lang in _ordinal_re:
+            text = re.sub(_ordinal_re[lang], lambda m: _expand_ordinal(m, lang), text)
+        text = re.sub(_number_re, lambda m: _expand_number(m, lang), text)
+    return text
+def lowercase(text):
+    return text.lower()
+def collapse_whitespace(text):
+    return re.sub(_whitespace_re, " ", text)
+def multilingual_cleaners(text, lang):
+    text = text.replace('"', "")
+    if lang == "tr":
+        text = text.replace("İ", "i")
+        text = text.replace("��", "ö")
+        text = text.replace("Ü", "ü")
+    text = lowercase(text)
+    text = expand_numbers_multilingual(text, lang)
+    text = expand_abbreviations_multilingual(text, lang)
+    text = expand_symbols_multilingual(text, lang=lang)
+    text = collapse_whitespace(text)
+    return text
+def basic_cleaners(text):
+    """Basic pipeline that lowercases and collapses whitespace without transliteration."""
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def chinese_transliterate(text):
+    return "".join(
+        [p[0] for p in pypinyin.pinyin(text, style=pypinyin.Style.TONE3, heteronym=False, neutral_tone_with_five=True)]
+    )
+def japanese_cleaners(text, katsu):
+    text = katsu.romaji(text)
+    text = lowercase(text)
+    return text
+def korean_transliterate(text, transliter):
+    return transliter.translit(text)
+# Fast Tokenizer Class
+class XTTSTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Fast Tokenizer implementation for XTTS model using HuggingFace's PreTrainedTokenizerFast
+    """
+    def __init__(
+            self,
+            vocab_file: str = None,
+            tokenizer_object: Optional[Tokenizer] = None,
+            unk_token: str = "[UNK]",
+            pad_token: str = "[PAD]",
+            bos_token: str = "[START]",
+            eos_token: str = "[STOP]",
+            auto_map: dict = {"AutoTokenizer": ["AstraMindAI/xtts2-gpt--tokenizer.XTTSTokenizerFast", None]},
+            clean_up_tokenization_spaces: bool = True,
+            **kwargs
+    ):
+        if tokenizer_object is None and vocab_file is not None:
+            tokenizer_object = Tokenizer.from_file(vocab_file)
+        if tokenizer_object is not None:
+            # Configure the tokenizer
+            tokenizer_object.pre_tokenizer = WhitespaceSplit()
+            tokenizer_object.post_processor = TemplateProcessing(
+                single=f"{bos_token} $A {eos_token}",
+                special_tokens=[
+                    (bos_token, tokenizer_object.token_to_id(bos_token)),
+                    (eos_token, tokenizer_object.token_to_id(eos_token)),
+                ],
+            )
+        super().__init__(
+            tokenizer_object=tokenizer_object,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs
+        )
+        # Character limits per language
+        self.char_limits = {
+            "en": 250, "de": 253, "fr": 273, "es": 239,
+            "it": 213, "pt": 203, "pl": 224, "zh": 82,
+            "ar": 166, "cs": 186, "ru": 182, "nl": 251,
+            "tr": 226, "ja": 71, "hu": 224, "ko": 95,
+        }
+        # Initialize language tools
+        self._katsu = None
+        self._korean_transliter = Transliter(academic)
+        # Ensure pad_token_id is set
+        if self.pad_token_id is None:
+            self.pad_token_id = self.tokenizer.token_to_id(self.pad_token)
+    @cached_property
+    def katsu(self):
+        if self._katsu is None:
+            self._katsu = cutlet.Cutlet()
+        return self._katsu
+    def preprocess_text(self, text: str, lang: str) -> str:
+        """Apply text preprocessing for language"""
+        base_lang = lang.split("-")[0]  # remove region
+        if base_lang in {"ar", "cs", "de", "en", "es", "fr", "hu", "it",
+                         "nl", "pl", "pt", "ru", "tr", "zh", "ko"}:
+            text = multilingual_cleaners(text, base_lang)
+            if base_lang == "zh":
+                text = chinese_transliterate(text)
+            if base_lang == "ko":
+                text = korean_transliterate(text, self._korean_transliter)
+        elif base_lang == "ja":
+            text = japanese_cleaners(text, self.katsu)
+        else:
+            text = basic_cleaners(text)
+        return text
+    def batch_encode_with_split(self, texts: Union[str, List[str]], lang: Union[str, List[str]],
+                                **kwargs) -> torch.Tensor:
+        """
+        Split texts into smaller chunks based on language character limits and encode them using HuggingFace fast tokenizer.
+        strictly mimic the xttsv2 tokenizer
+        """
+        # Convert single inputs to lists
+        if isinstance(texts, str):
+            texts = [texts]
+        if isinstance(lang, str):
+            lang = [lang]
+        # Ensure lang list matches texts list
+        if len(lang) == 1 and len(texts) > 1:
+            lang = lang * len(texts)
+        # Check if texts and lang have the same length
+        if len(texts) != len(lang):
+            raise ValueError(f"Number of texts ({len(texts)}) does not match number of languages ({len(lang)}).")
+        chunk_list = []
+        max_splits = 0
+        # For each text, split into chunks based on character limit
+        for text, text_lang in zip(texts, lang):
+            # Get language character limit
+            base_lang = text_lang.split("-")[0]
+            char_limit = self.char_limits.get(base_lang, 250)
+            # Clean and preprocess
+            #text = self.preprocess_text(text, text_lang) we do this in the hidden function
+            # Split text into sentences/chunks based on language
+            chunk_list = split_sentence(text, base_lang, text_split_length=char_limit)
+        # Ensure the tokenizer is a fast tokenizer
+        if not self.is_fast:
+            raise ValueError("The tokenizer must be a fast tokenizer.")
+        # Encode all chunks using the fast tokenizer
+        encoding: BatchEncoding = self(
+            chunk_list,
+            lang = lang,
+            add_special_tokens=False,
+            padding=False,
+            **kwargs
+        )
+        # The 'input_ids' tensor will have shape [total_chunks, max_sequence_length]
+        return encoding['input_ids']  # Tensor of shape [total_chunks, sequence_length]
+    def _batch_encode_plus(
+            self,
+            batch_text_or_text_pairs,
+            add_special_tokens: bool = True,
+            padding_strategy=PaddingStrategy.DO_NOT_PAD,
+            truncation_strategy=TruncationStrategy.DO_NOT_TRUNCATE,
+            max_length: Optional[int] = None,
+            stride: int = 0,
+            is_split_into_words: bool = False,
+            pad_to_multiple_of: Optional[int] = None,
+            return_tensors: Optional[str] = None,
+            return_token_type_ids: Optional[bool] = None,
+            return_attention_mask: Optional[bool] = None,
+            return_overflowing_tokens: bool = False,
+            return_special_tokens_mask: bool = False,
+            return_offsets_mapping: bool = False,
+            return_length: bool = False,
+            verbose: bool = True,
+            **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Override batch encoding to handle language-specific preprocessing
+        """
+        lang = kwargs.pop("lang", ["en"] * len(batch_text_or_text_pairs))
+        if isinstance(lang, str):
+            lang = [lang]
+        # Ensure lang list matches texts list
+        if len(lang) == 1 and len(batch_text_or_text_pairs) > 1:
+            lang = lang * len(batch_text_or_text_pairs)
+        # Check if batch_text_or_text_pairs and lang have the same length
+        if len(batch_text_or_text_pairs) != len(lang):
+            raise ValueError(f"Number of texts ({len(batch_text_or_text_pairs)}) does not match number of languages ({len(lang)}).")
+        # Preprocess each text in the batch with its corresponding language
+        processed_texts = []
+        for text, text_lang in zip(batch_text_or_text_pairs, lang):
+            if isinstance(text, str):
+                # Check length and preprocess
+                #self.check_input_length(text, text_lang)
+                processed_text = self.preprocess_text(text, text_lang)
+                # Format text with language tag and spaces
+                base_lang = text_lang.split("-")[0]
+                lang_code = "zh-cn" if base_lang == "zh" else base_lang
+                processed_text = f"[{lang_code}]{processed_text}"
+                processed_text = processed_text.replace(" ", "[SPACE]")
+                processed_texts.append(processed_text)
+            else:
+                processed_texts.append(text)
+        # Call the parent class's encoding method with processed texts
+        return super()._batch_encode_plus(
+            processed_texts,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs
+        )
+    def __call__(
+            self,
+            text: Union[str, List[str]],
+            lang: Union[str, List[str]] = "en",
+            add_special_tokens: bool = True,
+            padding: Union[bool, str, PaddingStrategy] = False,
+            truncation: Union[bool, str, TruncationStrategy] = False,
+            max_length: Optional[int] = None,
+            stride: int = 0,
+            return_tensors: Optional[str] = None,
+            return_token_type_ids: Optional[bool] = None,
+            return_attention_mask: Optional[bool] = True,
+            **kwargs
+    ):
+        """
+        Main tokenization method
+        """
+        # Convert single string to list for batch processing
+        if isinstance(text, str):
+            text = [text]
+        if isinstance(lang, str):
+            lang = [lang]
+        # Ensure lang list matches texts list
+        if len(lang) == 1 and len(text) > 1:
+            lang = lang * len(text)
+        # Ensure text and lang lists have same length
+        if len(text) != len(lang):
+            raise ValueError(f"Number of texts ({len(text)}) does not match number of languages ({len(lang)}).")
+        # Convert padding strategy
+        if isinstance(padding, bool):
+            padding_strategy = PaddingStrategy.LONGEST if padding else PaddingStrategy.DO_NOT_PAD
+        else:
+            padding_strategy = PaddingStrategy(padding)
+        # Convert truncation strategy
+        if isinstance(truncation, bool):
+            truncation_strategy = TruncationStrategy.LONGEST_FIRST if truncation else TruncationStrategy.DO_NOT_TRUNCATE
+        else:
+            truncation_strategy = TruncationStrategy(truncation)
+        # Use the batch encoding method
+        encoded = self._batch_encode_plus(
+            text,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            lang=lang,
+            **kwargs
+        )
+        return encoded

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,192 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[STOP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SPACE]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "259": {
+      "content": "[en]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "260": {
+      "content": "[de]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "261": {
+      "content": "[START]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "262": {
+      "content": "[fr]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "267": {
+      "content": "[ru]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "284": {
+      "content": "[es]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "285": {
+      "content": "[it]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "286": {
+      "content": "[pt]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "293": {
+      "content": "[cs]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "294": {
+      "content": "[pl]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "295": {
+      "content": "[tr]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "297": {
+      "content": "[nl]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5022": {
+      "content": "[ar]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5023": {
+      "content": "[zh-cn]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5412": {
+      "content": "[ja]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5753": {
+      "content": "[hu]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6152": {
+      "content": "[ko]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6680": {
+      "content": "[hi]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6681": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "auto_map": {"AutoTokenizer": ["AstraMindAI/xtts2-gpt--tokenizer.XTTSTokenizerFast", null]},
+  "bos_token": "[START]",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "[STOP]",
+  "max_length": null,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "tokenizer_class": "XTTSTokenizerFast",
+  "unk_token": "[UNK]"
+}

xtts2_gpt_modeling.py ADDED Viewed

	@@ -0,0 +1,460 @@

+import functools
+import math
+from array import array
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from typing import List, Optional, Union, Iterable, Tuple, Mapping
+from transformers import PretrainedConfig
+from vllm.attention import AttentionMetadata, Attention
+from vllm.config import CacheConfig, MultiModalConfig
+from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size
+from vllm.inputs import InputContext, INPUT_REGISTRY
+from vllm.model_executor.layers.activation import get_act_fn
+from vllm.model_executor.layers.linear import ColumnParallelLinear, QKVParallelLinear, RowParallelLinear
+from vllm.model_executor.layers.quantization import QuantizationConfig
+from vllm.model_executor.layers.sampler import Sampler, SamplerOutput
+from vllm.model_executor.layers.vocab_parallel_embedding import VocabParallelEmbedding
+from vllm.model_executor.model_loader.weight_utils import default_weight_loader
+from vllm.model_executor.sampling_metadata import SamplingMetadata
+from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalInputs
+from vllm.sequence import IntermediateTensors, SequenceData, VLLM_TOKEN_ID_ARRAY_TYPE
+from vllm.model_executor.models.interfaces import SupportsMultiModal, SupportsPP
+from TTS.tts.layers.xtts.latent_encoder import ConditioningEncoder # noqa
+from TTS.tts.layers.xtts.perceiver_encoder import PerceiverResampler # noqa
+from TTS.TTS.tts.layers.xtts.gpt import LearnedPositionEmbeddings
+# Constants for token calculation
+_AUDIO_PLACEHOLDER_TOKEN = 8192  # Using XTTS start_audio_token as placeholder
+_AUDIO_TOKENS_PER_SECOND = 6.25
+_CODE_STRIDE_LEN = 1024
+class GPT2Attention(nn.Module):
+    def __init__(
+            self,
+            config: PretrainedConfig,
+            cache_config: Optional[CacheConfig] = None,
+            quant_config: Optional[QuantizationConfig] = None,
+            prefix: str = "",
+    ):
+        super().__init__()
+        total_num_heads = config.num_attention_heads
+        self.hidden_size = config.hidden_size
+        tensor_model_parallel_world_size = get_tensor_model_parallel_world_size()
+        assert total_num_heads % tensor_model_parallel_world_size == 0
+        self.num_heads = total_num_heads // tensor_model_parallel_world_size
+        self.head_dim = self.hidden_size // total_num_heads
+        self.scale = self.head_dim**-0.5
+        self.c_attn = QKVParallelLinear(
+            self.hidden_size,
+            self.head_dim,
+            total_num_heads,
+            bias=True,
+            quant_config=quant_config,
+            prefix=f"{prefix}.c_attn",
+        )
+        self.c_proj = RowParallelLinear(
+            self.hidden_size,
+            self.hidden_size,
+            bias=True,
+            quant_config=quant_config,
+            prefix=f"{prefix}.c_proj",
+        )
+        self.attn = Attention(
+            self.num_heads,
+            self.head_dim,
+            scale=self.scale,
+            cache_config=cache_config,
+            quant_config=quant_config
+        )
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            kv_cache: torch.Tensor,
+            attn_metadata: AttentionMetadata,
+    ) -> torch.Tensor:
+        qkv, _ = self.c_attn(hidden_states)
+        q, k, v = qkv.chunk(chunks=3, dim=-1)
+        attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
+        attn_output, _ = self.c_proj(attn_output)
+        return attn_output
+class GPT2MLP(nn.Module):
+    def __init__(
+            self,
+            intermediate_size: int,
+            config: PretrainedConfig,
+            quant_config: Optional[QuantizationConfig] = None,
+            prefix: str = "",
+    ):
+        super().__init__()
+        hidden_size = config.hidden_size
+        self.c_fc = ColumnParallelLinear(
+            hidden_size,
+            intermediate_size,
+            bias=True,
+            quant_config=quant_config,
+            prefix=f"{prefix}.c_fc",
+        )
+        self.c_proj = RowParallelLinear(
+            intermediate_size,
+            hidden_size,
+            bias=True,
+            quant_config=quant_config,
+            prefix=f"{prefix}.c_proj",
+        )
+        self.act = get_act_fn(config.activation_function, quant_config, intermediate_size)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states, _ = self.c_fc(hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states, _ = self.c_proj(hidden_states)
+        return hidden_states
+class GPT2Block(nn.Module):
+    def __init__(
+            self,
+            config: PretrainedConfig,
+            cache_config: Optional[CacheConfig] = None,
+            quant_config: Optional[QuantizationConfig] = None,
+            prefix: str = "",
+    ):
+        super().__init__()
+        hidden_size = config.hidden_size
+        inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size
+        self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        self.attn = GPT2Attention(
+            config,
+            cache_config,
+            quant_config,
+            prefix=f"{prefix}.attn"
+        )
+        self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        self.mlp = GPT2MLP(
+            inner_dim,
+            config,
+            quant_config,
+            prefix=f"{prefix}.mlp"
+        )
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            kv_cache: torch.Tensor,
+            attn_metadata: AttentionMetadata,
+    ) -> torch.Tensor:
+        residual = hidden_states
+        hidden_states = self.ln_1(hidden_states)
+        attn_output = self.attn(
+            hidden_states=hidden_states,
+            kv_cache=kv_cache,
+            attn_metadata=attn_metadata,
+        )
+        hidden_states = attn_output + residual
+        residual = hidden_states
+        hidden_states = self.ln_2(hidden_states)
+        feed_forward_hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + feed_forward_hidden_states
+        return hidden_states
+def get_xtts_max_audio_tokens(ctx: InputContext) -> int:
+    """Calculate maximum audio tokens based on text context and audio duration."""
+    # Based on GPT config and XTTSv2 settings
+    return 608
+def dummy_seq_data_for_xtts(
+        ctx: InputContext,
+        seq_len: int,
+        audio_count: int,
+) -> SequenceData:
+    """Create dummy sequence data for XTTS profiling."""
+    # Calculate audio token space needed
+    audio_len_tokens = math.ceil(_AUDIO_TOKENS_PER_SECOND * 5)  # Assume 5s per chunk
+    audio_placeholder = array(
+        VLLM_TOKEN_ID_ARRAY_TYPE,
+        [_AUDIO_PLACEHOLDER_TOKEN]
+    ) * audio_len_tokens
+    # Add separator between chunks
+    audio_token_ids = (audio_placeholder + array(VLLM_TOKEN_ID_ARRAY_TYPE, [0])) * audio_count
+    # Fill remaining sequence with padding
+    other_token_ids = array(VLLM_TOKEN_ID_ARRAY_TYPE, [0]) * (seq_len - len(audio_token_ids))
+    return SequenceData(audio_token_ids + other_token_ids)
+def dummy_conditioning_for_xtts(
+        ctx: InputContext,
+        audio_count: int,
+) -> dict:
+    """Create dummy conditioning data for XTTS."""
+    return {
+        "audio": [(torch.zeros(80, 1024), 22050) for _ in range(audio_count)]
+    }
+def dummy_data_for_xtts(
+        ctx: InputContext,
+        seq_len: int,
+        mm_counts: Mapping[str, int],
+) -> Tuple[SequenceData, dict]:
+    """Create complete dummy data for XTTS profiling."""
+    audio_count = mm_counts["audio"]
+    seq_data = dummy_seq_data_for_xtts(ctx, seq_len, audio_count)
+    cond_data = dummy_conditioning_for_xtts(ctx, audio_count)
+    return (seq_data, cond_data)
+def input_mapper_for_xtts(ctx: InputContext, data: object) -> MultiModalInputs:
+    """Map input data to XTTS format."""
+    if not isinstance(data, list):
+        data = [data]
+    # Each item should be a tuple of (mel_spec, sample_rate)
+    for audio_input in data:
+        if not isinstance(audio_input, tuple):
+            raise NotImplementedError(f"Unsupported data type: {type(audio_input)}")
+    return MultiModalInputs({"cond_latents": data})
+@MULTIMODAL_REGISTRY.register_input_mapper("audio", input_mapper_for_xtts)
+@MULTIMODAL_REGISTRY.register_max_multimodal_tokens("audio", get_xtts_max_audio_tokens)
+@INPUT_REGISTRY.register_dummy_data(dummy_data_for_xtts)
+class XttsGPT(nn.Module, SupportsMultiModal, SupportsPP):
+    def __init__(
+            self,
+            config: PretrainedConfig,
+            multimodal_config: MultiModalConfig,
+            cache_config: Optional[CacheConfig] = None,
+            quant_config: Optional["QuantizationConfig"] = None,
+    ):
+        super().__init__()
+        self.config = config
+        self.quant_config = quant_config
+        # XTTS specific components
+        self.conditioning_encoder = ConditioningEncoder(
+            config.audio_config.mel_channels,
+            config.hidden_size,
+            num_attn_heads=config.num_attention_heads
+        )
+        if config.use_perceiver_resampler:
+            self.conditioning_perceiver = PerceiverResampler(
+                dim=config.hidden_size,
+                depth=2,
+                dim_context=config.hidden_size,
+                num_latents=32,
+                dim_head=64,
+                heads=8,
+                ff_mult=4,
+                use_flash_attn=False,
+            )
+        # Core GPT components following VLLM pattern
+        self.gpt = XttsGPT2Model(
+            config,
+            cache_config,
+            quant_config,
+            prefix="gpt"
+        )
+        # Prediction heads
+        self.text_head = ColumnParallelLinear(
+            config.hidden_size,
+            config.vocab_size,
+            bias=False,
+            quant_config=quant_config,
+            prefix="text_head"
+        )
+        self.mel_head = ColumnParallelLinear(
+            config.hidden_size,
+            config.num_audio_tokens,
+            bias=False,
+            quant_config=quant_config,
+            prefix="mel_head"
+        )
+        self.sampler = Sampler()
+    def get_style_emb(self, cond_input: torch.Tensor, return_latent: bool = False) -> torch.Tensor:
+        """Get conditioning embeddings from mel spectrograms."""
+        if not return_latent:
+            if cond_input.ndim == 4:
+                cond_input = cond_input.squeeze(1)
+            conds = self.conditioning_encoder(cond_input)
+            if hasattr(self, 'conditioning_perceiver'):
+                conds = self.conditioning_perceiver(
+                    conds.permute(0, 2, 1)
+                ).transpose(1, 2)
+        else:
+            conds = cond_input.unsqueeze(1)
+        return conds
+    def forward(self, input_ids: torch.Tensor, positions: torch.Tensor, kv_caches: List[torch.Tensor],
+            attn_metadata: AttentionMetadata, intermediate_tensors: Optional[IntermediateTensors] = None,
+            cond_latents: Optional[torch.Tensor] = None ) -> torch.Tensor:
+        """Forward pass following VLLM pattern."""
+        if cond_latents is not None:
+            # Combine conditioning with input embeddings
+            input_embeds = self.gpt.get_input_embeddings()(input_ids)
+            combined_embeds = torch.cat([cond_latents, input_embeds], dim=1)
+            hidden_states = self.gpt(
+                inputs_embeds=combined_embeds,
+                positions=positions,
+                kv_caches=kv_caches,
+                attn_metadata=attn_metadata,
+                intermediate_tensors=intermediate_tensors,
+            )
+        else:
+            hidden_states = self.gpt(
+                input_ids=input_ids,
+                positions=positions,
+                kv_caches=kv_caches,
+                attn_metadata=attn_metadata,
+                intermediate_tensors=intermediate_tensors,
+            )
+        return hidden_states
+    def compute_logits( # useless but kept for compatibility
+            self,
+            hidden_states: torch.Tensor,
+            sampling_metadata: SamplingMetadata,
+    ) -> torch.Tensor:
+        """Compute output logits."""
+        text_logits = self.text_head(hidden_states[sampling_metadata.selected_token_indices])
+        mel_logits = self.mel_head(hidden_states[sampling_metadata.selected_token_indices])
+        return torch.cat([text_logits, mel_logits], dim=1)
+    def sample(
+            self,
+            logits: torch.Tensor,
+            sampling_metadata: SamplingMetadata,
+    ) -> Optional[SamplerOutput]:
+        """Sample next tokens using VLLM sampler."""
+        return self.sampler(logits, sampling_metadata)
+    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+        """Load weights following VLLM pattern."""
+        params_dict = dict(self.named_parameters(remove_duplicate=False))
+        for name, loaded_weight in weights:
+            if name not in params_dict:
+                continue
+            param = params_dict[name]
+            if "c_attn" in name or "c_proj" in name or "c_fc" in name:
+                if name.endswith(".weight"):
+                    loaded_weight = loaded_weight.t()
+            weight_loader = getattr(param, "weight_loader", default_weight_loader)
+            weight_loader(param, loaded_weight)
+class XttsGPT2Model(nn.Module):
+    """VLLM-style implementation of GPT2 core architecture."""
+    def __init__(
+            self,
+            config: PretrainedConfig,
+            cache_config: Optional[CacheConfig] = None,
+            quant_config: Optional[QuantizationConfig] = None,
+            prefix: str = "",
+    ):
+        super().__init__()
+        self.config = config
+        self.text_embedding = VocabParallelEmbedding(
+            config.number_text_tokens,
+            config.hidden_size
+        )
+        self.mel_embedding = VocabParallelEmbedding(
+            config.num_audio_tokens,
+            config.hidden_size
+        )
+        self.text_pos_embedding = (
+            LearnedPositionEmbeddings(
+                config.max_text_tokens + 2,
+                config.hidden_size
+            )
+            if config.max_audio_tokens != -1
+            else functools.partial(config.null_position_embeddings, dim=config.hidden_size)
+        )
+        self.mel_pos_embedding = (
+            LearnedPositionEmbeddings(
+                config.max_audio_tokens + 3,
+                config.hidden_size
+            )
+            if config.max_audio_tokens != -1
+            else functools.partial(config.null_position_embeddings, dim=config.hidden_size)
+        )
+        self.h = nn.ModuleList([
+            GPT2Block(
+                config,
+                cache_config,
+                quant_config,
+                prefix=f"{prefix}.h.{i}"
+            ) for i in range(config.num_hidden_layers)
+        ])
+        self.ln_f = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
+    def get_input_embeddings(self):
+        return self.text_embedding
+    def forward(
+            self,
+            input_ids: Optional[torch.Tensor] = None,
+            positions: Optional[torch.Tensor] = None,
+            inputs_embeds: Optional[torch.Tensor] = None,
+            kv_caches: List[torch.Tensor] = None,
+            attn_metadata: AttentionMetadata = None,
+            intermediate_tensors: Optional[IntermediateTensors] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        if get_pp_group().is_first_rank:
+            if inputs_embeds is None:
+                inputs_embeds = self.text_embedding(input_ids)
+            hidden_states = inputs_embeds
+            if positions is not None:
+                position_embeds = self.text_pos_embedding(positions)
+                hidden_states = hidden_states + position_embeds
+        else:
+            assert intermediate_tensors is not None
+            hidden_states = intermediate_tensors["hidden_states"]
+        for i, block in enumerate(self.h):
+            hidden_states = block(
+                hidden_states,
+                kv_caches[i],
+                attn_metadata
+            )
+        if not get_pp_group().is_last_rank:
+            return IntermediateTensors({"hidden_states": hidden_states})
+        hidden_states = self.ln_f(hidden_states)
+        return hidden_states