Spaces:

ABAO77
/

Run_code_api

Sleeping

ABAO77 commited on Sep 1

Commit

df380ff

1 Parent(s): c6480d4

Add Wav2Vec2 model conversion and inference to ONNX format

- Implemented Wav2Vec2ONNXConverter class for converting Wav2Vec2 models to ONNX format, including model loading, conversion, and verification.
- Added Wav2Vec2ONNXInference class for performing inference using the converted ONNX model.
- Included methods for softmax calculation and transcription of audio files.
- Added utility functions for creating compatible models and exporting with fallback options for different ONNX opset versions.
- Introduced optimization function for ONNX models.
- Created helper function to convert numpy types to native Python types for better compatibility.

Files changed (10) hide show

.gitignore +2 -1
EVALUATION_CAROUSEL_UPDATES.md +0 -0
HYDRATION_ERROR_FIXES.md +0 -0
LESSON_PRACTICE_2_UPDATES.md +0 -95
requirements.txt +6 -1
src/apis/__pycache__/create_app.cpython-311.pyc +0 -0
src/apis/controllers/speaking_controller.py +1004 -0
src/apis/routes/speaking_route.py +73 -786
src/model_convert/wav2vec2onnx.py +373 -0
src/utils/helper.py +17 -0

.gitignore CHANGED Viewed

@@ -19,4 +19,5 @@ data_test
 **.tiff
 **.webp
 **.svg
-.serena

 **.tiff
 **.webp
 **.svg
+.serena
+**.onnx

EVALUATION_CAROUSEL_UPDATES.md DELETED Viewed

File without changes

HYDRATION_ERROR_FIXES.md DELETED Viewed

File without changes

LESSON_PRACTICE_2_UPDATES.md DELETED Viewed

@@ -1,95 +0,0 @@
-# Cập nhật Lesson Practice 2 Agent - Tóm tắt thay đổi
-## Mục tiêu
-Điều chỉnh `lesson_practice_2` agent để:
-- **Teaching Agent** trở thành agent mặc định (thay vì Practice Agent)
-- Tạo trải nghiệm học tập tự nhiên và thu hút
-- **Responses ngắn gọn và tương tác** - không quá dài làm người dùng nản
-- Chuyển đổi mượt mà giữa teaching và practice mode
-- Người dùng cảm thấy thoải mái và muốn tương tác nhiều hơn
-## Thay đổi chính
-### 1. Agent mặc định (func.py)
-- **Trước**: `state["active_agent"] = "Practice Agent"`
-- **Sau**: `state["active_agent"] = "Teaching Agent"`
-- **Lý do**: Bắt đầu với việc dạy và hướng dẫn trước khi thực hành
-### 2. Teaching Agent Prompt (prompt.py)
-#### Cải thiện chính:
-- **Triết lý dạy học tự nhiên**: Bắt đầu từ level hiện tại, xây dựng tự tin từ từ
-- **Linh hoạt ngôn ngữ**: Tiếng Việt khi cần, tiếng Anh khi có thể
-- **Phong cách thu hút**: Nhiệt tình, kiên nhẫn, khuyến khích với humor nhẹ nhàng
-- **Responses ngắn gọn**: 10-20 từ tối đa, một câu hỏi, tập trung vào tương tác
-- **Phương pháp dạy tương tác**: Một khái niệm/lần, hỏi input ngay, không giải thích quá nhiều
-- **Xử lý lỗi nhanh**: Sửa ngắn gọn, khuyến khích thử lại ngay
-- **Ví dụ cụ thể**: Có examples về responses tốt vs nên tránh
-### 3. Practice Agent Prompt (prompt.py)
-#### Cải thiện chính:
-- **Đối tác hội thoại tự nhiên**: Tập trung vào giao tiếp thay vì hoàn hảo
-- **Responses cực ngắn**: 1-2 câu tối đa, một câu hỏi hay
-- **Phong cách partner**: Quan tâm thực sự, không điền đầy mọi khoảng trống
-- **Khuyến khích tham gia**: Tạo không gian cho họ chia sẻ thêm
-- **Ví dụ responses**: Examples về cách trả lời ngắn gọn nhưng hấp dẫn
-### 4. Logic chuyển đổi (func.py)
-#### Teaching → Practice:
-- Người dùng thể hiện hiểu biết và tự tin
-- Yêu cầu thực hành hội thoại
-- Sẵn sàng cho giao tiếp tiếng Anh
-#### Practice → Teaching:
-- Cần giải thích ngữ pháp chi tiết
-- Lỗi cơ bản lặp lại nhiều lần
-- Yêu cầu hỗ trợ có cấu trúc hơn
-### 5. Flow routing (flow.py)
-- Thêm fallback logic: mặc định về Teaching Agent nếu không có active agent
-## Lợi ích của thay đổi
-### Trải nghiệm người học:
-1. **Bắt đầu thoải mái**: Teaching agent tạo môi trường an toàn để học
-2. **Tương tác cao**: Responses ngắn gọn, luôn có câu hỏi khuyến khích tham gia
-3. **Không bị overwhelm**: Không quá nhiều thông tin một lúc
-4. **Linh hoạt ngôn ngữ**: Dùng tiếng Việt khi cần, tiếng Anh khi có thể
-5. **Chuyển đổi tự nhiên**: Khi sẵn sàng, được khuyến khích thực hành
-6. **Partner thực sự**: Practice mode như nói chuyện với bạn thật, câu trả lời ngắn gọn
-### Hiệu quả giáo dục:
-1. **Học có cấu trúc**: Dạy trước, luyện sau, từng bước nhỏ
-2. **Động lực cao**: Môi trường vui vẻ, không áp lực, luôn được khuyến khích tham gia
-3. **Duy trì sự chú ý**: Responses ngắn giúp người học không bị mệt mỏi
-4. **Tương tác liên tục**: Luôn có cơ hội để người học phản hồi
-5. **Ứng dụng thực tế**: Tập trung vào giao tiếp thực tế
-6. **Tự tin giao tiếp**: Chuẩn bị kỹ trước khi thực hành
-## Cách sử dụng
-1. **Bắt đầu**: Teaching Agent sẽ chào và bắt đầu dạy
-2. **Học tập**: Giải thích, luyện tập với hỗ trợ và khuyến khích
-3. **Sẵn sàng**: Khi tự tin, Teaching Agent sẽ chuyển sang Practice Agent
-4. **Thực hành**: Hội thoại tự nhiên với Practice Agent
-5. **Hỗ trợ**: Nếu cần giúp, Practice Agent chuyển về Teaching Agent
-## Kết quả mong đợi
-- Người học cảm thấy thoải mái và được hỗ trợ
-- **Luôn muốn tương tác thêm** vì responses ngắn gọn, dễ đọc
-- Quá trình học tự nhiên và không áp lực
-- **Không bị overwhelm** bởi thông tin quá nhiều
-- Chuyển đổi mượt mà giữa học và thực hành
-- Động lực cao và muốn tiếp tục học
-- Giao tiếp tiếng Anh tự tin và tự nhiên
-## Ví dụ Response Style
-### Teaching Agent:
-❌ **Tránh**: "That's excellent! You're really making great progress with past tense. Let me explain how irregular verbs work in English. There are many irregular verbs like 'go-went', 'see-saw', 'have-had'..."
-✅ **Tốt**: "Good try! Use **went** instead. Can you try again?"
-### Practice Agent:
-❌ **Tránh**: "That sounds like a really interesting experience! I'd love to hear more about what happened next and how you felt about the whole situation. It must have been quite exciting for you!"
-✅ **Tốt**: "Wow, sounds exciting! What happened next?"

requirements.txt CHANGED Viewed

@@ -17,4 +17,9 @@ deepgram-sdk
 whisper-openai
 nltk
 librosa
-eng-to-ipa

 whisper-openai
 nltk
 librosa
+eng-to-ipa
+onnxruntime
+onnx
+transformers
+torch
+optimum[onnxruntime]

src/apis/__pycache__/create_app.cpython-311.pyc CHANGED Viewed

Binary files a/src/apis/__pycache__/create_app.cpython-311.pyc and b/src/apis/__pycache__/create_app.cpython-311.pyc differ

src/apis/controllers/speaking_controller.py ADDED Viewed

	@@ -0,0 +1,1004 @@

+from fastapi import FastAPI, UploadFile, File, Form, HTTPException, APIRouter
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import List, Dict, Optional
+import tempfile
+import os
+import numpy as np
+import librosa
+import nltk
+import eng_to_ipa as ipa
+import torch
+import re
+from collections import defaultdict
+import warnings
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from transformers import WhisperProcessor, WhisperForConditionalGeneration
+from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
+from loguru import logger
+import onnxruntime
+warnings.filterwarnings("ignore")
+# Download required NLTK data
+try:
+    nltk.download("cmudict", quiet=True)
+    from nltk.corpus import cmudict
+except:
+    print("Warning: NLTK data not available")
+class WhisperASR:
+    """Whisper ASR for normal mode pronunciation assessment"""
+    def __init__(self, model_name: str = "openai/whisper-base.en"):
+        """
+        Initialize Whisper model for normal mode
+        Args:
+            model_name: HuggingFace model name for Whisper
+        """
+        print(f"Loading Whisper model: {model_name}")
+        try:
+            # Try ONNX first
+            self.processor = WhisperProcessor.from_pretrained(model_name)
+            self.model = ORTModelForSpeechSeq2Seq.from_pretrained(
+                model_name,
+                export=True,
+                provider="CPUExecutionProvider",
+            )
+            self.model_type = "ONNX"
+            print("Whisper ONNX model loaded successfully")
+        except:
+            # Fallback to PyTorch
+            self.processor = WhisperProcessor.from_pretrained(model_name)
+            self.model = WhisperForConditionalGeneration.from_pretrained(model_name)
+            self.model_type = "PyTorch"
+            print("Whisper PyTorch model loaded successfully")
+        self.model_name = model_name
+        self.sample_rate = 16000
+    def transcribe_to_text(self, audio_path: str) -> Dict:
+        """
+        Transcribe audio to text using Whisper
+        Returns transcript and confidence score
+        """
+        try:
+            # Load audio
+            audio, sr = librosa.load(audio_path, sr=self.sample_rate)
+            # Process audio
+            inputs = self.processor(audio, sampling_rate=16000, return_tensors="pt")
+            # Set language to English
+            forced_decoder_ids = self.processor.get_decoder_prompt_ids(
+                language="en", task="transcribe"
+            )
+            # Generate transcription
+            with torch.no_grad():
+                predicted_ids = self.model.generate(
+                    inputs["input_features"],
+                    forced_decoder_ids=forced_decoder_ids,
+                    max_new_tokens=200,
+                    do_sample=False,
+                )
+            # Decode to text
+            transcript = self.processor.batch_decode(
+                predicted_ids, skip_special_tokens=True
+            )[0]
+            transcript = transcript.strip().lower()
+            # Convert to phoneme representation for comparison
+            g2p = SimpleG2P()
+            phoneme_representation = g2p.get_reference_phoneme_string(transcript)
+            return {
+                "character_transcript": transcript,
+                "phoneme_representation": phoneme_representation,
+                "confidence_scores": [0.8]
+                * len(transcript.split()),  # Simple confidence
+            }
+        except Exception as e:
+            logger.error(f"Whisper transcription error: {e}")
+            return {
+                "character_transcript": "",
+                "phoneme_representation": "",
+                "confidence_scores": [],
+            }
+    def get_model_info(self) -> Dict:
+        """Get information about the loaded Whisper model"""
+        return {
+            "model_name": self.model_name,
+            "model_type": self.model_type,
+            "sample_rate": self.sample_rate,
+        }
+class Wav2Vec2CharacterASRONNX:
+    """Wav2Vec2 character-level ASR with ONNX runtime - no language model correction"""
+    def __init__(
+        self,
+        onnx_model_path: str = "./wav2vec2_asr.onnx",
+        processor_name: str = "facebook/wav2vec2-base-960h",
+    ):
+        """
+        Initialize Wav2Vec2 ONNX character-level model
+        Automatically creates ONNX model if it doesn't exist
+        Args:
+            onnx_model_path: Path to the ONNX model file
+            processor_name: HuggingFace model name for the processor
+        """
+        print(f"Loading Wav2Vec2 ONNX model from: {onnx_model_path}")
+        print(f"Loading processor: {processor_name}")
+        # Check if ONNX model exists, if not create it
+        if not os.path.exists(onnx_model_path):
+            print(f"ONNX model not found at {onnx_model_path}. Creating it...")
+            self._create_onnx_model(onnx_model_path, processor_name)
+        try:
+            # Load ONNX model
+            self.session = onnxruntime.InferenceSession(onnx_model_path)
+            self.input_name = self.session.get_inputs()[0].name
+            self.output_name = self.session.get_outputs()[0].name
+            # Load processor
+            self.processor = Wav2Vec2Processor.from_pretrained(processor_name)
+            print("ONNX Wav2Vec2 character model loaded successfully")
+            self.model_name = processor_name
+            self.onnx_path = onnx_model_path
+            self.sample_rate = 16000
+        except Exception as e:
+            print(f"Error loading ONNX model: {e}")
+            raise
+    def _create_onnx_model(self, onnx_model_path: str, processor_name: str):
+        """Create ONNX model if it doesn't exist"""
+        try:
+            # Import the converter from model_convert
+            from src.model_convert.wav2vec2onnx import Wav2Vec2ONNXConverter
+            print("Creating new ONNX model...")
+            converter = Wav2Vec2ONNXConverter(processor_name)
+            created_path = converter.convert_to_onnx(
+                onnx_path=onnx_model_path,
+                input_length=160000,  # 10 seconds
+                opset_version=14,
+            )
+            print(f"✓ ONNX model created successfully at: {created_path}")
+        except ImportError as e:
+            print(f"Error importing Wav2Vec2ONNXConverter: {e}")
+            # Fallback: use the convert_to_onnx.py directly if wav2vec2onnx.py doesn't work
+            self._fallback_create_onnx_model(onnx_model_path, processor_name)
+        except Exception as e:
+            print(f"Error creating ONNX model: {e}")
+            # Try fallback method
+            self._fallback_create_onnx_model(onnx_model_path, processor_name)
+    def _fallback_create_onnx_model(self, onnx_model_path: str, processor_name: str):
+        """Fallback method to create ONNX model using basic torch.onnx.export"""
+        try:
+            print("Using fallback method to create ONNX model...")
+            # Load PyTorch model
+            model = Wav2Vec2ForCTC.from_pretrained(processor_name)
+            model.eval()
+            # Create dummy input
+            dummy_input = torch.randn(1, 160000, dtype=torch.float32)
+            # Export to ONNX
+            with torch.no_grad():
+                torch.onnx.export(
+                    model,
+                    dummy_input,
+                    onnx_model_path,
+                    input_names=["input_values"],
+                    output_names=["logits"],
+                    dynamic_axes={
+                        "input_values": {0: "batch_size", 1: "sequence_length"},
+                        "logits": {0: "batch_size", 1: "sequence_length"},
+                    },
+                    opset_version=14,
+                    do_constant_folding=True,
+                    verbose=False,
+                    export_params=True,
+                )
+            print(f"✓ Fallback ONNX model created at: {onnx_model_path}")
+        except Exception as e:
+            print(f"Fallback method also failed: {e}")
+            raise Exception(f"Could not create ONNX model: {e}")
+    def transcribe_to_characters(self, audio_path: str) -> Dict:
+        """
+        Transcribe audio directly to characters using ONNX model (no language model correction)
+        Returns raw character sequence as produced by the model
+        """
+        try:
+            # Load audio
+            speech, sr = librosa.load(audio_path, sr=self.sample_rate)
+            # Prepare input for ONNX
+            input_values = self.processor(
+                speech, sampling_rate=self.sample_rate, return_tensors="np"
+            ).input_values
+            # Run ONNX inference
+            ort_inputs = {self.input_name: input_values}
+            ort_outputs = self.session.run([self.output_name], ort_inputs)
+            logits = ort_outputs[0]
+            # Get predictions
+            predicted_ids = np.argmax(logits, axis=-1)
+            # Decode to characters directly
+            character_transcript = self.processor.batch_decode(predicted_ids)[0]
+            logger.info(f"character_transcript {character_transcript}")
+            # Clean up character transcript
+            character_transcript = self._clean_character_transcript(
+                character_transcript
+            )
+            # Convert characters to phoneme-like representation
+            phoneme_like_transcript = self._characters_to_phoneme_representation(
+                character_transcript
+            )
+            # Calculate confidence scores
+            confidence_scores = self._calculate_confidence_scores(logits)
+            return {
+                "character_transcript": character_transcript,
+                "phoneme_representation": phoneme_like_transcript,
+                "raw_predicted_ids": predicted_ids[0].tolist(),
+                "confidence_scores": confidence_scores[:100],  # Limit for JSON
+            }
+        except Exception as e:
+            print(f"Transcription error: {e}")
+            return {
+                "character_transcript": "",
+                "phoneme_representation": "",
+                "raw_predicted_ids": [],
+                "confidence_scores": [],
+            }
+    def _calculate_confidence_scores(self, logits: np.ndarray) -> List[float]:
+        """Calculate confidence scores from logits using numpy"""
+        # Apply softmax
+        exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
+        softmax_probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
+        # Get max probabilities
+        max_probs = np.max(softmax_probs, axis=-1)[0]
+        return max_probs.tolist()
+    def _clean_character_transcript(self, transcript: str) -> str:
+        """Clean and standardize character transcript"""
+        # Remove extra spaces and special tokens
+        logger.info(f"Raw transcript before cleaning: {transcript}")
+        cleaned = re.sub(r"\s+", " ", transcript)
+        cleaned = cleaned.strip().lower()
+        return cleaned
+    def _characters_to_phoneme_representation(self, text: str) -> str:
+        """Convert character-based transcript to phoneme-like representation for comparison"""
+        # This is a simple character-to-phoneme mapping for pronunciation comparison
+        # The idea is to convert the raw character output to something comparable with reference phonemes
+        if not text:
+            return ""
+        words = text.split()
+        phoneme_words = []
+        # Use our G2P to convert transcript words to phonemes
+        g2p = SimpleG2P()
+        for word in words:
+            try:
+                word_data = g2p.text_to_phonemes(word)[0]
+                phoneme_words.extend(word_data["phonemes"])
+            except:
+                # Fallback: simple letter-to-sound mapping
+                phoneme_words.extend(self._simple_letter_to_phoneme(word))
+        return " ".join(phoneme_words)
+    def _simple_letter_to_phoneme(self, word: str) -> List[str]:
+        """Simple fallback letter-to-phoneme conversion"""
+        letter_to_phoneme = {
+            "a": "æ",
+            "b": "b",
+            "c": "k",
+            "d": "d",
+            "e": "ɛ",
+            "f": "f",
+            "g": "ɡ",
+            "h": "h",
+            "i": "ɪ",
+            "j": "dʒ",
+            "k": "k",
+            "l": "l",
+            "m": "m",
+            "n": "n",
+            "o": "ʌ",
+            "p": "p",
+            "q": "k",
+            "r": "r",
+            "s": "s",
+            "t": "t",
+            "u": "ʌ",
+            "v": "v",
+            "w": "w",
+            "x": "ks",
+            "y": "j",
+            "z": "z",
+        }
+        phonemes = []
+        for letter in word.lower():
+            if letter in letter_to_phoneme:
+                phonemes.append(letter_to_phoneme[letter])
+        return phonemes
+    def get_model_info(self) -> Dict:
+        """Get information about the loaded ONNX model"""
+        return {
+            "onnx_model_path": self.onnx_path,
+            "processor_name": self.model_name,
+            "input_name": self.input_name,
+            "output_name": self.output_name,
+            "sample_rate": self.sample_rate,
+            "session_providers": self.session.get_providers(),
+        }
+class SimpleG2P:
+    """Simple Grapheme-to-Phoneme converter for reference text"""
+    def __init__(self):
+        try:
+            self.cmu_dict = cmudict.dict()
+        except:
+            self.cmu_dict = {}
+            print("Warning: CMU dictionary not available")
+    def text_to_phonemes(self, text: str) -> List[Dict]:
+        """Convert text to phoneme sequence"""
+        words = self._clean_text(text).split()
+        phoneme_sequence = []
+        for word in words:
+            word_phonemes = self._get_word_phonemes(word)
+            phoneme_sequence.append(
+                {
+                    "word": word,
+                    "phonemes": word_phonemes,
+                    "ipa": self._get_ipa(word),
+                    "phoneme_string": " ".join(word_phonemes),
+                }
+            )
+        return phoneme_sequence
+    def get_reference_phoneme_string(self, text: str) -> str:
+        """Get reference phoneme string for comparison"""
+        phoneme_sequence = self.text_to_phonemes(text)
+        all_phonemes = []
+        for word_data in phoneme_sequence:
+            all_phonemes.extend(word_data["phonemes"])
+        return " ".join(all_phonemes)
+    def _clean_text(self, text: str) -> str:
+        """Clean text for processing"""
+        text = re.sub(r"[^\w\s\']", " ", text)
+        text = re.sub(r"\s+", " ", text)
+        return text.lower().strip()
+    def _get_word_phonemes(self, word: str) -> List[str]:
+        """Get phonemes for a word"""
+        word_lower = word.lower()
+        if word_lower in self.cmu_dict:
+            # Remove stress markers and convert to Wav2Vec2 phoneme format
+            phonemes = self.cmu_dict[word_lower][0]
+            clean_phonemes = [re.sub(r"[0-9]", "", p) for p in phonemes]
+            return self._convert_to_wav2vec_format(clean_phonemes)
+        else:
+            return self._estimate_phonemes(word)
+    def _convert_to_wav2vec_format(self, cmu_phonemes: List[str]) -> List[str]:
+        """Convert CMU phonemes to Wav2Vec2 format"""
+        # Mapping from CMU to Wav2Vec2/eSpeak phonemes
+        cmu_to_espeak = {
+            "AA": "ɑ",
+            "AE": "æ",
+            "AH": "ʌ",
+            "AO": "ɔ",
+            "AW": "aʊ",
+            "AY": "aɪ",
+            "EH": "ɛ",
+            "ER": "ɝ",
+            "EY": "eɪ",
+            "IH": "ɪ",
+            "IY": "i",
+            "OW": "oʊ",
+            "OY": "ɔɪ",
+            "UH": "ʊ",
+            "UW": "u",
+            "B": "b",
+            "CH": "tʃ",
+            "D": "d",
+            "DH": "ð",
+            "F": "f",
+            "G": "ɡ",
+            "HH": "h",
+            "JH": "dʒ",
+            "K": "k",
+            "L": "l",
+            "M": "m",
+            "N": "n",
+            "NG": "ŋ",
+            "P": "p",
+            "R": "r",
+            "S": "s",
+            "SH": "ʃ",
+            "T": "t",
+            "TH": "θ",
+            "V": "v",
+            "W": "w",
+            "Y": "j",
+            "Z": "z",
+            "ZH": "ʒ",
+        }
+        converted = []
+        for phoneme in cmu_phonemes:
+            converted_phoneme = cmu_to_espeak.get(phoneme, phoneme.lower())
+            converted.append(converted_phoneme)
+        return converted
+    def _get_ipa(self, word: str) -> str:
+        """Get IPA transcription"""
+        try:
+            return ipa.convert(word)
+        except:
+            return f"/{word}/"
+    def _estimate_phonemes(self, word: str) -> List[str]:
+        """Estimate phonemes for unknown words"""
+        # Basic phoneme estimation with eSpeak-style output
+        phoneme_map = {
+            "ch": ["tʃ"],
+            "sh": ["ʃ"],
+            "th": ["θ"],
+            "ph": ["f"],
+            "ck": ["k"],
+            "ng": ["ŋ"],
+            "qu": ["k", "w"],
+            "a": ["æ"],
+            "e": ["ɛ"],
+            "i": ["ɪ"],
+            "o": ["ʌ"],
+            "u": ["ʌ"],
+            "b": ["b"],
+            "c": ["k"],
+            "d": ["d"],
+            "f": ["f"],
+            "g": ["ɡ"],
+            "h": ["h"],
+            "j": ["dʒ"],
+            "k": ["k"],
+            "l": ["l"],
+            "m": ["m"],
+            "n": ["n"],
+            "p": ["p"],
+            "r": ["r"],
+            "s": ["s"],
+            "t": ["t"],
+            "v": ["v"],
+            "w": ["w"],
+            "x": ["k", "s"],
+            "y": ["j"],
+            "z": ["z"],
+        }
+        word = word.lower()
+        phonemes = []
+        i = 0
+        while i < len(word):
+            # Check 2-letter combinations first
+            if i <= len(word) - 2:
+                two_char = word[i : i + 2]
+                if two_char in phoneme_map:
+                    phonemes.extend(phoneme_map[two_char])
+                    i += 2
+                    continue
+            # Single character
+            char = word[i]
+            if char in phoneme_map:
+                phonemes.extend(phoneme_map[char])
+            i += 1
+        return phonemes
+class PhonemeComparator:
+    """Compare reference and learner phoneme sequences"""
+    def __init__(self):
+        # Vietnamese speakers' common phoneme substitutions
+        self.substitution_patterns = {
+            "θ": ["f", "s", "t"],  # TH → F, S, T
+            "ð": ["d", "z", "v"],  # DH → D, Z, V
+            "v": ["w", "f"],  # V → W, F
+            "r": ["l"],  # R → L
+            "l": ["r"],  # L → R
+            "z": ["s"],  # Z → S
+            "ʒ": ["ʃ", "z"],  # ZH → SH, Z
+            "ŋ": ["n"],  # NG → N
+        }
+        # Difficulty levels for Vietnamese speakers
+        self.difficulty_map = {
+            "θ": 0.9,  # th (think)
+            "ð": 0.9,  # th (this)
+            "v": 0.8,  # v
+            "z": 0.8,  # z
+            "ʒ": 0.9,  # zh (measure)
+            "r": 0.7,  # r
+            "l": 0.6,  # l
+            "w": 0.5,  # w
+            "f": 0.4,  # f
+            "s": 0.3,  # s
+            "ʃ": 0.5,  # sh
+            "tʃ": 0.4,  # ch
+            "dʒ": 0.5,  # j
+            "ŋ": 0.3,  # ng
+        }
+    def compare_phoneme_sequences(
+        self, reference_phonemes: str, learner_phonemes: str
+    ) -> List[Dict]:
+        """Compare reference and learner phoneme sequences"""
+        # Split phoneme strings
+        ref_phones = reference_phonemes.split()
+        learner_phones = learner_phonemes.split()
+        print(f"Reference phonemes: {ref_phones}")
+        print(f"Learner phonemes: {learner_phones}")
+        # Simple alignment comparison
+        comparisons = []
+        max_len = max(len(ref_phones), len(learner_phones))
+        for i in range(max_len):
+            ref_phoneme = ref_phones[i] if i < len(ref_phones) else ""
+            learner_phoneme = learner_phones[i] if i < len(learner_phones) else ""
+            if ref_phoneme and learner_phoneme:
+                # Both present - check accuracy
+                if ref_phoneme == learner_phoneme:
+                    status = "correct"
+                    score = 1.0
+                elif self._is_acceptable_substitution(ref_phoneme, learner_phoneme):
+                    status = "acceptable"
+                    score = 0.7
+                else:
+                    status = "wrong"
+                    score = 0.2
+            elif ref_phoneme and not learner_phoneme:
+                # Missing phoneme
+                status = "missing"
+                score = 0.0
+            elif learner_phoneme and not ref_phoneme:
+                # Extra phoneme
+                status = "extra"
+                score = 0.0
+            else:
+                continue
+            comparison = {
+                "position": i,
+                "reference_phoneme": ref_phoneme,
+                "learner_phoneme": learner_phoneme,
+                "status": status,
+                "score": score,
+                "difficulty": self.difficulty_map.get(ref_phoneme, 0.3),
+            }
+            comparisons.append(comparison)
+        return comparisons
+    def _is_acceptable_substitution(self, reference: str, learner: str) -> bool:
+        """Check if learner phoneme is acceptable substitution for Vietnamese speakers"""
+        acceptable = self.substitution_patterns.get(reference, [])
+        return learner in acceptable
+# =============================================================================
+# WORD ANALYZER
+# =============================================================================
+class WordAnalyzer:
+    """Analyze word-level pronunciation accuracy using character-based ASR"""
+    def __init__(self):
+        self.g2p = SimpleG2P()
+        self.comparator = PhonemeComparator()
+    def analyze_words(self, reference_text: str, learner_phonemes: str) -> Dict:
+        """Analyze word-level pronunciation using phoneme representation from character ASR"""
+        # Get reference phonemes by word
+        reference_words = self.g2p.text_to_phonemes(reference_text)
+        # Get overall phoneme comparison
+        reference_phoneme_string = self.g2p.get_reference_phoneme_string(reference_text)
+        phoneme_comparisons = self.comparator.compare_phoneme_sequences(
+            reference_phoneme_string, learner_phonemes
+        )
+        # Map phonemes back to words
+        word_highlights = self._create_word_highlights(
+            reference_words, phoneme_comparisons
+        )
+        # Identify wrong words
+        wrong_words = self._identify_wrong_words(word_highlights, phoneme_comparisons)
+        return {
+            "word_highlights": word_highlights,
+            "phoneme_differences": phoneme_comparisons,
+            "wrong_words": wrong_words,
+        }
+    def _create_word_highlights(
+        self, reference_words: List[Dict], phoneme_comparisons: List[Dict]
+    ) -> List[Dict]:
+        """Create word highlighting data"""
+        word_highlights = []
+        phoneme_index = 0
+        for word_data in reference_words:
+            word = word_data["word"]
+            word_phonemes = word_data["phonemes"]
+            num_phonemes = len(word_phonemes)
+            # Get phoneme scores for this word
+            word_phoneme_scores = []
+            for j in range(num_phonemes):
+                if phoneme_index + j < len(phoneme_comparisons):
+                    comparison = phoneme_comparisons[phoneme_index + j]
+                    word_phoneme_scores.append(comparison["score"])
+            # Calculate word score
+            word_score = np.mean(word_phoneme_scores) if word_phoneme_scores else 0.0
+            # Create word highlight
+            highlight = {
+                "word": word,
+                "score": float(word_score),
+                "status": self._get_word_status(word_score),
+                "color": self._get_word_color(word_score),
+                "phonemes": word_phonemes,
+                "ipa": word_data["ipa"],
+                "phoneme_scores": word_phoneme_scores,
+                "phoneme_start_index": phoneme_index,
+                "phoneme_end_index": phoneme_index + num_phonemes - 1,
+            }
+            word_highlights.append(highlight)
+            phoneme_index += num_phonemes
+        return word_highlights
+    def _identify_wrong_words(
+        self, word_highlights: List[Dict], phoneme_comparisons: List[Dict]
+    ) -> List[Dict]:
+        """Identify words that were pronounced incorrectly"""
+        wrong_words = []
+        for word_highlight in word_highlights:
+            if word_highlight["score"] < 0.6:  # Threshold for wrong pronunciation
+                # Find specific phoneme errors for this word
+                start_idx = word_highlight["phoneme_start_index"]
+                end_idx = word_highlight["phoneme_end_index"]
+                wrong_phonemes = []
+                missing_phonemes = []
+                for i in range(start_idx, min(end_idx + 1, len(phoneme_comparisons))):
+                    comparison = phoneme_comparisons[i]
+                    if comparison["status"] == "wrong":
+                        wrong_phonemes.append(
+                            {
+                                "expected": comparison["reference_phoneme"],
+                                "actual": comparison["learner_phoneme"],
+                                "difficulty": comparison["difficulty"],
+                            }
+                        )
+                    elif comparison["status"] == "missing":
+                        missing_phonemes.append(
+                            {
+                                "phoneme": comparison["reference_phoneme"],
+                                "difficulty": comparison["difficulty"],
+                            }
+                        )
+                wrong_word = {
+                    "word": word_highlight["word"],
+                    "score": word_highlight["score"],
+                    "expected_phonemes": word_highlight["phonemes"],
+                    "ipa": word_highlight["ipa"],
+                    "wrong_phonemes": wrong_phonemes,
+                    "missing_phonemes": missing_phonemes,
+                    "tips": self._get_vietnamese_tips(wrong_phonemes, missing_phonemes),
+                }
+                wrong_words.append(wrong_word)
+        return wrong_words
+    def _get_word_status(self, score: float) -> str:
+        """Get word status from score"""
+        if score >= 0.8:
+            return "excellent"
+        elif score >= 0.6:
+            return "good"
+        elif score >= 0.4:
+            return "needs_practice"
+        else:
+            return "poor"
+    def _get_word_color(self, score: float) -> str:
+        """Get color for word highlighting"""
+        if score >= 0.8:
+            return "#22c55e"  # Green
+        elif score >= 0.6:
+            return "#84cc16"  # Light green
+        elif score >= 0.4:
+            return "#eab308"  # Yellow
+        else:
+            return "#ef4444"  # Red
+    def _get_vietnamese_tips(
+        self, wrong_phonemes: List[Dict], missing_phonemes: List[Dict]
+    ) -> List[str]:
+        """Get Vietnamese-specific pronunciation tips"""
+        tips = []
+        # Tips for specific Vietnamese pronunciation challenges
+        vietnamese_tips = {
+            "θ": "Đặt lưỡi giữa răng trên và dưới, thổi nhẹ (think, three)",
+            "ð": "Giống θ nhưng rung dây thanh âm (this, that)",
+            "v": "Chạm môi dưới vào răng trên, không dùng cả hai môi như tiếng Việt",
+            "r": "Cuộn lưỡi nhưng không chạm vào vòm miệng, không lăn lưỡi",
+            "l": "Đầu lưỡi chạm vào vòm miệng sau răng",
+            "z": "Giống âm 's' nhưng có rung dây thanh âm",
+            "ʒ": "Giống âm 'ʃ' (sh) nhưng có rung dây thanh âm",
+            "w": "Tròn môi như âm 'u', không dùng răng như âm 'v'",
+        }
+        # Add tips for wrong phonemes
+        for wrong in wrong_phonemes:
+            expected = wrong["expected"]
+            actual = wrong["actual"]
+            if expected in vietnamese_tips:
+                tips.append(f"Âm '{expected}': {vietnamese_tips[expected]}")
+            else:
+                tips.append(f"Luyện âm '{expected}' thay vì '{actual}'")
+        # Add tips for missing phonemes
+        for missing in missing_phonemes:
+            phoneme = missing["phoneme"]
+            if phoneme in vietnamese_tips:
+                tips.append(f"Thiếu âm '{phoneme}': {vietnamese_tips[phoneme]}")
+        return tips
+class SimpleFeedbackGenerator:
+    """Generate simple, actionable feedback in Vietnamese"""
+    def generate_feedback(
+        self,
+        overall_score: float,
+        wrong_words: List[Dict],
+        phoneme_comparisons: List[Dict],
+    ) -> List[str]:
+        """Generate Vietnamese feedback"""
+        feedback = []
+        # Overall feedback in Vietnamese
+        if overall_score >= 0.8:
+            feedback.append("Phát âm rất tốt! Bạn đã làm xuất sắc.")
+        elif overall_score >= 0.6:
+            feedback.append("Phát âm khá tốt, còn một vài điểm cần cải thiện.")
+        elif overall_score >= 0.4:
+            feedback.append(
+                "Cần luyện tập thêm. Tập trung vào những từ được đánh dấu đỏ."
+            )
+        else:
+            feedback.append("Hãy luyện tập chậm và rõ ràng hơn.")
+        # Wrong words feedback
+        if wrong_words:
+            if len(wrong_words) <= 3:
+                word_names = [w["word"] for w in wrong_words]
+                feedback.append(f"Các từ cần luyện tập: {', '.join(word_names)}")
+            else:
+                feedback.append(
+                    f"Có {len(wrong_words)} từ cần luyện tập. Tập trung vào từng từ một."
+                )
+        # Most problematic phonemes
+        problem_phonemes = defaultdict(int)
+        for comparison in phoneme_comparisons:
+            if comparison["status"] in ["wrong", "missing"]:
+                phoneme = comparison["reference_phoneme"]
+                problem_phonemes[phoneme] += 1
+        if problem_phonemes:
+            most_difficult = sorted(
+                problem_phonemes.items(), key=lambda x: x[1], reverse=True
+            )
+            top_problem = most_difficult[0][0]
+            phoneme_tips = {
+                "θ": "Lưỡi giữa răng, thổi nhẹ",
+                "ð": "Lưỡi giữa răng, rung dây thanh",
+                "v": "Môi dưới chạm răng trên",
+                "r": "Cuộn lưỡi, không chạm vòm miệng",
+                "l": "Lưỡi chạm vòm miệng",
+                "z": "Như 's' nhưng rung dây thanh",
+            }
+            if top_problem in phoneme_tips:
+                feedback.append(
+                    f"Âm khó nhất '{top_problem}': {phoneme_tips[top_problem]}"
+                )
+        return feedback
+class SimplePronunciationAssessor:
+    """Main pronunciation assessor supporting both normal (Whisper) and advanced (Wav2Vec2) modes"""
+    def __init__(self):
+        print("Initializing Simple Pronunciation Assessor...")
+        self.wav2vec2_asr = Wav2Vec2CharacterASRONNX()  # Advanced mode
+        self.whisper_asr = WhisperASR()  # Normal mode
+        self.word_analyzer = WordAnalyzer()
+        self.feedback_generator = SimpleFeedbackGenerator()
+        print("Initialization completed")
+    def assess_pronunciation(
+        self, audio_path: str, reference_text: str, mode: str = "normal"
+    ) -> Dict:
+        """
+        Main assessment function with mode selection
+        Args:
+            audio_path: Path to audio file
+            reference_text: Reference text to compare
+            mode: 'normal' (Whisper) or 'advanced' (Wav2Vec2)
+        Output: Word highlights + Phoneme differences + Wrong words
+        """
+        print(f"Starting pronunciation assessment in {mode} mode...")
+        # Step 1: Choose ASR model based on mode
+        if mode == "advanced":
+            print("Step 1: Using Wav2Vec2 character transcription...")
+            asr_result = self.wav2vec2_asr.transcribe_to_characters(audio_path)
+            model_info = f"Wav2Vec2-Character ({self.wav2vec2_asr.model_name})"
+        else:  # normal mode
+            print("Step 1: Using Whisper transcription...")
+            asr_result = self.whisper_asr.transcribe_to_text(audio_path)
+            model_info = f"Whisper ({self.whisper_asr.model_name})"
+        character_transcript = asr_result["character_transcript"]
+        phoneme_representation = asr_result["phoneme_representation"]
+        print(f"Character transcript: {character_transcript}")
+        print(f"Phoneme representation: {phoneme_representation}")
+        # Step 2: Word analysis using phoneme representation
+        print("Step 2: Analyzing words...")
+        analysis_result = self.word_analyzer.analyze_words(
+            reference_text, phoneme_representation
+        )
+        # Step 3: Calculate overall score
+        phoneme_comparisons = analysis_result["phoneme_differences"]
+        overall_score = self._calculate_overall_score(phoneme_comparisons)
+        # Step 4: Generate feedback
+        print("Step 3: Generating feedback...")
+        feedback = self.feedback_generator.generate_feedback(
+            overall_score, analysis_result["wrong_words"], phoneme_comparisons
+        )
+        result = {
+            "transcript": character_transcript,  # What user actually said
+            "transcript_phonemes": phoneme_representation,
+            "user_phonemes": phoneme_representation,  # Alias for UI clarity
+            "character_transcript": character_transcript,
+            "overall_score": overall_score,
+            "word_highlights": analysis_result["word_highlights"],
+            "phoneme_differences": phoneme_comparisons,
+            "wrong_words": analysis_result["wrong_words"],
+            "feedback": feedback,
+            "processing_info": {
+                "model_used": model_info,
+                "mode": mode,
+                "character_based": mode == "advanced",
+                "language_model_correction": mode == "normal",
+                "raw_output": mode == "advanced",
+            },
+        }
+        print("Assessment completed successfully")
+        return result
+    def _calculate_overall_score(self, phoneme_comparisons: List[Dict]) -> float:
+        """Calculate overall pronunciation score"""
+        if not phoneme_comparisons:
+            return 0.0
+        total_score = sum(comparison["score"] for comparison in phoneme_comparisons)
+        return total_score / len(phoneme_comparisons)
+def convert_numpy_types(obj):
+    """Convert numpy types to Python native types"""
+    if isinstance(obj, np.integer):
+        return int(obj)
+    elif isinstance(obj, np.floating):
+        return float(obj)
+    elif isinstance(obj, np.ndarray):
+        return obj.tolist()
+    elif isinstance(obj, dict):
+        return {key: convert_numpy_types(value) for key, value in obj.items()}
+    elif isinstance(obj, list):
+        return [convert_numpy_types(item) for item in obj]
+    else:
+        return obj

src/apis/routes/speaking_route.py CHANGED Viewed

@@ -1,38 +1,23 @@
-# PRONUNCIATION ASSESSMENT USING WAV2VEC2PHONEME
-# Input: Audio + Reference Text → Output: Word highlights + Phoneme diff + Wrong words
-# Uses Wav2Vec2Phoneme for accurate phoneme-level transcription without language model correction
-from fastapi import FastAPI, UploadFile, File, Form, HTTPException, APIRouter
-from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel
-from typing import List, Dict, Optional
 import tempfile
-import os
 import numpy as np
-import librosa
-import nltk
-import eng_to_ipa as ipa
-import torch
 import re
-from collections import defaultdict
 import warnings
-from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2PhonemeCTCTokenizer
 warnings.filterwarnings("ignore")
-# Download required NLTK data
-try:
-    nltk.download("cmudict", quiet=True)
-    from nltk.corpus import cmudict
-except:
-    print("Warning: NLTK data not available")
-# =============================================================================
-# MODELS
-# =============================================================================
 router = APIRouter(prefix="/pronunciation", tags=["Pronunciation"])
 class PronunciationAssessmentResult(BaseModel):
     transcript: str  # What the user actually said (character transcript)
     transcript_phonemes: str  # User's phonemes
@@ -45,843 +30,145 @@ class PronunciationAssessmentResult(BaseModel):
     feedback: List[str]
     processing_info: Dict
-# =============================================================================
-# WAV2VEC2 PHONEME ASR
-# =============================================================================
-class Wav2Vec2CharacterASR:
-    """Wav2Vec2 character-level ASR without language model correction"""
-    def __init__(self, model_name: str = "facebook/wav2vec2-base-960h"):
-        """
-        Initialize Wav2Vec2 character-level model
-        Available models:
-        - facebook/wav2vec2-large-960h-lv60-self (character-level, no LM)
-        - facebook/wav2vec2-base-960h (character-level, no LM)
-        - facebook/wav2vec2-large-960h (character-level, no LM)
-        """
-        print(f"Loading Wav2Vec2 character model: {model_name}")
-        try:
-            self.processor = Wav2Vec2Processor.from_pretrained(model_name)
-            self.model = Wav2Vec2ForCTC.from_pretrained(model_name)
-            self.model.eval()
-            print("Wav2Vec2 character model loaded successfully")
-            self.model_name = model_name
-        except Exception as e:
-            print(f"Error loading model {model_name}: {e}")
-            # Fallback to base model
-            fallback_model = "facebook/wav2vec2-base-960h"
-            print(f"Trying fallback model: {fallback_model}")
-            try:
-                self.processor = Wav2Vec2Processor.from_pretrained(fallback_model)
-                self.model = Wav2Vec2ForCTC.from_pretrained(fallback_model)
-                self.model.eval()
-                self.model_name = fallback_model
-                print("Fallback model loaded successfully")
-            except Exception as e2:
-                raise Exception(f"Failed to load both models. Original error: {e}, Fallback error: {e2}")
-        self.sample_rate = 16000
-    def transcribe_to_characters(self, audio_path: str) -> Dict:
-        """
-        Transcribe audio directly to characters (no language model correction)
-        Returns raw character sequence as produced by the model
-        """
-        try:
-            # Load audio
-            speech, sr = librosa.load(audio_path, sr=self.sample_rate)
-            # Prepare input
-            input_values = self.processor(
-                speech,
-                sampling_rate=self.sample_rate,
-                return_tensors="pt"
-            ).input_values
-            # Get model predictions (no language model involved)
-            with torch.no_grad():
-                logits = self.model(input_values).logits
-                predicted_ids = torch.argmax(logits, dim=-1)
-            # Decode to characters directly
-            character_transcript = self.processor.batch_decode(predicted_ids)[0]
-            # Clean up character transcript
-            character_transcript = self._clean_character_transcript(character_transcript)
-            # Convert characters to phoneme-like representation
-            phoneme_like_transcript = self._characters_to_phoneme_representation(character_transcript)
-            return {
-                "character_transcript": character_transcript,
-                "phoneme_representation": phoneme_like_transcript,
-                "raw_predicted_ids": predicted_ids[0].tolist(),
-                "confidence_scores": torch.softmax(logits, dim=-1).max(dim=-1)[0][0].tolist()[:100]  # Limit for JSON
-            }
-        except Exception as e:
-            print(f"Transcription error: {e}")
-            return {
-                "character_transcript": "",
-                "phoneme_representation": "",
-                "raw_predicted_ids": [],
-                "confidence_scores": []
-            }
-    def _clean_character_transcript(self, transcript: str) -> str:
-        """Clean and standardize character transcript"""
-        # Remove extra spaces and special tokens
-        cleaned = re.sub(r'\s+', ' ', transcript)
-        cleaned = cleaned.strip().lower()
-        return cleaned
-    def _characters_to_phoneme_representation(self, text: str) -> str:
-        """Convert character-based transcript to phoneme-like representation for comparison"""
-        # This is a simple character-to-phoneme mapping for pronunciation comparison
-        # The idea is to convert the raw character output to something comparable with reference phonemes
-        if not text:
-            return ""
-        words = text.split()
-        phoneme_words = []
-        # Use our G2P to convert transcript words to phonemes
-        g2p = SimpleG2P()
-        for word in words:
-            try:
-                word_data = g2p.text_to_phonemes(word)[0]
-                phoneme_words.extend(word_data["phonemes"])
-            except:
-                # Fallback: simple letter-to-sound mapping
-                phoneme_words.extend(self._simple_letter_to_phoneme(word))
-        return " ".join(phoneme_words)
-    def _simple_letter_to_phoneme(self, word: str) -> List[str]:
-        """Simple fallback letter-to-phoneme conversion"""
-        letter_to_phoneme = {
-            'a': 'æ', 'b': 'b', 'c': 'k', 'd': 'd', 'e': 'ɛ',
-            'f': 'f', 'g': 'ɡ', 'h': 'h', 'i': 'ɪ', 'j': 'dʒ',
-            'k': 'k', 'l': 'l', 'm': 'm', 'n': 'n', 'o': 'ʌ',
-            'p': 'p', 'q': 'k', 'r': 'r', 's': 's', 't': 't',
-            'u': 'ʌ', 'v': 'v', 'w': 'w', 'x': 'ks', 'y': 'j', 'z': 'z'
-        }
-        phonemes = []
-        for letter in word.lower():
-            if letter in letter_to_phoneme:
-                phonemes.append(letter_to_phoneme[letter])
-        return phonemes
-# =============================================================================
-# SIMPLE G2P FOR REFERENCE
-# =============================================================================
-class SimpleG2P:
-    """Simple Grapheme-to-Phoneme converter for reference text"""
-    def __init__(self):
-        try:
-            self.cmu_dict = cmudict.dict()
-        except:
-            self.cmu_dict = {}
-            print("Warning: CMU dictionary not available")
-    def text_to_phonemes(self, text: str) -> List[Dict]:
-        """Convert text to phoneme sequence"""
-        words = self._clean_text(text).split()
-        phoneme_sequence = []
-        for word in words:
-            word_phonemes = self._get_word_phonemes(word)
-            phoneme_sequence.append({
-                "word": word,
-                "phonemes": word_phonemes,
-                "ipa": self._get_ipa(word),
-                "phoneme_string": " ".join(word_phonemes)
-            })
-        return phoneme_sequence
-    def get_reference_phoneme_string(self, text: str) -> str:
-        """Get reference phoneme string for comparison"""
-        phoneme_sequence = self.text_to_phonemes(text)
-        all_phonemes = []
-        for word_data in phoneme_sequence:
-            all_phonemes.extend(word_data["phonemes"])
-        return " ".join(all_phonemes)
-    def _clean_text(self, text: str) -> str:
-        """Clean text for processing"""
-        text = re.sub(r"[^\w\s\']", " ", text)
-        text = re.sub(r"\s+", " ", text)
-        return text.lower().strip()
-    def _get_word_phonemes(self, word: str) -> List[str]:
-        """Get phonemes for a word"""
-        word_lower = word.lower()
-        if word_lower in self.cmu_dict:
-            # Remove stress markers and convert to Wav2Vec2 phoneme format
-            phonemes = self.cmu_dict[word_lower][0]
-            clean_phonemes = [re.sub(r"[0-9]", "", p) for p in phonemes]
-            return self._convert_to_wav2vec_format(clean_phonemes)
-        else:
-            return self._estimate_phonemes(word)
-    def _convert_to_wav2vec_format(self, cmu_phonemes: List[str]) -> List[str]:
-        """Convert CMU phonemes to Wav2Vec2 format"""
-        # Mapping from CMU to Wav2Vec2/eSpeak phonemes
-        cmu_to_espeak = {
-            "AA": "ɑ", "AE": "æ", "AH": "ʌ", "AO": "ɔ", "AW": "aʊ",
-            "AY": "aɪ", "EH": "ɛ", "ER": "ɝ", "EY": "eɪ", "IH": "ɪ",
-            "IY": "i", "OW": "oʊ", "OY": "ɔɪ", "UH": "ʊ", "UW": "u",
-            "B": "b", "CH": "tʃ", "D": "d", "DH": "ð", "F": "f",
-            "G": "ɡ", "HH": "h", "JH": "dʒ", "K": "k", "L": "l",
-            "M": "m", "N": "n", "NG": "ŋ", "P": "p", "R": "r",
-            "S": "s", "SH": "ʃ", "T": "t", "TH": "θ", "V": "v",
-            "W": "w", "Y": "j", "Z": "z", "ZH": "ʒ"
-        }
-        converted = []
-        for phoneme in cmu_phonemes:
-            converted_phoneme = cmu_to_espeak.get(phoneme, phoneme.lower())
-            converted.append(converted_phoneme)
-        return converted
-    def _get_ipa(self, word: str) -> str:
-        """Get IPA transcription"""
-        try:
-            return ipa.convert(word)
-        except:
-            return f"/{word}/"
-    def _estimate_phonemes(self, word: str) -> List[str]:
-        """Estimate phonemes for unknown words"""
-        # Basic phoneme estimation with eSpeak-style output
-        phoneme_map = {
-            "ch": ["tʃ"], "sh": ["ʃ"], "th": ["θ"], "ph": ["f"],
-            "ck": ["k"], "ng": ["ŋ"], "qu": ["k", "w"],
-            "a": ["æ"], "e": ["ɛ"], "i": ["ɪ"], "o": ["ʌ"], "u": ["ʌ"],
-            "b": ["b"], "c": ["k"], "d": ["d"], "f": ["f"], "g": ["ɡ"],
-            "h": ["h"], "j": ["dʒ"], "k": ["k"], "l": ["l"], "m": ["m"],
-            "n": ["n"], "p": ["p"], "r": ["r"], "s": ["s"], "t": ["t"],
-            "v": ["v"], "w": ["w"], "x": ["k", "s"], "y": ["j"], "z": ["z"]
-        }
-        word = word.lower()
-        phonemes = []
-        i = 0
-        while i < len(word):
-            # Check 2-letter combinations first
-            if i <= len(word) - 2:
-                two_char = word[i:i+2]
-                if two_char in phoneme_map:
-                    phonemes.extend(phoneme_map[two_char])
-                    i += 2
-                    continue
-            # Single character
-            char = word[i]
-            if char in phoneme_map:
-                phonemes.extend(phoneme_map[char])
-            i += 1
-        return phonemes
-# =============================================================================
-# PHONEME COMPARATOR
-# =============================================================================
-class PhonemeComparator:
-    """Compare reference and learner phoneme sequences"""
-    def __init__(self):
-        # Vietnamese speakers' common phoneme substitutions
-        self.substitution_patterns = {
-            "θ": ["f", "s", "t"],    # TH → F, S, T
-            "ð": ["d", "z", "v"],    # DH → D, Z, V
-            "v": ["w", "f"],         # V → W, F
-            "r": ["l"],              # R → L
-            "l": ["r"],              # L → R
-            "z": ["s"],              # Z → S
-            "ʒ": ["ʃ", "z"],         # ZH → SH, Z
-            "ŋ": ["n"],              # NG → N
-        }
-        # Difficulty levels for Vietnamese speakers
-        self.difficulty_map = {
-            "θ": 0.9,  # th (think)
-            "ð": 0.9,  # th (this)
-            "v": 0.8,  # v
-            "z": 0.8,  # z
-            "ʒ": 0.9,  # zh (measure)
-            "r": 0.7,  # r
-            "l": 0.6,  # l
-            "w": 0.5,  # w
-            "f": 0.4,  # f
-            "s": 0.3,  # s
-            "ʃ": 0.5,  # sh
-            "tʃ": 0.4, # ch
-            "dʒ": 0.5, # j
-            "ŋ": 0.3,  # ng
-        }
-    def compare_phoneme_sequences(self, reference_phonemes: str,
-                                 learner_phonemes: str) -> List[Dict]:
-        """Compare reference and learner phoneme sequences"""
-        # Split phoneme strings
-        ref_phones = reference_phonemes.split()
-        learner_phones = learner_phonemes.split()
-        print(f"Reference phonemes: {ref_phones}")
-        print(f"Learner phonemes: {learner_phones}")
-        # Simple alignment comparison
-        comparisons = []
-        max_len = max(len(ref_phones), len(learner_phones))
-        for i in range(max_len):
-            ref_phoneme = ref_phones[i] if i < len(ref_phones) else ""
-            learner_phoneme = learner_phones[i] if i < len(learner_phones) else ""
-            if ref_phoneme and learner_phoneme:
-                # Both present - check accuracy
-                if ref_phoneme == learner_phoneme:
-                    status = "correct"
-                    score = 1.0
-                elif self._is_acceptable_substitution(ref_phoneme, learner_phoneme):
-                    status = "acceptable"
-                    score = 0.7
-                else:
-                    status = "wrong"
-                    score = 0.2
-            elif ref_phoneme and not learner_phoneme:
-                # Missing phoneme
-                status = "missing"
-                score = 0.0
-            elif learner_phoneme and not ref_phoneme:
-                # Extra phoneme
-                status = "extra"
-                score = 0.0
-            else:
-                continue
-            comparison = {
-                "position": i,
-                "reference_phoneme": ref_phoneme,
-                "learner_phoneme": learner_phoneme,
-                "status": status,
-                "score": score,
-                "difficulty": self.difficulty_map.get(ref_phoneme, 0.3)
-            }
-            comparisons.append(comparison)
-        return comparisons
-    def _is_acceptable_substitution(self, reference: str, learner: str) -> bool:
-        """Check if learner phoneme is acceptable substitution for Vietnamese speakers"""
-        acceptable = self.substitution_patterns.get(reference, [])
-        return learner in acceptable
-# =============================================================================
-# WORD ANALYZER
-# =============================================================================
-class WordAnalyzer:
-    """Analyze word-level pronunciation accuracy using character-based ASR"""
-    def __init__(self):
-        self.g2p = SimpleG2P()
-        self.comparator = PhonemeComparator()
-    def analyze_words(self, reference_text: str, learner_phonemes: str) -> Dict:
-        """Analyze word-level pronunciation using phoneme representation from character ASR"""
-        # Get reference phonemes by word
-        reference_words = self.g2p.text_to_phonemes(reference_text)
-        # Get overall phoneme comparison
-        reference_phoneme_string = self.g2p.get_reference_phoneme_string(reference_text)
-        phoneme_comparisons = self.comparator.compare_phoneme_sequences(
-            reference_phoneme_string, learner_phonemes
-        )
-        # Map phonemes back to words
-        word_highlights = self._create_word_highlights(reference_words, phoneme_comparisons)
-        # Identify wrong words
-        wrong_words = self._identify_wrong_words(word_highlights, phoneme_comparisons)
-        return {
-            "word_highlights": word_highlights,
-            "phoneme_differences": phoneme_comparisons,
-            "wrong_words": wrong_words
-        }
-    def _create_word_highlights(self, reference_words: List[Dict],
-                              phoneme_comparisons: List[Dict]) -> List[Dict]:
-        """Create word highlighting data"""
-        word_highlights = []
-        phoneme_index = 0
-        for word_data in reference_words:
-            word = word_data["word"]
-            word_phonemes = word_data["phonemes"]
-            num_phonemes = len(word_phonemes)
-            # Get phoneme scores for this word
-            word_phoneme_scores = []
-            for j in range(num_phonemes):
-                if phoneme_index + j < len(phoneme_comparisons):
-                    comparison = phoneme_comparisons[phoneme_index + j]
-                    word_phoneme_scores.append(comparison["score"])
-            # Calculate word score
-            word_score = np.mean(word_phoneme_scores) if word_phoneme_scores else 0.0
-            # Create word highlight
-            highlight = {
-                "word": word,
-                "score": float(word_score),
-                "status": self._get_word_status(word_score),
-                "color": self._get_word_color(word_score),
-                "phonemes": word_phonemes,
-                "ipa": word_data["ipa"],
-                "phoneme_scores": word_phoneme_scores,
-                "phoneme_start_index": phoneme_index,
-                "phoneme_end_index": phoneme_index + num_phonemes - 1
-            }
-            word_highlights.append(highlight)
-            phoneme_index += num_phonemes
-        return word_highlights
-    def _identify_wrong_words(self, word_highlights: List[Dict],
-                            phoneme_comparisons: List[Dict]) -> List[Dict]:
-        """Identify words that were pronounced incorrectly"""
-        wrong_words = []
-        for word_highlight in word_highlights:
-            if word_highlight["score"] < 0.6:  # Threshold for wrong pronunciation
-                # Find specific phoneme errors for this word
-                start_idx = word_highlight["phoneme_start_index"]
-                end_idx = word_highlight["phoneme_end_index"]
-                wrong_phonemes = []
-                missing_phonemes = []
-                for i in range(start_idx, min(end_idx + 1, len(phoneme_comparisons))):
-                    comparison = phoneme_comparisons[i]
-                    if comparison["status"] == "wrong":
-                        wrong_phonemes.append({
-                            "expected": comparison["reference_phoneme"],
-                            "actual": comparison["learner_phoneme"],
-                            "difficulty": comparison["difficulty"]
-                        })
-                    elif comparison["status"] == "missing":
-                        missing_phonemes.append({
-                            "phoneme": comparison["reference_phoneme"],
-                            "difficulty": comparison["difficulty"]
-                        })
-                wrong_word = {
-                    "word": word_highlight["word"],
-                    "score": word_highlight["score"],
-                    "expected_phonemes": word_highlight["phonemes"],
-                    "ipa": word_highlight["ipa"],
-                    "wrong_phonemes": wrong_phonemes,
-                    "missing_phonemes": missing_phonemes,
-                    "tips": self._get_vietnamese_tips(wrong_phonemes, missing_phonemes)
-                }
-                wrong_words.append(wrong_word)
-        return wrong_words
-    def _get_word_status(self, score: float) -> str:
-        """Get word status from score"""
-        if score >= 0.8:
-            return "excellent"
-        elif score >= 0.6:
-            return "good"
-        elif score >= 0.4:
-            return "needs_practice"
-        else:
-            return "poor"
-    def _get_word_color(self, score: float) -> str:
-        """Get color for word highlighting"""
-        if score >= 0.8:
-            return "#22c55e"  # Green
-        elif score >= 0.6:
-            return "#84cc16"  # Light green
-        elif score >= 0.4:
-            return "#eab308"  # Yellow
-        else:
-            return "#ef4444"  # Red
-    def _get_vietnamese_tips(self, wrong_phonemes: List[Dict],
-                           missing_phonemes: List[Dict]) -> List[str]:
-        """Get Vietnamese-specific pronunciation tips"""
-        tips = []
-        # Tips for specific Vietnamese pronunciation challenges
-        vietnamese_tips = {
-            "θ": "Đặt lưỡi giữa răng trên và dưới, thổi nhẹ (think, three)",
-            "ð": "Giống θ nhưng rung dây thanh âm (this, that)",
-            "v": "Chạm môi dưới vào răng trên, không dùng cả hai môi như tiếng Việt",
-            "r": "Cuộn lưỡi nhưng không chạm vào vòm miệng, không lăn lưỡi",
-            "l": "Đầu lưỡi chạm vào vòm miệng sau răng",
-            "z": "Giống âm 's' nhưng có rung dây thanh âm",
-            "ʒ": "Giống âm 'ʃ' (sh) nhưng có rung dây thanh âm",
-            "w": "Tròn môi như âm 'u', không dùng răng như âm 'v'"
-        }
-        # Add tips for wrong phonemes
-        for wrong in wrong_phonemes:
-            expected = wrong["expected"]
-            actual = wrong["actual"]
-            if expected in vietnamese_tips:
-                tips.append(f"Âm '{expected}': {vietnamese_tips[expected]}")
-            else:
-                tips.append(f"Luyện âm '{expected}' thay vì '{actual}'")
-        # Add tips for missing phonemes
-        for missing in missing_phonemes:
-            phoneme = missing["phoneme"]
-            if phoneme in vietnamese_tips:
-                tips.append(f"Thiếu âm '{phoneme}': {vietnamese_tips[phoneme]}")
-        return tips
-# =============================================================================
-# FEEDBACK GENERATOR
-# =============================================================================
-class SimpleFeedbackGenerator:
-    """Generate simple, actionable feedback in Vietnamese"""
-    def generate_feedback(self, overall_score: float, wrong_words: List[Dict],
-                         phoneme_comparisons: List[Dict]) -> List[str]:
-        """Generate Vietnamese feedback"""
-        feedback = []
-        # Overall feedback in Vietnamese
-        if overall_score >= 0.8:
-            feedback.append("Phát âm rất tốt! Bạn đã làm xuất sắc.")
-        elif overall_score >= 0.6:
-            feedback.append("Phát âm khá tốt, còn một vài điểm cần cải thiện.")
-        elif overall_score >= 0.4:
-            feedback.append("Cần luyện tập thêm. Tập trung vào những từ được đánh dấu đỏ.")
-        else:
-            feedback.append("Hãy luyện tập chậm và rõ ràng hơn.")
-        # Wrong words feedback
-        if wrong_words:
-            if len(wrong_words) <= 3:
-                word_names = [w["word"] for w in wrong_words]
-                feedback.append(f"Các từ cần luyện tập: {', '.join(word_names)}")
-            else:
-                feedback.append(f"Có {len(wrong_words)} từ cần luyện tập. Tập trung vào từng từ một.")
-        # Most problematic phonemes
-        problem_phonemes = defaultdict(int)
-        for comparison in phoneme_comparisons:
-            if comparison["status"] in ["wrong", "missing"]:
-                phoneme = comparison["reference_phoneme"]
-                problem_phonemes[phoneme] += 1
-        if problem_phonemes:
-            most_difficult = sorted(problem_phonemes.items(), key=lambda x: x[1], reverse=True)
-            top_problem = most_difficult[0][0]
-            phoneme_tips = {
-                "θ": "Lưỡi giữa răng, thổi nhẹ",
-                "ð": "Lưỡi giữa răng, rung dây thanh",
-                "v": "Môi dưới chạm răng trên",
-                "r": "Cuộn lưỡi, không chạm vòm miệng",
-                "l": "Lưỡi chạm vòm miệng",
-                "z": "Như 's' nhưng rung dây thanh"
-            }
-            if top_problem in phoneme_tips:
-                feedback.append(f"Âm khó nhất '{top_problem}': {phoneme_tips[top_problem]}")
-        return feedback
-# =============================================================================
-# MAIN PRONUNCIATION ASSESSOR
-# =============================================================================
-class SimplePronunciationAssessor:
-    """Main pronunciation assessor using Wav2Vec2 character-level model"""
-    def __init__(self):
-        print("Initializing Simple Pronunciation Assessor...")
-        self.asr = Wav2Vec2CharacterASR()  # Updated to use character-based ASR
-        self.word_analyzer = WordAnalyzer()
-        self.feedback_generator = SimpleFeedbackGenerator()
-        print("Initialization completed")
-    def assess_pronunciation(self, audio_path: str, reference_text: str) -> Dict:
-        """
-        Main assessment function
-        Input: Audio path + Reference text
-        Output: Word highlights + Phoneme differences + Wrong words
-        """
-        print("Starting pronunciation assessment...")
-        # Step 1: Wav2Vec2 character transcription (no language model)
-        print("Step 1: Transcribing to characters...")
-        asr_result = self.asr.transcribe_to_characters(audio_path)
-        character_transcript = asr_result["character_transcript"]
-        phoneme_representation = asr_result["phoneme_representation"]
-        print(f"Character transcript: {character_transcript}")
-        print(f"Phoneme representation: {phoneme_representation}")
-        # Step 2: Word analysis using phoneme representation
-        print("Step 2: Analyzing words...")
-        analysis_result = self.word_analyzer.analyze_words(reference_text, phoneme_representation)
-        # Step 3: Calculate overall score
-        phoneme_comparisons = analysis_result["phoneme_differences"]
-        overall_score = self._calculate_overall_score(phoneme_comparisons)
-        # Step 4: Generate feedback
-        print("Step 3: Generating feedback...")
-        feedback = self.feedback_generator.generate_feedback(
-            overall_score, analysis_result["wrong_words"], phoneme_comparisons
-        )
-        result = {
-            "transcript": character_transcript,  # What user actually said
-            "transcript_phonemes": phoneme_representation,
-            "user_phonemes": phoneme_representation,  # Alias for UI clarity
-            "character_transcript": character_transcript,
-            "overall_score": overall_score,
-            "word_highlights": analysis_result["word_highlights"],
-            "phoneme_differences": phoneme_comparisons,
-            "wrong_words": analysis_result["wrong_words"],
-            "feedback": feedback,
-            "processing_info": {
-                "model_used": f"Wav2Vec2-Character ({self.asr.model_name})",
-                "character_based": True,
-                "language_model_correction": False,
-                "raw_output": True
-            }
-        }
-        print("Assessment completed successfully")
-        return result
-    def _calculate_overall_score(self, phoneme_comparisons: List[Dict]) -> float:
-        """Calculate overall pronunciation score"""
-        if not phoneme_comparisons:
-            return 0.0
-        total_score = sum(comparison["score"] for comparison in phoneme_comparisons)
-        return total_score / len(phoneme_comparisons)
-# =============================================================================
-# API ENDPOINT
-# =============================================================================
-# Initialize assessor
 assessor = SimplePronunciationAssessor()
-def convert_numpy_types(obj):
-    """Convert numpy types to Python native types"""
-    if isinstance(obj, np.integer):
-        return int(obj)
-    elif isinstance(obj, np.floating):
-        return float(obj)
-    elif isinstance(obj, np.ndarray):
-        return obj.tolist()
-    elif isinstance(obj, dict):
-        return {key: convert_numpy_types(value) for key, value in obj.items()}
-    elif isinstance(obj, list):
-        return [convert_numpy_types(item) for item in obj]
-    else:
-        return obj
 @router.post("/assess", response_model=PronunciationAssessmentResult)
 async def assess_pronunciation(
     audio: UploadFile = File(..., description="Audio file (.wav, .mp3, .m4a)"),
-    reference_text: str = Form(..., description="Reference text to pronounce")
 ):
     """
-    Pronunciation Assessment API using Wav2Vec2 Character-level Model
     Key Features:
-    - Uses facebook/wav2vec2-large-960h-lv60-self for character transcription
-    - NO language model correction (shows actual pronunciation errors)
     - Character-level accuracy converted to phoneme representation
     - Vietnamese-optimized feedback and tips
-    Input: Audio file + Reference text
     Output: Word highlights + Phoneme differences + Wrong words
     """
     import time
     start_time = time.time()
     # Validate inputs
     if not reference_text.strip():
         raise HTTPException(status_code=400, detail="Reference text cannot be empty")
     if len(reference_text) > 500:
-        raise HTTPException(status_code=400, detail="Reference text too long (max 500 characters)")
     # Check for valid English characters
     if not re.match(r"^[a-zA-Z\s\'\-\.!?,;:]+$", reference_text):
         raise HTTPException(
             status_code=400,
-            detail="Text must contain only English letters, spaces, and basic punctuation"
         )
     try:
         # Save uploaded file temporarily
         file_extension = ".wav"
         if audio.filename and "." in audio.filename:
             file_extension = f".{audio.filename.split('.')[-1]}"
-        with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as tmp_file:
             content = await audio.read()
             tmp_file.write(content)
             tmp_file.flush()
-            print(f"Processing audio file: {tmp_file.name}")
-            # Run assessment using Wav2Vec2 Character model
-            result = assessor.assess_pronunciation(tmp_file.name, reference_text)
         # Add processing time
         processing_time = time.time() - start_time
         result["processing_info"]["processing_time"] = processing_time
         # Convert numpy types for JSON serialization
         final_result = convert_numpy_types(result)
-        print(f"Assessment completed in {processing_time:.2f} seconds")
         return PronunciationAssessmentResult(**final_result)
     except Exception as e:
-        print(f"Assessment error: {str(e)}")
         import traceback
         traceback.print_exc()
         raise HTTPException(status_code=500, detail=f"Assessment failed: {str(e)}")
 # =============================================================================
 # UTILITY ENDPOINTS
 # =============================================================================
 @router.get("/phonemes/{word}")
 async def get_word_phonemes(word: str):
     """Get phoneme breakdown for a specific word"""
     try:
         g2p = SimpleG2P()
         phoneme_data = g2p.text_to_phonemes(word)[0]
         # Add difficulty analysis for Vietnamese speakers
         difficulty_scores = []
         comparator = PhonemeComparator()
         for phoneme in phoneme_data["phonemes"]:
             difficulty = comparator.difficulty_map.get(phoneme, 0.3)
             difficulty_scores.append(difficulty)
         avg_difficulty = float(np.mean(difficulty_scores)) if difficulty_scores else 0.3
         return {
             "word": word,
             "phonemes": phoneme_data["phonemes"],
             "phoneme_string": phoneme_data["phoneme_string"],
             "ipa": phoneme_data["ipa"],
             "difficulty_score": avg_difficulty,
-            "difficulty_level": "hard" if avg_difficulty > 0.6 else "medium" if avg_difficulty > 0.4 else "easy",
             "challenging_phonemes": [
                 {
                     "phoneme": p,
                     "difficulty": comparator.difficulty_map.get(p, 0.3),
-                    "vietnamese_tip": get_vietnamese_tip(p)
                 }
                 for p in phoneme_data["phonemes"]
                 if comparator.difficulty_map.get(p, 0.3) > 0.6
-            ]
-        }
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Word analysis error: {str(e)}")
-@router.get("/health")
-async def health_check():
-    """Health check endpoint"""
-    try:
-        model_info = {
-            "status": "healthy",
-            "model": assessor.asr.model_name,
-            "character_based": True,
-            "language_model_correction": False,
-            "vietnamese_optimized": True
-        }
-        return model_info
-    except Exception as e:
-        return {
-            "status": "error",
-            "error": str(e)
         }
-@router.get("/test-model")
-async def test_model():
-    """Test if Wav2Vec2 model is working"""
-    try:
-        # Test model info
-        test_result = {
-            "model_loaded": True,
-            "model_name": assessor.asr.model_name,
-            "processor_ready": True,
-            "sample_rate": assessor.asr.sample_rate,
-            "sample_characters": "this is a test",
-            "sample_phonemes": "ðɪs ɪz ə tɛst"
-        }
-        return test_result
     except Exception as e:
-        return {
-            "model_loaded": False,
-            "error": str(e)
-        }
-# =============================================================================
-# HELPER FUNCTIONS
-# =============================================================================
 def get_vietnamese_tip(phoneme: str) -> str:
     """Get Vietnamese pronunciation tip for a phoneme"""
@@ -889,10 +176,10 @@ def get_vietnamese_tip(phoneme: str) -> str:
         "θ": "Đặt lưỡi giữa răng, thổi nhẹ",
         "ð": "Giống θ nhưng rung dây thanh âm",
         "v": "Môi dưới chạm răng trên",
-        "r": "Cuộn lưỡi, không chạm vòm miệng",
         "l": "Lưỡi chạm vòm miệng sau răng",
         "z": "Như 's' nhưng rung dây thanh",
         "ʒ": "Như 'ʃ' nhưng rung dây thanh",
-        "w": "Tròn môi như 'u'"
     }
     return tips.get(phoneme, f"Luyện âm {phoneme}")

+from fastapi import UploadFile, File, Form, HTTPException, APIRouter
 from pydantic import BaseModel
+from typing import List, Dict
 import tempfile
 import numpy as np
 import re
 import warnings
+from loguru import logger
+from src.apis.controllers.speaking_controller import (
+    SimpleG2P,
+    PhonemeComparator,
+    SimplePronunciationAssessor,
+    convert_numpy_types,
+)
 warnings.filterwarnings("ignore")
 router = APIRouter(prefix="/pronunciation", tags=["Pronunciation"])
 class PronunciationAssessmentResult(BaseModel):
     transcript: str  # What the user actually said (character transcript)
     transcript_phonemes: str  # User's phonemes
     feedback: List[str]
     processing_info: Dict
 assessor = SimplePronunciationAssessor()
 @router.post("/assess", response_model=PronunciationAssessmentResult)
 async def assess_pronunciation(
     audio: UploadFile = File(..., description="Audio file (.wav, .mp3, .m4a)"),
+    reference_text: str = Form(..., description="Reference text to pronounce"),
+    mode: str = Form(
+        "normal",
+        description="Assessment mode: 'normal' (Whisper) or 'advanced' (Wav2Vec2)",
+    ),
 ):
     """
+    Pronunciation Assessment API with mode selection
     Key Features:
+    - Normal mode: Uses Whisper for more accurate transcription with language model
+    - Advanced mode: Uses facebook/wav2vec2-large-960h-lv60-self for character transcription
+    - NO language model correction in advanced mode (shows actual pronunciation errors)
     - Character-level accuracy converted to phoneme representation
     - Vietnamese-optimized feedback and tips
+    Input: Audio file + Reference text + Mode
     Output: Word highlights + Phoneme differences + Wrong words
     """
     import time
     start_time = time.time()
+    # Validate mode
+    if mode not in ["normal", "advanced"]:
+        raise HTTPException(
+            status_code=400, detail="Mode must be 'normal' or 'advanced'"
+        )
     # Validate inputs
     if not reference_text.strip():
         raise HTTPException(status_code=400, detail="Reference text cannot be empty")
     if len(reference_text) > 500:
+        raise HTTPException(
+            status_code=400, detail="Reference text too long (max 500 characters)"
+        )
     # Check for valid English characters
     if not re.match(r"^[a-zA-Z\s\'\-\.!?,;:]+$", reference_text):
         raise HTTPException(
             status_code=400,
+            detail="Text must contain only English letters, spaces, and basic punctuation",
         )
     try:
         # Save uploaded file temporarily
         file_extension = ".wav"
         if audio.filename and "." in audio.filename:
             file_extension = f".{audio.filename.split('.')[-1]}"
+        with tempfile.NamedTemporaryFile(
+            delete=False, suffix=file_extension
+        ) as tmp_file:
             content = await audio.read()
             tmp_file.write(content)
             tmp_file.flush()
+            logger.info(f"Processing audio file: {tmp_file.name} with mode: {mode}")
+            # Run assessment using selected mode
+            result = assessor.assess_pronunciation(tmp_file.name, reference_text, mode)
         # Add processing time
         processing_time = time.time() - start_time
         result["processing_info"]["processing_time"] = processing_time
         # Convert numpy types for JSON serialization
         final_result = convert_numpy_types(result)
+        logger.info(
+            f"Assessment completed in {processing_time:.2f} seconds using {mode} mode"
+        )
         return PronunciationAssessmentResult(**final_result)
     except Exception as e:
+        logger.error(f"Assessment error: {str(e)}")
         import traceback
         traceback.print_exc()
         raise HTTPException(status_code=500, detail=f"Assessment failed: {str(e)}")
 # =============================================================================
 # UTILITY ENDPOINTS
 # =============================================================================
 @router.get("/phonemes/{word}")
 async def get_word_phonemes(word: str):
     """Get phoneme breakdown for a specific word"""
     try:
         g2p = SimpleG2P()
         phoneme_data = g2p.text_to_phonemes(word)[0]
         # Add difficulty analysis for Vietnamese speakers
         difficulty_scores = []
         comparator = PhonemeComparator()
         for phoneme in phoneme_data["phonemes"]:
             difficulty = comparator.difficulty_map.get(phoneme, 0.3)
             difficulty_scores.append(difficulty)
         avg_difficulty = float(np.mean(difficulty_scores)) if difficulty_scores else 0.3
         return {
             "word": word,
             "phonemes": phoneme_data["phonemes"],
             "phoneme_string": phoneme_data["phoneme_string"],
             "ipa": phoneme_data["ipa"],
             "difficulty_score": avg_difficulty,
+            "difficulty_level": (
+                "hard"
+                if avg_difficulty > 0.6
+                else "medium" if avg_difficulty > 0.4 else "easy"
+            ),
             "challenging_phonemes": [
                 {
                     "phoneme": p,
                     "difficulty": comparator.difficulty_map.get(p, 0.3),
+                    "vietnamese_tip": get_vietnamese_tip(p),
                 }
                 for p in phoneme_data["phonemes"]
                 if comparator.difficulty_map.get(p, 0.3) > 0.6
+            ],
         }
     except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Word analysis error: {str(e)}")
 def get_vietnamese_tip(phoneme: str) -> str:
     """Get Vietnamese pronunciation tip for a phoneme"""
         "θ": "Đặt lưỡi giữa răng, thổi nhẹ",
         "ð": "Giống θ nhưng rung dây thanh âm",
         "v": "Môi dưới chạm răng trên",
+        "r": "Cuộn lưỡi, không chạm vòm miệng",
         "l": "Lưỡi chạm vòm miệng sau răng",
         "z": "Như 's' nhưng rung dây thanh",
         "ʒ": "Như 'ʃ' nhưng rung dây thanh",
+        "w": "Tròn môi như 'u'",
     }
     return tips.get(phoneme, f"Luyện âm {phoneme}")

src/model_convert/wav2vec2onnx.py ADDED Viewed

	@@ -0,0 +1,373 @@

+import torch
+import onnx
+import onnxruntime
+import numpy as np
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from typing import Dict, Tuple
+import librosa
+import os
+class Wav2Vec2ONNXConverter:
+    """Convert Wav2Vec2 model to ONNX format"""
+    def __init__(self, model_name: str = "facebook/wav2vec2-base-960h"):
+        """Initialize the converter with the specified model"""
+        print(f"Loading Wav2Vec2 model: {model_name}")
+        self.model_name = model_name
+        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
+        self.model = Wav2Vec2ForCTC.from_pretrained(model_name)
+        # Disable flash attention and scaled_dot_product_attention for ONNX compatibility
+        if hasattr(self.model.config, 'use_flash_attention_2'):
+            self.model.config.use_flash_attention_2 = False
+        # Force model to use standard attention
+        if hasattr(self.model, 'wav2vec2') and hasattr(self.model.wav2vec2, 'encoder'):
+            for layer in self.model.wav2vec2.encoder.layers:
+                if hasattr(layer.attention, 'attention_dropout'):
+                    # Ensure standard attention is used
+                    layer.attention.attention_dropout = torch.nn.Dropout(layer.attention.attention_dropout.p)
+        self.model.eval()
+        self.sample_rate = 16000
+        print("Model loaded successfully")
+    def convert_to_onnx(self,
+                       onnx_path: str = "wav2vec2_model.onnx",
+                       input_length: int = 160000,  # 10 seconds at 16kHz
+                       opset_version: int = 14) -> str:
+        """
+        Convert the Wav2Vec2 model to ONNX format
+        Args:
+            onnx_path: Path to save the ONNX model
+            input_length: Length of input audio (samples)
+            opset_version: ONNX opset version
+        Returns:
+            Path to the saved ONNX model
+        """
+        print(f"Converting model to ONNX format...")
+        # Create dummy input
+        dummy_input = torch.randn(1, input_length, dtype=torch.float32)
+        # Input names and dynamic axes
+        input_names = ["input_values"]
+        output_names = ["logits"]
+        # Dynamic axes for variable length input
+        dynamic_axes = {
+            "input_values": {0: "batch_size", 1: "sequence_length"},
+            "logits": {0: "batch_size", 1: "sequence_length"}
+        }
+        try:
+            # Disable torch optimizations that may cause ONNX issues
+            with torch.no_grad():
+                # Set model to evaluation mode and disable dropout
+                self.model.eval()
+                for module in self.model.modules():
+                    if isinstance(module, torch.nn.Dropout):
+                        module.p = 0.0
+                # Export to ONNX
+                torch.onnx.export(
+                    self.model,
+                    dummy_input,
+                    onnx_path,
+                    input_names=input_names,
+                    output_names=output_names,
+                    dynamic_axes=dynamic_axes,
+                    opset_version=opset_version,
+                    do_constant_folding=True,
+                    verbose=False,
+                    export_params=True,
+                    training=torch.onnx.TrainingMode.EVAL,
+                    operator_export_type=torch.onnx.OperatorExportTypes.ONNX
+                )
+            print(f"Model successfully exported to: {onnx_path}")
+            # Verify the exported model
+            self._verify_onnx_model(onnx_path, dummy_input)
+            return onnx_path
+        except Exception as e:
+            print(f"Error during ONNX conversion: {e}")
+            raise
+    def _verify_onnx_model(self, onnx_path: str, test_input: torch.Tensor):
+        """Verify the exported ONNX model"""
+        print("Verifying ONNX model...")
+        try:
+            # Load and check ONNX model
+            onnx_model = onnx.load(onnx_path)
+            onnx.checker.check_model(onnx_model)
+            print("✓ ONNX model structure is valid")
+            # Test inference with ONNX Runtime
+            ort_session = onnxruntime.InferenceSession(onnx_path)
+            # Get model input/output info
+            input_name = ort_session.get_inputs()[0].name
+            output_name = ort_session.get_outputs()[0].name
+            print(f"✓ Input name: {input_name}")
+            print(f"✓ Output name: {output_name}")
+            # Run inference
+            ort_inputs = {input_name: test_input.numpy()}
+            ort_outputs = ort_session.run([output_name], ort_inputs)
+            # Compare with original PyTorch model
+            with torch.no_grad():
+                torch_output = self.model(test_input)
+                torch_logits = torch_output.logits
+            # Check output similarity
+            onnx_logits = ort_outputs[0]
+            max_diff = np.max(np.abs(torch_logits.numpy() - onnx_logits))
+            print(f"✓ Maximum difference between PyTorch and ONNX: {max_diff:.6f}")
+            if max_diff < 1e-4:
+                print("✓ ONNX model verification successful!")
+            else:
+                print("⚠ Warning: Large difference detected between models")
+        except Exception as e:
+            print(f"Error during verification: {e}")
+            raise
+class Wav2Vec2ONNXInference:
+    """ONNX inference class for Wav2Vec2"""
+    def __init__(self, onnx_path: str, processor_name: str = "facebook/wav2vec2-base-960h"):
+        """Initialize ONNX inference"""
+        print(f"Loading ONNX model from: {onnx_path}")
+        # Load processor for tokenization
+        self.processor = Wav2Vec2Processor.from_pretrained(processor_name)
+        # Create ONNX Runtime session
+        self.session = onnxruntime.InferenceSession(onnx_path)
+        self.input_name = self.session.get_inputs()[0].name
+        self.output_name = self.session.get_outputs()[0].name
+        self.sample_rate = 16000
+        print("ONNX model loaded successfully")
+    def transcribe(self, audio_path: str) -> Dict:
+        """Transcribe audio using ONNX model"""
+        try:
+            # Load audio
+            speech, sr = librosa.load(audio_path, sr=self.sample_rate)
+            # Prepare input
+            input_values = self.processor(
+                speech,
+                sampling_rate=self.sample_rate,
+                return_tensors="np"
+            ).input_values
+            # Run ONNX inference
+            ort_inputs = {self.input_name: input_values}
+            ort_outputs = self.session.run([self.output_name], ort_inputs)
+            logits = ort_outputs[0]
+            # Decode predictions
+            predicted_ids = np.argmax(logits, axis=-1)
+            transcription = self.processor.batch_decode(predicted_ids)[0]
+            # Calculate confidence scores
+            confidence_scores = np.max(self._softmax(logits), axis=-1)[0]
+            return {
+                "transcription": transcription,
+                "confidence_scores": confidence_scores[:100].tolist(),  # Limit for JSON
+                "predicted_ids": predicted_ids[0].tolist()
+            }
+        except Exception as e:
+            print(f"Transcription error: {e}")
+            return {
+                "transcription": "",
+                "confidence_scores": [],
+                "predicted_ids": []
+            }
+    def _softmax(self, x):
+        """Apply softmax to logits"""
+        exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
+        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
+# Example usage and testing
+def main():
+    """Example usage of the converter"""
+    # Method 1: Try standard conversion
+    try:
+        print("Method 1: Standard conversion...")
+        converter = Wav2Vec2ONNXConverter("facebook/wav2vec2-base-960h")
+        onnx_path = converter.convert_to_onnx(
+            onnx_path="wav2vec2_asr.onnx",
+            input_length=160000,  # 10 seconds
+            opset_version=14  # Updated to version 14 for compatibility
+        )
+        print("✓ Standard conversion successful!")
+    except Exception as e:
+        print(f"✗ Standard conversion failed: {e}")
+        print("\nMethod 2: Trying fallback approach...")
+        try:
+            # Method 2: Use compatible model creation
+            model, processor = create_compatible_model("facebook/wav2vec2-base-960h")
+            onnx_path = export_with_fallback(
+                model,
+                processor,
+                "wav2vec2_asr_fallback.onnx",
+                input_length=160000
+            )
+            print("✓ Fallback conversion successful!")
+        except Exception as e2:
+            print(f"✗ All conversion methods failed: {e2}")
+            return
+    # Test ONNX inference
+    print("\nTesting ONNX inference...")
+    try:
+        onnx_inference = Wav2Vec2ONNXInference(onnx_path)
+        print("✓ ONNX model loaded successfully for inference")
+        # Create a test audio file (or use your own)
+        # result = onnx_inference.transcribe("test_audio.wav")
+        # print("Transcription:", result["transcription"])
+    except Exception as e:
+        print(f"✗ ONNX inference test failed: {e}")
+    print("Conversion process completed!")
+# Additional utility functions
+def create_compatible_model(model_name: str = "facebook/wav2vec2-base-960h"):
+    """Create a Wav2Vec2 model compatible with ONNX export"""
+    from transformers import Wav2Vec2Config
+    # Load config and modify for ONNX compatibility
+    config = Wav2Vec2Config.from_pretrained(model_name)
+    # Disable features that may cause ONNX issues
+    if hasattr(config, 'use_flash_attention_2'):
+        config.use_flash_attention_2 = False
+    if hasattr(config, 'torch_dtype'):
+        config.torch_dtype = torch.float32
+    # Load model with modified config
+    model = Wav2Vec2ForCTC.from_pretrained(model_name, config=config, torch_dtype=torch.float32)
+    processor = Wav2Vec2Processor.from_pretrained(model_name)
+    return model, processor
+def export_with_fallback(model, processor, onnx_path: str, input_length: int = 160000):
+    """Export model with fallback options for different opset versions"""
+    dummy_input = torch.randn(1, input_length, dtype=torch.float32)
+    input_names = ["input_values"]
+    output_names = ["logits"]
+    dynamic_axes = {
+        "input_values": {0: "batch_size", 1: "sequence_length"},
+        "logits": {0: "batch_size", 1: "sequence_length"}
+    }
+    # Try different opset versions
+    opset_versions = [14, 13, 12, 11]
+    for opset_version in opset_versions:
+        try:
+            print(f"Trying ONNX export with opset version {opset_version}...")
+            with torch.no_grad():
+                model.eval()
+                # Disable all dropouts
+                for module in model.modules():
+                    if isinstance(module, torch.nn.Dropout):
+                        module.p = 0.0
+                torch.onnx.export(
+                    model,
+                    dummy_input,
+                    onnx_path,
+                    input_names=input_names,
+                    output_names=output_names,
+                    dynamic_axes=dynamic_axes,
+                    opset_version=opset_version,
+                    do_constant_folding=True,
+                    verbose=False,
+                    export_params=True,
+                    training=torch.onnx.TrainingMode.EVAL
+                )
+                print(f"✓ Successfully exported with opset version {opset_version}")
+                return onnx_path
+        except Exception as e:
+            print(f"✗ Failed with opset {opset_version}: {str(e)[:100]}...")
+            continue
+    raise Exception("Failed to export with all attempted opset versions")
+def optimize_onnx_model(onnx_path: str, optimized_path: str = None):
+    """Optimize ONNX model for inference"""
+    try:
+        from onnxruntime.tools import optimizer
+        if optimized_path is None:
+            optimized_path = onnx_path.replace(".onnx", "_optimized.onnx")
+        # Optimize model
+        opt_model = optimizer.optimize_model(
+            onnx_path,
+            model_type="bert",  # Similar architecture
+            num_heads=12,
+            hidden_size=768
+        )
+        opt_model.save_model_to_file(optimized_path)
+        print(f"Optimized model saved to: {optimized_path}")
+        return optimized_path
+    except ImportError:
+        print("ONNX Runtime tools not available for optimization")
+        return onnx_path
+    except Exception as e:
+        print(f"Optimization error: {e}")
+        return onnx_path
+def compare_models(original_converter, onnx_inference, test_audio_path: str):
+    """Compare PyTorch and ONNX model outputs"""
+    print("Comparing PyTorch vs ONNX outputs...")
+    # PyTorch inference
+    torch_result = original_converter.transcribe_to_characters(test_audio_path)
+    # ONNX inference
+    onnx_result = onnx_inference.transcribe(test_audio_path)
+    print(f"PyTorch transcription: {torch_result['character_transcript']}")
+    print(f"ONNX transcription: {onnx_result['transcription']}")
+    # Compare similarity
+    if torch_result['character_transcript'] == onnx_result['transcription']:
+        print("✓ Transcriptions match exactly!")
+    else:
+        print("⚠ Transcriptions differ")
+if __name__ == "__main__":
+    main()

src/utils/helper.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import numpy as np
+def convert_numpy_types(obj):
+    """Convert numpy types to Python native types"""
+    if isinstance(obj, np.integer):
+        return int(obj)
+    elif isinstance(obj, np.floating):
+        return float(obj)
+    elif isinstance(obj, np.ndarray):
+        return obj.tolist()
+    elif isinstance(obj, dict):
+        return {key: convert_numpy_types(value) for key, value in obj.items()}
+    elif isinstance(obj, list):
+        return [convert_numpy_types(item) for item in obj]
+    else:
+        return obj