Gemma 4 E2B — Khmer ASR LoRA Adapter

This repository contains the LoRA adapter fine-tuned on the DDD-Cambodia/khmer-speech-dataset using Unsloth to perform Automatic Speech Recognition (ASR) in the Khmer language.

By targeting both the decoder attention layers and the specialized projection components of the Gemma 4 audio encoder, this model is highly optimized for resource-constrained environments (like a free Google Colab T4 GPU).

Key Highlights

  • VRAM Optimized: Built using Unsloth 4-bit quantization, allowing inference and evaluation to run well within standard VRAM limits.
  • Audio Pre-Casting: Implicitly cast to float32 to resolve typical Fourier Transform (rfft) cast exceptions with NumPy during processing.
  • Greedy Decoding: Optimized for fast, deterministic transcription.

Optimal Inference Code

Below is the optimized script to load the model and run inference on any local audio file.

Prerequisites

Make sure you have the required dependencies installed:

# NOTE: Run this cell first, then restart the runtime before proceeding.
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    import torch
    v = re.match(r'[\d]{1,}\.[\d]{1,}', str(torch.__version__)).group(0)
    xformers = 'xformers==' + {
        '2.10': '0.0.34', '2.9': '0.0.33.post1', '2.8': '0.0.32.post2'
    }.get(v, "0.0.34")
    !pip install sentencepiece protobuf "datasets==4.3.0" \
                 "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth_zoo bitsandbytes accelerate \
                 {xformers} peft trl triton unsloth
    !pip install --no-deps --upgrade "torchao>=0.16.0"
!pip install --no-deps transformers==5.5.0 "tokenizers>=0.22.0,<=0.23.0"
!pip install torchcodec                    # Audio codec support
!pip install --no-deps --upgrade timm      # Required for Gemma 4 audio/vision
!pip install jiwer                         # WER / CER evaluation
import torch; torch._dynamo.config.recompile_limit = 64

Python Inference Script

import torch
import numpy as np
import librosa
from unsloth import FastModel

# 1. Define Model and Repository Name
HF_REPO_NAME = "Sothay/gemm4-E2B-khmer-asr"
TARGET_SAMPLE_RATE = 16000

# 2. Load Model & Processor
print("⏳ Loading model and processor...")
model, processor = FastModel.from_pretrained(
    model_name     = HF_REPO_NAME,
    max_seq_length = 8192,
    dtype          = torch.float16,  # Force float16 for T4/GPU compatibility
    load_in_4bit   = True,           # Enable 4-bit quantization for minimal VRAM
)

# 3. Enable Unsloth's optimized inference kernels
FastModel.for_inference(model)
print("✅ Model ready for inference.")

# 4. Helper function to load and preprocess audio
def load_audio(file_path: str) -> np.ndarray:
    """Loads audio and resamples it to 16,000 Hz float32 array."""
    # librosa automatically returns float32 arrays
    audio, sr = librosa.load(file_path, sr=TARGET_SAMPLE_RATE)
    return audio

# 5. Define transcription function
def transcribe_khmer(audio_array: np.ndarray) -> str:
    """Runs inference on a processed float32 audio array."""
    # Guarantee float32 to prevent numpy.fft.rfft casting errors
    audio_array = np.asarray(audio_array, dtype=np.float32)
    
    # Structure system and user prompts matching the training template
    system_prompt = (
        "You are an expert Khmer speech recognition assistant. "
        "Transcribe the spoken audio accurately in Khmer script, "
        "without translation or explanation."
    )
    
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_prompt}],
        },
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": audio_array},
                {"type": "text",  "text": "Please transcribe this Khmer audio."},
            ],
        },
    ]
    
    # Tokenize input sequence
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt = True,
        tokenize              = True,
        return_dict           = True,
        return_tensors        = "pt",
    ).to("cuda")

    # Generate transcript
    output_ids = model.generate(
        **inputs,
        max_new_tokens = 256,
        do_sample      = False,   # Greedy decoding: stable, fast & reproducible
    )
    
    # Decode and strip prompt tokens
    prompt_len = inputs["input_ids"].shape[1]
    transcript = processor.decode(
        output_ids[0][prompt_len:], 
        skip_special_tokens = True
    ).strip()
    
    return transcript

# ── Example Usage ──────────────────────────────────────────────────────────
# Replace with the path to your local Khmer audio file (wav, mp3, flac, etc.)
audio_path = "path_to_your_audio.wav" 

try:
    print(f"🔊 Processing audio: {audio_path}...")
    audio_data = load_audio(audio_path)
    
    print("Transcribing...")
    prediction = transcribe_khmer(audio_data)
    
    print("\n Predicted Transcript:")
    print(prediction)
except Exception as e:
    print(f" Error running inference: {e}")

Training Methodology & Metrics

This model was trained using the SFTTrainer from Hugging Face's trl library with the following memory-saving configuration to enable stable training on a single T4 (16 GB) GPU:

  • Dtype: torch.float16 (Mixed-precision fp16=True, bf16=False since T4 lacks native bf16 compute support).
  • Optimizer: paged_adamw_8bit (cuts optimizer states footprint by ~75%).
  • Gradient Checkpointing: Active (use_reentrant=False) to avoid autograd graph overheads.
  • Batching: per_device_train_batch_size=2 with gradient_accumulation_steps=4 (effective batch size of 8).
  • Sequence Length: Capped at 4096 tokens.
  • Evaluation Metric: Character Error Rate (CER) was selected instead of Word Error Rate (WER) as Khmer does not consistently separate words using whitespaces.

Or

💻 Usage: Transcribe a Local Audio File

To run inference on your own audio files, you can use the code block below. It loads any local audio file (MP3, WAV, FLAC, M4A, etc.), resamples it, ensures it is in the correct format (float32), and prints the Khmer transcription.

Setup

First, make sure you have the required dependencies:

pip install unsloth librosa soundfile numpy torch

Python Script

import torch
import numpy as np
import librosa
from unsloth import FastModel

# 1. Define Model Repository
HF_REPO_NAME = "Sothay/gemm4-E2B-khmer-asr"
TARGET_SAMPLE_RATE = 16000

# 2. Load Model & Processor
print("⏳ Loading model and processor...")
model, processor = FastModel.from_pretrained(
    model_name     = HF_REPO_NAME,
    max_seq_length = 4096,
    dtype          = torch.float16,  # Force float16 for T4 GPU compatibility
    load_in_4bit   = True,           # 4-bit quantization for low-VRAM environments
)

# Enable Unsloth's optimized inference mode
FastModel.for_inference(model)
print("✅ Model loaded successfully!")

# 3. Define Inference Functions
def transcribe_audio_file(file_path: str) -> str:
    """Loads a local audio file, pre-processes it, and returns the transcription."""
    # Load and automatically resample to 16,000 Hz float32
    audio_array, sr = librosa.load(file_path, sr=TARGET_SAMPLE_RATE)
    
    # Ensure float32 to prevent numpy.fft.rfft casting issues
    audio_array = np.asarray(audio_array, dtype=np.float32)
    
    # Format instruction messages
    system_prompt = (
        "You are an expert Khmer speech recognition assistant. "
        "Transcribe the spoken audio accurately in Khmer script, "
        "without translation or explanation."
    )
    
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_prompt}],
        },
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": audio_array},
                {"type": "text",  "text": "Please transcribe this Khmer audio."},
            ],
        },
    ]
    
    # Tokenize input context
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt = True,
        tokenize              = True,
        return_dict           = True,
        return_tensors        = "pt",
    ).to("cuda")

    # Generate output transcript using Greedy decoding
    output_ids = model.generate(
        **inputs,
        max_new_tokens = 256,
        do_sample      = False,  # Greedy decoding (deterministic ASR)
    )
    
    # Strip prompt tokens and decode prediction
    prompt_len = inputs["input_ids"].shape[1]
    transcript = processor.decode(
        output_ids[0][prompt_len:], 
        skip_special_tokens = True
    ).strip()
    
    return transcript

# ── Example Usage ──────────────────────────────────────────────────────────
# Simply change this path to point to your local audio file
my_audio_file = "my_voice_sample.wav" 

try:
    print(f"\n🔊 Reading audio file: {my_audio_file}")
    transcript = transcribe_audio_file(my_audio_file)
    print("\n🎯 Transcription:")
    print(transcript)
except Exception as e:
    print(f"❌ Error occurred: {e}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sothay/gemm4-E2B-khmer-asr

Adapter
(40)
this model

Dataset used to train Sothay/gemm4-E2B-khmer-asr