Instructions to use Sothay/gemm4-E2B-khmer-asr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps Settings
- Unsloth Studio
How to use Sothay/gemm4-E2B-khmer-asr with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Sothay/gemm4-E2B-khmer-asr to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Sothay/gemm4-E2B-khmer-asr to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Sothay/gemm4-E2B-khmer-asr to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Sothay/gemm4-E2B-khmer-asr", max_seq_length=2048, )
Gemma 4 E2B — Khmer ASR LoRA Adapter
This repository contains the LoRA adapter fine-tuned on the DDD-Cambodia/khmer-speech-dataset using Unsloth to perform Automatic Speech Recognition (ASR) in the Khmer language.
By targeting both the decoder attention layers and the specialized projection components of the Gemma 4 audio encoder, this model is highly optimized for resource-constrained environments (like a free Google Colab T4 GPU).
Key Highlights
- VRAM Optimized: Built using Unsloth 4-bit quantization, allowing inference and evaluation to run well within standard VRAM limits.
- Audio Pre-Casting: Implicitly cast to
float32to resolve typical Fourier Transform (rfft) cast exceptions with NumPy during processing. - Greedy Decoding: Optimized for fast, deterministic transcription.
Optimal Inference Code
Below is the optimized script to load the model and run inference on any local audio file.
Prerequisites
Make sure you have the required dependencies installed:
# NOTE: Run this cell first, then restart the runtime before proceeding.
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth
else:
import torch
v = re.match(r'[\d]{1,}\.[\d]{1,}', str(torch.__version__)).group(0)
xformers = 'xformers==' + {
'2.10': '0.0.34', '2.9': '0.0.33.post1', '2.8': '0.0.32.post2'
}.get(v, "0.0.34")
!pip install sentencepiece protobuf "datasets==4.3.0" \
"huggingface_hub>=0.34.0" hf_transfer
!pip install --no-deps unsloth_zoo bitsandbytes accelerate \
{xformers} peft trl triton unsloth
!pip install --no-deps --upgrade "torchao>=0.16.0"
!pip install --no-deps transformers==5.5.0 "tokenizers>=0.22.0,<=0.23.0"
!pip install torchcodec # Audio codec support
!pip install --no-deps --upgrade timm # Required for Gemma 4 audio/vision
!pip install jiwer # WER / CER evaluation
import torch; torch._dynamo.config.recompile_limit = 64
Python Inference Script
import torch
import numpy as np
import librosa
from unsloth import FastModel
# 1. Define Model and Repository Name
HF_REPO_NAME = "Sothay/gemm4-E2B-khmer-asr"
TARGET_SAMPLE_RATE = 16000
# 2. Load Model & Processor
print("⏳ Loading model and processor...")
model, processor = FastModel.from_pretrained(
model_name = HF_REPO_NAME,
max_seq_length = 8192,
dtype = torch.float16, # Force float16 for T4/GPU compatibility
load_in_4bit = True, # Enable 4-bit quantization for minimal VRAM
)
# 3. Enable Unsloth's optimized inference kernels
FastModel.for_inference(model)
print("✅ Model ready for inference.")
# 4. Helper function to load and preprocess audio
def load_audio(file_path: str) -> np.ndarray:
"""Loads audio and resamples it to 16,000 Hz float32 array."""
# librosa automatically returns float32 arrays
audio, sr = librosa.load(file_path, sr=TARGET_SAMPLE_RATE)
return audio
# 5. Define transcription function
def transcribe_khmer(audio_array: np.ndarray) -> str:
"""Runs inference on a processed float32 audio array."""
# Guarantee float32 to prevent numpy.fft.rfft casting errors
audio_array = np.asarray(audio_array, dtype=np.float32)
# Structure system and user prompts matching the training template
system_prompt = (
"You are an expert Khmer speech recognition assistant. "
"Transcribe the spoken audio accurately in Khmer script, "
"without translation or explanation."
)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": system_prompt}],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": audio_array},
{"type": "text", "text": "Please transcribe this Khmer audio."},
],
},
]
# Tokenize input sequence
inputs = processor.apply_chat_template(
messages,
add_generation_prompt = True,
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda")
# Generate transcript
output_ids = model.generate(
**inputs,
max_new_tokens = 256,
do_sample = False, # Greedy decoding: stable, fast & reproducible
)
# Decode and strip prompt tokens
prompt_len = inputs["input_ids"].shape[1]
transcript = processor.decode(
output_ids[0][prompt_len:],
skip_special_tokens = True
).strip()
return transcript
# ── Example Usage ──────────────────────────────────────────────────────────
# Replace with the path to your local Khmer audio file (wav, mp3, flac, etc.)
audio_path = "path_to_your_audio.wav"
try:
print(f"🔊 Processing audio: {audio_path}...")
audio_data = load_audio(audio_path)
print("Transcribing...")
prediction = transcribe_khmer(audio_data)
print("\n Predicted Transcript:")
print(prediction)
except Exception as e:
print(f" Error running inference: {e}")
Training Methodology & Metrics
This model was trained using the SFTTrainer from Hugging Face's trl library with the following memory-saving configuration to enable stable training on a single T4 (16 GB) GPU:
- Dtype:
torch.float16(Mixed-precisionfp16=True,bf16=Falsesince T4 lacks nativebf16compute support). - Optimizer:
paged_adamw_8bit(cuts optimizer states footprint by ~75%). - Gradient Checkpointing: Active (
use_reentrant=False) to avoid autograd graph overheads. - Batching:
per_device_train_batch_size=2withgradient_accumulation_steps=4(effective batch size of 8). - Sequence Length: Capped at
4096tokens. - Evaluation Metric: Character Error Rate (CER) was selected instead of Word Error Rate (WER) as Khmer does not consistently separate words using whitespaces.
Or
💻 Usage: Transcribe a Local Audio File
To run inference on your own audio files, you can use the code block below. It loads any local audio file (MP3, WAV, FLAC, M4A, etc.), resamples it, ensures it is in the correct format (float32), and prints the Khmer transcription.
Setup
First, make sure you have the required dependencies:
pip install unsloth librosa soundfile numpy torch
Python Script
import torch
import numpy as np
import librosa
from unsloth import FastModel
# 1. Define Model Repository
HF_REPO_NAME = "Sothay/gemm4-E2B-khmer-asr"
TARGET_SAMPLE_RATE = 16000
# 2. Load Model & Processor
print("⏳ Loading model and processor...")
model, processor = FastModel.from_pretrained(
model_name = HF_REPO_NAME,
max_seq_length = 4096,
dtype = torch.float16, # Force float16 for T4 GPU compatibility
load_in_4bit = True, # 4-bit quantization for low-VRAM environments
)
# Enable Unsloth's optimized inference mode
FastModel.for_inference(model)
print("✅ Model loaded successfully!")
# 3. Define Inference Functions
def transcribe_audio_file(file_path: str) -> str:
"""Loads a local audio file, pre-processes it, and returns the transcription."""
# Load and automatically resample to 16,000 Hz float32
audio_array, sr = librosa.load(file_path, sr=TARGET_SAMPLE_RATE)
# Ensure float32 to prevent numpy.fft.rfft casting issues
audio_array = np.asarray(audio_array, dtype=np.float32)
# Format instruction messages
system_prompt = (
"You are an expert Khmer speech recognition assistant. "
"Transcribe the spoken audio accurately in Khmer script, "
"without translation or explanation."
)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": system_prompt}],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": audio_array},
{"type": "text", "text": "Please transcribe this Khmer audio."},
],
},
]
# Tokenize input context
inputs = processor.apply_chat_template(
messages,
add_generation_prompt = True,
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda")
# Generate output transcript using Greedy decoding
output_ids = model.generate(
**inputs,
max_new_tokens = 256,
do_sample = False, # Greedy decoding (deterministic ASR)
)
# Strip prompt tokens and decode prediction
prompt_len = inputs["input_ids"].shape[1]
transcript = processor.decode(
output_ids[0][prompt_len:],
skip_special_tokens = True
).strip()
return transcript
# ── Example Usage ──────────────────────────────────────────────────────────
# Simply change this path to point to your local audio file
my_audio_file = "my_voice_sample.wav"
try:
print(f"\n🔊 Reading audio file: {my_audio_file}")
transcript = transcribe_audio_file(my_audio_file)
print("\n🎯 Transcription:")
print(transcript)
except Exception as e:
print(f"❌ Error occurred: {e}")