I'm releasing the speech version of Gemma-3!
Hi, Thanks for your cool project!
I'm here to share my project I've been working on - a speech version of Gemma-3!
Here is Model : Gemma-3-MM
I've applied a Speech Adapter to the Gemma-3 model, inspired by the Phi-4-multimodal-instruct approach. The performance is pretty interesting - for example, CER/WER of 4.47/8.49 in Covost2 ASR (English) and a BLEU score of 29.83 for zero-shot English-Korean AST.
Surprisingly, when no speech data is provided, the model functions exactly like the original, with no impact on performance.
I'd love to hear your thoughts on this! Any tips or suggestions for further improvements?
And, I wanted to check about the Gemma 3 license. Given that this is an adaptation of the original model, are there any potential issues I should be aware of regarding the use and sharing of this speech version? Just want to make sure I'm staying within the guidelines.
Looking forward to your feedback!
Hello
@junnei
, this is cool.
I am looking at such a direction too.
Do you mind sharing any steps that were taken or advice ?
Hi
@KatoStevenMubiru
.
Thanks for your interest in my Gemma-3-MM speech project! Here's how I implemented it:
Implementation Steps
- I first analyzed Gemma-3's architecture and added an Audio Tower (using Conformer Encoder) at the same level as the Vision Tower.
Similar to the vision implementation, I designed an Audio Projector to connect the Audio Tower outputs to the Language Model. - I modified the tokenizer by adding Audio Tokens (following the same approach used for Vision tokens) and redesigned the processing pipeline to include an Audio Feature Extractor.
- To enhance performance, I borrowed parameters from the speech encoder in Phi-4-multimodal-instruct as my inspiration.
- For training, I froze all original Gemma-3 parameters and only made the Audio Projector trainable, while also applying LoRA to the Language Model.
This approach establishes the foundation for a Mixture-of-LoRAs setup.
I'm currently developing Python files to enable training on this model architecture, and I'll share updates here when it's complete.
Let me know if you have any specific questions about any part of the implementation!
Thank you so much,
Could you kindly create time for a simple session with you?
We are trying to do a similar work around Gemma 3 .
I would be glad to have a zoom or google meet with you and some of my team members.
hello
@lgusm
,
How are you?
Hi
@KatoStevenMubiru
.
you can fine-tune on our model : Link
Let me know if there is any issues!