Instructions to use google/gemma-4-31B-it-assistant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-31B-it-assistant with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it-assistant") model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it-assistant") - Notebooks
- Google Colab
- Kaggle
AWQ Version of MTP
Hi there
i see lots of AWQ version of Gemma-4-31B-it , but for MTP version i can't find even single AWQ version
is there any issue on create AWQ version of this model ?
If you use vLLM it can quantize the model/kv_cache on the fly usingllm = LLM( model=CHECKPOINT, trust_remote_code=True, gpu_memory_utilization=0.90, max_model_len=MAX_MODEL_LEN, seed=0, disable_log_stats=True, enable_chunked_prefill=True, enable_prefix_caching=True, quantization="fp8", kv_cache_dtype="fp8", )
If you use vLLM it can quantize the model/kv_cache on the fly using
llm = LLM( model=CHECKPOINT, trust_remote_code=True, gpu_memory_utilization=0.90, max_model_len=MAX_MODEL_LEN, seed=0, disable_log_stats=True, enable_chunked_prefill=True, enable_prefix_caching=True, quantization="fp8", kv_cache_dtype="fp8", )
thanks @trieudemo11
but awq give us accuracy similar to fp8 on int4 , or i missed something, i also use vllm serve command to create openai api campatible endpoints
no update?
Depending on your task, you must test by yourself on your dataset.
For general tasks like I do, I found int4 quality is acceptable, but definitely not superior than fp8. For instruction-following tasks, int4 does well, but the answer is always kinda lazy unless I ask it to do more. For fp8, the output is more balanced between content and accuracy.
Depending on your task, you must test by yourself on your dataset.
For general tasks like I do, I found int4 quality is acceptable, but definitely not superior than fp8. For instruction-following tasks, int4 does well, but the answer is always kinda lazy unless I ask it to do more. For fp8, the output is more balanced between content and accuracy.
but there ire only fp8 and gguf version, there is no awq version