Edit model card

Gemma-2-9B-It-SPPO-Iter3-Q8

Model Overview

Gemma-2-9B-It-SPPO-Iter3 quantized to FP8 weights using dynamic activation scheme, ready for inference with vLLM >= 0.5.0.

Usage and Creation

Produced using AutoFP8.

from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3"
quantized_model_dir = "/quantized/Gemma-2-9B-It-SPPO-Iter3_Q8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)

model.save_quantized(quantized_model_dir)

How to run FP8 quantized models

vLLM has full support for FP8 models quantized with this package. Install vLLM with: pip install vllm>=0.5.0

Then simply pass the quantized checkpoint directly to vLLM's entrypoints! It will detect the checkpoint format using the quantization_config in the config.json.

from vllm import LLM
model = LLM("tranhoangnguyen03/Gemma-2-9B-It-SPPO-Iter3_Q8")

outputs = model.generate("Once upon a time,")
print(outputs[0].outputs[0].text)

Benchmark Results

||| TBA |||

Downloads last month
8
Safetensors
Model size
9.24B params
Tensor type
BF16
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.