Instructions to use Shankara-A-S/g4e4-it-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Shankara-A-S/g4e4-it-v0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Shankara-A-S/g4e4-it-v0")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Shankara-A-S/g4e4-it-v0")
model = AutoModelForMultimodalLM.from_pretrained("Shankara-A-S/g4e4-it-v0")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Shankara-A-S/g4e4-it-v0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Shankara-A-S/g4e4-it-v0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Shankara-A-S/g4e4-it-v0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Shankara-A-S/g4e4-it-v0

SGLang

How to use Shankara-A-S/g4e4-it-v0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Shankara-A-S/g4e4-it-v0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Shankara-A-S/g4e4-it-v0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Shankara-A-S/g4e4-it-v0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Shankara-A-S/g4e4-it-v0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Shankara-A-S/g4e4-it-v0 with Docker Model Runner:
```
docker model run hf.co/Shankara-A-S/g4e4-it-v0
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Gemma-4 E4B-it · runtime BitsAndBytes NF4 (Round 1 artifact — retrospective)

Team Godspeed AI's Round 1 submission to the Resilient AI Challenge 2026 (image-to-text). Preserved unchanged as a historical artifact and a documented negative result. Do not use this approach if your goal is inference energy efficiency — read on.

What this is

google/gemma-4-E4B-it with weights stored in bf16 and quantized to 4-bit NF4 at load time by vLLM's BitsAndBytes integration (load-format: bitsandbytes, quantization: bitsandbytes). No weights were modified; compression happens entirely at runtime.

Official Round 1 results (organizer-measured, NVIDIA L4)

Model	Energy (J)	Doc analysis	Image understanding	Mean recovery
BF16 base	99.71	0.7608	0.68	100%
This artifact	113.41 (+13.7%)	0.7576	0.56 (−17.6%)	91.45%

The compressed model used more energy than the uncompressed base.

Why — the lesson this repo exists to teach

Runtime dequantization is an energy trap. BnB dequantizes 4-bit tiles to higher precision on every attention and MLP forward. The compute spent unpacking exceeds the bandwidth saved by smaller weights. Stored-weight formats with fused int4 kernels (GPTQ-Marlin / AWQ-Marlin) do the matmul directly on packed weights and actually save energy (−52% in our Round 2 artifacts on identical hardware).
NF4 hurts multimodal composition. The vision tower stays bf16, but the LM layers that compose vision-token embeddings are NF4-quantized; document OCR (mostly text decoding) survived, visual reasoning dropped 17.6%.

Our Round 2 artifacts fix both: g4e4-it-r2-awq-smoke-v0 (primary — AWQ-Marlin full decoder + response-economy chat template, ~4–5× less energy than this repo at higher recovery) and g4e4-it-r2-w4a16-mlpo-v0 (GPTQ-Marlin over MLP and attention-output projections — the conservative alternative).

Usage (reproduction only)

vllm serve Shankara-A-S/g4e4-it-v0 --config vllm_config.yaml

Tested on vLLM 0.20.2. Sampling: temperature=1.0, top_p=0.95, top_k=64 (also in generation_config.json).

License

Apache 2.0, inherited from google/gemma-4-E4B-it.

Downloads last month: 38

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Shankara-A-S/g4e4-it-v0

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Finetuned

(221)

this model