Text Generation
Transformers
Safetensors
English
granite
w8a8
int8
vllm
8-bit precision
compressed-tensors
Instructions to use RedHatAI/granite-3.1-2b-base-quantized.w8a8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RedHatAI/granite-3.1-2b-base-quantized.w8a8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RedHatAI/granite-3.1-2b-base-quantized.w8a8")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("RedHatAI/granite-3.1-2b-base-quantized.w8a8") model = AutoModelForCausalLM.from_pretrained("RedHatAI/granite-3.1-2b-base-quantized.w8a8") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RedHatAI/granite-3.1-2b-base-quantized.w8a8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RedHatAI/granite-3.1-2b-base-quantized.w8a8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/granite-3.1-2b-base-quantized.w8a8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/RedHatAI/granite-3.1-2b-base-quantized.w8a8
- SGLang
How to use RedHatAI/granite-3.1-2b-base-quantized.w8a8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RedHatAI/granite-3.1-2b-base-quantized.w8a8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/granite-3.1-2b-base-quantized.w8a8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RedHatAI/granite-3.1-2b-base-quantized.w8a8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/granite-3.1-2b-base-quantized.w8a8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use RedHatAI/granite-3.1-2b-base-quantized.w8a8 with Docker Model Runner:
docker model run hf.co/RedHatAI/granite-3.1-2b-base-quantized.w8a8
Update README.md
Browse files
README.md
CHANGED
|
@@ -298,7 +298,7 @@ guidellm --model neuralmagic/granite-3.1-2b-base-quantized.w8a8 --target "http:/
|
|
| 298 |
<th>Docstring Generation<br>prefill: 768 tokens<br>decode: 128 tokens</th>
|
| 299 |
<th>Code Fixing<br>prefill: 1024 tokens<br>decode: 1024 tokens</th>
|
| 300 |
<th>RAG<br>prefill: 1024 tokens<br>decode: 128 tokens</th>
|
| 301 |
-
<th>
|
| 302 |
<th>Multi-turn Chat<br>prefill: 512 tokens<br>decode: 256 tokens</th>
|
| 303 |
<th>Large Summarization<br>prefill: 4096 tokens<br>decode: 512 tokens</th>
|
| 304 |
</tr>
|
|
@@ -326,7 +326,7 @@ guidellm --model neuralmagic/granite-3.1-2b-base-quantized.w8a8 --target "http:/
|
|
| 326 |
<td>4.7</td>
|
| 327 |
</tr>
|
| 328 |
<tr>
|
| 329 |
-
<td>granite-3.1-2b-base-quantized.
|
| 330 |
<td>1.94</td>
|
| 331 |
<td>5.4</td>
|
| 332 |
<td>0.7</td>
|
|
@@ -360,7 +360,7 @@ guidellm --model neuralmagic/granite-3.1-2b-base-quantized.w8a8 --target "http:/
|
|
| 360 |
<td>4.5</td>
|
| 361 |
</tr>
|
| 362 |
<tr>
|
| 363 |
-
<td>granite-3.1-2b-base-quantized.
|
| 364 |
<td>1.87</td>
|
| 365 |
<td>5.1</td>
|
| 366 |
<td>0.7</td>
|
|
@@ -417,7 +417,7 @@ guidellm --model neuralmagic/granite-3.1-2b-base-quantized.w8a8 --target "http:/
|
|
| 417 |
<td>1.4</td>
|
| 418 |
</tr>
|
| 419 |
<tr>
|
| 420 |
-
<td>granite-3.1-2b-base-quantized.
|
| 421 |
<td>0.98</td>
|
| 422 |
<td>2.8</td>
|
| 423 |
<td>10.0</td>
|
|
@@ -451,7 +451,7 @@ guidellm --model neuralmagic/granite-3.1-2b-base-quantized.w8a8 --target "http:/
|
|
| 451 |
<td>1.7</td>
|
| 452 |
</tr>
|
| 453 |
<tr>
|
| 454 |
-
<td>granite-3.1-2b-base-quantized.
|
| 455 |
<td>0.95</td>
|
| 456 |
<td>3.7</td>
|
| 457 |
<td>11.4</td>
|
|
@@ -462,4 +462,3 @@ guidellm --model neuralmagic/granite-3.1-2b-base-quantized.w8a8 --target "http:/
|
|
| 462 |
<td>1.4</td>
|
| 463 |
</tr>
|
| 464 |
</table>
|
| 465 |
-
|
|
|
|
| 298 |
<th>Docstring Generation<br>prefill: 768 tokens<br>decode: 128 tokens</th>
|
| 299 |
<th>Code Fixing<br>prefill: 1024 tokens<br>decode: 1024 tokens</th>
|
| 300 |
<th>RAG<br>prefill: 1024 tokens<br>decode: 128 tokens</th>
|
| 301 |
+
<th>Instruction Following<br>prefill: 256 tokens<br>decode: 128 tokens</th>
|
| 302 |
<th>Multi-turn Chat<br>prefill: 512 tokens<br>decode: 256 tokens</th>
|
| 303 |
<th>Large Summarization<br>prefill: 4096 tokens<br>decode: 512 tokens</th>
|
| 304 |
</tr>
|
|
|
|
| 326 |
<td>4.7</td>
|
| 327 |
</tr>
|
| 328 |
<tr>
|
| 329 |
+
<td>granite-3.1-2b-base-quantized.w4a16</td>
|
| 330 |
<td>1.94</td>
|
| 331 |
<td>5.4</td>
|
| 332 |
<td>0.7</td>
|
|
|
|
| 360 |
<td>4.5</td>
|
| 361 |
</tr>
|
| 362 |
<tr>
|
| 363 |
+
<td>granite-3.1-2b-base-quantized.w4a16</td>
|
| 364 |
<td>1.87</td>
|
| 365 |
<td>5.1</td>
|
| 366 |
<td>0.7</td>
|
|
|
|
| 417 |
<td>1.4</td>
|
| 418 |
</tr>
|
| 419 |
<tr>
|
| 420 |
+
<td>granite-3.1-2b-base-quantized.w4a16</td>
|
| 421 |
<td>0.98</td>
|
| 422 |
<td>2.8</td>
|
| 423 |
<td>10.0</td>
|
|
|
|
| 451 |
<td>1.7</td>
|
| 452 |
</tr>
|
| 453 |
<tr>
|
| 454 |
+
<td>granite-3.1-2b-base-quantized.w4a16</td>
|
| 455 |
<td>0.95</td>
|
| 456 |
<td>3.7</td>
|
| 457 |
<td>11.4</td>
|
|
|
|
| 462 |
<td>1.4</td>
|
| 463 |
</tr>
|
| 464 |
</table>
|
|
|