--- base_model: ibm-granite/granite-3.1-2b-instruct tags: - text-generation - transformers - safetensors - english - granite - text-generation-inference - trl - grpo - conversational - inference-endpoints - 4-bit precision - bitsandbytes license: apache-2.0 language: - en --- # Granite-3.1-2B-Reasoning-4bit (Quantized for Efficiency) ## Model Overview This is a **4-bit quantized version** of **ruslanmv/granite-3.1-2b-Reasoning**, which is fine-tuned from **ibm-granite/granite-3.1-2b-instruct**. The quantization allows for significantly reduced memory usage while maintaining strong reasoning capabilities. - **Developed by:** [ruslanmv](https://huggingface.co/ruslanmv) - **License:** Apache 2.0 - **Base Model:** [ibm-granite/granite-3.1-2b-instruct](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) - **Fine-tuned for:** Logical reasoning, structured problem-solving, long-context tasks - **Quantized with:** **bitsandbytes (4-bit precision)** - **Supported Languages:** English - **Tensor Type:** **BF16** - **Parameter Size:** **2.53B params** --- ## Why Use the Quantized Version? This **4-bit quantized model** is ideal for users who require **fast inference speeds and reduced memory usage** while still benefiting from **Granite's advanced reasoning capabilities**. ✅ **2x Faster Training** compared to standard methods ✅ **Lower VRAM usage**, ideal for consumer GPUs ✅ **Optimized for inference**, making it more efficient for deployment --- ## Installation & Usage To run the quantized model, install the required dependencies: ```bash pip install torch torchvision torchaudio pip install accelerate pip install transformers pip install bitsandbytes ``` ### Running the Model Use the following Python snippet to load and generate text with the **4-bit quantized** model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import bitsandbytes as bnb device = "cuda" if torch.cuda.is_available() else "cpu" model_path = "ruslanmv/granite-3.1-2b-Reasoning-4bit" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", load_in_4bit=True, # Load model in 4-bit precision quantization_config=bnb.QuantizationConfig(llm_int8_threshold=6.0) ) model.eval() input_text = "Can you explain the difference between inductive and deductive reasoning?" input_tokens = tokenizer(input_text, return_tensors="pt").to(device) output = model.generate(**input_tokens, max_length=4000) output_text = tokenizer.batch_decode(output) print(output_text) ``` --- ## Intended Use Granite-3.1-2B-Reasoning-4bit is designed for tasks requiring structured **reasoning**, including: - **Logical and analytical problem-solving** - **Text-based reasoning tasks** - **Mathematical and symbolic reasoning** - **Advanced instruction-following** This model is particularly useful for users needing a **lightweight, high-performance** version of **Granite-3.1-2B-Reasoning** without sacrificing too much accuracy. --- ## License & Acknowledgments This model is released under the **Apache 2.0** license. It is fine-tuned from IBM’s **Granite 3.1-2B-Instruct** model and **quantized using bitsandbytes** for optimal efficiency. Special thanks to the **IBM Granite Team** for developing the base model. For more details, visit the [IBM Granite Documentation](https://huggingface.co/ibm-granite). --- ### Citation If you use this model in your research or applications, please cite: ``` @misc{ruslanmv2025granite, title={Fine-Tuning and Quantizing Granite-3.1 for Advanced Reasoning}, author={Ruslan M.V.}, year={2025}, url={https://huggingface.co/ruslanmv/granite-3.1-2b-Reasoning-4bit} } ```