Instructions to use SOULAMA/qwen2.5-coder-ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SOULAMA/qwen2.5-coder-ft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SOULAMA/qwen2.5-coder-ft") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SOULAMA/qwen2.5-coder-ft") model = AutoModelForCausalLM.from_pretrained("SOULAMA/qwen2.5-coder-ft") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SOULAMA/qwen2.5-coder-ft with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SOULAMA/qwen2.5-coder-ft" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SOULAMA/qwen2.5-coder-ft", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SOULAMA/qwen2.5-coder-ft
- SGLang
How to use SOULAMA/qwen2.5-coder-ft with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SOULAMA/qwen2.5-coder-ft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SOULAMA/qwen2.5-coder-ft", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SOULAMA/qwen2.5-coder-ft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SOULAMA/qwen2.5-coder-ft", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SOULAMA/qwen2.5-coder-ft with Docker Model Runner:
docker model run hf.co/SOULAMA/qwen2.5-coder-ft
library_name: transformers tags:
- qwen
- code
- text-generation
- fine-tuned
Model Card for qwen2.5-coder-ft
This model is a fine-tuned and merged version of Qwen2.5-Coder-1.5B-Instruct, specialized in Python programming and precise code generation.
Model Details
Model Description
This model has been fine-tuned using Low-Rank Adaptation (LoRA) and subsequently merged into full 16-bit precision weights. It is optimized to act as a strict code assistant, delivering accurate programming solutions while minimizing conversational overhead.
- Developed by: Soulama Haicanama Ismael
- Model type: Causal Language Model (Transformer Architecture)
- Language(s) (NLP): English, Python
- License: Apache 2.0 (inherited from Qwen base model)
- Finetuned from model: Qwen/Qwen2.5-Coder-1.5B-Instruct
Model Sources
- Repository: SOULAMA/qwen2.5-coder-ft
Uses
Direct Use
This model is intended for direct code generation and answering programming questions. It is designed to work within a Chat Template infrastructure using specific system prompts to isolate python code blocks.
Out-of-Scope Use
The model should not be used for generic non-coding tasks (such as writing creative essays, general chat, or translation), as its attention layers have been heavily adjusted towards script structures and programmatic vocabulary.
Bias, Risks, and Limitations
Due to its 1.5B parameter size, the model can suffer from context-loop repetition if the stopping criteria are not explicitly configured during inference. Users must handle stop tokens (<|im_end|>) strictly in their generation script to ensure execution stability.
Recommendations
It is highly recommended to lower the generation temperature ($\le 0.2$) and provide clear, standalone system instructions to ensure deterministic code results.
How to Get Started with the Model
Use the code below to get started with the model using proper generation boundaries:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "SOULAMA/qwen2.5-coder-ft"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto"
)
question = "Write a Python function that takes two values c and d and returns c+d."
def build_prompt(question: str) -> str:
return (
"<|im_start|>system\n"
"Tu es un expert en programmation. Écris uniquement le code Python qui résout le problème.\n"
"<|im_end|>\n"
"<|im_start|>user\n"
f"{question}\n"
"<|im_end|>\n"
"<|im_start|>assistant\n"
)
messages=build_prompt(question)
inputs = tokenizer(messages, add_generation_prompt=True, return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(
inputs,
max_new_tokens=256,
temperature=0.1,
repetition_penalty=1.2,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
new_tokens = output_ids[0][inputs.shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))
Training Details
Training Data
The model was trained on a custom instruction dataset containing coding exercises, software engineering questions, and structured Python scripts.
Training Procedure
Preprocessing
Prompts were structured using the Qwen ChatML format, dividing blocks into <|im_start|>system, <|im_start|>user, and <|im_start|>assistant segments to maintain deep semantic alignment with the original instruct template.
Training Hyperparameters
- Training regime: PEFT (LoRA) followed by a full matrix
merge_and_unload()into float16 precision. - Base model precision: 4-bit quantized base setup during training (BitsAndBytes).
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
Speeds, Sizes, Times
- Checkpoint size: ~3.09 GB (Full Safetensors model)
- Adaptation layer size: ~73.9 MB (LoRA Weights)
Technical Specifications
Model Architecture and Objective
Based on the Qwen2.5-Coder dense architecture with Grouped-Query Attention (GQA) and RoPE (Rotary Position Embedding) optimized for dense source code token sequences.
Compute Infrastructure
Hardware
- GPU Type: 1 x NVIDIA Tesla T4 (via Google Colab Ecosystem)
Software
- Libraries: PyTorch, Transformers, PEFT, BitsAndBytes, TRL.
Model Card Authors
Soulama Haicanama Ismael
Model Card Contact
[More Information Needed]
- Downloads last month
- 448