Jetlink
/

JetLLMLite-4.0

+---
+license: apache-2.0
+library_name: transformers
+tags:
+  - gemma
+  - gemma4
+  - multimodal
+  - vision-language
+  - conversational
+  - transformers
+  - vllm
+  - sglang
+  - function-calling
+  - reasoning
+pipeline_tag: image-text-to-text
+base_model: google/gemma-4-31b-it
+---
+# JetLLMLite-4.0
+**JetLLMLite-4.0** is a multimodal instruction-tuned model published by **Jetlink**, built on top of [google/gemma-4-31b-it](https://huggingface.co/google/gemma-4-31b-it) as its base model.
+It is intended for teams that want to manage deployment, access, and internal distribution from their own namespace while preserving compatibility with the original upstream model ecosystem.
+## Model Summary
+JetLLMLite-4.0 is a 31B dense multimodal model with:
+- **31B total parameters (dense architecture)**
+- **Instruction-tuned (IT) variant**
+- **256,144 tokens context length**
+- **Multimodal: text + image input, text output**
+- **Video understanding support (up to 60 seconds at 1 fps)**
+- **Built-in reasoning / thinking mode**
+- **Native function calling support**
+- **Support for 140+ languages**
+- Compatibility with **Transformers**, **vLLM**, **SGLang**, **llama.cpp**, **MLX**, **Ollama**
+## Intended Use
+This model is suitable for advanced workloads such as:
+- multimodal chat assistants
+- long-context document and PDF understanding
+- reasoning and step-by-step problem solving
+- agentic workflows with function calling
+- coding assistants and code generation
+- image, chart, and OCR tasks
+- multilingual enterprise assistants
+- research and benchmarking
+## Model Details
+### Architecture
+- **Model type:** Dense Causal Language Model with Vision Encoder
+- **Training stage:** Pre-training & Post-training (Instruction-tuned)
+- **Total parameters:** 31B
+- **Architecture style:** Dense (not MoE)
+- **Attention mechanism:** Hybrid — alternating local sliding-window (1024 tokens) and global full-context attention
+- **RoPE:** Dual config — standard RoPE for sliding layers, Proportional RoPE (p-RoPE) for global layers
+- **Per-Layer Embeddings (PLE):** Yes
+- **Shared KV Cache:** Yes (last N layers reuse KV states from earlier layers)
+- **Native context length:** 256,144 tokens
+- **Vision encoder:** Variable aspect ratio; configurable token budgets (70 / 140 / 280 / 560 / 1120 tokens)
+- **Thinking mode:** Configurable via `<|think|>` token in system prompt
+### Ecosystem Compatibility
+- Hugging Face Transformers
+- vLLM
+- SGLang
+- llama.cpp
+- MLX
+- Ollama
+- mistral.rs
+- LM Studio
+## Hardware Requirements
+> JetLLMLite-4.0 is a **single-GPU capable** model at full precision (bfloat16), making it significantly more accessible than large MoE or 100B+ scale models.
+### Reference Hardware
+Approximate GPU memory requirements (bfloat16 / full precision):
+- **Unquantized (bfloat16):** fits on a single 80GB NVIDIA H100/H200 GPU
+- **4-bit quantized:** runs on consumer GPUs with 24GB+ VRAM (e.g. RTX 3090, RTX 4090)
+- **Multi-GPU:** tensor parallelism supported via vLLM and SGLang for higher throughput
+> Note: requirements vary based on context length, batch size, and KV cache settings. The above are practical reference points, not universal minimums.
+### Practical Guidance
+#### Single GPU deployment
+Unlike large-scale MoE models, JetLLMLite-4.0 can be served from a **single 80GB datacenter GPU** at full precision — making it an excellent fit for single-node or cost-conscious deployment scenarios.
+For consumer-grade hardware, quantized variants (GGUF, GPTQ, AWQ) significantly reduce memory requirements with minimal quality loss.
+#### Text-only deployment
+Use the `--language-model-only` flag in vLLM to skip vision encoder profiling and free additional KV cache memory when your workload is purely text-based.
+### Recommendation
+For most production teams:
+1. start with **vLLM** or **SGLang** for serving
+2. use a **single H100/H200** for unquantized bfloat16 deployment
+3. use **4-bit quantization** for consumer GPU or cost-optimized deployments
+4. disable vision if not needed via `--language-model-only`
+## Software Requirements
+Recommended environment:
+- Python 3.10+
+- Linux
+- CUDA-enabled GPU infrastructure
+- One of the following runtimes:
+  - Transformers (`>= 4.51.0` required for Gemma 4)
+  - vLLM
+  - SGLang
+  - llama.cpp
+Common dependencies:
+- `torch`
+- `transformers >= 4.51.0`
+- `torchvision`
+- `pillow`
+- `accelerate`
+## Quickstart
+Install Transformers:
+    pip install "transformers>=4.51.0"
+### Basic text inference
+    from transformers import pipeline
+    import torch
+    pipe = pipeline(
+        "image-text-to-text",
+        model="Jetlink/JetLLMLite-4.0",
+        device="cuda",
+        torch_dtype=torch.bfloat16
+    )
+    messages = [
+        {"role": "user", "content": [{"type": "text", "text": "What is the capital of France?"}]}
+    ]
+    output = pipe(messages, max_new_tokens=200)
+    print(output[0]["generated_text"][-1]["content"])
+### Multimodal inference (image + text)
+    from transformers import AutoProcessor, AutoModelForImageTextToText
+    import torch
+    from PIL import Image
+    model_id = "Jetlink/JetLLMLite-4.0"
+    processor = AutoProcessor.from_pretrained(model_id)
+    model = AutoModelForImageTextToText.from_pretrained(
+        model_id,
+        torch_dtype=torch.bfloat16,
+        device_map="auto"
+    )
+    image = Image.open("image.jpg")
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "image", "image": image},
+                {"type": "text", "text": "Describe this image in detail."}
+            ]
+        }
+    ]
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt"
+    ).to(model.device)
+    output = model.generate(**inputs, max_new_tokens=512)
+    print(processor.decode(output[0], skip_special_tokens=True))
+### Reasoning / Thinking mode
+Enable thinking mode by adding `<|think|>` to the system prompt:
+    messages = [
+        {"role": "system", "content": "<|think|>"},
+        {"role": "user", "content": [{"type": "text", "text": "Solve: If x² + 5x + 6 = 0, what are the values of x?"}]}
+    ]
+## Serving Examples
+### vLLM
+    vllm serve Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tensor-parallel-size 1 \
+      --max-model-len 32768 \
+      --dtype bfloat16
+### vLLM with Tool Use
+    vllm serve Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tensor-parallel-size 1 \
+      --max-model-len 32768 \
+      --dtype bfloat16 \
+      --enable-auto-tool-choice \
+      --tool-call-parser gemma
+### vLLM text-only mode
+    vllm serve Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tensor-parallel-size 1 \
+      --max-model-len 32768 \
+      --dtype bfloat16 \
+      --language-model-only
+### SGLang
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tp-size 1 \
+      --mem-fraction-static 0.85 \
+      --context-length 32768 \
+      --dtype bfloat16
+### SGLang with Tool Use
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tp-size 1 \
+      --mem-fraction-static 0.85 \
+      --context-length 32768 \
+      --dtype bfloat16 \
+      --tool-call-parser gemma4
+### llama.cpp
+    llama-server \
+      -m JetLLMLite-4.0.Q4_K_M.gguf \
+      --port 8080 \
+      -ngl 99 \
+      -c 8192
+## Long Context Notes
+JetLLMLite-4.0 natively supports **256,144 tokens** of context.
+The hybrid attention mechanism (alternating sliding-window and global attention) with Proportional RoPE (p-RoPE) enables efficient long-context processing without degradation. For most practical deployments, setting `--max-model-len` to a lower value (e.g. 32768) is recommended to manage KV cache memory pressure.
+## Thinking Mode Notes
+JetLLMLite-4.0 supports configurable thinking mode inherited from the Gemma 4 architecture:
+- **Thinking enabled:** add `<|think|>` token to the system prompt
+- **Thinking disabled:** omit `<|think|>` from the system prompt
+When thinking is enabled, the model outputs internal reasoning using `<|channel>thought\n[reasoning]<channel|>` before the final answer. In multi-turn conversations, thought content from previous turns should not be included before the next user turn.
+## Strengths
+- single-GPU deployable at full precision (80GB H100/H200)
+- strong multimodal capabilities (image, video, OCR, document parsing)
+- built-in reasoning / thinking mode
+- native function calling support
+- 256K token context window
+- 140+ language support
+- broad compatibility with inference frameworks
+- dense architecture — predictable and consistent performance
+## Limitations
+- requires at least one high-memory GPU for unquantized deployment
+- long context significantly increases KV cache memory pressure
+- video understanding limited to 60 seconds at 1 fps
+- multimodal usage adds memory overhead compared to text-only
+- deployment characteristics depend on framework and quantization settings
+## Out-of-Scope / Cautionary Use
+As with other frontier-scale multimodal models, outputs should be reviewed before use in:
+- medical decision-making
+- legal advice
+- safety-critical automation
+- high-stakes financial decisions
+- fully autonomous customer actions without guardrails
+Human review, policy controls, and tool-level validation are strongly recommended.
+## License
+This repository follows the same license as the upstream release.
+- **License:** Apache-2.0
+- See the upstream Google Gemma repository and included license text for the governing terms.
+If you redistribute, fine-tune, quantize, or otherwise modify this model, make sure your usage remains compliant with the upstream license and attribution requirements.
+## Attribution
+Original model and research release by **Google DeepMind**.
+Upstream model:
+- `google/gemma-4-31b-it`
+This repository is an organization-managed copy and is **not the original upstream source**.
+## Citation
+Please cite the original Gemma 4 release when using this model in research, evaluation, or production documentation.
+```bibtex
+@misc{gemma4,
+  title        = {Gemma 4 Technical Report},
+  author       = {Google DeepMind},
+  year         = {2026},
+  publisher    = {Google DeepMind},
+  howpublished = {\url{https://huggingface.co/google/gemma-4-31b-it}}
+}
+```
+---
+# JetLLMLite-4.0 (Türkçe)
+**JetLLMLite-4.0**, **Jetlink** tarafından yayınlanan multimodal bir instruction-tuned modeldir. Base model olarak [google/gemma-4-31b-it](https://huggingface.co/google/gemma-4-31b-it) kullanılmıştır.
+Bu depo; modeli kendi namespace'i altında yönetmek, erişimi kontrol etmek ve dağıtımı kolaylaştırmak isteyen ekipler için hazırlanmıştır.
+## Model Özeti
+JetLLMLite-4.0, aşağıdaki özelliklere sahip 31B parametreli dense bir multimodal modeldir:
+- **31B toplam parametre (dense mimari)**
+- **Instruction-tuned (IT) varyant**
+- **256.144 token bağlam uzunluğu**
+- **Multimodal: metin + görüntü girişi, metin çıkışı**
+- **Video anlama desteği (saniyede 1 kare, 60 saniyeye kadar)**
+- **Yerleşik reasoning / thinking modu**
+- **Native function calling desteği**
+- **140+ dil desteği**
+- **Transformers**, **vLLM**, **SGLang**, **llama.cpp**, **MLX**, **Ollama** ile uyumluluk
+## Kullanım Amacı
+Bu model aşağıdaki gelişmiş kullanım senaryoları için uygundur:
+- multimodal sohbet asistanları
+- uzun bağlamlı doküman ve PDF anlama
+- adım adım akıl yürütme ve problem çözme
+- function calling ile agentic workflow yapıları
+- kodlama asistanları ve kod üretimi
+- görüntü, grafik ve OCR görevleri
+- çok dilli kurumsal asistanlar
+- araştırma ve benchmark çalışmaları
+## Model Detayları
+### Mimari
+- **Model tipi:** Vision Encoder içeren Dense Causal Language Model
+- **Eğitim aşaması:** Pre-training ve Post-training (Instruction-tuned)
+- **Toplam parametre:** 31B
+- **Mimari stili:** Dense (MoE değil)
+- **Dikkat mekanizması:** Hibrit — local sliding-window (1024 token) ve global full-context attention
+- **RoPE:** Çift konfigürasyon — sliding katmanlar için standart RoPE, global katmanlar için Proportional RoPE (p-RoPE)
+- **Per-Layer Embeddings (PLE):** Evet
+- **Paylaşılan KV Cache:** Evet
+- **Yerel bağlam uzunluğu:** 256.144 token
+- **Vision encoder:** Değişken en-boy oranı; yapılandırılabilir token bütçeleri (70 / 140 / 280 / 560 / 1120 token)
+- **Thinking modu:** System prompt'a `<|think|>` token eklenerek etkinleştirilir
+### Ekosistem Uyumluluğu
+- Hugging Face Transformers
+- vLLM
+- SGLang
+- llama.cpp
+- MLX
+- Ollama
+- mistral.rs
+- LM Studio
+## Donanım Gereksinimleri
+> JetLLMLite-4.0, tam hassasiyetle (bfloat16) **tek GPU'da çalışabilen** bir modeldir. Bu özelliği, büyük ölçekli MoE veya 100B+ modellerine kıyasla çok daha erişilebilir kılar.
+### Referans Donanım
+Tahmini GPU bellek gereksinimleri (bfloat16 / tam hassasiyet):
+- **Quantize edilmemiş (bfloat16):** tek bir 80GB NVIDIA H100/H200 GPU'ya sığar
+- **4-bit quantize:** 24GB+ VRAM'li consumer GPU'larda çalışır (ör. RTX 3090, RTX 4090)
+- **Çoklu GPU:** daha yüksek throughput için vLLM ve SGLang üzerinden tensor parallelism desteklenir
+> Not: gereksinimler bağlam uzunluğu, batch size ve KV cache ayarlarına göre değişir. Yukarıdakiler pratik referans noktaları olup evrensel minimum değildir.
+### Pratik Rehber
+#### Tek GPU dağıtımı
+Büyük ölçekli MoE modellerinin aksine JetLLMLite-4.0, tam hassasiyetle **tek bir 80GB datacenter GPU'dan** servis edilebilir. Bu özellik, single-node veya maliyet odaklı dağıtım senaryoları için mükemmel bir seçenek sunar.
+Consumer GPU'lar için quantize varyantlar (GGUF, GPTQ, AWQ) minimal kalite kaybıyla bellek gereksinimini önemli ölçüde azaltır.
+#### Sadece metin kullanımı
+vLLM'de `--language-model-only` bayrağını kullanarak vision encoder profiling'i atlayabilir ve KV cache için ek bellek açabilirsiniz.
+### Öneri
+Çoğu production ekip için en mantıklı yaklaşım:
+1. serving için **vLLM** veya **SGLang** ile başlamak
+2. quantize edilmemiş bfloat16 dağıtım için **tek H100/H200** kullanmak
+3. consumer GPU veya maliyet optimize edilmiş dağıtımlar için **4-bit quantization** uygulamak
+4. vision gerekmiyorsa `--language-model-only` ile devre dışı bırakmak
+## Yazılım Gereksinimleri
+Önerilen ortam:
+- Python 3.10+
+- Linux
+- CUDA destekli GPU altyapısı
+- Şu runtime'lardan biri:
+  - Transformers (`>= 4.51.0` — Gemma 4 için zorunlu)
+  - vLLM
+  - SGLang
+  - llama.cpp
+Yaygın bağımlılıklar:
+- `torch`
+- `transformers >= 4.51.0`
+- `torchvision`
+- `pillow`
+- `accelerate`
+## Hızlı Başlangıç
+Transformers kurulumu:
+    pip install "transformers>=4.51.0"
+### Temel metin çıkarımı
+    from transformers import pipeline
+    import torch
+    pipe = pipeline(
+        "image-text-to-text",
+        model="Jetlink/JetLLMLite-4.0",
+        device="cuda",
+        torch_dtype=torch.bfloat16
+    )
+    messages = [
+        {"role": "user", "content": [{"type": "text", "text": "Fransa'nın başkenti neresidir?"}]}
+    ]
+    output = pipe(messages, max_new_tokens=200)
+    print(output[0]["generated_text"][-1]["content"])
+### Multimodal çıkarım (görüntü + metin)
+    from transformers import AutoProcessor, AutoModelForImageTextToText
+    import torch
+    from PIL import Image
+    model_id = "Jetlink/JetLLMLite-4.0"
+    processor = AutoProcessor.from_pretrained(model_id)
+    model = AutoModelForImageTextToText.from_pretrained(
+        model_id,
+        torch_dtype=torch.bfloat16,
+        device_map="auto"
+    )
+    image = Image.open("goruntu.jpg")
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "image", "image": image},
+                {"type": "text", "text": "Bu görseli detaylı olarak açıkla."}
+            ]
+        }
+    ]
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt"
+    ).to(model.device)
+    output = model.generate(**inputs, max_new_tokens=512)
+    print(processor.decode(output[0], skip_special_tokens=True))
+### Reasoning / Thinking modu
+Thinking modunu etkinleştirmek için system prompt'a `<|think|>` ekleyin:
+    messages = [
+        {"role": "system", "content": "<|think|>"},
+        {"role": "user", "content": [{"type": "text", "text": "x² + 5x + 6 = 0 denkleminin kökleri nelerdir?"}]}
+    ]
+## Serving Örnekleri
+### vLLM
+    vllm serve Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tensor-parallel-size 1 \
+      --max-model-len 32768 \
+      --dtype bfloat16
+### vLLM Tool Use ile
+    vllm serve Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tensor-parallel-size 1 \
+      --max-model-len 32768 \
+      --dtype bfloat16 \
+      --enable-auto-tool-choice \
+      --tool-call-parser gemma
+### vLLM sadece metin modu
+    vllm serve Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tensor-parallel-size 1 \
+      --max-model-len 32768 \
+      --dtype bfloat16 \
+      --language-model-only
+### SGLang
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tp-size 1 \
+      --mem-fraction-static 0.85 \
+      --context-length 32768 \
+      --dtype bfloat16
+### SGLang Tool Use ile
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMLite-4.0 \
+      --port 8000 \
+      --tp-size 1 \
+      --mem-fraction-static 0.85 \
+      --context-length 32768 \
+      --dtype bfloat16 \
+      --tool-call-parser gemma4
+### llama.cpp
+    llama-server \
+      -m JetLLMLite-4.0.Q4_K_M.gguf \
+      --port 8080 \
+      -ngl 99 \
+      -c 8192
+## Uzun Bağlam Notları
+JetLLMLite-4.0 yerel olarak **256.144 token** destekler.
+Hibrit dikkat mekanizması (alternatif sliding-window ve global attention) ve Proportional RoPE (p-RoPE) sayesinde verimli uzun bağlam işleme sağlanır. Çoğu production dağıtımında KV cache bellek baskısını yönetmek için `--max-model-len` değerini daha düşük tutmak (ör. 32768) önerilir.
+## Thinking Modu Notları
+JetLLMLite-4.0, Gemma 4 mimarisinden gelen yapılandırılabilir thinking modunu destekler:
+- **Thinking etkin:** system prompt'a `<|think|>` token ekleyin
+- **Thinking devre dışı:** `<|think|>` tokenını system prompt'tan çıkarın
+Thinking etkinleştirildiğinde model, nihai yanıttan önce `<|channel>thought\n[akıl yürütme]<channel|>` yapısıyla iç mantığını çıktılar. Çok turlu konuşmalarda önceki turların thought içeriği bir sonraki kullanıcı turuna dahil edilmemelidir.
+## Güçlü Yönler
+- tam hassasiyetle tek GPU'da dağıtılabilir (80GB H100/H200)
+- güçlü multimodal yetenekler (görüntü, video, OCR, doküman ayrıştırma)
+- yerleşik reasoning / thinking modu
+- native function calling desteği
+- 256K token bağlam penceresi
+- 140+ dil desteği
+- inference framework'leriyle geniş uyumluluk
+- dense mimari — öngörülebilir ve tutarlı performans
+## Sınırlamalar
+- quantize edilmemiş dağıtım için en az bir yüksek belleğe sahip GPU gerektirir
+- uzun bağlam KV cache bellek baskısını ciddi ölçüde artırır
+- video anlama, saniyede 1 kare hızında 60 saniyeyle sınırlıdır
+- multimodal kullanım metin çıkarımına kıyasla ek bellek maliyeti getirir
+- deployment karakteristiği framework ve quantization ayarlarına göre değişir
+## Kapsam Dışı / Dikkat Gerektiren Kullanımlar
+Diğer frontier-scale multimodal modellerde olduğu gibi, model çıktıları şu alanlarda insan denetimi olmadan kullanılmamalıdır:
+- tıbbi karar verme
+- hukuki tavsiye
+- güvenlik kritik otomasyon
+- yüksek riskli finansal kararlar
+- korumasız tam otonom müşteri aksiyonları
+İnsan incelemesi, politika kontrolleri ve tool seviyesinde doğrulama güçlü şekilde önerilir.
+## Lisans
+Bu depo, upstream sürümle aynı lisansı takip eder.
+- **Lisans:** Apache-2.0
+- Geçerli şartlar için upstream Google Gemma deposu ve lisans metni incelenmelidir.
+Modeli yeniden dağıtıyor, fine-tune ediyor, quantize ediyor veya başka şekilde değiştiriyorsan; kullanımının upstream lisans ve attribution gereklilikleriyle uyumlu olduğundan emin olmalısın.
+## Atıf
+Orijinal model ve araştırma yayını **Google DeepMind** ekibine aittir.
+Upstream model:
+- `google/gemma-4-31b-it`
+Bu depo, kurum tarafından yönetilen bir kopyadır ve **orijinal upstream kaynak değildir**.
+## Atıf / Citation
+Bu modeli araştırma, değerlendirme veya production dokümantasyonunda kullanıyorsan, lütfen orijinal Gemma 4 sürümüne atıf yap.
+```bibtex
+@misc{gemma4,
+  title        = {Gemma 4 Technical Report},
+  author       = {Google DeepMind},
+  year         = {2026},
+  publisher    = {Google DeepMind},
+  howpublished = {\url{https://huggingface.co/google/gemma-4-31b-it}}
+}
+```