aciklab
/

kubernetes-ai-4bit

@@ -88,80 +88,6 @@ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(response)
 ```
-### Advanced Usage with Pipeline
-```python
-from transformers import pipeline
-# Create text generation pipeline
-pipe = pipeline(
-    "text-generation",
-    model="aciklab/kubernetes-ai-4bit",
-    device_map="auto",
-    trust_remote_code=True
-)
-# Generate response
-messages = [
-    {"role": "system", "content": "Sen Kubernetes konusunda uzmanlaşmış bir yapay zeka asistanısın."},
-    {"role": "user", "content": "Pod ve Deployment arasındaki fark nedir?"}
-]
-response = pipe(
-    messages,
-    max_new_tokens=512,
-    temperature=1.0,
-    top_p=0.95,
-    do_sample=True
-)
-print(response[0]["generated_text"][-1]["content"])
-```
-### Streaming Responses
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
-from threading import Thread
-model_name = "aciklab/kubernetes-ai-4bit"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    device_map="auto",
-    trust_remote_code=True
-)
-# Prepare input
-prompt = "Kubernetes Service türlerini açıkla"
-messages = [
-    {"role": "system", "content": "Sen Kubernetes konusunda uzmanlaşmış bir yapay zeka asistanısın."},
-    {"role": "user", "content": prompt}
-]
-input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
-# Setup streamer
-streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
-generation_kwargs = dict(
-    **inputs,
-    max_new_tokens=512,
-    temperature=1.0,
-    streamer=streamer
-)
-# Generate in separate thread
-thread = Thread(target=model.generate, kwargs=generation_kwargs)
-thread.start()
-# Stream output
-for text in streamer:
-    print(text, end="", flush=True)
-thread.join()
-```
 ## Training Details
 This model is based on the [aciklab/kubernetes-ai](https://huggingface.co/aciklab/kubernetes-ai) LoRA adapters:
@@ -205,88 +131,20 @@ This model uses 4bit quantization with BitsAndBytes for optimal memory efficienc
 ## Hardware Requirements
 ### Minimum (GPU)
-- **GPU:** 8GB VRAM (e.g., RTX 3060, RTX 4060)
 - **RAM:** 8GB system memory
 - **Storage:** 10GB free space
-- **Recommended:** CUDA-capable NVIDIA GPU
-### Minimum (CPU Only)
-- **CPU:** 8+ cores
-- **RAM:** 16GB system memory
-- **Storage:** 10GB free space
-- **Note:** CPU inference will be slower than GPU
 ### Recommended
-- **GPU:** 12GB+ VRAM (e.g., RTX 3080, RTX 4070, RTX 5070)
 - **RAM:** 16GB system memory
 - **Storage:** 15GB free space
-- **CUDA:** 11.8 or higher
-## Performance Benchmarks
-| Hardware | Tokens/Second | Latency (512 tokens) |
-|----------|---------------|----------------------|
-| RTX 5070 12GB | ~45-55 | ~10-12 seconds |
-| RTX 4060 8GB | ~35-45 | ~12-15 seconds |
-| CPU (16 cores) | ~5-10 | ~60-100 seconds |
-*Benchmarks are approximate and may vary based on system configuration*
-## Inference Optimization Tips
-### For Maximum Speed
-```python
-# Use Flash Attention 2 (if available)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    device_map="auto",
-    trust_remote_code=True,
-    attn_implementation="flash_attention_2"  # Requires flash-attn package
-)
-```
-### For Lower Memory Usage
-```python
-# Enable 8bit quantization instead of 4bit if needed
-from transformers import BitsAndBytesConfig
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4"
-)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    quantization_config=quantization_config,
-    device_map="auto"
-)
-```
-## Example Queries
-```python
-# Example 1: Creating a Deployment
-"Kubernetes'te 3 replikaya sahip bir nginx deployment nasıl oluştururum?"
-# Example 2: Service Explanation
-"ClusterIP, NodePort ve LoadBalancer service türleri arasındaki farklar nelerdir?"
-# Example 3: Troubleshooting
-"Pod'um CrashLoopBackOff durumunda, nasıl debug edebilirim?"
-# Example 4: Configuration
-"ConfigMap ve Secret arasındaki fark nedir ve ne zaman hangisini kullanmalıyım?"
-# Example 5: Best Practices
-"Production ortamında Kubernetes deployment için en iyi pratikler nelerdir?"
-```
 ## Limitations
-- **Language:** Optimized primarily for Turkish; English queries may work but with reduced quality
-- **Context Window:** 1024 tokens maximum sequence length
 - **Domain:** Specialized for Kubernetes; may not perform well on general topics
 - **Quantization:** 4bit quantization may occasionally affect response quality on complex queries

 print(response)
 ```
 ## Training Details
 This model is based on the [aciklab/kubernetes-ai](https://huggingface.co/aciklab/kubernetes-ai) LoRA adapters:
 ## Hardware Requirements
 ### Minimum (GPU)
+- **GPU:** 8GB VRAM
 - **RAM:** 8GB system memory
 - **Storage:** 10GB free space
 ### Recommended
+- **GPU:** 12GB+ VRAM
 - **RAM:** 16GB system memory
 - **Storage:** 15GB free space
 ## Limitations
+- **Language:** Optimized primarily for Turkish and English.
 - **Domain:** Specialized for Kubernetes; may not perform well on general topics
 - **Quantization:** 4bit quantization may occasionally affect response quality on complex queries