Llama 3 8B - Thinking V2

This model is a specialized LoRA fine-tune of Meta-Llama-3-8B-Instruct designed to enforce Chain-of-Thought (CoT) reasoning before providing a final answer. By utilizing specialized <thinking> tags, the model pauses to break down logic puzzles, coding problems, and tricky phrasing (like riddles and state-tracking) before generating its response.

🧠 Model Details

Base Model: Meta Llama 3 8B Instruct
Fine-Tuning Method: LoRA (Low-Rank Adaptation) via Unsloth
Dataset: 475 hand-curated logic, math, and reasoning puzzles.
Epochs: 3
Primary Goal: To force "System 2" thinking, reducing hallucinations and impulsive errors on complex prompts.

🚀 How to Use

CRITICAL: To trigger the reasoning engine, your prompt must be formatted to anticipate the <thinking> tag. If you do not prompt the model correctly, it may bypass the reasoning phase and act like a standard Llama 3 model.

Inference Code (Python / Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "eelixir/llama3-8b-thinking-v2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

test_question = "A farmer has 17 sheep. All but 9 run away. How many sheep are left?"

# Note the intentional inclusion of <thinking>\n at the end to "prime" the reasoning!
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{test_question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n<thinking>\n"

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, use_cache=True)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eelixir/llama3-8b-thinking-v2

Base model

unsloth/llama-3-8b-Instruct-bnb-4bit

Adapter

(162)

this model