MIST-1-70B

MIST-1-70B is the mid-size model in the MIST model family by olaverse. Built by blending 4 of the best Llama 3.1 70B models using DARE+TIES. structured, detailed, production ready

MIST Model Family

Model Params Speed Status
MIST-1-8B 8B ~63 tok/s ✅ Available
MIST-1-70B 70B ~23 tok/s ✅ Available
MIST-1-140B 140B ~8 tok/s ✅ Available

Key Strengths

  • 🧠 Strong Reasoning — DeepSeek R1 distillation at 70B scale
  • 🤝 Highly Helpful — built on Nemotron #1 on helpfulness benchmarks
  • 💻 Coding — clean documented production-ready code
  • 📐 Math — step-by-step structured problem solving
  • 🌍 Multilingual — supports 8+ languages
  • 📚 Long Context — 128K token context window
  • 🔓 Unrestricted — follows instructions without excessive refusals

Merge Method

MIST-1-70B uses DARE+TIES:

  • DARE prunes redundant weights from each model
  • TIES resolves weight conflicts using sign consensus
  • Result: best capabilities of all 4 models combined

Benchmark Results

Task Speed Quality
Reasoning 10.5s ✅ Correct step-by-step
Coding 11.3s ✅ Clean with type hints
Math 11.3s ✅ Structured with verification
General 11.3s ✅ Accurate and detailed
Instruction 8.1s ✅ Precise and formatted

Average: 23 tok/s

How to Use

bfloat16 — Full Precision (140GB VRAM)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "olaverse/MIST-1-70B",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-1-70B")

messages = [{"role": "user", "content": "Your question here"}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4-bit Quantized (40GB VRAM)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type='nf4'
)
model = AutoModelForCausalLM.from_pretrained(
    "olaverse/MIST-1-70B",
    quantization_config=quantization_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-1-70B")

Hardware Requirements

Precision VRAM Size
bfloat16 140GB (1x H200 or 2x H100) 132GB
4-bit (NF4) 40GB (1x A100/H100) ~35GB

Important Notes on Chat Template

This model was merged from both Llama 3.1 and ChatML-trained parents (Hermes-3, Nemotron, DeepSeek-R1-Distill). The merged tokenizer uses Llama 3.1 format, but the model can occasionally output <|im_end|> as plain text due to its mixed training heritage.

✅ Correct Usage

Always use the built-in chat template — never hardcode prompts manually:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-1-70B")
model = AutoModelForCausalLM.from_pretrained(
    "olaverse/MIST-1-70B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are MIST, a helpful AI assistant."},
    {"role": "user", "content": "Your question here"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    min_p=0.05,
    repetition_penalty=1.5,
    eos_token_id=[128009, 128001, 128008],
    pad_token_id=128001,
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(response)

⚠️ Common Mistake

Do NOT hardcode ChatML format like this:

# WRONG — this model is Llama 3.1, not ChatML
prompt = f"user\n{question}\nassistant\n"

Forcing ChatML format on a Llama 3.1 tokenizer confuses the model and causes <|im_end|> to leak as plain text in responses.

Stop Tokens

Token ID Purpose
<|eot_id|> 128009 Primary stop (Llama 3.1 native)
<|end_of_text|> 128001 Secondary stop
<|eom_id|> 128008 End of message

Note: <|im_end|> is NOT in this model's vocabulary. If you see it in output, your prompt formatting is wrong — use apply_chat_template as shown above.

Recommended Generation Settings

Parameter Value
temperature 0.8
top_p 0.99
min_p 0.05
repetition_penalty 1.0
max_new_tokens 1024

License

Llama 3.1 Community License

Downloads last month
168
Safetensors
Model size
71B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olaverse/MIST-1-70B

Collection including olaverse/MIST-1-70B