Instructions to use tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit

Run Hermes

hermes

MLX LM

How to use tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3-Swallow-32B-RL-v0.2-MLX-8bit

This model is an MLX format conversion of tokyotech-llm/Qwen3-Swallow-32B-RL-v0.2, optimized for Apple Silicon.

Model Details

Attribute	Value
Original Model	`tokyotech-llm/Qwen3-Swallow-32B-RL-v0.2`
Architecture	Dense Transformer
Parameters	32B
Quantization	8-bit quantization
Model Size	~32 GB
Format	MLX (Apple Silicon optimized)
Converted with	mlx-lm v0.30.8
License	Apache 2.0

About Qwen3-Swallow

Qwen3-Swallow is a bilingual Japanese-English large language model developed by the Swallow Project at the Institute of Science Tokyo (formerly Tokyo Institute of Technology) and AIST. Built upon Qwen3 through Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL), it achieves strong performance on both Japanese and English tasks while maintaining capabilities in mathematics and coding.

For more details, see the original model card.

Usage

Quick Start (Python)

from mlx_lm import load, generate

model, tokenizer = load("tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit")

messages = [{"role": "user", "content": "hello"}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)

Interactive Chat

mlx_lm.chat --model tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit

OpenAI-Compatible Server

mlx_lm.server --model tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit --port 8080

Then connect with any OpenAI-compatible client at http://localhost:8080/v1.

Acknowledgments

Original model by Swallow Project (Institute of Science Tokyo and AIST)
MLX framework by Apple Machine Learning Research
Conversion performed using mlx-lm

Downloads last month: 34

Safetensors

Model size

33B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit

Model tree for tocchitocchi/Qwen3-Swallow-32B-RL-v0.2-MLX-8bit

Base model

Qwen/Qwen3-32B

Finetuned

tokyotech-llm/Qwen3-Swallow-32B-CPT-v0.2

Finetuned

tokyotech-llm/Qwen3-Swallow-32B-SFT-v0.2

Finetuned

tokyotech-llm/Qwen3-Swallow-32B-RL-v0.2

Quantized

(2)

this model