Instructions to use cyberagent/CAT-Thinking-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cyberagent/CAT-Thinking-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cyberagent/CAT-Thinking-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cyberagent/CAT-Thinking-8B")
model = AutoModelForCausalLM.from_pretrained("cyberagent/CAT-Thinking-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use cyberagent/CAT-Thinking-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cyberagent/CAT-Thinking-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyberagent/CAT-Thinking-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/cyberagent/CAT-Thinking-8B

SGLang

How to use cyberagent/CAT-Thinking-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cyberagent/CAT-Thinking-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyberagent/CAT-Thinking-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cyberagent/CAT-Thinking-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyberagent/CAT-Thinking-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use cyberagent/CAT-Thinking-8B with Docker Model Runner:
```
docker model run hf.co/cyberagent/CAT-Thinking-8B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

CAT-Thinking 🐱

Tiny Language Model that thinks in Japanese

CAT-Thinking is trained to generate reasoning trace in Japanese by reinforcement learning. The model is based on Qwen3-Swallow-v0.2 which is a continual pretraining model based on Qwen3 to read and write fluently in Japanese.

Usage

CAT-Thinking is designed to reason in Japanese even if the input text is in English. The model is trained with the maximum output token length of 4096. We recommend setting max_new_tokens to at least 4096, and larger for difficult problems. Although the model is trained to respond within 4096 tokens, it tends to generate longer responses, especially for difficult and/or confusing instructions. It often gets stuck in repetition, especially when the instruction is confusing (e.g., two contradicting instructions are given). To mitigate the probability of repetition, we find repetition_penalty=1.05 or larger to be useful.

The model is compatible with all the libraries Qwen-3-8B is. To run with transformers, run the following command.

pip install transformers

from transformers import pipeline

# Load the model
chat_pipeline = pipeline("text-generation", model="CyberAgent/CAT-Thinking-8B")

prompt = "You have two cats, one male and one female. A female cat gives birth to up to 12 kittens per year.\n" + \
  "Assume you don't spay them. In three years, how many cats might you need to take care of at most?"

user_input = [{"role": "user", "content": prompt}]

response = chat_pipeline(user_input, max_new_tokens=8192, temperature=0.8, top_p=0.95)

print(response[0]['generated_text'])

Evaluation

We conducted evaluation on coding and math tasks in Japanese and English. We compare CAT-Thinking with its counterparts, Qwen-3-8B (Qwen-3) and Qwen3-Swallow-8B-RL-v0.2 (Swallow) Random sampling (temperature=0.8, top_p=0.95, max_new_tokens=4096) is used for all runs.

For all instances, CAT-Thinking generated its reasoning trace in Japanese while the comparison models generated theirs in English. The results show that it is feasible to think in Japanese while retaining the model's performance on English tasks. The benchmarks were published before the models were released, thus they should be interpreted as reference points.

A detailed experimental evaluation will be presented in a technical report.

Generation Examples

The reasoning trace of CAT-Thinking is in Japanese but not a natural one. In particular, it tends to start with a weird phrase such as:「済みの問題を解くことは許可されていますか？」「済みの数学問題かを確認する必要がありますか？」. It occurred during the second GRPO training and the model ended up learning to start with this phrase.

Example of Coding

PROMPT: Write a function to check if the given integer is a prime number. Name the function as follows: def prime_num(num):

RESPONSE: 済みの問題を解くことは許可されていますか？はい。この問題はプログラミングに関する内容であり、特定のカテゴリや禁止事項には該当しません。計算やアルゴリズムを求められる質問です。単純な素数判定関数を実装するコードを書けばよいでしょう。指示は「与えられた整数が素数かどうかを判定する関数を書いてください」というものです。単純な素数判定ロジックを提供すればよいでしょう。考慮すべき点としては、入力が負の数や0、1の場合の扱いです。通常、素数は1より大きい自然数で、1と自分自身以外に約数を持たない数です。したがって、0や1は素数ではありません。負の数も素数ではありません。関数は整数を受け取り、素数であればTrue、そうでなければFalseを返すように実装できます。効率的な方法として、2から平方根までの範囲で割り切れないかチェックする方法があります。境界条件として2は素数です。偶数は除外できます。Pythonでの実装例を提供し、コメントやテストケースも添えてくださいか？問題文は「関数を書いてください」とだけ書かれていますが、説明やコード例も添えると良いでしょう。最終的な答えは、要件通りの関数定義と説明です。コードは日本語のコメントを含めても構いませんか？はい、可能です。ただし、問題の指示は「関数を書いてください」というだけなので、シンプルに実装した関数だけを出力すればよいでしょう。追加の説明も添えて問題ありませんか？質問の意図はおそらくコードを書かせることが目的です。したがって、関数prime_numを実装したコードと簡単な説明を出力します。注意点として、入力が整数であることを前提とし、例外処理は省略してもよいでしょう。最終的に、Pythonコードを提示しますか？はい、出力してください。答えは日本語で書かれたコードと簡単な説明です。注意点として、素数判定はO(√n)の時間計算量です。コードを提供してくださいか？はい、提供します。最後に、テスト例も添えておくと親切です。しかし、問題文は関数だけを要求しているようですので、関数定義とコメントだけを書けば十分です。以下に回答を示します。\r

解説
素数（prime）とは「1 とその数自身以外に約数を持たない 1 より大きい整数」のことです。
判定の基本は「√num までの整数で割り切れるものが無ければ素数」という考え方です。

2 は唯一の偶数の素数なので特別扱いします。
それ以外の偶数は必ず合成数です。
それ以外の奇数については 3 から √num までの奇数で割り切れないかを調べます。

このアルゴリズムは最悪でも O(√n) の時間で判定でき、入力サイズがそれほど大きくない限り十分高速です。

実装例（Python 3）

import math

def prime_num(num: int) -> bool:
    """
    与えられた整数が素数かどうかを判定して bool を返す関数。

    Parameters
    ----------
    num : int
        判定したい整数。負の数や 0, 1 は素数ではないとみなす。

    Returns
    -------
    bool
        素数であれば True、そうでなければ False。
    """
    # 1 以下の数は素数ではない
    if num <= 1:
        return False

    # 2 は唯一の偶数の素数
    if num == 2:
        return True

    # 偶数は 2 以外すべて合成数
    if num % 2 == 0:
        return False

    # 3 から√num までの奇数で割り切れないか調べる
    limit = int(math.isqrt(num))   # 整数平方根を取得（Python3.8 以降）
    for i in range(3, limit + 1, 2):
        if num % i == 0:
            return False

    # ここまで来たら割り切れる約数が無かった → 素数
    return True

ポイント

math.isqrt（または int(math.sqrt(num))）で整数の平方根を取得し、ループ回数を最小化しています。
偶数の除外を行うことで、ループは奇数だけを対象にし、実行速度が約半分に向上します。
関数は bool を返すだけにし、呼び出し側で print(prime_num(x)) のように結果を出力できます。

使い方の簡単な例

print(prime_num(2))   # True
print(prime_num(15))  # False
print(prime_num(29))  # True
print(prime_num(1))   # False

この関数をそのままプロジェクトに組み込めば、整数が素数かどうかの判定が簡潔に行えます。

Example of Repetition on Solving Math

PROMPT: 海辺の主な港の税関職員は、輸入車の2台分のコンテナをチェックしており、各コンテナには5台の車が中に入っている。次の日、さらに多くのコンテナが運び込まれ、港の車両の数は全部で30となった。すべてのコンテナに5台の車が入っていたとすると、2日目にいくつのコンテナが輸入されたかを計算しなさい。回答は\\boxed{}の中に書いてください。

RESPONSE: 済みの数学問題を解くことは許可されていますか？はい。この問題は算数や代数の範囲であり、特定のカテゴリや禁止事項には該当しません。計算や論理的推論を求める問題です。求められるべきは「2日目に何個のコンテナが輸入されたか」という数値です。問題文を読み取ると、最初の日に税関職員は2つのコンテナをチェックし、各コンテナには5台の車が入っていました。したがって、最初の日の車の総数は2×5＝10台です。次の日、さらにコンテナが運ばれ、港にある車の総数は30台になりました。ここで「すべてのコンテナに5台の車が入っていた」と仮定すると、次の日に追加されたコンテナの数は（全体の車の数−最初の日の車の数）÷5で求められます。計算すると、（30−10）÷5＝20÷5＝4です。したがって、答えは4です。確認のために考えを整理すると、最初の日は10台の車があり、最終的に30台になるので、増加分は20台です。各コンテナは5台の車を含むので、必要なコンテナ数は4です。答えは4です。問題文の指示に従い、答えを四角で囲んでくださいか？指示では「回答は\boxed{}の中に書いてください」とありますので、答えを四角で囲んでください。答えは「4」です。日本語で答えを出力してくださいか？はい、答えを日本語で書きます。「答えは4です。」と書くか、単に「4」と書くべきか。指示は「回答は\boxed{}の中に書いてください」とあるので、答えを四角で囲んでください。したがって「\boxed{4}」と出力します。日本語で説明を添えても良いですが、最終回答は四角で囲んだ数字です。問題はシンプルな計算問題ですので、答えは4です。答えを四角で囲んでください。最終回答は「\boxed{4}」です。日本語で「答えは4です」と付け加えても良いですが、指示は「回答は\boxed{}の中に書いてください」とあるので、答えだけを四角で囲んで出力すべきです。ここでは答えだけを出力しますか？通常は答えと簡単な説明を添えることが多いですが、指示は明確に「回答は\boxed{}の中に書いてください」とありますので、答えだけを四角で囲んで出力します。したがって最終回答は「\boxed{4}」です。日本語での説明は不要かもしれませんが、問題文の指示に従う形で答えだけを出力するのが適切です。答えを四角で囲んでください。最終的に「4」を四角で囲んだ形で出力しますか？はい、出力は「\boxed{4}」となります。日本語で「答えは4です」と書く必要はありませんか？指示は「回答は\boxed{}の中に書いてください」とあるので、答えだけを四角で囲んでください。したがって最終回答は「\boxed{4}」です。ここでは日本語の説明は省き、答えだけを出力しますか？安全策として、答えを四角で囲んだ後に「答えは4です」と添えてはいけませんか？指示は答えを四角で囲むことだけを求めているようです。従って、最終出力は「\boxed{4}」です。日本語の文言は不要です。答えだけを出力します。

まず、最初の日に税関職員がチェックしたコンテナは 2 台で、各コンテナには 5 台の車が入っているので、最初の日に港にあった車の総数は

\[ 2 \times 5 = 10 \text{ 台} \]

です。

2 日目が終わった時点で港にある車の総数は 30 台です。したがって、2 日目に新たに増えた車の台数は

\[ 30 - 10 = 20 \text{ 台} \]

です。

問題文の条件「すべてのコンテナに 5 台の車が入っていた」とあるので、2 日目に輸入されたコンテナの数は増えた車の台数を 1 コンテナあたりの車の台数で割ったものです。

\[ \frac{20}{5} = 4 \]

したがって、2 日目に輸入されたコンテナの数は 4 台 です。

[ \boxed{4} \]

Training Procedure

The model is trained with GRPO with a warm-start. We first generate a teacher dataset using gpt-oss-120b as a reference. Since the reasoning traces are in English, we translate them into Japanese using CAT-Translate-7b. We train the Swallow model using the generated synthesized dataset with full-parameter SFT.

Then, we run GRPO with a permissive reward model which gives partial rewards for being able to (1) follow the reasoning format, (2) generate reasoning trace and the main text in Japanese, and (3) answer the question in an instructed format. In this way, the model learns to follow the reasoning format and generate its reasoning trace in Japanese. Since this training phase focuses on learning the superficial format rather than reasoning competence itself, we use LoRA.

Finally, we train the model with GRPO using a strict reward model that gives a reward only if the model follows all format constraints and also generates the correct answer.

The training data consists of synthesized math and coding dataset generated by gpt-oss-120b. Most of the instructions used for the training are in English. We speculate this to be the reason why the model underperforms on Japanese benchmarks.

License

The model is licensed under the Apache 2.0 License.

Citation

@misc{cat-thinking-2026,
  title={CAT-Thinking: How Hard is it to Learn Reasoning in non-English Language? Case Study in Japanese},
  author={Yuu Jinnai},
  year={2026},
  url={https://huggingface.co/collections/cyberagent/cat-thinking-8b}
}

Downloads last month: 54

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for cyberagent/CAT-Thinking-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1639)

this model

Quantizations

2 models