Instructions to use nekocyrene/NekoMind1.5-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nekocyrene/NekoMind1.5-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nekocyrene/NekoMind1.5-Base", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nekocyrene/NekoMind1.5-Base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nekocyrene/NekoMind1.5-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nekocyrene/NekoMind1.5-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nekocyrene/NekoMind1.5-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nekocyrene/NekoMind1.5-Base

SGLang

How to use nekocyrene/NekoMind1.5-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nekocyrene/NekoMind1.5-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nekocyrene/NekoMind1.5-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nekocyrene/NekoMind1.5-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nekocyrene/NekoMind1.5-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nekocyrene/NekoMind1.5-Base with Docker Model Runner:
```
docker model run hf.co/nekocyrene/NekoMind1.5-Base
```

NekoMind1.5-Base

Introduction

NekoMind1.5 is the latest series of NekoMind large language models. It adopts a Mixture-of-Experts (MoE) architecture to achieve a strong balance between model capacity and inference efficiency. With 1.35B total parameters but only ~300M activated per token, NekoMind1.5 delivers competitive performance while maintaining low computational cost during inference.

This repo contains the base (pre-trained) NekoMind1.5 model, which has the following features:

Type: Causal Language Models
Training Stage: Pretraining
Architecture: Transformer decoder with RoPE, SwiGLU, RMSNorm, GQA, and Mixture-of-Experts
Number of Parameters: 1.35B (Total) / ~300M (Activated)
Number of Parameters (Non-Embedding): 1.32B
Number of Layers: 18
Number of Attention Heads (GQA): 8 for Q and 4 for KV
Head Dimension: 128
Context Length: 32,768 tokens
Number of Experts: 32 (Top-4 routing)
Shared Expert: Yes (with gating)
Vocabulary Size: 32,006

Key Design Choices

Mixture-of-Experts (MoE): 16 out of 18 layers use sparse MoE blocks with 32 experts and top-4 routing, enabling high model capacity with efficient inference.
Dense Layers: The first 2 layers (layer 0 and 1) use standard dense MLP for stable early feature extraction.
Shared Expert with Gating: Each MoE layer includes a shared expert with a sigmoid gate, ensuring a baseline of knowledge is always available regardless of routing decisions.
Grouped Query Attention (GQA): Uses 8 query heads and 4 key-value heads to reduce KV-cache memory usage.
QK-Norm: Applies RMSNorm to query and key projections for training stability.
RoPE: Rotary Position Embedding with a base frequency of 1,000,000 for strong long-context extrapolation.

Architecture

The following diagram illustrates the overall architecture of NekoMind1.5:

┌─────────────────────────────────────────────────────────────┐
│                    NekoMind1.5-Base                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Input Tokens                                               │
│       │                                                     │
│       ▼                                                     │
│  ┌──────────┐                                               │
│  │Embedding │  (vocab: 32006, dim: 1024)                    │
│  └────┬─────┘                                               │
│       │                                                     │
│       ▼                                                     │
│  ╔══════════════════════════════════════════════════════╗    │
│  ║  Decoder Layer × 18                                  ║    │
│  ║                                                      ║    │
│  ║  ┌─────────────────────────────────────────────┐     ║    │
│  ║  │ RMSNorm                                     │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     ▼                                       │     ║    │
│  ║  │ GQA Attention (8Q / 4KV, head_dim=128)      │     ║    │
│  ║  │ ├─ Q/K Projections → QK-Norm → RoPE        │     ║    │
│  ║  │ └─ Output Projection                       │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     + (residual)                            │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │ RMSNorm                                     │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     ▼                                       │     ║    │
│  ║  │ ┌───────────────────────────────────────┐   │     ║    │
│  ║  │ │ Layer 0-1: Dense MLP (SwiGLU)         │   │     ║    │
│  ║  │ │   gate_proj ─┐                        │   │     ║    │
│  ║  │ │   up_proj ───┼─→ SiLU(gate) * up      │   │     ║    │
│  ║  │ │              └─→ down_proj → output    │   │     ║    │
│  ║  │ ├───────────────────────────────────────┤   │     ║    │
│  ║  │ │ Layer 2-17: Sparse MoE Block          │   │     ║    │
│  ║  │ │                                       │   │     ║    │
│  ║  │ │  input ──┬──→ Router (TopK=4/32)      │   │     ║    │
│  ║  │ │          │       │                    │   │     ║    │
│  ║  │ │          │       ▼                    │   │     ║    │
│  ║  │ │          │    Expert × 32 (SwiGLU)    │   │     ║    │
│  ║  │ │          │       │ (weighted sum)     │   │     ║    │
│  ║  │ │          │       ▼                    │   │     ║    │
│  ║  │ │          └──→ Shared Expert (SwiGLU)  │   │     ║    │
│  ║  │ │                  │ × σ(gate)          │   │     ║    │
│  ║  │ │                  ▼                    │   │     ║    │
│  ║  │ │          expert_out + shared_out      │   │     ║    │
│  ║  │ └───────────────────────────────────────┘   │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     + (residual)                            │     ║    │
│  ║  └─────────────────────────────────────────────┘     ║    │
│  ╚══════════════════════════════════════════════════════╝    │
│       │                                                     │
│       ▼                                                     │
│  ┌──────────┐                                               │
│  │ RMSNorm  │                                               │
│  └────┬─────┘                                               │
│       │                                                     │
│       ▼                                                     │
│  ┌──────────┐                                               │
│  │ LM Head  │  (tied with embedding weights)                │
│  └────┬─────┘                                               │
│       │                                                     │
│       ▼                                                     │
│  Output Logits (vocab: 32006)                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Requirements

Note: The NekoMind1.5 model code has not yet been merged into the main transformers library. You must enable trust_remote_code=True when loading the model to use the custom modeling code hosted in this repository.

transformers >= 4.51.0
torch >= 2.1.0

Install the required dependencies:

pip install transformers>=4.51.0 torch accelerate

Quickstart

Here is a code snippet showing how to load the model and generate text. Since the model architecture is not yet integrated into the upstream transformers library, you need to set trust_remote_code=True to load the custom model code from this repository.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nekocyrene/NekoMind1.5-Base"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)

prompt = "The theory of relativity"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Chat Usage

For chat-style interaction, use apply_chat_template:

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

License

This model is released under the Apache 2.0 License.

Downloads last month: 199

Safetensors

Model size

1B params

Tensor type

BF16