NekoMind1.5-Base

Introduction

NekoMind1.5 is the latest series of NekoMind large language models. It adopts a Mixture-of-Experts (MoE) architecture to achieve a strong balance between model capacity and inference efficiency. With 1.35B total parameters but only ~300M activated per token, NekoMind1.5 delivers competitive performance while maintaining low computational cost during inference.

This repo contains the base (pre-trained) NekoMind1.5 model, which has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining
  • Architecture: Transformer decoder with RoPE, SwiGLU, RMSNorm, GQA, and Mixture-of-Experts
  • Number of Parameters: 1.35B (Total) / ~300M (Activated)
  • Number of Parameters (Non-Embedding): 1.32B
  • Number of Layers: 18
  • Number of Attention Heads (GQA): 8 for Q and 4 for KV
  • Head Dimension: 128
  • Context Length: 32,768 tokens
  • Number of Experts: 32 (Top-4 routing)
  • Shared Expert: Yes (with gating)
  • Vocabulary Size: 32,006

Key Design Choices

  • Mixture-of-Experts (MoE): 16 out of 18 layers use sparse MoE blocks with 32 experts and top-4 routing, enabling high model capacity with efficient inference.
  • Dense Layers: The first 2 layers (layer 0 and 1) use standard dense MLP for stable early feature extraction.
  • Shared Expert with Gating: Each MoE layer includes a shared expert with a sigmoid gate, ensuring a baseline of knowledge is always available regardless of routing decisions.
  • Grouped Query Attention (GQA): Uses 8 query heads and 4 key-value heads to reduce KV-cache memory usage.
  • QK-Norm: Applies RMSNorm to query and key projections for training stability.
  • RoPE: Rotary Position Embedding with a base frequency of 1,000,000 for strong long-context extrapolation.

Architecture

The following diagram illustrates the overall architecture of NekoMind1.5:

┌─────────────────────────────────────────────────────────────┐
│                    NekoMind1.5-Base                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Input Tokens                                               │
│       │                                                     │
│       ▼                                                     │
│  ┌──────────┐                                               │
│  │Embedding │  (vocab: 32006, dim: 1024)                    │
│  └────┬─────┘                                               │
│       │                                                     │
│       ▼                                                     │
│  ╔══════════════════════════════════════════════════════╗    │
│  ║  Decoder Layer × 18                                  ║    │
│  ║                                                      ║    │
│  ║  ┌─────────────────────────────────────────────┐     ║    │
│  ║  │ RMSNorm                                     │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     ▼                                       │     ║    │
│  ║  │ GQA Attention (8Q / 4KV, head_dim=128)      │     ║    │
│  ║  │ ├─ Q/K Projections → QK-Norm → RoPE        │     ║    │
│  ║  │ └─ Output Projection                       │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     + (residual)                            │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │ RMSNorm                                     │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     ▼                                       │     ║    │
│  ║  │ ┌───────────────────────────────────────┐   │     ║    │
│  ║  │ │ Layer 0-1: Dense MLP (SwiGLU)         │   │     ║    │
│  ║  │ │   gate_proj ─┐                        │   │     ║    │
│  ║  │ │   up_proj ───┼─→ SiLU(gate) * up      │   │     ║    │
│  ║  │ │              └─→ down_proj → output    │   │     ║    │
│  ║  │ ├───────────────────────────────────────┤   │     ║    │
│  ║  │ │ Layer 2-17: Sparse MoE Block          │   │     ║    │
│  ║  │ │                                       │   │     ║    │
│  ║  │ │  input ──┬──→ Router (TopK=4/32)      │   │     ║    │
│  ║  │ │          │       │                    │   │     ║    │
│  ║  │ │          │       ▼                    │   │     ║    │
│  ║  │ │          │    Expert × 32 (SwiGLU)    │   │     ║    │
│  ║  │ │          │       │ (weighted sum)     │   │     ║    │
│  ║  │ │          │       ▼                    │   │     ║    │
│  ║  │ │          └──→ Shared Expert (SwiGLU)  │   │     ║    │
│  ║  │ │                  │ × σ(gate)          │   │     ║    │
│  ║  │ │                  ▼                    │   │     ║    │
│  ║  │ │          expert_out + shared_out      │   │     ║    │
│  ║  │ └───────────────────────────────────────┘   │     ║    │
│  ║  │     │                                       │     ║    │
│  ║  │     + (residual)                            │     ║    │
│  ║  └─────────────────────────────────────────────┘     ║    │
│  ╚══════════════════════════════════════════════════════╝    │
│       │                                                     │
│       ▼                                                     │
│  ┌──────────┐                                               │
│  │ RMSNorm  │                                               │
│  └────┬─────┘                                               │
│       │                                                     │
│       ▼                                                     │
│  ┌──────────┐                                               │
│  │ LM Head  │  (tied with embedding weights)                │
│  └────┬─────┘                                               │
│       │                                                     │
│       ▼                                                     │
│  Output Logits (vocab: 32006)                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Requirements

Note: The NekoMind1.5 model code has not yet been merged into the main transformers library. You must enable trust_remote_code=True when loading the model to use the custom modeling code hosted in this repository.

  • transformers >= 4.51.0
  • torch >= 2.1.0

Install the required dependencies:

pip install transformers>=4.51.0 torch accelerate

Quickstart

Here is a code snippet showing how to load the model and generate text. Since the model architecture is not yet integrated into the upstream transformers library, you need to set trust_remote_code=True to load the custom model code from this repository.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nekocyrene/NekoMind1.5-Base"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)

prompt = "The theory of relativity"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Chat Usage

For chat-style interaction, use apply_chat_template:

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

License

This model is released under the Apache 2.0 License.

Downloads last month
199
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support