Instructions to use nekocyrene/NekoMind1.5-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nekocyrene/NekoMind1.5-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nekocyrene/NekoMind1.5-Base", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("nekocyrene/NekoMind1.5-Base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nekocyrene/NekoMind1.5-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nekocyrene/NekoMind1.5-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nekocyrene/NekoMind1.5-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nekocyrene/NekoMind1.5-Base
- SGLang
How to use nekocyrene/NekoMind1.5-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nekocyrene/NekoMind1.5-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nekocyrene/NekoMind1.5-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nekocyrene/NekoMind1.5-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nekocyrene/NekoMind1.5-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nekocyrene/NekoMind1.5-Base with Docker Model Runner:
docker model run hf.co/nekocyrene/NekoMind1.5-Base
NekoMind1.5-Base
Introduction
NekoMind1.5 is the latest series of NekoMind large language models. It adopts a Mixture-of-Experts (MoE) architecture to achieve a strong balance between model capacity and inference efficiency. With 1.35B total parameters but only ~300M activated per token, NekoMind1.5 delivers competitive performance while maintaining low computational cost during inference.
This repo contains the base (pre-trained) NekoMind1.5 model, which has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining
- Architecture: Transformer decoder with RoPE, SwiGLU, RMSNorm, GQA, and Mixture-of-Experts
- Number of Parameters: 1.35B (Total) / ~300M (Activated)
- Number of Parameters (Non-Embedding): 1.32B
- Number of Layers: 18
- Number of Attention Heads (GQA): 8 for Q and 4 for KV
- Head Dimension: 128
- Context Length: 32,768 tokens
- Number of Experts: 32 (Top-4 routing)
- Shared Expert: Yes (with gating)
- Vocabulary Size: 32,006
Key Design Choices
- Mixture-of-Experts (MoE): 16 out of 18 layers use sparse MoE blocks with 32 experts and top-4 routing, enabling high model capacity with efficient inference.
- Dense Layers: The first 2 layers (layer 0 and 1) use standard dense MLP for stable early feature extraction.
- Shared Expert with Gating: Each MoE layer includes a shared expert with a sigmoid gate, ensuring a baseline of knowledge is always available regardless of routing decisions.
- Grouped Query Attention (GQA): Uses 8 query heads and 4 key-value heads to reduce KV-cache memory usage.
- QK-Norm: Applies RMSNorm to query and key projections for training stability.
- RoPE: Rotary Position Embedding with a base frequency of 1,000,000 for strong long-context extrapolation.
Architecture
The following diagram illustrates the overall architecture of NekoMind1.5:
┌─────────────────────────────────────────────────────────────┐
│ NekoMind1.5-Base │
├─────────────────────────────────────────────────────────────┤
│ │
│ Input Tokens │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │Embedding │ (vocab: 32006, dim: 1024) │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ ╔══════════════════════════════════════════════════════╗ │
│ ║ Decoder Layer × 18 ║ │
│ ║ ║ │
│ ║ ┌─────────────────────────────────────────────┐ ║ │
│ ║ │ RMSNorm │ ║ │
│ ║ │ │ │ ║ │
│ ║ │ ▼ │ ║ │
│ ║ │ GQA Attention (8Q / 4KV, head_dim=128) │ ║ │
│ ║ │ ├─ Q/K Projections → QK-Norm → RoPE │ ║ │
│ ║ │ └─ Output Projection │ ║ │
│ ║ │ │ │ ║ │
│ ║ │ + (residual) │ ║ │
│ ║ │ │ │ ║ │
│ ║ │ RMSNorm │ ║ │
│ ║ │ │ │ ║ │
│ ║ │ ▼ │ ║ │
│ ║ │ ┌───────────────────────────────────────┐ │ ║ │
│ ║ │ │ Layer 0-1: Dense MLP (SwiGLU) │ │ ║ │
│ ║ │ │ gate_proj ─┐ │ │ ║ │
│ ║ │ │ up_proj ───┼─→ SiLU(gate) * up │ │ ║ │
│ ║ │ │ └─→ down_proj → output │ │ ║ │
│ ║ │ ├───────────────────────────────────────┤ │ ║ │
│ ║ │ │ Layer 2-17: Sparse MoE Block │ │ ║ │
│ ║ │ │ │ │ ║ │
│ ║ │ │ input ──┬──→ Router (TopK=4/32) │ │ ║ │
│ ║ │ │ │ │ │ │ ║ │
│ ║ │ │ │ ▼ │ │ ║ │
│ ║ │ │ │ Expert × 32 (SwiGLU) │ │ ║ │
│ ║ │ │ │ │ (weighted sum) │ │ ║ │
│ ║ │ │ │ ▼ │ │ ║ │
│ ║ │ │ └──→ Shared Expert (SwiGLU) │ │ ║ │
│ ║ │ │ │ × σ(gate) │ │ ║ │
│ ║ │ │ ▼ │ │ ║ │
│ ║ │ │ expert_out + shared_out │ │ ║ │
│ ║ │ └───────────────────────────────────────┘ │ ║ │
│ ║ │ │ │ ║ │
│ ║ │ + (residual) │ ║ │
│ ║ └─────────────────────────────────────────────┘ ║ │
│ ╚══════════════════════════════════════════════════════╝ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ RMSNorm │ │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ LM Head │ (tied with embedding weights) │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ Output Logits (vocab: 32006) │
│ │
└─────────────────────────────────────────────────────────────┘
Requirements
Note: The NekoMind1.5 model code has not yet been merged into the main
transformerslibrary. You must enabletrust_remote_code=Truewhen loading the model to use the custom modeling code hosted in this repository.
transformers >= 4.51.0torch >= 2.1.0
Install the required dependencies:
pip install transformers>=4.51.0 torch accelerate
Quickstart
Here is a code snippet showing how to load the model and generate text. Since the model architecture is not yet integrated into the upstream transformers library, you need to set trust_remote_code=True to load the custom model code from this repository.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nekocyrene/NekoMind1.5-Base"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
)
prompt = "The theory of relativity"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Chat Usage
For chat-style interaction, use apply_chat_template:
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 199