Instructions to use OsaurusAI/LFM2.5-8B-A1B-JANG_2L with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/LFM2.5-8B-A1B-JANG_2L with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/LFM2.5-8B-A1B-JANG_2L") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use OsaurusAI/LFM2.5-8B-A1B-JANG_2L with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/LFM2.5-8B-A1B-JANG_2L"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/LFM2.5-8B-A1B-JANG_2L" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/LFM2.5-8B-A1B-JANG_2L with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/LFM2.5-8B-A1B-JANG_2L"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/LFM2.5-8B-A1B-JANG_2L
Run Hermes
hermes
- MLX LM
How to use OsaurusAI/LFM2.5-8B-A1B-JANG_2L with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "OsaurusAI/LFM2.5-8B-A1B-JANG_2L"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "OsaurusAI/LFM2.5-8B-A1B-JANG_2L" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OsaurusAI/LFM2.5-8B-A1B-JANG_2L", "messages": [ {"role": "user", "content": "Hello"} ] }'
LFM2.5-8B-A1B-JANG_2L
JANG_2L conversion of LiquidAI/LFM2.5-8B-A1B, built for Apple Silicon inference through JANG-aware MLX/vMLX runtimes.
This bundle is not a plain MLX 2-bit quant. It uses JANG importance allocation over MLX affine quantized tensors, with higher precision reserved for runtime-sensitive tensors.
Format
- Format: JANG affine
- Profile:
JANG_2L - Quantization backend:
mx.quantize - Group size:
64 - Actual bits from
jang_config.json:2.37 - Bit widths used:
2,6,8 - Passthrough bit width:
16 - Local size before upload:
2.9G - JANG runtime weight size metadata:
2.84 GB - Source model:
LiquidAI/LFM2.5-8B-A1B
Runtime capability stamp:
{
"reasoning_parser": "qwen3",
"tool_parser": "lfm2",
"think_in_template": false,
"supports_tools": true,
"supports_thinking": true,
"family": "lfm2_moe",
"modality": "text",
"cache_type": "hybrid"
}
Runtime
Use a JANG-aware MLX/vMLX runtime. The model has hybrid cache behavior: attention layers use KV cache, while LIV convolution layers use convolution/state cache.
Example with the local JANG tools runtime:
python -m jang_tools inference \
--model OsaurusAI/LFM2.5-8B-A1B-JANG_2L \
--prompt "What is 2+2? Answer briefly." \
--max-tokens 128 \
--temperature 0
Chat Template And Reasoning
The bundled chat_template.jinja uses Liquid's ChatML-like format:
- User and assistant turns use
<|im_start|>/<|im_end|>. - The generation prompt ends at
<|im_start|>assistant\n; it does not pre-open<think>. - Assistant reasoning may appear inside
<think>...</think>. - Tool calls use Liquid's Python-call list format inside
<|tool_call_start|>and<|tool_call_end|>.
For this bundle, think_in_template=false is intentional. Runtime code should parse reasoning if the model emits it, but should not force a second reasoning prefix.
Verification
Local smoke run on the converted bundle:
- Prompt:
What is 2+2? Answer briefly. - Result: output closed
<think>...</think>and answered2 + 2 = 4. - Reported generation speed:
206.886 tok/s - Load time:
1.946 s - Peak RSS:
3887 MB
This is a smoke test, not a benchmark suite or accuracy evaluation.
Korean
이 모델은 LiquidAI/LFM2.5-8B-A1B를 JANG_2L 형식으로 변환한 Apple Silicon용 번들입니다. think_in_template=false가 의도된 설정이며, 런타임은 모델이 생성한 <think>...</think>를 파싱하되 별도의 reasoning 접두어를 강제로 추가하지 않아야 합니다.
- Downloads last month
- 112
Quantized