Instructions to use idle-intelligence/personaplex-24L-q4_k-webgpu with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Moshi
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "idle-intelligence/personaplex-24L-q4_k-webgpu" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("idle-intelligence/personaplex-24L-q4_k-webgpu") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - llama-cpp-python
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="idle-intelligence/personaplex-24L-q4_k-webgpu", filename="personaplex-24L-q4_k.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: ./llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: ./build/bin/llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Use Docker
docker model run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
- LM Studio
- Jan
- Ollama
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Ollama:
ollama run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
- Unsloth Studio new
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting
- Docker Model Runner
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Docker Model Runner:
docker model run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
- Lemonade
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull idle-intelligence/personaplex-24L-q4_k-webgpu
Run and chat with the model
lemonade run user.personaplex-24L-q4_k-webgpu-{{QUANT_TAG}}List all available models
lemonade list
code example
Guys any interface codes to run the model? Like Python codes or something like that?
Hey Shadow,
I'm not sure you've seen the disclaimer on the Model car :
"Work in progress. This model exists primarily to test layer pruning + QLoRA recovery for browser deployment. Model behavior may differ from the original 32L model. No guarantees — use at your own discretion."
That being said, I built this to play with Speech to Speech in the browser, you can find some inference code (in rust, sorry) over there: https://github.com/idle-intelligence/sts-web and there's a demo link in there too.
I guess if you're interested I do have, somewhere, a version of nvidia's python inference that will run this model, let me know!
Hey ilnmtlbnm , thanks for sharing this looks really interesting.
I don’t think I’m seeing the model card on my end, not sure if I missed it
Also, I’d definitely be interested in the Python inference version (the NVIDIA one you mentioned), if you don’t mind sharing it. That would be super helpful for me to experiment with.
Appreciate it
Hey Shadow, I appreciate your interest. I meant the model card here: https://huggingface.co/idle-intelligence/personaplex-24L-q4_k-webgpu
For the python inference... I didn't think this through when answering initially (I did this thing a few weeks back already 😅 .
This GGUF is custom (layer-pruned + LoRA-recovered + Q4_K with Moshi-style tensor names), so it can't be loaded with NVIDIA's PersonaPlex Python inference as-is.
I added a native Rust CLI that runs the model end-to-end (Vulkan on Linux, Metal on macOS, no Python required):
https://github.com/idle-intelligence/sts-web#native-cli-sts
huggingface-cli download idle-intelligence/personaplex-24L-q4_k-webgpu
--local-dir personaplex-24L-q4_k-webgpu
cargo run --release --features "wgpu,cli" --bin sts -- \
--model-dir ./personaplex-24L-q4_k-webgpu \
--input question.wav \
--output response.wav
--voice NATF2
Takes ~5 min for the first build (Burn + cubecl), then ~64 ms/frame on a 3080. Voices: NATF0..3, NATM0..3, VARF0..4, VARM0..4.