Instructions to use mlx-community/Nex-N2-mini-mlx-optiq-static-mixed-3_6bits with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Nex-N2-mini-mlx-optiq-static-mixed-3_6bits with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Nex-N2-mini-mlx-optiq-static-mixed-3_6bits mlx-community/Nex-N2-mini-mlx-optiq-static-mixed-3_6bits
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Info on this Version
An mlx-optiq quantized version of nex-agi/Nex-N2-mini, only from 70.2 GB its size was reduced to... 17.8! For static version, there's optiq-mixed variation, 37 GB, but, i added nex-n2.txt where the same user prompt/system instruction are sent to both versions and... honestly, i see no sense in adding 37GB version here, for they are roughtly the same. While the reduction from 70 to 17 is usually expected to have, as an outcome, the death of 'brain' in a model, optiq quantization, albeit aggressiveness, makes local inference with low mem peak as accessible as it has never been! See for yourself. Smart quantization and Open Weights ethos are the future killers of mainstream big tech monoculture monstrosities!
Yes, its 'Need maybe' in reasoning may be slightly annoying, and yet, no parasitic words bleed into the real output for user, as well as the model is adherent to system instructions concerning formatting of the text which you can see from the text file (strict adherence to system prompt is perhaps among my most cherished virtues of qwen models and their derivations, which this model is as well, with ssm layers mutations, if config is to be believed)
Note on mlx-optiq version: I use old version, not 0.2.x they have now, not even 0.1.x, but 0.0.11 in venv. Why? Because monkey-patching --sensitivity flag with 'streaming' option each time i update the package would suck, but what REALLY, like, REALLY sucks, is the way the quantization takes place in the latest version: there is no actual RAM sparing, it can process 14/596 layers in some 14-16 hours and THEN OOM, and this is considered to be optimized for local devices (like mine M1 Pro 64 GB URAM/24 GPU cores)... It is NOT even suboptimal, unlike old, 0.0.11 version's streaming option: less than 5 minutes and you get it done as you see it in this repo. Just an information someone may find useful when trying to quantize with mlx-optiq. Aside from that, its advantages over my two otherwise fav quantizations JANG-Q and turboquant-MLX-full is that it requires no additional import or separate cli args/commands for terminal, just basic mlx_lm.* – i am pissed off by packages that eventually build on mlx_lm yet would require to import their shit to basic functions or call their functions because otherwise their quantized models won't generate, raising the errors 'name_of_method_package' missing! mlx-optiq builds on mlx_lm as well but SMARTLY, hiding all monkey patching into the pipeline so that it is noticeable only if you decide to walk through the .py files in the package library, but no visible actions come to sight otherwise, no latency added! a lesson for everyone who's in it
- Downloads last month
- 159
4-bit
Model tree for mlx-community/Nex-N2-mini-mlx-optiq-static-mixed-3_6bits
Base model
nex-agi/Nex-N2-mini