Alan Tseng



AI & ML interests
Recent Activity
Organizations
agentlans's activity
I'm not surprised actually. The developers of ChatGPT and DeepSeek probably aren't trying to maximize Italian benchmarks. For specialized tasks, well-trained small models can beat much bigger ones. Certainly, the Italians know what kind of fresh high-quality ingredients are best for training their model.
Also, not every hobby project needs to be commercialized. In many cases Mistral is fine.
Hi Nikki,
It sounds like you need business advice and not necessarily more AI. I think once you have a clearer idea of what you want to do, then it should be easier to attract more people to your projects (whether it's AI or not). If you approach this from a customer's point of view and based what you know about AI, you can find all sorts of new possibilities.
Good luck!
An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc.
The method that I'm using to produce these experimental versions, for example eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF is explained in https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca
At a high level it involves using a custom version of the llama-quantize tool to selectively quantize different tensors at different levels. On average a 10% or more reduction with little loss of quality is possible.
There’re two PRs to merge these changes back into the core project but until then, the modified version will be available on GitHub https://github.com/EAddario/llama.cpp/tree/quantize
Would love to hear if you can achieve smaller sizes at higher quality!
Yeah, I don't think EXL2 and GGUF are compatible at all. It's just that your layer-specific quantization method is conceptually similar to EXL2.
In the end, we're limited to what GGUF can support. And if you manage to build those optimizations in, then every little bit helps!
Note: I use both formats. ExLlamaV2 has faster inference than my custom-compiled llama.cpp but it has a few drawbacks (less popular, slower loading, slower quantization, sometimes unstable, etc.)
Nice to know that GGUF can be optimized further! By the way, I came across another approach called ExLlamaV2's EXL2 format that also uses selective quantization but with a wider range of bit-widths and mixing within layers. What do you think?
https://github.com/turboderp-org/exllamav2?tab=readme-ov-file#exl2-quantization