Alan Tseng

agentlans

AI & ML interests

Small data, boring AI

Recent Activity

Organizations

None yet

agentlans's activity

replied to giux78's post 9 days ago
view reply

I'm not surprised actually. The developers of ChatGPT and DeepSeek probably aren't trying to maximize Italian benchmarks. For specialized tasks, well-trained small models can beat much bigger ones. Certainly, the Italians know what kind of fresh high-quality ingredients are best for training their model.

Also, not every hobby project needs to be commercialized. In many cases Mistral is fine.

replied to Dragunflie-420's post 12 days ago
view reply

Hi Nikki,

It sounds like you need business advice and not necessarily more AI. I think once you have a clearer idea of what you want to do, then it should be easier to attract more people to your projects (whether it's AI or not). If you approach this from a customer's point of view and based what you know about AI, you can find all sorts of new possibilities.

Good luck!

reacted to eaddario's post with 👍 12 days ago
view post
Post
2723
Squeezing Tensor Bits: the quest for smaller LLMs

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc.

The method that I'm using to produce these experimental versions, for example eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF is explained in https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca

At a high level it involves using a custom version of the llama-quantize tool to selectively quantize different tensors at different levels. On average a 10% or more reduction with little loss of quality is possible.

There’re two PRs to merge these changes back into the core project but until then, the modified version will be available on GitHub https://github.com/EAddario/llama.cpp/tree/quantize

Would love to hear if you can achieve smaller sizes at higher quality!
·
replied to eaddario's post 12 days ago
view reply

Yeah, I don't think EXL2 and GGUF are compatible at all. It's just that your layer-specific quantization method is conceptually similar to EXL2.

In the end, we're limited to what GGUF can support. And if you manage to build those optimizations in, then every little bit helps!

Note: I use both formats. ExLlamaV2 has faster inference than my custom-compiled llama.cpp but it has a few drawbacks (less popular, slower loading, slower quantization, sometimes unstable, etc.)

replied to eaddario's post 14 days ago