Efficient LLMs

xb-chang 's Collections

LLMs

updated Jul 13

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

Paper • 2407.08296 • Published Jul 11 • 31

Note 高效的LLM实现，Code available，好用？ Training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory.
Inference Performance Optimization for Large Language Models on CPUs

Paper • 2407.07304 • Published Jul 10 • 52

Note In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. an effective way to reduce the KV cache size while ensuring precision. a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library