Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
Paper
•
2407.08296
•
Published
•
31
Note 高效的LLM实现,Code available,好用? Training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory.
Note In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. an effective way to reduce the KV cache size while ensuring precision. a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library