Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Renderlib-dev
Renderlib-dev
3
5
Follow
tengomucho's profile picture
1 follower
ยท
101 following
AI & ML interests
None yet
Recent Activity
reacted
to
Banaxi-Tech
's
post
with ๐
about 12 hours ago
Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision. Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2. Model: https://huggingface.co/BananaMind/BananaMind-KV1-8M-2Bit-Experimental The important part: this is not just post-training KV cache quantization. Instead we take the BitNet approach. KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime. During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte. WikiText-2 eval vs 16-bit KV cache reference: Mean KLD: 0.0916 nats/token Mean KLD: 0.1322 bits/token Average KV cache shrink vs FP16: 5.33x Evaluated positions: 372,675 If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine! Try it here: https://huggingface.co/BananaMind/BananaMind-KV1-8M-2Bit-Experimental Code: https://github.com/Banaxi-Tech/kv1
reacted
to
Banaxi-Tech
's
post
with ๐ฅ
about 12 hours ago
Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision. Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2. Model: https://huggingface.co/BananaMind/BananaMind-KV1-8M-2Bit-Experimental The important part: this is not just post-training KV cache quantization. Instead we take the BitNet approach. KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime. During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte. WikiText-2 eval vs 16-bit KV cache reference: Mean KLD: 0.0916 nats/token Mean KLD: 0.1322 bits/token Average KV cache shrink vs FP16: 5.33x Evaluated positions: 372,675 If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine! Try it here: https://huggingface.co/BananaMind/BananaMind-KV1-8M-2Bit-Experimental Code: https://github.com/Banaxi-Tech/kv1
new
activity
about 12 hours ago
huggingface/InferenceSupport:
bharatgenai/sooktam2
View all activity
Organizations
Renderlib-dev
's Spaces
2
Sort:ย Recently updated
Running
2
CODEVED
๐
Demo version of Vedika
Build error
TTS
๐ฆ
Text to speech