nltpt

Enterprise
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

nltpt's activity

Xenova 
posted an update about 20 hours ago
Xenova 
posted an update 15 days ago
Narsil 
posted an update 21 days ago
view post
Post
1012
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !



3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking
Xenova 
posted an update 25 days ago
view post
Post
2950
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! 🔥 High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. 🤗 Try it out yourself!

Demo: webml-community/text-to-speech-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/text-to-speech-webgpu
Model: onnx-community/OuteTTS-0.2-500M (ONNX), OuteAI/OuteTTS-0.2-500M (PyTorch)
reach-vb 
posted an update 26 days ago
view post
Post
3416
VLMs are going through quite an open revolution AND on-device friendly sizes:

1. Google DeepMind w/ PaliGemma2 - 3B, 10B & 28B: google/paligemma-2-release-67500e1e1dbfdd4dee27ba48

2. OpenGVLabs w/ InternVL 2.5 - 1B, 2B, 4B, 8B, 26B, 38B & 78B: https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c

3. Qwen w/ Qwen 2 VL - 2B, 7B & 72B: Qwen/qwen2-vl-66cee7455501d7126940800d

4. Microsoft w/ FlorenceVL - 3B & 8B: https://huggingface.co/jiuhai

5. Moondream2 w/ 0.5B: https://huggingface.co/vikhyatk/

What a time to be alive! 🔥

Create USE_POLICY.md

#8 opened 27 days ago by
reach-vb

Create LICENSE

#7 opened 27 days ago by
reach-vb