gg-hf

Enterprise
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

gg-hf's activity

alvarobarttΒ 
posted an update 16 days ago
view post
Post
2836
πŸ”₯ Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)
XenovaΒ 
posted an update about 1 month ago
view post
Post
9424
We did it. Kokoro TTS (v1.0) can now run 100% locally in your browser w/ WebGPU acceleration. Real-time text-to-speech without a server. ⚑️

Generate 10 seconds of speech in ~1 second for $0.

What will you build? πŸ”₯
webml-community/kokoro-webgpu

The most difficult part was getting the model running in the first place, but the next steps are simple:
βœ‚οΈ Implement sentence splitting, allowing for streamed responses
🌍 Multilingual support (only phonemization left)

Who wants to help?
Β·
ariG23498Β 
posted an update about 2 months ago
view post
Post
2264
Tried my hand at simplifying the derivations of Direct Preference Optimization.

I cover how one can reformulate RLHF into DPO. The idea of implicit reward modeling is chef's kiss.

Blog: https://huggingface.co/blog/ariG23498/rlhf-to-dpo
XenovaΒ 
posted an update about 2 months ago
view post
Post
6442
Introducing Kokoro.js, a new JavaScript library for running Kokoro TTS, an 82 million parameter text-to-speech model, 100% locally in the browser w/ WASM. Powered by πŸ€— Transformers.js. WebGPU support coming soon!
πŸ‘‰ npm i kokoro-js πŸ‘ˆ

Try it out yourself: webml-community/kokoro-web
Link to models/samples: onnx-community/Kokoro-82M-ONNX

You can get started in just a few lines of code!
import { KokoroTTS } from "kokoro-js";

const tts = await KokoroTTS.from_pretrained(
  "onnx-community/Kokoro-82M-ONNX",
  { dtype: "q8" }, // fp32, fp16, q8, q4, q4f16
);

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text,
  { voice: "af_sky" }, // See `tts.list_voices()`
);
audio.save("audio.wav");

Huge kudos to the Kokoro TTS community, especially taylorchu for the ONNX exports and Hexgrad for the amazing project! None of this would be possible without you all! πŸ€—

The model is also extremely resilient to quantization. The smallest variant is only 86 MB in size (down from the original 326 MB), with no noticeable difference in audio quality! 🀯
Β·
ariG23498Β 
posted an update about 2 months ago
XenovaΒ 
posted an update 2 months ago
view post
Post
8355
First project of 2025: Vision Transformer Explorer

I built a web app to interactively explore the self-attention maps produced by ViTs. This explains what the model is focusing on when making predictions, and provides insights into its inner workings! 🀯

Try it out yourself! πŸ‘‡
webml-community/attention-visualization

Source code: https://github.com/huggingface/transformers.js-examples/tree/main/attention-visualization
akhaliqΒ 
posted an update 3 months ago
view post
Post
13202
Google drops Gemini 2.0 Flash Thinking

a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more

now available in anychat, try it out: akhaliq/anychat
Β·
XenovaΒ 
posted an update 3 months ago
view post
Post
4249
Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
πŸš€ Faster and more accurate than Whisper
πŸ”’ Privacy-focused (no data leaves your device)
⚑️ WebGPU accelerated (w/ WASM fallback)
πŸ”₯ Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web
Β·
NarsilΒ 
posted an update 3 months ago
view post
Post
1414
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !



3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani Γ«l de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking
XenovaΒ 
posted an update 3 months ago
view post
Post
3287
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! πŸ”₯ High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. πŸ€— Try it out yourself!

Demo: webml-community/text-to-speech-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/text-to-speech-webgpu
Model: onnx-community/OuteTTS-0.2-500M (ONNX), OuteAI/OuteTTS-0.2-500M (PyTorch)
reach-vbΒ 
posted an update 3 months ago
view post
Post
5599
VLMs are going through quite an open revolution AND on-device friendly sizes:

1. Google DeepMind w/ PaliGemma2 - 3B, 10B & 28B: google/paligemma-2-release-67500e1e1dbfdd4dee27ba48

2. OpenGVLabs w/ InternVL 2.5 - 1B, 2B, 4B, 8B, 26B, 38B & 78B: https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c

3. Qwen w/ Qwen 2 VL - 2B, 7B & 72B: Qwen/qwen2-vl-66cee7455501d7126940800d

4. Microsoft w/ FlorenceVL - 3B & 8B: https://huggingface.co/jiuhai

5. Moondream2 w/ 0.5B: https://huggingface.co/vikhyatk/

What a time to be alive! πŸ”₯
ariG23498Β 
posted an update 3 months ago
XenovaΒ 
posted an update 4 months ago
view post
Post
4120
We just released Transformers.js v3.1 and you're not going to believe what's now possible in the browser w/ WebGPU! 🀯 Let's take a look:
πŸ”€ Janus from Deepseek for unified multimodal understanding and generation (Text-to-Image and Image-Text-to-Text)
πŸ‘οΈ Qwen2-VL from Qwen for dynamic-resolution image understanding
πŸ”’ JinaCLIP from Jina AI for general-purpose multilingual multimodal embeddings
πŸŒ‹ LLaVA-OneVision from ByteDance for Image-Text-to-Text generation
πŸ€Έβ€β™€οΈ ViTPose for pose estimation
πŸ“„ MGP-STR for optical character recognition (OCR)
πŸ“ˆ PatchTST & PatchTSMixer for time series forecasting

That's right, everything running 100% locally in your browser (no data sent to a server)! πŸ”₯ Huge for privacy!

Check out the release notes for more information. πŸ‘‡
https://github.com/huggingface/transformers.js/releases/tag/3.1.0

Demo link (+ source code): webml-community/Janus-1.3B-WebGPU