Jฤ™drzej Grabala

jgitsolutions

AI & ML interests

Local Drive Human Overseered System of Agents, LLMs, Langchains & other useful stuff on mid-to-low-end of commercial hardware.

Recent Activity

liked a model 3 days ago
microsoft/Phi-3.5-MoE-instruct
liked a Space 4 days ago
Lightricks/LTX-Video-Playground
liked a Space 19 days ago
cutechicken/tankwar
View all activity

Organizations

LangChain Agents Hub's profile picture LangChainDatasets's profile picture ZeroGPU Explorers's profile picture Dev Mode Explorers's profile picture

jgitsolutions's activity

reacted to anakin87's post with ๐Ÿ‘€ about 1 month ago
view post
Post
1104
Ok, you're finally convinced that synthetic data works... โš—๏ธ

๐๐จ๐ฐ ๐ฒ๐จ๐ฎ ๐ฐ๐š๐ง๐ญ ๐ญ๐จ ๐ ๐ž๐ง๐ž๐ซ๐š๐ญ๐ž ๐š๐ง ๐ข๐ง๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ข๐จ๐ง ๐๐š๐ญ๐š๐ฌ๐ž๐ญ ๐Ÿ๐จ๐ซ ๐Ÿ๐ข๐ง๐ž-๐ญ๐ฎ๐ง๐ข๐ง๐  ๐ข๐ง ๐š ๐ฅ๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐จ๐ญ๐ก๐ž๐ซ ๐ญ๐ก๐š๐ง ๐„๐ง๐ ๐ฅ๐ข๐ฌ๐ก.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

๐Ÿฆโ€โฌ› ๐–๐ก๐š๐ญ ๐ข๐ฌ ๐Œ๐š๐ ๐ฉ๐ข๐ž?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea ๐Ÿ‘‡
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

๐Ÿช„ The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.


๐Ÿง—๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ง๐  ๐ง๐จ๐ง-๐„๐ง๐ ๐ฅ๐ข๐ฌ๐ก ๐๐š๐ญ๐š

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

โŒ Unfortunately, it does not work well for other languages (๐Ÿ‡ฎ๐Ÿ‡น, ๐Ÿ‡ณ๐Ÿ‡ฑ, ...)

๐Ÿ‘‡
  • 1 reply
ยท
reacted to yongchanghao's post with ๐Ÿ‘€๐Ÿ”ฅ about 1 month ago
view post
Post
3746
We just released a paper (NeuZip) that compresses VRAM in a lossless manner to run larger models. This should be particularly useful when VRAM is insufficient during training/inference. Specifically, we look inside each floating number and find that the exponents are highly compressible (as shown in the figure below).

Read more about the work at NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks (2410.20650)
reacted to PLB's post with ๐Ÿš€ about 1 month ago
view post
Post
1863
โš ๏ธ People selling AI chatbots for websites hate us.
Add an open source chat assistant on your website in 5 minutes: https://github.com/phospho-app/ai-chat-bubble

How does it work ?
- You give an URL
- The AI assistant crawls the website content and embed it
- Add it to your frontend in one line of code
- People on your website can ask the assistant questions

Powered by BAAI/bge-small-en-v1.5 and Mistral AI
ยท
reacted to LukeNeumann's post with ๐Ÿ‘ about 1 month ago
view post
Post
1858
Hello Hugging Face community!

I wanted to introduce myself and my company @Overlaiapp . We are a collective of filmmakers, photographers, and AI engineers working on high resolution (8K+) training data.

We plan to share a lot of our datasets with the community and are kicking things off with two curated datasets:

- Overlaiai/OregonCoastin4K

- Overlaiai/SubArcticPolarBear


Overlai.ai Dataset Features

๐ŸŽฅ Oversampled: Every clip is captured in stunning 8K resolution, delivering rich detail ideal for fine tuning scenic landscapes and ocean dynamics.

๐Ÿ“ธ Variance: Includes close-up details, slow-motion footage of crashing waves, sweeping landscapes, and wildlife shots.

๐Ÿ“‹ Detailed Metadata: Every clip is paired with structured metadata, including creative descriptions, precise camera movements, lens information, field of view calculations, and shot settings, ensuring AI models can fully understand and replicate real-world cinematography with accuracy.

โš™๏ธ Consistency: Re-thinking training data at the point of capture by "overshooting" a subject, enabling models to learn more nuanced relationships and views across scenes.

๐ŸŒ… Light: Shot during early morning and sunset light for optimal color contrast and dynamic range, maximizing visual quality for color and lighting-sensitive tasks.

๐Ÿ” Curation: Curated specifically for machine learning, providing clean, high-quality data for next generation model training.
reacted to BlinkDL's post with ๐Ÿ”ฅ about 1 month ago
view post
Post
3869
RWKV-6-world-v3 (+3.1T tokens) is our best multilingual 7B model as of now: BlinkDL/rwkv-6-world

It's 100% RNN and attention-free. MMLU 54.2% (previous world-v2.1 = 47.9%. note: without eval-boosting tricks such as annealing).

RWKV-7-world-v4 soon :)