somosnlp-hackathon-2022 (I Hackathon Somos NLP: PLN en Español)

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

·

mariagrandury

authored a paper 9 months ago

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Paper • 2406.17789 • Published May 28, 2024 • 1

osanseviero

updated 4 Spaces 9 months ago

26

mrm8488

posted an update 10 months ago

Post

6377

Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.

6 replies

·

pcuenq

posted an update 11 months ago

Post

6230

OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!

4 replies

·

milmor

authored a paper 12 months ago

Efficient generative adversarial networks using linear additive-attention Transformers

Paper • 2401.09596 • Published Jan 17, 2024 • 1

mrm8488

posted an update about 1 year ago

Post

Hello world! 🔥

rockdrigoma

updated a model about 1 year ago

somosnlp-hackathon-2022/t5-small-spanish-nahuatl

Translation • Updated Jan 23, 2024 • 48 • 17

I Hackathon Somos NLP: PLN en Español

AI & ML interests

somosnlp-hackathon-2022's activity

1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering

Maya: An Instruction Finetuned Multilingual Multimodal Model

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains

1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs

M-RewardBench: Evaluating Reward Models in Multilingual Settings

Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail?

Spanish to Nahuatl Translation

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Audio Sentiment Classifier

Poem Generation Es

Sonnet Poetry Generator Spanish

Clasificador De Tesis

Efficient generative adversarial networks using linear additive-attention Transformers

somosnlp-hackathon-2022/t5-small-spanish-nahuatl

AI & ML interests

Team members 173

somosnlp-hackathon-2022's activity

Spanish to Nahuatl Translation

Audio Sentiment Classifier

Poem Generation Es

Sonnet Poetry Generator Spanish

Clasificador De Tesis