I Hackathon Somos NLP: PLN en Español
non-profit
AI & ML interests
Hackathon de PLN en español open-source y enfocado a los Objetivos de Desarrollo Sostenible de la ONU. Organizado por Somos NLP y patrocinado por Platzi, Paperspace y Hugging Face.
Recent Activity
View all activity
somosnlp-hackathon-2022's activity
haritzpuerto
posted
an
update
2 days ago
haritzpuerto
posted
an
update
3 days ago
Post
1405
I'm excited to announce that my internship paper at Parameter Lab was accepted to Findings of #NAACL2025 🎉
TLDR: Stating an LLM was trained on a sentence might not be possible 😥 , but it is possible for large enough amounts of tokens, such as long documents or multiple documents! 🤯
Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models (2411.00154)
🔗 https://github.com/parameterlab/mia-scaling
TLDR: Stating an LLM was trained on a sentence might not be possible 😥 , but it is possible for large enough amounts of tokens, such as long documents or multiple documents! 🤯
Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models (2411.00154)
🔗 https://github.com/parameterlab/mia-scaling
DrishtiSharma
authored
5
papers
about 2 months ago
1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering
Paper
•
2412.06009
•
Published
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
•
2412.07112
•
Published
•
27
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Paper
•
2411.19799
•
Published
•
11
SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains
Paper
•
2412.00549
•
Published
•
1
1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs
Paper
•
2411.06850
•
Published
•
4
haritzpuerto
authored
a
paper
3 months ago
DrishtiSharma
authored
a
paper
3 months ago
mariagrandury
authored
a
paper
4 months ago
rockdrigoma
updated
a
Space
6 months ago
mariagrandury
authored
a
paper
6 months ago
haritzpuerto
authored
a
paper
7 months ago
Post
4977
🚨Exciting news for the Multilingual Synthetic Data Community!🚨
I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!
🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (
🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?
🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.
👩💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)
🔍 Explore the datasets 📚 generated using our new script!
- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)
Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.
Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!
🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (
Llama-3-8B-instruct
) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B
) on this dataset, you can improve even the it-tuned version🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?
🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.
👩💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using
ollama
models (initially phi and llama3) automatically and upload it to the Hugging Face Hub![Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)
🔍 Explore the datasets 📚 generated using our new script!
- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)
Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.
Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
mariagrandury
authored
a
paper
7 months ago
osanseviero
updated
4
Spaces
8 months ago