Ali El Filali
AI & ML interests
Articles
Organizations
alielfilali01's activity
TxT360 is +15 Trillion tokens corpus outperforming FineWeb on several metrics. Ablation studies were done up to 1T tokens.
Read blog here : LLM360/TxT360
Dataset : LLM360/TxT360
At least, when someone is collecting a group of datasets from an organization or let's say the whole hub can filter based on that tag and avoid somehow contaminating their "training" data.
I wish if i could repost this !
But a hub feature itself hits different πΆ
Exactly it's much more convenient to have it as a hub feature π
An editor on the hub for datasets would be even better ! Imagine applying a map function on your dataset and saving the resulted dataset in a seperate revision or a subset/split without even the need to download and push back !!!
Maybe apply fees for larger datasets since they require more compute ... like renting the server were the dataset reside anyway !?
Sometimes you just want to save something in your profile privately and work on it later without the hassle of "load_.../push_to_hub" in a code file.
I know this is super lazy π But it is what it is ...
tag : @victor
It is impressive how they choosed to design this leaderboard and how it support 4 languages (all part of Spain ofc).
Check it out from this link :
la-leaderboard/la-leaderboard
https://github.com/huggingface/transformers/issues/8771
π«‘ to the man @stas
I've been stuck with that error almost the whole day !
"I found myself recently whipping up notebooks just to pull huggingface datasets locally, annotate or operate changes and update them again. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.
While huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing."
~ @omarkamali
Link : https://omarkama.li/blog/datapluck
Huge Congrats π
Now it finally makes sense π
OMG I'm equally excited for this π
The HF ecosystem is getting bigger and better π₯
OMG finally i will have some green squares π₯Ή
But I'm #data_rich π
meta-llama/Meta-Llama-3-70B-Instruct is still the king of the leaderboard π with a 3.46 points difference compared to its successor CohereForAI/c4ai-command-r-plus who took the 2nd place π₯ from his younger brother CohereForAI/c4ai-command-r-v01 that lives today in the 5th floor just behind Ashmal/MBZUAI-oryx -3rd place π₯- (AFAIK an experimental model from MBZUAI) and https://huggingface.co/core42/jais-30b-chat-v3 -4th place- from Core42.
PS : I should consider a career in sports commentary π
Would you recommend me to BeIN Sports π ?
And now meta-llama/Meta-Llama-3-70B-Instruct is the new hero of the leaderboard beating CohereForAI/c4ai-command-r-v01 by 5.43 points π₯
Almost another 80 models are still PENDING ! So this might change very fast in the upcoming days
Here's a quick update for our community that is waiting for new results. Some of you noticed that since the release yesterday, the finished evaluations tab has stayed at 14 models up until now (May 15th, 12 PM). For those concerned, rest assuredβwe had a minor memory issue in our cluster yesterday that we overlooked. The problem is now fixed, and 7 models are currently being evaluated in parallel, so expect to hit the 20 milestone today! π
Check the discussion below for more info :
OALL/Open-Arabic-LLM-Leaderboard#3
Not really π
And that's just what a NOOB like myself had in mind, I'm sure there is better, more efficient ways to do it ! So the question again, why we haven't yet ? I feel I'm missing something... Right?
Almsot 24 Hours after the release of the Arabic cohort of DIBT-MPEP, we are at 100 prompts translated/corrected !
Shout out to the hero
@seyf1elislam
for contributing more than 60 prompts π₯
How to Get Involved?
1. Visit our Argilla Space and start reviewing prompts.
https://2a2i-prompt-translation-for-arabic.hf.space/
2. Join our Discord channel in the HuggingFace's discord server to connect with the community and share your insights.
https://discord.com/channels/879548962464493619/1217179730869096448
Hi
@smangrul
, apparently i can't push the merged adapter to the hub ???
Cuz when i do so it create num_of_adapters_to_merge + 1 (the merged adapter) and when i want to load the the merged adapter with model = PeftModel.from_pretrained(model, adapter)
i got the error in image 2 !
Your help is much appreciated, tnx π€
Today we announce : 2A2I/Arabic-OpenHermes-2.5
Arabic-OpenHermes-2.5 Is simply the translation of the original dataset released by @teknium couple months ago !
In fact it looks as a simple task ! In reality it was quite a laborious job !
But thanks to @maghwa & @medmac01 this dataset managed to see the light today and help creating better / more aligned arabic LLMs in the near future.
If you are interested to join us and/or help us, please leave a comment below or visit our HuggingFace Org Card for more details about How/What you can do.
More datasets to come and more models are in the way π₯
Thanks to your support π€
Got some input from @ybelkada about not needing a ref_model because we can just swap out the LoRa adapters during training.
About this part π
@Ali-C137 it should be fixed now. Thank you for your feedback!
Thank you so much π€
Hi !
I think for NEFTune it should be supported out of the box as you just need to pass the correct argumentneftune_noise_alpha
inTrainingArguments
right?
Yes indeed (AFAIK) but i asked if Unsloth support it as well by incorporating it in their code base (i assume they are based on PEFT & TRL as well !?)
Arabic Aya is a carefully curated dataset, derived from the vast Aya collection by CohereForAI, tailored specifically for Arabic language processing. It consolidates texts across Modern Standard Arabic (MSA) and other dialects, simplifying access to high-quality data for researchers, developers, and linguists.
π Why Arabic Aya?
- Time-saving : Jump straight into your projects with pre-filtered Arabic texts.
- Diverse applications : Perfect for language modeling, sentiment analysis, dialect identification, and more.
- Community-driven : Your contributions and feedback can help enrich this resource further.
π Utilize Arabic Aya for your next NLP/LLM projects and be part of advancing Arabic language technologies. Letβs collaborate to make Arabic AI research more accessible and robust!
Check it out here: 2A2I/Arabic_Aya
Hi
@julien-c
, always about the viewer
, this white view of sections within dark mode been really annoying, do you think you guys can do something about it ?
PS : I have been using this viewer for almost 6 hours now π₯π€
Is all-linear
(most recent update of PEFT) supported in the target_modules
arg ? Also what about NEFTune
?
Amazing work π€©
I wish if we had a save button for posts here
I'm also interested to know more about this :
"To prevent catastrophic forgetting, I used weight averaging between iterations."
Can you please elaborate !? Tnx π€
Can't wait for the release soon π₯
The idea itself was not that revolutionary tho, cuz practically chess moves are just sequences and better they are expressed with letters and numbers that are familiar to LLMs. I remember back in July i had a discussion about the very same idea with some folks during a summer school
I don't even wanna think about my email inbox π€¦π»ββοΈπ
Today, we are thrilled to officially launch the "2A2I" Arabic Artificial Intelligence Initiative. This is a community-driven initiative founded on the philosophy of "Small team, Big work" Our goal is to elevate Arabic AI (LLMs, Diffusion Models, ASR, etc.) to the same level as English (and also Chinese π).
Naturally, our focus today is primarily on datasets. We aim to provide high-quality datasets, especially for LLMs this month, to support our future efforts. In line with this, we're excited to introduce the Arabic version of H4-no_robots, find here : 2A2I/H4_no_robots (and yes, we know it's not "no_robots" anymore π). Stay tuned for more exciting, high-quality datasets in the next couple of weeks (+ 4 million rowsπ₯)
In parallel, we're also developing a model πͺ that we hope will set new high standards for Arabic LLMs. π₯ This model is planned for release in the coming months.
For more information, please visit our Organization card here : https://huggingface.co/2A2I
If you're interested in Arabic AI and want to help pushing the wheel as well, fill out this form, and let us know your motivation and your exciting ideas π₯
The form link : https://forms.gle/kZLVuynWFU2FyTm57
If you have any questions, feel free to reach out to us at the email address below.
Additionally, if you believe as we do in this mission and would like to help this community and contribute some compute resources π or any other form of help you might think about, please contact us at the same email address below or reach out to me through LinkedIn π₯
2A2I Contact Email : arabic.ai.initiative@gmail.com
My LinkedIn : https://www.linkedin.com/in/alielfilali01/
Can you elaborate more plz ?
(I've been asked to provide a report about the cost of finetuning each model etc... so i decided to do the lazy job and build a tool for it, Prof later can choose whatever config he likes π)
π But Why this is important?
As LLMs continue to grow in size and complexity, understanding the computational and financial requirements is crucial for planning and managing AI projects. I believe this tool simplifies this process, giving you insights into potential expenses based on the number of parameters and tokens in your dataset.
π Features:
- Input the number of parameters (in billions) and tokens (in trillions).
- Adjust for GPU utilization rates and overhead costs.
- Get an instant estimate of your training costs.
+ Choose your GPU (A100 80GB PCle, A100 80GB SXM, V100, H100 SXM, H100 PCle)
π Coming Soon:
Plans are in place to expand the calculator's capabilities to include fine-tuning costs for models using LoRA or QLoRA. You'll be able to input a model ID from the Hugging Face Hub, select your fine-tuning strategy, and specify quantization details if using QLoRA.
I believe this tool will be a valuable asset to the AI community, helping to plan and allocate resources more effectively π€.
Should you have any suggestions or feedback, please don't hesitate to contribute your thoughts in the comments below. Together, we can refine and enhance this resource for all.
π Try it here : https://huggingface.co/spaces/Ali-C137/LLM-Training-Cost-Calculator
PS : All thanks to Gradio, Hugging Face and the community ofc π₯ π
It would be super helpful if they released their dataset π₯
Just created mine π₯
I've been willing to create HuggingAssist
for soo long and you guys just made it a lot easier π₯ tnx π€
Super excited to share with you my first Chat assistant :
HuggingAssist
, meant to offer guidance with the large HuggingFace ecosystemChat with it from here : https://hf.co/chat/assistant/65bd0adc08560e58be454d86
It would be more helpful when the RAG / WEB features are available !
Looking forward to it π₯
ps : tnx @Chunte for the cool Huggies
I don't know why i always thought it would be multilingual π€¦π»ββοΈ
Great job π₯ the paper is a masterpiece ππ» tnx for it