Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepย 
posted an update Jul 23
Post
2705
Meta Researchers: How many compute hours should we use to train Llama 3.1?
Mr. Zuck: Yes! ๐Ÿค–๐Ÿ’ช

Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper ๐Ÿ“„ on their findings and technical aspects of the models and their training process!

Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. ๐Ÿ–ฅ๏ธ๐Ÿš€


Here are some interesting findings about the computing infrastructure of Llamas:

- Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Metaโ€™s production clusters! ๐Ÿ“Š

- That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Metaโ€™s Grand Teton AI server platform. ๐Ÿ–ฅ๏ธ๐Ÿ”‹

- What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. ๐Ÿ’พ๐Ÿ“ˆ

- Meta's mad lads saved each GPUโ€™s model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. ๐Ÿ› ๏ธ๐Ÿ”


If this sounds big, well, they document the humungous challenges that come with it:

- In the 54-day training period, there were 466 job interruptions. ๐Ÿ•’๐Ÿ”„

- About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! ๐Ÿ’ฅ๐Ÿ–ฅ๏ธ

- Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. ๐Ÿ“‰๐Ÿ’พ

- With all this, effective training timeโ€”measured as the time spent on useful training over the elapsed timeโ€”was higher than 90%. โฑ๏ธ๐Ÿ“Š

I think this is the stuff that movies can be made on! ๐ŸŽฌ๐ŸŒŸ

Paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

em subdefuge my friend :
you do not need a paper to explian your Pretrain!!!

so you can see the smoke screens going up !

what did they really put in the model and why did they really want to train a open source model , with so much cash and time ??? what is the point !

here its called diversion !

the sub models here are trained on guraded data !!! ( hence cannot perform as well as unlocked models with task training :
they also released models of Bad mathmatical sizes , ie nemo relsed with 5120 hidden size instead of 4k or 8k the next step as we know ( 5120 ) <<< is a dead number in computing it is not a binary ! << it cannot be a coolection of bytes as it cannot match and must make some .5 number !
rember floats do not exist in binary ! ( so they are to be avoided ) ( hence using numbers such as 1 2 4 8 16 32 , 64 , 128 <<<<<< so if the layer count is not one of these values we are allready wrong !
if the hidden windows are not these sixes we are not training well ::
What s the ponit in a paper whuich does not highlight these facts !
i think becuse they hired some random devs the devsare not doing due dilligence and testing and training they are just wasting money !!
especilally when there is actually no need to pass the 7b marker and obvioulsly it should be jumping to 16 next to 14.89 next ? ( in the ball park )) << with a hidden size of 8196 etc ! <<<

hence you should play with the numbers of the model than you will find how easy it is to train the model and how the model is actually fast and better using the binary based values !

so...
LLama did update the codebase this time but still released a bad maths model !

obviousy they would never release a powerfull model into the public hands !

especially face book and microsoft and google etc the big companies ! as they are already inside the govs of the world ! (controlled) especially as OPENAI Still did not relase thier code or model buty keep adding layers to thier existing model and sppon feeding the public a comerical and railed moel , now the independants will be the only valid source for TRAINED models !

hence developing a FULL training strategy is key to th open source devloper not these prettrained ( red herrings )

ยท

Hey @LeroyDyer
I absolutely ๐Ÿ’ฏ agree with you,
I even posted about this: https://www.linkedin.com/feed/update/urn:li:activity:7221869245893070848/

Could not post on Hugging Face due to 1post/24 hr limit