Ketan's picture
1

Ketan

Ketansomewhere
ยท

AI & ML interests

Diffusion Models

Recent Activity

liked a dataset 8 months ago
m1guelpf/nouns
reacted to singhsidhukuldeep's post with ๐Ÿคฏ 8 months ago
Meta Researchers: How many compute hours should we use to train Llama 3.1? Mr. Zuck: Yes! ๐Ÿค–๐Ÿ’ช Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper ๐Ÿ“„ on their findings and technical aspects of the models and their training process! Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. ๐Ÿ–ฅ๏ธ๐Ÿš€ Here are some interesting findings about the computing infrastructure of Llamas: - Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Metaโ€™s production clusters! ๐Ÿ“Š - That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Metaโ€™s Grand Teton AI server platform. ๐Ÿ–ฅ๏ธ๐Ÿ”‹ - What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. ๐Ÿ’พ๐Ÿ“ˆ - Meta's mad lads saved each GPUโ€™s model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. ๐Ÿ› ๏ธ๐Ÿ” If this sounds big, well, they document the humungous challenges that come with it: - In the 54-day training period, there were 466 job interruptions. ๐Ÿ•’๐Ÿ”„ - About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! ๐Ÿ’ฅ๐Ÿ–ฅ๏ธ - Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. ๐Ÿ“‰๐Ÿ’พ - With all this, effective training timeโ€”measured as the time spent on useful training over the elapsed timeโ€”was higher than 90%. โฑ๏ธ๐Ÿ“Š I think this is the stuff that movies can be made on! ๐ŸŽฌ๐ŸŒŸ Paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
View all activity

Organizations

None yet