Quentin Tardif

ntnq

AI & ML interests

None yet

Recent Activity

Organizations

ntnq's activity

Reacted to Taylor658's post with ๐Ÿ‘€ about 1 month ago
view post
Post
2199
The Mystery Bot ๐Ÿ•ต๏ธโ€โ™‚๏ธ saga I posted about from earlier this week has been solved...๐Ÿค—

Cohere for AI has just announced its open source Aya Expanse multilingual model. The Initial release supports 23 languages with more on the way soon.๐ŸŒŒ ๐ŸŒ

You can also try Aya Expanse via SMS on your mobile phone using the global WhatsApp number or one of the initial set of country specific numbers listed below.โฌ‡๏ธ

๐ŸŒWhatsApp - +14313028498
Germany - (+49) 1771786365
USA โ€“ +18332746219
United Kingdom โ€” (+44) 7418373332
Canada โ€“ (+1) 2044107115
Netherlands โ€“ (+31) 97006520757
Brazil โ€” (+55) 11950110169
Portugal โ€“ (+351) 923249773
Italy โ€“ (+39) 3399950813
Poland - (+48) 459050281
  • 1 reply
ยท
Reacted to m-ric's post with ๐Ÿ‘ about 1 month ago
view post
Post
3035
๐Ÿ“œ ๐Ž๐ฅ๐-๐ฌ๐œ๐ก๐จ๐จ๐ฅ ๐‘๐๐๐ฌ ๐œ๐š๐ง ๐š๐œ๐ญ๐ฎ๐š๐ฅ๐ฅ๐ฒ ๐ซ๐ข๐ฏ๐š๐ฅ ๐Ÿ๐š๐ง๐œ๐ฒ ๐ญ๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ!

Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
โถ Removed dependencies on previous hidden states in the gates
โท Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
โธ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)

โšก๏ธ As a result, you can use a โ€œparallel scanโ€ algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences

๐Ÿ”ฅ The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! ๐Ÿš€

๐Ÿค” Why does this matter?

By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!

๐Ÿ’ฌย Franรงois Chollet wrote in a tweet about this paper:

โ€œThe fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)โ€

โ€œCurve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.โ€

Itโ€™s the Bitter lesson by Rich Sutton striking again: donโ€™t need fancy thinking architectures, just scale up your model and data!

Read the paper ๐Ÿ‘‰ย  Were RNNs All We Needed? (2410.01201)
  • 2 replies
ยท