Mohammed Hamdy

mmhamdy

AI & ML interests

NLP | Reinforcement Learning

Organizations

mmhamdy's activity

replied to their post 17 days ago
posted an update 17 days ago
view post
Post
1229
💡 Thinking Tokens For Language Models!

How much is 56 times 37? Can you answer that right away?

In a short paper, David Herel and Tomas Mikolov propose a simple method to improve the reasoning of language models when performing complex calculations.

📌 They note that, although language models are not that good with difficult calculations, humans also cannot perform these calculations immediately and require a considerable amount of time to come up with an answer.

Inspired by this, they introduce 💡Thinking Tokens💡

So what are those "thinking tokens"?! Nothing fancy, they are just special tokens '<T>' that you insert after each word in a sentence whenever a complex problem is encountered. That's it!

👉 The main idea is to "buy" the model "some time" to think about the problem with these additional computations before answering. Using this method they observed an improved (a little bit) perplexity.

👉 Before getting excited note that: They have added these tokens manually, and they have used an RNN language model. From the paper:

"As a proof of concept, we have added N ’thinking tokens’ (< T >) after each observed word in a dataset. Our vision is that this basic concept can be extended to a self-adjusting model, which will be able to decide itself if and how many ’thinking tokens’ will be used for a specific problem, where N could also vary throughout the sentence. This would allow us to reduce the computational time, which would not increase N times."
  • 2 replies
·
posted an update 21 days ago
replied to Sentdex's post 30 days ago
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
1758
⌚ Visiting the past with Time Machine GPT!

We are all familiar with the concept of a suite of models being a series of variants of a certain model that differ mainly in size. For example, Llama-2 7B, Llama-2 13B, Llama-2 70B

But this is not always the case. Researchers from The University of Oxford, The Alan Turing Institute, and The University of Manchester introduced TimeMachineGPT (TiMaGPT), a suite of language models that were pretrained on data constrained by a certain period in time. Instead of various sizes of the model, you get the same model but trained on different data coming from different times.

Using a GPT-2 model architecture with 117 million parameters, they trained 12 different models on Wikipedia and WMT News from 2011 to 2022 with each year represented by a model. For example, TiMaGPT-2011, TiMaGPT-2012, ..., TiMaGPT-2022.

🤔 But how could these models be useful?

They can be very useful. For example:

1️⃣ Most language models are static in the sense that they are trapped in the time bubble of their pretraining data, their knowledge is limited by the cut-off date of their training dataset. In order to update their knowledge, Temporal Adaptation can be performed, which means further training on newer data. The TiMaGPT series of models can be used to study the limitations of Temporal Adaptation of language models.

2️⃣ Word meaning can change not only with its context but also with its time of use and there is a large amount of research that focuses on understanding how embeddings shift through time. TiMaGPT will be very helpful in studying this phenomenon.

3️⃣ One more use case in the context of Time-series forecasting and event prediction is "backtesting". Which is using historical data to evaluate new models for forecasting the future. Models like TiMaGPT (each living in its own time without any knowledge of the future/present) will be great for such a use case.

🤗 All models and datasets are on the hub: https://huggingface.co/Ti-Ma
  • 1 reply
·
posted an update about 1 month ago
view post
Post
1781
Prompting BERT!

Zero-shot learning ability is the hottest thing about causal LLMs. You don't need to finetune causal LLMs on each specific task. Instead, you can use prompting and get a decent performance on unseen tasks.

Unfortunately, autoencoding LLMs - like our dear friend BERT 🙋‍♂️- lack this ability and you need a task-specific head for different tasks. But what if you could prompt all the BERTs in the world?!

🥁 Introducing Statement-Tuning 🥁

Now hold your horses! don't go full-LLama on it yet. Using this finetuning approach, we can get zero-shot performance from encoders by turning a problem into a yes/no problem. Binary classification all the way down!
For example, a single entailment problem will be decomposed into 3 yes/no questions.

This is still not super useful. But I like works that try to make a little more space for encoders in the current autoregressive era!

Check the paper if interested: Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning (2404.12897)