@mmhamdy on Hugging Face: "💡 Thinking Tokens For Language Models! How much is 56 times 37? Can you…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

mmhamdy

posted an update May 15, 2024

Post

1489

💡 Thinking Tokens For Language Models!

How much is 56 times 37? Can you answer that right away?

In a short paper, David Herel and Tomas Mikolov propose a simple method to improve the reasoning of language models when performing complex calculations.

📌 They note that, although language models are not that good with difficult calculations, humans also cannot perform these calculations immediately and require a considerable amount of time to come up with an answer.

Inspired by this, they introduce 💡Thinking Tokens💡

So what are those "thinking tokens"?! Nothing fancy, they are just special tokens '<T>' that you insert after each word in a sentence whenever a complex problem is encountered. That's it!

👉 The main idea is to "buy" the model "some time" to think about the problem with these additional computations before answering. Using this method they observed an improved (a little bit) perplexity.

👉 Before getting excited note that: They have added these tokens manually, and they have used an RNN language model. From the paper:

"As a proof of concept, we have added N ’thinking tokens’ (< T >) after each observed word in a dataset. Our vision is that this basic concept can be extended to a self-adjusting model, which will be able to decide itself if and how many ’thinking tokens’ will be used for a specific problem, where N could also vary throughout the sentence. This would allow us to reduce the computational time, which would not increase N times."

mmhamdy

May 15, 2024

📎 Read the paper: https://arxiv.org/abs/2405.08644

LeroyDyer

May 15, 2024

i added this thology to my models ; using the prompt and giving a space for thoughts :
In this space i found datasets which had some calculations inside as well as the answer ; and added the process of the step by step anyls inside this part of the prompt and the response at the bottomof the response with no explanation:
so the main prompt include the phrase think step by step .... but if you do not give it a space to think then how! <<
SO of these thorys are lovely but they lack practical implementation inside the existing framework ... leaving the domain of ai science flooded wih many avenues of false trails !>>>
Hence adusting the prompt was a simple option :
PRevioulsy i had implmented an augmented response inside the model generation ; using think heads a discussed in this paper and others.... it was also a crystalAI model and the other model by the original QUiet thoughts models:
there was a mini issue with the scripts ... but after i overcame this is was able to create my own frankenmodel , which required remote code or github clone and hack the transformers before compiling it....
but it worked but the training proceduce was now going to be special ..... as well as he MOE models haveing somwhat this power internally , in our models we could alsooutput the datas generared by each head into the respponse or not how ever you choose to config the model:

he prompting method actually works .... as it is simular to installing a new task::: i also used dpo datasets so the rejected thought was the thought and the output was the response ... retainoing the original thought it was replacing even if it did not have it ....( its a sequence like all so it will be recallable even if you choose to frame it like this) .... hence if you use the same prompt to think step by step and add the feild for thoughts ... then you will get the full completion ... activating the task .... but if you do not inact the thought prompt .... does that mean it does not use the methodology ...hidden..... here the idea was not just to have soe type of thouhgt but to structure the thoughts into some form of order.... so i decided to reframe as much datsets in this way , even saving some of the verbose sructred data into usable feilds inside the thoughts area .... like recalling a record from a dataset to the thoughts and retriveing the data from the record like a query !.... and framing the rought process for all entrys in the dataaset and embedding the task at 0.3....
so now the thoughts will be structured: ...... now if you choose the prompt again the model nay generate these missing fields in memeory before putting out the data.... now we have given the model methodologys in the thoughts and examples of how to use the methodolgys in the thoughts to solve simular shaped problems or tasks ... hence mathmatics improving vastly...
as we give the model for addition and subration and simply the calculation in the thoughts before giving a direct answer.... as it is a simple problem that does not require thoughts... but we need to teach the model basic maths first before we can ask complex math questions.... then we can teach the model to use these rulesets to solve basic and complex problems as now it is just using replacement o substitue comon paterns and we want it to predict correct answers and not simuar shapes ... hence we want it to hallucenate simular type problems .. but we need it to calculate based on the context and past methods....
Now .... when we give the model many examples such as this it will havea strue step by step thinking....

recent aditions to this is tree of thoughts.... ie we can also make the model generate sub agents in the ming to perform the task and then produce the output!
so with agentgen it uses something like langchain to do this! so we can capture this data and frame it into a dataset of agent operatins to place in the thoughts section of how it solved the problem using chains of agents internally ..... so we adust our prompt to say ... generate agents to solve this task, develop the task first by passing the step by step instructions to each agent for its specific task to acomplish the gaol of this task: use the responses from each agent to contruct the response for this task, respond to this task in a formatted and factual fashion:
now we can insert the blurb of the trasaction fromt he output of the agentgen or rag or langchain......
now in the future the model can access this prompt to solve tasks internally to produce an output using these internal agents and again by activating the thoughts you will be displaying the verbose of the model ! (you can add Show your thoughts , or hide your thoughts..... so you can send the model the same problem with just input and output but not showing the thoughts (they should still be in the thoughts for training) and in the response you shuld also copy the thoughts to the output .... so when the prompt is not correctly installed with the thoughts feild you can still acess these things from chat mode!.... again you should also train the same data with no prompt as simple input and output simple also first ..... only to +1 not deep just a few random batches to help it converge later when you give it the ability to calculate the answers as when you test it again on simple input output the model would have jumped down automatically!!!.... hence it learned internally ..... very very very cools stuff.... so we can use these new tools like agentgen and langchain temporaryly to gather good data from models and other models as wiell as rag verboses .... to train our models to thing that methodology so we dont need langchain at all as it will be internal!!! <<< hence we trained it to think !! <<<< (tasks with thought patterns)(data should not be based non opinion (only after fact has been entered!!..(only to give the model sarcasim and chat abilitys(so it should be framed as chat in some fake role!!)

ritwikm

Oct 14, 2024

The paper judges the effectiveness of this approach only through perplexity. The concept of perplexity is basically, "how perplexed (surprised) your language model is when predicting a token". If a language model generates words at random then perplexity will be very high. However, if the LM is confident about a small set of words to be generated then perplexity will be low. So adding a predefined fixed token after each token will obviously make the LM more confident about the next word. So obviously perplexity will be low. Isn't it?

In this post