Jaward Sesay

Jaward

AI & ML interests

I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy

Articles

Organizations

Posts 40

view post
Post
1255
Super Exciting New Paper By Meta🤖🧠🚀

Discrete Flow Matching:
Introduces a new framework/algorithm for generating text/code without having to predict auto-regressively or one “word” at a time as traditional GPT models do. It generates all parts of the text/code at once.

The algorithm does this by slowly transforming random noise (source) into meaningful text (data). It learns how to transform samples along a path created between source and target using a "probability velocity" that describes how probabilities change over time. During generation, DFM starts with a random sample and iteratively updates it using this learned velocity, gradually transforming it into a sample from the target distribution. This allows for non-autoregressive generation.

They were able to scale models of up to 1.7B parameters achieving impressive scores on HumanEval and MBPP for coding, significantly closing the gap between autoregressive models and discrete flow models.

Though in its infancy, it sure does hold a promising future as leading research scientists argue non-autoregressive methods yield better reasoning.
view post
Post
1243
LazyLLM - Unusual Colab (Apple & Meta) Yields Impactful Work

LLM inference typically consists of two stages: prefilling/tokenizing and decoding. In the prefilling stage, the model processes the entire input prompt, computing and caching key-value (KV) pairs for each token, which can be time-consuming for long prompts. This is followed by the decoding stage, where the model generates tokens sequentially, reusing the cached KVs.

LazyLLM introduces a dynamic token pruning technique. Instead of computing KVs for all tokens during prefilling, LazyLLM selectively processes only the most important tokens based on attention scores, deferring less important ones to later steps if needed. It uses progressive token pruning across transformer layers and introduces an Aux Cache to store hidden states of pruned tokens.

This approach significantly reduces the time-to-first-token (TTFT) and overall generation time while maintaining accuracy across various tasks. LazyLLM outperforms baseline techniques like random token dropping and static pruning, and can be easily integrated into existing LLMs without fine-tuning, offering a practical solution for accelerating LLM inference, especially in long context scenarios.

IN SIMPLE TERMS
When you prompt a large language model (LLM), it usually looks at every single word/subword (or tokens) in your prompt before generating a response. This can be time consuming, especially for prompts with very long texts. This paper introduces a new technique that solves this problem by being more selective. Instead of looking at every word right away, it only focuses on the most important words first. It decides which words are important based on how much attention the model gives them. If it needs other words later, it can go back and look at them then. This approach is like skimming a text for key information before reading it in detail.

Read More: https://arxiv.org/pdf/2407.14057

datasets

None public yet