LongNet: Scaling Transformers to 1,000,000,000 Tokens

Published on Jul 5, 2023
ยท Submitted by akhaliq on Jul 6, 2023
#1 Paper of the day


Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.


They are literally taking all the tricks that vision ppl used on ViT re-pbublishing them. When are they going to publish something like Swin-LLM?

They are literally taking all the tricks that vision ppl used on ViT re-pbublishing them. When are they going to publish something like Swin-LLM?

Good points. I think the next one will be Deformable Masked Attention

โ€œ Different from vanilla attention, both sizes of K and V are independent of the sequence length N, making the
communication cost constant.โ€
This sentence is doubtful.Show the provement that the K_i and V_i are independent of the sequence length N.These tensors' size are still related to sequence length even you did the dilation.

we should have a dislike button too.. don't you think?

Is there any model avaliable ?

This is happening for quite a sometime now. Using NLP in CV and CV in NLP.
End of the day its the math.

Hey, Im reviewing deep learning papers on twitter daily in Hebrew via hashtag # So far I've shortly reviewed about deep learning papers. You are invited to follow and comment

This paper review can be found at:

They are literally taking all the tricks that vision ppl used on ViT re-pbublishing them. When are they going to publish something like Swin-LLM?

Is this so bad, as long as they cite CV papers? It's... arguably... how science ought to work?

No matter the source of their inspiration (deepmind always does this..), we want it on huggingface ASAP !

please gibe me model

Is "LongNet: Scaling Transformers to 1,000,000,000 Tokens" something like this?

import torch
import torch.nn as nn
from tqdm import tqdm

class CrossBar(nn.Module):
def init(self, dim, heads):
self.dim = dim
self.heads = heads
self.crossbar_linear = nn.Linear(self.dim, self.dim * self.heads)
self.scale = nn.Parameter(torch.ones(1))

def forward(self, input):
    # reshaping input and scaling
    input = self.crossbar_linear(input).reshape(*input.shape[:-1], self.heads, -1)
    return self.scale * torch.gelu(input)

class DilatedMHAttention(nn.Module):
def init(self, dim, num_heads=8, qkv_bias=False, dilation_rates=[1]):
self.dim = dim
self.num_heads = num_heads
self.q = nn.Linear(dim, dim, bias=qkv_bias)
self.k = nn.Linear(dim, dim, bias=qkv_bias)
self.v = nn.Linear(dim, dim, bias=qkv_bias)
self.dilation_rates = dilation_rates
self.crossbars = nn.ModuleList([CrossBar(dim, num_heads) for _ in dilation_rates])

def forward(self, x):
    # mapping tensor to each crossbar's dimension
    q, k, v = map(lambda t: t.view(*t.shape[:-1], self.num_heads, -1), (self.q(x), self.k(x), self.v(x)))

    # forwarding to each crossbar and outputting
    outputs = [crossbar((q * k).mean(dim=-1)) for crossbar, q, k in zip(self.crossbars, q.chunk(len(self.crossbars), dim=-2), k.chunk(len(self.crossbars), dim=-2))]
    return sum(outputs) / len(outputs)

class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.):
super().init() = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.Linear(hidden_dim, dim),

def forward(self, x):

class LongNet(nn.Module):
def init(self, dim, depth, heads, mlp_dim, num_classes, dilation_rates=None):
self.blocks = nn.ModuleList([
DilatedMHAttention(dim, heads, dilation_rates=[dilation_rates[i]]),
FeedForward(dim, mlp_dim),
for i in range(depth)
self.classifier = nn.Linear(dim, num_classes)

def forward(self, x):
        for block in tqdm(self.blocks, desc='Progress:', bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt}', ascii=False, dynamic_ncols=True):
            x = block(x) + x
        x = x.mean(dim=1)
        return self.classifier(x)
    except Exception as e:
        with open("errors.txt", "a", encoding="utf-8") as f:
            f.write(str(e) + "\n")
        print("Error occurred! Please check errors.txt file for details.")

@Emil-Zakirov here is an implementation you can refer

@unknownentity does is accept inputs_embeds like in huggingface ?

The official Microsoft implementation is in the TorchScale repo (no pretrained checkpoints that I know of, you have to train it yourself).

Transforming AI: How LongNet Handles A Billion Tokens Effortlessly!

๐Ÿ‘‰ Subscribe:
๐Ÿ‘‰ Twitter:
๐Ÿ‘‰ LMNT (Partner):

By Arxflix

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 6