Papers
arxiv:2305.07185

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Published on May 12, 2023
ยท Featured in Daily Papers on May 15, 2023
Authors:
,
,
,
,

Abstract

Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding -- unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

Community

๐Ÿ‘๐Ÿคฉ

๐Ÿ‘๐Ÿฆ™

Reminds me pretty strongly of the Hierarchical Perceiver architecture.

It would be interesting to see tests with long-range dependencies, rare tokens, or complex patterns in the data.

I have read this paper and run the demo, It seem this model only get a good performence in char level. I doubt the performence in downstream task

This paper was cited by @karpathy in the Tokenizer Tutorial: https://youtu.be/zduSFxRajkE?si=qthqUglGA0qc8Z0F

image.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2305.07185 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2305.07185 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2305.07185 in a Space README.md to link it from this page.

Collections including this paper 7