Papers
arxiv:2406.19223

T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Published on Jun 27
· Submitted by mbrack on Jun 28
Authors:
,
,
,

Abstract

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.

Community

Paper author Paper submitter

Hi Akhaliq,

I believe this latest paper of ours would be a great fit for the daily papers.
We propose a radical paradigm shift for LLM's tokenizers, which have remained virtually unchanged for the last 5-6 years.
Our tokenizer-free approach addresses a lot of inherent weaknesses while being more computationally efficient with competitive downstream performance. While maintaining that performance, we are able to reduce the size of embedding and LM-Head layers by over 85%, potentially shaving of billions of parameters. The method also easily allows for multilingual LLMs and easy cross-lingual transfer, since your vocabulary is no longer optimized for a certain language.

Best,
Manuel

Congrats @mbrack on this work, are you planning to share models on the hub (which can then be linked to this paper)?

See https://huggingface.co/docs/hub/models-uploading for details.

·
Paper author

Yes we are currently scaling the method to 7B and we plan to make that model available on the hub.

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.19223 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.19223 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.19223 in a Space README.md to link it from this page.

Collections including this paper 4