--- license: mit datasets: - wikipedia language: - en --- # Bottleneck T5 ⏳ The Bottleneck T5 model powers many of my experiments and demos exploring interfaces for inspecting and editing text in latent space. This model is an autoencoder for text; it's able to encode text up to 512 tokens into an embedding, then reconstruct the original text from the embedding. The structure of the embedding space produced by this model also allows for semantic edits to text through vector arithmetic in latent space. ## Model Details Using embeddings produced by this model, we can semantically interpolate between pieces of text and edit sentences using their latent attributes like length, tone, structure, or topic. All Bottleneck T5 models are trained on a filtered subset of the English Wikipedia, and performs best at encoding and decoding encyclopedic and other similar kinds of text. Text that's heavily technical, conversational, or otherwise unconventional may be out of distribution for the model, and the model may not perform as well on such inputs. Bottleneck T5 embeddings are always normalized to length 1; the encoder produces embeddings of length 1, and any inputs to the decoder will be normalized to length 1. - **Developed by:** [Linus Lee](https://thesephist.com/) - **Model type:** T5-style encoder-decoder transformer with an attention pooled bottleneck and gated cross-attention - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** LM-adapted T5 v1.1 ## Using the model The model is currently in a prototype state implemented on top of the T5 language model, so we need a small wrapper class around it to use it for embedding and generating text: ```py import os import torch import torch.nn as nn import torch.nn.functional as F from tqdm import tqdm from transformers import AutoTokenizer, AutoModelForCausalLM class BottleneckT5Autoencoder: def __init__(self, model_path: str, device='cpu'): self.device = device self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512) self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device) self.model.eval() @torch.no_grad() def embed(self, text: str) -> torch.FloatTensor: inputs = self.tokenizer(text, return_tensors='pt').to(self.device) decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device) return self.model( **inputs, decoder_input_ids=decoder_inputs['input_ids'], encode_only=True, )[0] @torch.no_grad() def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str: dummy_text = '.' dummy = self.embed(dummy_text) perturb_vector = latent - dummy self.model.perturb_vector = perturb_vector input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids output = self.model.generate( input_ids=input_ids, max_length=max_length, do_sample=True, temperature=temperature, top_p=0.9, num_return_sequences=1, ) return self.tokenizer.decode(output[0], skip_special_tokens=True) ``` Then we can initialize this autoencoder class based on a model class. ```py device = 'cuda' if torch.cuda.is_available() else 'cpu' autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device) ``` Embed and un-embed text with `.embed(text: str)` and `.generate_from_latent(embedding: torch.FloatTensor)`. ```py texts = [ 'The quick brown fox jumps over the lazy dog', 'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.', 'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.', ] for t in texts: embedding = autoencoder.embed(t) reconstruction = autoencoder.generate_from_latent(embedding) print(reconstruction) ``` produces the text: ``` The quick brown fox jumps over the lazy dog I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models. Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing — or plan it exactly the way you want. ``` For more examples on how to use the model to compute interpolations and semantic edits with Contra, see [this Google Colab notebook](https://linus.zone/contra-colab). ## Training Details Contra was initialized from the [language modeling-adapted T5 v1.1 checkpoint](https://huggingface.co/models?other=t5-lm-adapt) and trained on a subset of the English [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset filtered for length, for a single epoch, as a denoising autoencoder with 30% of tokens randomly masked, using the Adafactor optimizer. #### Model family and checkpoints I recommend experimenting first with `thesephist/contra-bottleneck-t5-large-wikipedia`, which strikes a good balance between model size and output quality, but I've trained four variants ranging from 330M to 3B parameters: - [thesephist/contra-bottleneck-t5-small-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-small-wikipedia): 60M params, 512 embedding dimensions - [thesephist/contra-bottleneck-t5-base-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-base-wikipedia): 220M params, 768 embedding dimensions - [thesephist/contra-bottleneck-t5-large-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-large-wikipedia): 770M params, 1024 embedding dimensions - [thesephist/contra-bottleneck-t5-xl-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-xl-wikipedia): 3B params, 2048 embedding dimensions