license: mit
datasets:
- wikipedia
language:
- en
Bottleneck T5 ⏳
The Bottleneck T5 model powers many of my experiments and demos exploring interfaces for inspecting and editing text in latent space. This model is an autoencoder for text; it's able to encode text up to 512 tokens into an embedding, then reconstruct the original text from the embedding. The structure of the embedding space produced by this model also allows for semantic edits to text through vector arithmetic in latent space.
Model Details
Using embeddings produced by this model, we can semantically interpolate between pieces of text and edit sentences using their latent attributes like length, tone, structure, or topic.
All Bottleneck T5 models are trained on a filtered subset of the English Wikipedia, and performs best at encoding and decoding encyclopedic and other similar kinds of text. Text that's heavily technical, conversational, or otherwise unconventional may be out of distribution for the model, and the model may not perform as well on such inputs.
Bottleneck T5 embeddings are always normalized to length 1; the encoder produces embeddings of length 1, and any inputs to the decoder will be normalized to length 1.
- Developed by: Linus Lee
- Model type: T5-style encoder-decoder transformer with an attention pooled bottleneck and gated cross-attention
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: LM-adapted T5 v1.1
Using the model
The model is currently in a prototype state implemented on top of the T5 language model, so we need a small wrapper class around it to use it for embedding and generating text:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
class BottleneckT5Autoencoder:
def __init__(self, model_path: str, device='cpu'):
self.device = device
self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
self.model.eval()
@torch.no_grad()
def embed(self, text: str) -> torch.FloatTensor:
inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
return self.model(
**inputs,
decoder_input_ids=decoder_inputs['input_ids'],
encode_only=True,
)[0]
@torch.no_grad()
def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
dummy_text = '.'
dummy = self.embed(dummy_text)
perturb_vector = latent - dummy
self.model.perturb_vector = perturb_vector
input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
output = self.model.generate(
input_ids=input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
num_return_sequences=1,
)
return self.tokenizer.decode(output[0], skip_special_tokens=True)
Then we can initialize this autoencoder class based on a model class.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)
Embed and un-embed text with .embed(text: str)
and .generate_from_latent(embedding: torch.FloatTensor)
.
texts = [
'The quick brown fox jumps over the lazy dog',
'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.',
]
for t in texts:
embedding = autoencoder.embed(t)
reconstruction = autoencoder.generate_from_latent(embedding)
print(reconstruction)
produces the text:
The quick brown fox jumps over the lazy dog
I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models.
Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing — or plan it exactly the way you want.
For more examples on how to use the model to compute interpolations and semantic edits with Contra, see this Google Colab notebook.
Training Details
Contra was initialized from the language modeling-adapted T5 v1.1 checkpoint and trained on a subset of the English Wikipedia dataset filtered for length, for a single epoch, as a denoising autoencoder with 30% of tokens randomly masked, using the Adafactor optimizer.
Model family and checkpoints
I recommend experimenting first with thesephist/contra-bottleneck-t5-large-wikipedia
, which strikes a good balance between model size and output quality, but I've trained four variants ranging from 330M to 3B parameters:
- thesephist/contra-bottleneck-t5-small-wikipedia: 60M params, 512 embedding dimensions
- thesephist/contra-bottleneck-t5-base-wikipedia: 220M params, 768 embedding dimensions
- thesephist/contra-bottleneck-t5-large-wikipedia: 770M params, 1024 embedding dimensions
- thesephist/contra-bottleneck-t5-xl-wikipedia: 3B params, 2048 embedding dimensions