Model Card for Pixel-Linguist/Pixel-Linguist-v0

Official model checkpoint of Pixel-Linguist-v0 from paper "Pixel Sentence Representation Learning".

Model Summary

Pixel Linguist v0 is a sentence encoder that is trained to understand sentence and document-level semantics with only pixel-based textual signals. It is trained on 59 language pairs, and English unsupervised data, Wikipedia Spans, and NLI. It has strong zero-shot transferability to other existing langauges (even ancient ones) that you can think of.

Model Sources

Github Repo: https://github.com/gowitheflow-1998/Pixel-Linguist
Paper: https://arxiv.org/pdf/2402.08183.pdf

Downstream Use

Semantic Textual Similarity, Information Retrieval

Out-of-Scope Use

The model might not be optimal for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.

Training Data

All the training sets involved in our progressive training scheme that we created can be found in tags in meta data. Please refer to the paper for the exact process.

Inference

Encoding with our PixelLinguist class is very straightforward, just like using a SentenceTransformer class.

model_name = "Pixel-Linguist/Pixel-Linguist-v0"
model = PixelLinguist(model_name)

texts = ["I love you","I like you"]
embeddings = model.encode(texts)
print(outputs[0] @ outputs[1].T)  # just use dot product because the embeddings are normalized automatically in the model class.
#tensor(0.9217)

To use the PixelLinguist class: First install the package following our Github Repo. Then define our PixelLinguist Class as follow.

import torch
from PIL import Image
from pixel import (
    AutoConfig,
    PangoCairoTextRenderer,
    PIXELForSequenceClassification,
    PIXELForRepresentation,
    PoolingMode,
    get_attention_mask,
    get_transforms,
    glue_strip_spaces,
    resize_model_embeddings,
)
from tqdm import tqdm

class PixelLinguist:
    def __init__(self, model_name, batch_size = 16, max_seq_length = 64, 
                 device=None, pooling = "mean", keep_mlp = False):
        if device is not None:
            self.device = device
        else:
            self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        self.config = AutoConfig.from_pretrained(model_name, num_labels=0)
        self.batch_size = batch_size
        if keep_mlp == True:
            self.model = PIXELForSequenceClassification.from_pretrained(
                model_name,
                config=self.config,
                pooling_mode=PoolingMode.from_string(pooling),
                add_layer_norm=True
            ).to(self.device)
        else:
            self.model = PIXELForRepresentation.from_pretrained(
                model_name,
                config=self.config,
                pooling_mode=PoolingMode.from_string(pooling),
                add_layer_norm=True
            ).to(self.device)
        self.processor = PangoCairoTextRenderer.from_pretrained(model_name, rgb=False)
        self.processor.max_seq_length = max_seq_length
        resize_model_embeddings(self.model, self.processor.max_seq_length)
        self.transforms = get_transforms(do_resize=True, size=(self.processor.pixels_per_patch, self.processor.pixels_per_patch * self.processor.max_seq_length))

    def preprocess(self, texts):
        encodings = [self.processor(text=glue_strip_spaces(a)) for a in texts]
        pixel_values = torch.stack([self.transforms(Image.fromarray(e.pixel_values)) for e in encodings])
        attention_mask = torch.stack([get_attention_mask(e.num_text_patches, seq_length=self.processor.max_seq_length) for e in encodings])
        return {'pixel_values': pixel_values, 'attention_mask': attention_mask}

    def encode(self, texts, **kwargs):
        all_outputs = []
        for i in tqdm(range(0, len(texts), self.batch_size)):
            batch_texts = texts[i:i+batch_size]
            inputs = self.preprocess(batch_texts)
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            with torch.no_grad():
                outputs = self.model(**inputs).logits.detach().cpu()
            all_outputs.append(outputs)
        return torch.cat(all_outputs, dim=0)

Evaluation

For STS-benchmark evaluation (see Github repo):

python tools/evaluation_sts.py

For BEIR information retrieval evaluation (see Github repo):

python tools/evaluation_retrieval.py

BibTeX:

@article{xiao2024pixel,
  title={Pixel Sentence Representation Learning},
  author={Xiao, Chenghao and Huang, Zhuoxu and Chen, Danlu and Hudson, G Thomas and Li, Yizhi and Duan, Haoran and Lin, Chenghua and Fu, Jie and Han, Jungong and Moubayed, Noura Al},
  journal={arXiv preprint arXiv:2402.08183},
  year={2024}
}

Pixel-Linguist
/

Pixel-Linguist-v0