pearl_small / README.md
Lihuchen's picture
Update README.md
581e568 verified
metadata
license: apache-2.0
language:
  - en
tags:
  - Phrase Representation
  - String Matching
  - Fuzzy Join
  - Entity Retrieval
  - transformers
  - sentence-transformers

🦪⚪ PEARL-small

Learning High-Quality and General-Purpose Phrase Representations.
Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek. Accepted by EACL Findings 2024

PEARL-small is a lightweight string embedding model. It is the tool of choice for semantic similarity computation for strings, creating excellent embeddings for string matching, entity retrieval, entity clustering, fuzzy join...
It differs from typical sentence embedders because it incorporates phrase type information and morphological features, allowing it to better capture variations in strings. The model is a variant of E5-small finetuned on our constructed context-free dataset to yield better representations for phrases and strings.

🤗 PEARL-small 🤗 PEARL-base 📐 PEARL Benchmark 🏆 PEARL Leaderboard

Model Size Avg PPDB PPDB filtered Turney BIRD YAGO UMLS CoNLL BC5CDR AutoFJ
FastText - 40.3 94.4 61.2 59.6 58.9 16.9 14.5 3.0 0.2 53.6
Sentence-BERT 110M 50.1 94.6 66.8 50.4 62.6 21.6 23.6 25.5 48.4 57.2
Phrase-BERT 110M 54.5 96.8 68.7 57.2 68.8 23.7 26.1 35.4 59.5 66.9
E5-small 34M 57.0 96.0 56.8 55.9 63.1 43.3 42.0 27.6 53.7 74.8
E5-base 110M 61.1 95.4 65.6 59.4 66.3 47.3 44.0 32.0 69.3 76.1
PEARL-small 34M 62.5 97.0 70.2 57.9 68.1 48.1 44.5 42.4 59.3 75.2
PEARL-base 110M 64.8 97.3 72.2 59.7 72.6 50.7 45.8 39.3 69.4 77.1

Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is *ms/512 samples. The FastText model here is crawl-300d-2M-subword.bin.

Model Avg Score Estimated Memory Speed GPU Speed CPU
FastText 40.3 1200MB - 57ms
PEARL-small 62.5 68MB 42ms 446ms
PEARL-base 64.8 220MB 89ms 1394ms

Usage

Sentence Transformers

PEARL is integrated with the Sentence Transformers library (Thanks for Tom Aarsen's contribution), and can be used like so:

from sentence_transformers import SentenceTransformer, util

query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
# [[90.56318664550781, 79.65763854980469, 75.52056121826172]]

Transformers

You can also use transformers to use PEARL. Below is an example of entity retrieval, and we reuse the code from E5.

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]

Training and Evaluation

Have a look at our code on Github

Citation

If you find our work useful, please give us a citation:

@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}