--- license: apache-2.0 language: - en tags: - Phrase Representation - String Matching - Fuzzy Join - Entity Retrieval - transformers - sentence-transformers --- ## PEARL-small [Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf).
[Lihu Chen](https://chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/). Accepted by EACL Findings 2024
PEARL-small is a lightweight string embedding model. It is the tool of choice for semantic similarity computation for strings, creating excellent embeddings for string matching, entity retrieval, entity clustering, fuzzy join...
It differs from typical sentence embedders because it incorporates phrase type information and morphological features, allowing it to better capture variations in strings. The model is a variant of [E5-small](https://huggingface.co/intfloat/e5-small-v2) finetuned on our constructed context-free [dataset](https://zenodo.org/records/10676475) to yield better representations for phrases and strings.
🤗 [PEARL-small](https://huggingface.co/Lihuchen/pearl_small) 🤗 [PEARL-base](https://huggingface.co/Lihuchen/pearl_base)
| Model |Size|Avg| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ| |-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| | FastText |-| 40.3| 94.4 | 61.2 | 59.6 | 58.9 |16.9|14.5|3.0|0.2| 53.6| | Sentence-BERT |110M|50.1| 94.6 | 66.8 | 50.4 | 62.6 | 21.6|23.6|25.5|48.4| 57.2| | Phrase-BERT |110M|54.5| 96.8 | 68.7 | 57.2 | 68.8 |23.7|26.1|35.4| 59.5|66.9| | E5-small |34M|57.0| 96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8| |E5-base|110M| 61.1| 95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1| |PEARL-small|34M| 62.5| 97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2| |PEARL-base|110M|64.8|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1| Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is `*ms/512 samples`. The FastText model here is `crawl-300d-2M-subword.bin`. | Model |Avg Score| Estimated Memory |Speed GPU | Speed CPU | |-|-|-|-|-| |FastText|40.3|1200MB|-|57ms| |PEARL-small|62.5|68MB|42ms|446ms| |PEARL-base|64.8|220MB|89ms|1394ms| ## Usage ### Sentence Transformers PEARL is integrated with the Sentence Transformers library, and can be used like so: ```python from sentence_transformers import SentenceTransformer, util query_texts = ["The New York Times"] doc_texts = [ "NYTimes", "New York Post", "New York"] input_texts = query_texts + doc_texts model = SentenceTransformer("Lihuchen/pearl_small") embeddings = model.encode(input_texts) scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100 print(scores.tolist()) # [[90.56318664550781, 79.65763854980469, 75.52056121826172]] ``` ### Transformers You can also use `transformers` to use PEARL. Below is an example of entity retrieval, and we reuse the code from E5. ```python import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] def encode_text(model, input_texts): # Tokenize the input texts batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) return embeddings query_texts = ["The New York Times"] doc_texts = [ "NYTimes", "New York Post", "New York"] input_texts = query_texts + doc_texts tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small') model = AutoModel.from_pretrained('Lihuchen/pearl_small') # encode embeddings = encode_text(model, input_texts) # calculate similarity embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:1] @ embeddings[1:].T) * 100 print(scores.tolist()) # expected outputs # [[90.56318664550781, 79.65763854980469, 75.52054595947266]] ``` ## Training and Evaluation Have a look at our code on [Github](https://github.com/tigerchen52/PEARL) ## Citation If you find our work useful, please give us a citation: ``` @article{chen2024learning, title={Learning High-Quality and General-Purpose Phrase Representations}, author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M}, journal={arXiv preprint arXiv:2401.10407}, year={2024} } ```