ember-v1

This model is trained on a large-scale corpus of relevance text pairs, covering a wide range of domains like financial, scientific, medical, legal and others. While training we used some technics from Retromae and SetFit papers.

We are also providing it on our own platform as API as a service, feel free to signup: LLMRails.

Plans

Paper will be published soon
v2 is on it's way with 4k maximum sequence length

Usage

Use with API request:

curl --location 'https://api.llmrails.com/v1/embeddings' \
--header 'X-API-KEY: {token}' \
--header 'Content-Type: application/json' \
--data '{
   "input": ["This is an example sentence"],
   "model":"embedding-english-v1" # equals to ember-v1
}'

API docs: https://docs.llmrails.com/embedding/embed-text Langchain plugin: https://python.langchain.com/docs/integrations/text_embedding/llm_rails

Use with transformers:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "This is an example sentence",
    "Each sentence is converted"
]

tokenizer = AutoTokenizer.from_pretrained("llmrails/ember-v1")
model = AutoModel.from_pretrained("llmrails/ember-v1")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Use with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = [
    "This is an example sentence",
    "Each sentence is converted"
]

model = SentenceTransformer('llmrails/ember-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Massive Text Embedding Benchmark (MTEB) Evaluation

Our model achieve state-of-the-art performance on MTEB leaderboard

Model Name	Dimension	Sequence Length	Average (56)
bge-large-en-v1.5	1024	512	64.23
bge-base-en-v1.5	768	512	63.55
ember-v1	1024	512	63.54
text-embedding-ada-002	1536	8191	60.99

Limitation

This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.