Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Linq-AI-Research/Linq-Embed-Mistral

Linq-Embed-Mistral

Linq-Embed-Mistral has been developed by building upon the foundations of the E5-mistral-7b-instruct and Mistral-7B-v0.1 models. We focus on improving text retrieval using advanced data refinement methods, including sophisticated data crafting, data filtering, and negative mining guided by teacher models, which are highly tailored to each task, to improve the quality of the synthetic data generated by LLM. These methods are applied to both existing benchmark dataset and highly tailored synthetic dataset generated via LLMs. Our efforts primarily aim to create high-quality triplet datasets (query, positive example, negative example), significantly improving text retrieval performance.

Linq-Embed-Mistral performs well in the MTEB benchmarks (as of May 29, 2024). The model excels in retrieval tasks, ranking 1st among all models listed on the MTEB leaderboard with a performance score of 60.2. This outstanding performance underscores its superior capability in enhancing search precision and reliability. The model achieves an average score of 68.2 across 56 datasets in the MTEB benchmarks, making it the highest-ranking publicly accessible model and third overall. (Please note that NV-Emb-v1 and voyage-large-2-instruct, ranked 1st and 2nd on the leaderboard as of May 29, reported their performance without releasing their models.)

This project is for research purposes only. Third-party datasets may be subject to additional terms and conditions under their associated licenses. Please refer to specific papers for more details:

For more details, refer to this blog post and this report.

How to use

Here is an example of how to encode queries and passages from the Mr.TyDi training dataset, both with Sentence Transformers or Transformers directly.

Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Linq-AI-Research/Linq-Embed-Mistral")

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a question, retrieve Wikipedia passages that answer the question'
prompt = f"Instruct: {task}\nQuery: "
queries = [
    "졜초의 μ›μžλ ₯ λ°œμ „μ†ŒλŠ” 무엇인가?",
    "Who invented Hangul?"
]
passages = [
    "ν˜„μž¬ μ‚¬μš©λ˜λŠ” ν•΅λΆ„μ—΄ 방식을 μ΄μš©ν•œ μ „λ ₯생산은 1948λ…„ 9μ›” λ―Έκ΅­ ν…Œλ„€μ‹œμ£Ό μ˜€ν¬λ¦¬μ§€μ— μ„€μΉ˜λœ X-10 ν‘μ—°μ›μžλ‘œμ—μ„œ μ „κ΅¬μ˜ λΆˆμ„ λ°νžˆλŠ” 데 μ‚¬μš©λ˜λ©΄μ„œ μ‹œμž‘λ˜μ—ˆλ‹€. 그리고 1954λ…„ 6월에 κ΅¬μ†Œλ ¨μ˜ μ˜€λΈŒλ‹ŒμŠ€ν¬μ— κ±΄μ„€λœ 흑연감속 λΉ„λ“±κ²½μˆ˜ μ••λ ₯κ΄€ν˜• μ›μžλ‘œλ₯Ό μ‚¬μš©ν•œ μ˜€λΈŒλ‹ŒμŠ€ν¬ μ›μžλ ₯ λ°œμ „μ†Œκ°€ μ‹œν—˜μ μœΌλ‘œ μ „λ ₯생산을 μ‹œμž‘ν•˜μ˜€κ³ , 졜초의 μƒμ—…μš© μ›μžλ ₯ μ—‰λ”μ΄λ‘œλ₯Ό μ‚¬μš©ν•œ 영ꡭ μ…€λΌν•„λ“œ μ›μžλ ₯ 단지에 μœ„μΉ˜ν•œ μ½œλ” 홀(Calder Hall) μ›μžλ ₯ λ°œμ „μ†Œλ‘œ, 1956λ…„ 10μ›” 17일 상업 μš΄μ „μ„ μ‹œμž‘ν•˜μ˜€λ‹€.",
    "Hangul was personally created and promulgated by the fourth king of the Joseon dynasty, Sejong the Great.[1][2] Sejong's scholarly institute, the Hall of Worthies, is often credited with the work, and at least one of its scholars was heavily involved in its creation, but it appears to have also been a personal project of Sejong."
]

# Encode the queries and passages. We only use the prompt for the queries
query_embeddings = model.encode(queries, prompt=prompt)
passage_embeddings = model.encode(passages)

# Compute the (cosine) similarity scores
scores = model.similarity(query_embeddings, passage_embeddings) * 100
print(scores.tolist())
# [[73.72908782958984, 30.122787475585938], [29.15508460998535, 79.25375366210938]]

Transformers

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a question, retrieve Wikipedia passages that answer the question'
queries = [
    get_detailed_instruct(task, '졜초의 μ›μžλ ₯ λ°œμ „μ†ŒλŠ” 무엇인가?'),
    get_detailed_instruct(task, 'Who invented Hangul?')
]
# No need to add instruction for retrieval documents
passages = [
    "ν˜„μž¬ μ‚¬μš©λ˜λŠ” ν•΅λΆ„μ—΄ 방식을 μ΄μš©ν•œ μ „λ ₯생산은 1948λ…„ 9μ›” λ―Έκ΅­ ν…Œλ„€μ‹œμ£Ό μ˜€ν¬λ¦¬μ§€μ— μ„€μΉ˜λœ X-10 ν‘μ—°μ›μžλ‘œμ—μ„œ μ „κ΅¬μ˜ λΆˆμ„ λ°νžˆλŠ” 데 μ‚¬μš©λ˜λ©΄μ„œ μ‹œμž‘λ˜μ—ˆλ‹€. 그리고 1954λ…„ 6월에 κ΅¬μ†Œλ ¨μ˜ μ˜€λΈŒλ‹ŒμŠ€ν¬μ— κ±΄μ„€λœ 흑연감속 λΉ„λ“±κ²½μˆ˜ μ••λ ₯κ΄€ν˜• μ›μžλ‘œλ₯Ό μ‚¬μš©ν•œ μ˜€λΈŒλ‹ŒμŠ€ν¬ μ›μžλ ₯ λ°œμ „μ†Œκ°€ μ‹œν—˜μ μœΌλ‘œ μ „λ ₯생산을 μ‹œμž‘ν•˜μ˜€κ³ , 졜초의 μƒμ—…μš© μ›μžλ ₯ μ—‰λ”μ΄λ‘œλ₯Ό μ‚¬μš©ν•œ 영ꡭ μ…€λΌν•„λ“œ μ›μžλ ₯ 단지에 μœ„μΉ˜ν•œ μ½œλ” 홀(Calder Hall) μ›μžλ ₯ λ°œμ „μ†Œλ‘œ, 1956λ…„ 10μ›” 17일 상업 μš΄μ „μ„ μ‹œμž‘ν•˜μ˜€λ‹€.",
    "Hangul was personally created and promulgated by the fourth king of the Joseon dynasty, Sejong the Great.[1][2] Sejong's scholarly institute, the Hall of Worthies, is often credited with the work, and at least one of its scholars was heavily involved in its creation, but it appears to have also been a personal project of Sejong."
]

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Linq-AI-Research/Linq-Embed-Mistral')
model = AutoModel.from_pretrained('Linq-AI-Research/Linq-Embed-Mistral')

max_length = 4096
input_texts = [*queries, *passages]
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[73.72909545898438, 30.122783660888672], [29.155078887939453, 79.25374603271484]]

MTEB Benchmark Evaluation

Check out unilm/e5 to reproduce evaluation results on the BEIR and MTEB benchmark.

Evaluation Result

MTEB (as of May 29, 2024)

Model Name Retrieval (15) Average (56)
Linq-Embed-Mistral 60.2 68.2
NV-Embed-v1 59.4 69.3
SFR-Embedding-Mistral 59.0 67.6
voyage-large-2-instruct 58.3 68.3
GritLM-7B 57.4 66.8
voyage-lite-02-instruct 56.6 67.1
gte-Qwen1.5-7B-instruct 56.2 67.3
e5-mistral-7b-instruct 56.9 66.6
google-gecko.text-embedding-preview-0409 55.7 66.3
text-embedding-3-large 55.4 64.6
Cohere-embed-english-v3.0 55.0 64.5

Linq Research Team.

Citation

@misc{LinqAIResearch2024,
  title={Linq-Embed-Mistral:Elevating Text Retrieval with Improved GPT Data Through Task-Specific Control and Quality Refinement},
  author={Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy-yong Sohn, Chanyeol Choi},
  howpublished={Linq AI Research Blog},
  year={2024},
  url={https://getlinq.com/blog/linq-embed-mistral/}
}
Downloads last month
6,027
Safetensors
Model size
7.11B params
Tensor type
FP16
Β·

Spaces using Linq-AI-Research/Linq-Embed-Mistral 2

Evaluation results