Law tab & Google Gecko

#90
by Muennighoff - opened
Massive Text Embedding Benchmark org
โ€ข
edited 30 days ago

Looks like below:

Screenshot 2024-04-04 at 10.44.43 AM.png

Screenshot 2024-04-04 at 10.44.59 AM.png

(still need to remove SONAR & jina as it's a different dataset)

Relevant GitHub PRs:
https://github.com/embeddings-benchmark/mteb/pull/311

As detailed in the PR, I think mixing domain & language tabs is just temporary and once there is a significant amount of both, we can split them up into separate tab lines I think. Maybe we can also have people select them similar to the nice UI by @tomaarsen in https://huggingface.co/spaces/mteb/leaderboard/discussions/89

cc @Shuang59 @KennethEnevoldsen @tomaarsen

Muennighoff changed pull request status to open
Massive Text Embedding Benchmark org

I think mixing domain & language tabs is just temporary and once there is a significant amount of both, we can split them up into separate tab lines I think.

That works for me.
This PR looks solid to me.

  • Tom Aarsen
Massive Text Embedding Benchmark org

Also @Shuang59 could you share the instruction you used for e5-mistral-7b-instruct? ๐Ÿ™‚ I'd like to try GritLM-7B on it with the same instruction, which should perform slightly better.

Massive Text Embedding Benchmark org
โ€ข
edited 29 days ago

Hi @Muennighoff , I used the same instruction used in the original code in this link:
https://huggingface.co/intfloat/e5-mistral-7b-instruct

def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'

task_prompt = 'Given a web search query, retrieve relevant passages that answer the query'
batch = [get_detailed_instruct(task_prompt, q) for q in batch]

if self.engine == 'intfloat/e5-mistral-7b-instruct':
all_tokens = self.tokenizer(batch, max_length=self.max_token_len - 1, return_attention_mask=False, padding=False, truncation=True)
all_tokens['input_ids'] = [input_ids + [self.tokenizer.eos_token_id] for input_ids in all_tokens['input_ids']]
all_tokens = self.tokenizer.pad(all_tokens, padding=True, return_attention_mask=True, return_tensors='pt')
elif self.engine == 'Salesforce/SFR-Embedding-Mistral':
all_tokens = self.tokenizer(batch, max_length=self.max_token_len, padding=True, truncation=True, return_tensors="pt")

outputs = self.model(**all_tokens)

Muennighoff changed pull request status to merged

Sign up or log in to comment