Spaces:

mteb
/

leaderboard

Running on CPU Upgrade

App Files Files Community

149

Law tab & Google Gecko

#90

by Muennighoff - opened Apr 4

base: refs/heads/main

←

from: refs/pr/90

Discussion Files changed

+59

-4

Muennighoff

Massive Text Embedding Benchmark org Apr 4

•

edited Apr 4

Looks like below:

(still need to remove SONAR & jina as it's a different dataset)

Relevant GitHub PRs:
https://github.com/embeddings-benchmark/mteb/pull/311

As detailed in the PR, I think mixing domain & language tabs is just temporary and once there is a significant amount of both, we can split them up into separate tab lines I think. Maybe we can also have people select them similar to the nice UI by @tomaarsen in https://huggingface.co/spaces/mteb/leaderboard/discussions/89

cc @Shuang59 @KennethEnevoldsen @tomaarsen

Add law & gecko75488a8a

Upload EXTERNAL_MODEL_RESULTS.json89464054

Muennighoff changed pull request status to open Apr 4

tomaarsen

Massive Text Embedding Benchmark org Apr 4

I think mixing domain & language tabs is just temporary and once there is a significant amount of both, we can split them up into separate tab lines I think.

That works for me.
This PR looks solid to me.

Tom Aarsen

Muennighoff

Massive Text Embedding Benchmark org Apr 4

Also @Shuang59 could you share the instruction you used for e5-mistral-7b-instruct? 🙂 I'd like to try GritLM-7B on it with the same instruction, which should perform slightly better.

Shuang59

Massive Text Embedding Benchmark org Apr 4

•

edited Apr 4

Hi @Muennighoff , I used the same instruction used in the original code in this link:
https://huggingface.co/intfloat/e5-mistral-7b-instruct

def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'

task_prompt = 'Given a web search query, retrieve relevant passages that answer the query'
batch = [get_detailed_instruct(task_prompt, q) for q in batch]

if self.engine == 'intfloat/e5-mistral-7b-instruct':
all_tokens = self.tokenizer(batch, max_length=self.max_token_len - 1, return_attention_mask=False, padding=False, truncation=True)
all_tokens['input_ids'] = [input_ids + [self.tokenizer.eos_token_id] for input_ids in all_tokens['input_ids']]
all_tokens = self.tokenizer.pad(all_tokens, padding=True, return_attention_mask=True, return_tensors='pt')
elif self.engine == 'Salesforce/SFR-Embedding-Mistral':
all_tokens = self.tokenizer(batch, max_length=self.max_token_len, padding=True, truncation=True, return_tensors="pt")

outputs = self.model(**all_tokens)

Update app.pyea7ebb3b

Update app.py27cdb589

Upload EXTERNAL_MODEL_RESULTS.json91da484e

Muennighoff changed pull request status to merged Apr 5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment