Feature extraction suitability?

#52

by ivoras - opened Apr 15, 2024

Discussion

ivoras

Apr 15, 2024

Does it make sense to use gemma-2b for feature extraction / generation of embeddings for vector similarity search?

I'm generating vectors with:

def dataset():
  for x in data:
    yield x

p  = pipeline('feature-extraction', framework='pt', model='google/gemma-2b', device='cuda', access_token=os.environ['HF_TOKEN'])
for i, vec in enumerate(p(dataset())):
  save_vec(i, data[i], vec)

But after vectors are generated, trying to find vectors nearby (using L2 distance) to the query vector yields gibberish. This exact code works with other models, including bert-based and phi2.

lkv

Google org Jul 23, 2024

Hi @ivoras , Gemma-2B is a large language model pre-trained by Google. It's not specifically designed for feature extraction or vector similarity search. While it can generate vectors, these vectors may not be optimal for such tasks. For vector similarity search, it's recommended to use models explicitly trained for this purpose. Models like BERT and Phi2 have been designed and optimized for generating vectors that are suitable for similarity search tasks. Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment