dangvantuan/french-document-embedding

FP32 vs FP16

by GuillaumeGrosjean - opened 16 days ago

Discussion

GuillaumeGrosjean

16 days ago

•

edited 14 days ago

First of all thank you for your work!

The base model you used as pre-trained model Alibaba-NLP/gte-multilingual-base is trained in FP16. Your fine-tuned version is trained in F32.

Is it for better performance?
Do you think we can convert it back to FP16 to reduce memory use and faster inference?

dangvantuan

Owner 14 days ago

Thank you, @GuillaumeGrosjean !
I’ve tested converting the model back to FP16 for inference, and there’s no significant difference in prediction time compared to the FP32 version. Memory usage is reduced, and performance remains stable. Therefore, using FP16 for inference is a practical approach to optimize resource usage without compromising performance.

GuillaumeGrosjean

14 days ago

Loading your model in FP16 saves a lot of computing time on my configuration (GPU T4):

import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('dangvantuan/french-document-embedding', trust_remote_code=True)
model.encode(["Bonjour ! "*100]*1000, show_progress_bar=True)
# Log: Batches: 100%|████████████████████████████████| 32/32 [00:16<00:00,  1.91it/s]

model = SentenceTransformer('dangvantuan/french-document-embedding', trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
model.encode(["Bonjour ! "*100]*1000, show_progress_bar=True)
# Log: Batches: 100%|████████████████████████████████| 32/32 [00:03<00:00,  8.43it/s]

Maybe this performance gain only applies to GPUs ?

To specify my question, I was wondering about the impact of quantizing F32-trained models to FP16 on metrics. Do you think there will be a huge drop in benchmark scores ? Do we lose performance over memory/compution time ?

dangvantuan

Owner 13 days ago

Hi @GuillaumeGrosjean
The performance using FP16 is nearly identical to FP32, with only a negligible difference (~2e-4). However, FP16 significantly reduces computation time and memory usage, especially on GPUs like the T4. It’s an optimal choice for improving efficiency without sacrificing accuracy.

dangvantuan

Owner 13 days ago

@GuillaumeGrosjean : The score on all benchmarks MTEB is the same!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment