FP32 vs FP16
First of all thank you for your work!
The base model you used as pre-trained model Alibaba-NLP/gte-multilingual-base
is trained in FP16. Your fine-tuned version is trained in F32.
Is it for better performance?
Do you think we can convert it back to FP16 to reduce memory use and faster inference?
Thank you,
@GuillaumeGrosjean
!
I’ve tested converting the model back to FP16 for inference, and there’s no significant difference in prediction time compared to the FP32 version. Memory usage is reduced, and performance remains stable. Therefore, using FP16 for inference is a practical approach to optimize resource usage without compromising performance.
Loading your model in FP16 saves a lot of computing time on my configuration (GPU T4):
import torch
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('dangvantuan/french-document-embedding', trust_remote_code=True)
model.encode(["Bonjour ! "*100]*1000, show_progress_bar=True)
# Log: Batches: 100%|████████████████████████████████| 32/32 [00:16<00:00, 1.91it/s]
model = SentenceTransformer('dangvantuan/french-document-embedding', trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
model.encode(["Bonjour ! "*100]*1000, show_progress_bar=True)
# Log: Batches: 100%|████████████████████████████████| 32/32 [00:03<00:00, 8.43it/s]
Maybe this performance gain only applies to GPUs ?
To specify my question, I was wondering about the impact of quantizing F32-trained models to FP16 on metrics. Do you think there will be a huge drop in benchmark scores ? Do we lose performance over memory/compution time ?
Hi
@GuillaumeGrosjean
The performance using FP16 is nearly identical to FP32, with only a negligible difference (~2e-4). However, FP16 significantly reduces computation time and memory usage, especially on GPUs like the T4. It’s an optimal choice for improving efficiency without sacrificing accuracy.