Usage: Infinity
#6
by
michaelfeil
- opened
Added an example for usage with infinity: https://github.com/michaelfeil/infinity . Ready for review.
Tested and works:
docker run --gpus all -v $PWD/data:/app/.cache -e HF_TOKEN=$HF_TOKEN -p "7999":"7997" michaelf34/infinity:0.0.68 v2 --model-id Salesforce/SFR-Embedding-2_R --revision "91762139d94ed4371a9fa31db5551272e0b83818" --dtype bfloat16 --batch-size 4 --device cuda --engine torch --port 7997 --no-bettertransformer
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-13 00:05:11,159 infinity_emb INFO: infinity_server.py:89
Creating 1engines:
engines=['Salesforce/SFR-Embedding-2_R']
INFO 2024-11-13 00:05:11,163 infinity_emb INFO: Anonymized telemetry.py:30
telemetry can be disabled via environment variable
`DO_NOT_TRACK=1`.
INFO 2024-11-13 00:05:11,171 infinity_emb INFO: select_model.py:64
model=`Salesforce/SFR-Embedding-2_R` selected, using
engine=`torch` and device=`cuda`
INFO 2024-11-13 00:05:11,174 SentenceTransformer.py:216
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer:
Salesforce/SFR-Embedding-2_R
INFO 2024-11-13 00:05:17,293 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=4 and avg tokens per
sentence=2
0.63 ms tokenization
24.57 ms inference
0.10 ms post-processing
25.29 ms total
embeddings/sec: 158.14
INFO 2024-11-13 00:05:17,642 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=4 and avg tokens per
sentence=513
2.35 ms tokenization
163.52 ms inference
0.27 ms post-processing
166.15 ms total
embeddings/sec: 24.07
INFO 2024-11-13 00:05:17,644 infinity_emb INFO: model select_model.py:104
warmed up, between 24.07-158.14 embeddings/sec at
batch_size=4
INFO 2024-11-13 00:05:17,648 SentenceTransformer.py:216
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer: