infinity usage of reranking. Implements a cohere compatible api.
#10
by
michaelfeil
- opened
docker run --gpus "0" -v $PWD/data:/app/.cache -p "7997":"7997" michaelf3
4/infinity:0.0.68 v2 --model-id Alibaba-NLP/gte-multilingual-reranker-base --revision "main" --dtype bfloat16 --batch-size 32
--device cuda --engine torch --port 7997
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-12 23:54:16,142 infinity_emb INFO: infinity_server.py:89
Creating 1engines:
engines=['Alibaba-NLP/gte-multilingual-reranker-b
ase']
INFO 2024-11-12 23:54:16,147 infinity_emb INFO: Anonymized telemetry.py:30
telemetry can be disabled via environment variable
`DO_NOT_TRACK=1`.
INFO 2024-11-12 23:54:16,154 infinity_emb INFO: select_model.py:64
model=`Alibaba-NLP/gte-multilingual-reranker-base`
selected, using engine=`torch` and device=`cuda`
INFO 2024-11-12 23:54:20,710 infinity_emb INFO: Adding acceleration.py:56
optimizations via Huggingface optimum.
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
WARNING 2024-11-12 23:54:20,712 infinity_emb WARNING: acceleration.py:67
BetterTransformer is not available for model: <class
'transformers_modules.Alibaba-NLP.new-impl.40ced75c3
017eb27626c9d4ea981bde21a2662f4.modeling.NewForSeque
nceClassification'> Continue without
bettertransformer modeling code.
INFO 2024-11-12 23:54:20,948 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=32 and avg tokens per
sentence=2
3.51 ms tokenization
6.39 ms inference
0.01 ms post-processing
9.90 ms total
embeddings/sec: 3231.28
INFO 2024-11-12 23:54:21,914 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=32 and avg tokens per
sentence=1024
25.42 ms tokenization
429.65 ms inference
0.03 ms post-processing
455.10 ms total
embeddings/sec: 70.31
thenlper
changed pull request status to
merged