infinity usage of reranking. Implements a cohere compatible api.

#10
docker run --gpus "0" -v $PWD/data:/app/.cache -p "7997":"7997" michaelf3
4/infinity:0.0.68 v2 --model-id Alibaba-NLP/gte-multilingual-reranker-base --revision "main" --dtype bfloat16 --batch-size 32
 --device cuda --engine torch --port 7997
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2024-11-12 23:54:16,142 infinity_emb INFO:        infinity_server.py:89
         Creating 1engines:                                                     
         engines=['Alibaba-NLP/gte-multilingual-reranker-b                      
         ase']                                                                  
INFO     2024-11-12 23:54:16,147 infinity_emb INFO: Anonymized   telemetry.py:30
         telemetry can be disabled via environment variable                     
         `DO_NOT_TRACK=1`.                                                      
INFO     2024-11-12 23:54:16,154 infinity_emb INFO:           select_model.py:64
         model=`Alibaba-NLP/gte-multilingual-reranker-base`                     
         selected, using engine=`torch` and device=`cuda`                       
INFO     2024-11-12 23:54:20,710 infinity_emb INFO: Adding    acceleration.py:56
         optimizations via Huggingface optimum.                                 
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
WARNING  2024-11-12 23:54:20,712 infinity_emb WARNING:        acceleration.py:67
         BetterTransformer is not available for model: <class                   
         'transformers_modules.Alibaba-NLP.new-impl.40ced75c3                   
         017eb27626c9d4ea981bde21a2662f4.modeling.NewForSeque                   
         nceClassification'> Continue without                                   
         bettertransformer modeling code.                                       
INFO     2024-11-12 23:54:20,948 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=2                                                             
                 3.51     ms tokenization                                       
                 6.39     ms inference                                          
                 0.01     ms post-processing                                    
                 9.90     ms total                                              
         embeddings/sec: 3231.28                                                
INFO     2024-11-12 23:54:21,914 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=1024                                                          
                 25.42    ms tokenization                                       
                 429.65   ms inference                                          
                 0.03     ms post-processing                                    
                 455.10   ms total                                              
         embeddings/sec: 70.31                              
thenlper changed pull request status to merged

Sign up or log in to comment