Added an example for usage with infinity: https://github.com/michaelfeil/infinity . Ready for review.

Tested and works:

docker run --gpus all -v $PWD/data:/app/.cache -e HF_TOKEN=$HF_TOKEN -p "7999":"7997" michaelf34/infinity:0.0.68 v2 --model-id Salesforce/SFR-Embedding-2_R --revision "91762139d94ed4371a9fa31db5551272e0b83818" --dtype bfloat16 --batch-size 4 --device cuda --engine torch --port 7997 --no-bettertransformer
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2024-11-13 00:05:11,159 infinity_emb INFO:        infinity_server.py:89
         Creating 1engines:                                                     
         engines=['Salesforce/SFR-Embedding-2_R']                               
INFO     2024-11-13 00:05:11,163 infinity_emb INFO: Anonymized   telemetry.py:30
         telemetry can be disabled via environment variable                     
         `DO_NOT_TRACK=1`.                                                      
INFO     2024-11-13 00:05:11,171 infinity_emb INFO:           select_model.py:64
         model=`Salesforce/SFR-Embedding-2_R` selected, using                   
         engine=`torch` and device=`cuda`                                       
INFO     2024-11-13 00:05:11,174                      SentenceTransformer.py:216
         sentence_transformers.SentenceTransformer                              
         INFO: Load pretrained SentenceTransformer:                             
         Salesforce/SFR-Embedding-2_R                                           
INFO     2024-11-13 00:05:17,293 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=4 and avg tokens per                            
         sentence=2                                                             
                 0.63     ms tokenization                                       
                 24.57    ms inference                                          
                 0.10     ms post-processing                                    
                 25.29    ms total                                              
         embeddings/sec: 158.14                                                 
INFO     2024-11-13 00:05:17,642 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=4 and avg tokens per                            
         sentence=513                                                           
                 2.35     ms tokenization                                       
                 163.52   ms inference                                          
                 0.27     ms post-processing                                    
                 166.15   ms total                                              
         embeddings/sec: 24.07                                                  
INFO     2024-11-13 00:05:17,644 infinity_emb INFO: model    select_model.py:104
         warmed up, between 24.07-158.14 embeddings/sec at                      
         batch_size=4                                                           
INFO     2024-11-13 00:05:17,648                      SentenceTransformer.py:216
         sentence_transformers.SentenceTransformer                              
         INFO: Load pretrained SentenceTransformer:                        

@yliu279 Can you review this?

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment