bwang0911 commited on
Commit
9a8bec8
1 Parent(s): 778540d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -17
README.md CHANGED
@@ -2668,23 +2668,10 @@ print(cos_sim(embeddings[0], embeddings[1]))
2668
  If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
2669
 
2670
  ```python
2671
- embeddings = model.encode(['Very long ... document'], max_length=2048)
2672
- ```
2673
-
2674
- For long sequences, it's recommended to perform inference using Flash Attention. Using Flash Attention allows you to increase the batch size and throughput for long sequence length.
2675
- We include an experimental implementation for Flash Attention, shipped with the model.
2676
- Install the following triton version:
2677
- `pip install triton==2.0.0.dev20221202`.
2678
- Now run the same code above, but make sure to set the parameter `with_flash` to `True` when you load the model. You also have to use either `fp16` or `bf16`:
2679
- ```python
2680
- from transformers import AutoModel
2681
- from numpy.linalg import norm
2682
- import torch
2683
-
2684
- cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
2685
- model = AutoModel.from_pretrained('jinaai/jina-embedding-s-en-v2', trust_remote_code=True, with_flash=True, torch_dtype=torch.float16).cuda() # trust_remote_code is needed to use the encode method
2686
- embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
2687
- print(cos_sim(embeddings[0], embeddings[1]))
2688
  ```
2689
 
2690
  ## Fine-tuning
 
2668
  If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
2669
 
2670
  ```python
2671
+ embeddings = model.encode(
2672
+ ['Very long ... document'],
2673
+ max_length=2048
2674
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
2675
  ```
2676
 
2677
  ## Fine-tuning