Update README.md
Browse files
README.md
CHANGED
@@ -11,18 +11,73 @@ library_name: Tevatron
|
|
11 |
---
|
12 |
|
13 |
# DSE-Phi3-Docmatix-V1.0
|
14 |
-
DSE is a bi-encoder that encodes document screenshots into dense vectors for document retrieval.
|
15 |
|
16 |
-
Document Screenshot Embedding ([DSE](https://arxiv.org/abs/2406.11251))
|
17 |
-
Specifically, DSE regards document screenshots as a unified input format that preserves all the information in a document (e.g., text, image and layout), encoding document (PDF, Webpage, Slides) directly into dense vector for document retrieval.
|
18 |
-
`Tevatron/dse-phi3-docmatix-v1.0` is trained with the `Tevatron/docmatix-ir` dataset, a variant of `HuggingFaceM4/Docmatix` to train PDF retriever with Vision Language Model for open-domain question answering.
|
19 |
-
Please see the dataset page of [docmatix-ir](https://huggingface.co/datasets/Tevatron/docmatix-ir/blob/main/README.md) for how we filter out questions that is not suitable for open domain retrieval and how we conduct hard negative mining with DSE-Phi3-V1.0 to get high query bi-encoder training data.
|
20 |
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
### Encode Text Query
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
### Encode Document Screenshot
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
### Encode Document Text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
# DSE-Phi3-Docmatix-V1.0
|
|
|
14 |
|
15 |
+
DSE-Phi3-Docmatix-V1.0 is a bi-encoder model designed to encode document screenshots into dense vectors for document retrieval. The Document Screenshot Embedding ([DSE](https://arxiv.org/abs/2406.11251)) approach captures documents in their original visual format, preserving all information such as text, images, and layout, thus avoiding tedious parsing and potential information loss.
|
|
|
|
|
|
|
16 |
|
17 |
+
The model, `Tevatron/dse-phi3-docmatix-v1.0`, is trained using the `Tevatron/docmatix-ir` dataset, a variant of `HuggingFaceM4/Docmatix` specifically adapted for training PDF retrievers with Vision Language Models in open-domain question answering scenarios. For more information on dataset filtering and hard negative mining, refer to the [docmatix-ir dataset page](https://huggingface.co/datasets/Tevatron/docmatix-ir/blob/main/README.md).
|
18 |
+
|
19 |
+
## How to Use the Model
|
20 |
+
|
21 |
+
### Load the Model and Processor
|
22 |
+
|
23 |
+
```python
|
24 |
+
import torch
|
25 |
+
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
|
26 |
+
|
27 |
+
processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
|
28 |
+
config = AutoConfig.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, use_cache=False)
|
29 |
+
model = AutoModelForCausalLM.from_pretrained('Tevatron/dse-phi3-docmatix-v1.0', trust_remote_code=True, config=config, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).to('cuda:0')
|
30 |
+
|
31 |
+
def get_embedding(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
|
32 |
+
sequence_lengths = attention_mask.sum(dim=1) - 1
|
33 |
+
bs = last_hidden_state.shape[0]
|
34 |
+
reps = last_hidden_state[torch.arange(bs, device=last_hidden_state.device), sequence_lengths]
|
35 |
+
reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
|
36 |
+
return reps
|
37 |
+
```
|
38 |
|
39 |
### Encode Text Query
|
40 |
|
41 |
+
```python
|
42 |
+
queries = ["query: Where can we find Llama?", "query: What is the LLaMA model?"]
|
43 |
+
query_inputs = processor(queries, return_tensors="pt", padding="longest", max_length=128, truncation=True).to('cuda:0')
|
44 |
+
output = model(**query_inputs, return_dict=True, output_hidden_states=True)
|
45 |
+
query_embeddings = get_embedding(output.hidden_states[-1], query_inputs["attention_mask"])
|
46 |
+
```
|
47 |
+
|
48 |
### Encode Document Screenshot
|
49 |
|
50 |
+
```python
|
51 |
+
from PIL import Image
|
52 |
+
|
53 |
+
passage_image1 = Image.open("path/to/your/image1.png")
|
54 |
+
passage_image2 = Image.open("path/to/your/image2.png")
|
55 |
+
passage_images = [passage_image1, passage_image2]
|
56 |
+
passage_prompts = ["\nWhat is shown in this image?</s>", "\nWhat is shown in this image?</s>"]
|
57 |
+
|
58 |
+
passage_inputs = processor(passage_prompts, images=passage_images, return_tensors="pt", padding="longest", max_length=4096, truncation=True).to('cuda:0')
|
59 |
+
output = model(**passage_inputs, return_dict=True, output_hidden_states=True)
|
60 |
+
doc_embeddings = get_embedding(output.hidden_states[-1], passage_inputs["attention_mask"])
|
61 |
+
```
|
62 |
+
|
63 |
+
### Compute Similarity
|
64 |
+
|
65 |
+
```python
|
66 |
+
from torch.nn.functional import cosine_similarity
|
67 |
+
|
68 |
+
similarities = cosine_similarity(query_embeddings, doc_embeddings)
|
69 |
+
print(similarities)
|
70 |
+
```
|
71 |
+
|
72 |
### Encode Document Text
|
73 |
+
This DSE checkpoint is warm-up with `Tevatron/msmarco-passage-aug`, thus the model can also effectively encode document as text input.
|
74 |
+
```python
|
75 |
+
passage_prompts = ["Llama is in Aferica</s>", "LLaMA is an LLM released by Meta.</s>"]
|
76 |
+
|
77 |
+
passage_inputs = processor(passage_prompts, images=None, return_tensors="pt", padding="longest", max_length=4096, truncation=True).to('cuda:0')
|
78 |
+
output = model(**passage_inputs, return_dict=True, output_hidden_states=True)
|
79 |
+
doc_embeddings = get_embedding(output.hidden_states[-1], passage_inputs["attention_mask"])
|
80 |
|
81 |
+
similarities = cosine_similarity(query_embeddings, doc_embeddings)
|
82 |
+
print(similarities)
|
83 |
+
```
|