Bia
/

Safetensors
qwen2_5_vl
Bia commited on
Commit
e064cb5
·
verified ·
1 Parent(s): 85baf14

Update Reademe.md

Browse files
Files changed (1) hide show
  1. README.md +21 -13
README.md CHANGED
@@ -33,11 +33,16 @@ We introduce CORAL, a multi-modal embedding model built upon Qwen2.5-3B-Instruct
33
  CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss function of CORAL consists of three components: Contrastive Learning Loss, Vision Reconstruction Loss, and Masked Language Modeling Loss. During training, we reconstruct both the query and its corresponding positive sample.
34
 
35
  <p align="center">
36
- <img src="https://merit-2025.github.io/static/images/part3/method.png" alt="CORAL Structure" style="width: 100%; max-width: 500px;">
37
  </p>
38
 
39
  <p align="center"><b>Overview for CORAL</b></p>
40
 
 
 
 
 
 
41
 
42
  ## Usage
43
 
@@ -46,6 +51,7 @@ CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss
46
  We provide the checkpoint of CORAL on Huggingface. You can load the model using the following code:
47
 
48
  ```python
 
49
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
50
  from qwen_vl_utils import process_vision_info
51
 
@@ -54,6 +60,7 @@ from qwen_vl_utils import process_vision_info
54
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
55
  "Bia/CORAL", torch_dtype="auto", device_map="auto"
56
  )
 
57
  processor = AutoProcessor.from_pretrained("Bia/CORAL")
58
 
59
  ## Prepare Inputs
@@ -64,12 +71,12 @@ query = [
64
  {"type": "text", "text": "Find a product of backpack that have the same brand with <Product 1> \n "},
65
  {
66
  "type": "image",
67
- "image": "images/product_1.jpg",
68
  },
69
  {"type": "text", "text": "\n Ransel MOSSDOOM Polyester dengan Ruang Komputer dan Penyimpanan Besar, Ukuran $30 \times 12 \times 38$ cm , Berat 0.32 kg. </Product 1> and the same fashion style with <Product 2> "},
70
  {
71
  "type": "image",
72
- "image": "images/product_2.jpg",
73
  },
74
  {"type": "text", "text": "\n Elegant Pink Flats with Low Heel and Buckle Closure for Stylish Party Wear </Product 2> with a quilted texture and a chain strap."}
75
  ],
@@ -83,7 +90,7 @@ candidate = [
83
  {"type": "text", "text": "Represent the given product: "},
84
  {
85
  "type": "image",
86
- "image": "images/product_3.jpg",
87
  },
88
  {"type": "text", "text": "\n MOSSDOOM Elegant Pink PU Leather Handbag with Chain Strap and Large Capacity, Compact Size $18 \times 9.5 \times 15 \mathrm{~cm}$."},
89
  ],
@@ -120,19 +127,20 @@ candidate_inputs = processor(
120
 
121
 
122
  # Encode Embeddings
123
- query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
124
- query_embedding = query_outputs.hidden_states[-1][:,-1,:]
125
- query_embedding = torch.nn.functional.normalize(query_embedding, dim=-1)
126
- print(query_embedding.shape) # torch.Size([1, 2048])
 
127
 
128
- candidate_outputs = model(**inputs, return_dict=True, output_hidden_states=True)
129
- candidate_embedding = candidate_outputs.hidden_states[-1][:,-1,:]
130
- candidate_embedding = torch.nn.functional.normalize(candidate_embedding, dim=-1)
131
- print(candidate_embedding.shape) # torch.Size([1, 2048])
132
 
133
  # Compute Similarity
134
  similarity = torch.matmul(query_embedding, candidate_embedding.T)
135
- print(similarity) # tensor([[0.7650]], device='cuda:0')
136
  ```
137
 
138
  ## Evaluation
 
33
  CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss function of CORAL consists of three components: Contrastive Learning Loss, Vision Reconstruction Loss, and Masked Language Modeling Loss. During training, we reconstruct both the query and its corresponding positive sample.
34
 
35
  <p align="center">
36
+ <img src="https://merit-2025.github.io/static/images/part3/method.png" alt="CORAL Overview" style="width: 100%; max-width: 600px;">
37
  </p>
38
 
39
  <p align="center"><b>Overview for CORAL</b></p>
40
 
41
+ <p align="center">
42
+ <img src="images/example.jpg" alt="Example" style="width: 100%; max-width: 600px;">
43
+ </p>
44
+
45
+ <p align="center"><b>Example Query and Ground Truth</b></p>
46
 
47
  ## Usage
48
 
 
51
  We provide the checkpoint of CORAL on Huggingface. You can load the model using the following code:
52
 
53
  ```python
54
+ import torch
55
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
56
  from qwen_vl_utils import process_vision_info
57
 
 
60
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
61
  "Bia/CORAL", torch_dtype="auto", device_map="auto"
62
  )
63
+
64
  processor = AutoProcessor.from_pretrained("Bia/CORAL")
65
 
66
  ## Prepare Inputs
 
71
  {"type": "text", "text": "Find a product of backpack that have the same brand with <Product 1> \n "},
72
  {
73
  "type": "image",
74
+ "image": "CORAL/images/product_1.png",
75
  },
76
  {"type": "text", "text": "\n Ransel MOSSDOOM Polyester dengan Ruang Komputer dan Penyimpanan Besar, Ukuran $30 \times 12 \times 38$ cm , Berat 0.32 kg. </Product 1> and the same fashion style with <Product 2> "},
77
  {
78
  "type": "image",
79
+ "image": "CORAL/images/product_2.png",
80
  },
81
  {"type": "text", "text": "\n Elegant Pink Flats with Low Heel and Buckle Closure for Stylish Party Wear </Product 2> with a quilted texture and a chain strap."}
82
  ],
 
90
  {"type": "text", "text": "Represent the given product: "},
91
  {
92
  "type": "image",
93
+ "image": "CORAL/images/product_3.png",
94
  },
95
  {"type": "text", "text": "\n MOSSDOOM Elegant Pink PU Leather Handbag with Chain Strap and Large Capacity, Compact Size $18 \times 9.5 \times 15 \mathrm{~cm}$."},
96
  ],
 
127
 
128
 
129
  # Encode Embeddings
130
+ with torch.inference_mode():
131
+ query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
132
+ query_embedding = query_outputs.hidden_states[-1][:,-1,:]
133
+ query_embedding = torch.nn.functional.normalize(query_embedding, dim=-1)
134
+ print(query_embedding.shape) # torch.Size([1, 2048])
135
 
136
+ candidate_outputs = model(**candidate_inputs, return_dict=True, output_hidden_states=True)
137
+ candidate_embedding = candidate_outputs.hidden_states[-1][:,-1,:]
138
+ candidate_embedding = torch.nn.functional.normalize(candidate_embedding, dim=-1)
139
+ print(candidate_embedding.shape) # torch.Size([1, 2048])
140
 
141
  # Compute Similarity
142
  similarity = torch.matmul(query_embedding, candidate_embedding.T)
143
+ print(similarity) # tensor([[0.6992]], device='cuda:0', dtype=torch.bfloat16)
144
  ```
145
 
146
  ## Evaluation