Safetensors
English
llava_next
custom_code
File size: 10,835 Bytes
a511d69
 
 
 
 
 
f9fd1c2
 
 
 
a511d69
 
 
 
 
 
 
 
 
 
 
c4a65b9
a511d69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecbbc70
a511d69
 
e013b23
 
 
 
 
 
 
 
 
0395d3a
e013b23
 
0395d3a
3236bc7
e013b23
 
 
 
45eff93
e013b23
45eff93
e013b23
45eff93
e013b23
 
 
45eff93
e013b23
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
language:
- en
license: cc-by-nc-4.0
---
## Introduction
We introduce MM-Embed, an extension of NV-Embed-v1 with multimodal retrieval capability. 
MM-Embed achieves state-of-the-art results in [UniIR benchmark](https://huggingface.co/TIGER-Lab/UniIR) with 52.7 averaged score compared to 48.9 (the best results in [UnIR benchmark paper](https://eccv.ecva.net/virtual/2024/poster/863)). 
Notably, MM-Embed improves NV-Embed-v1 text retrieval accuracy, from 59.36 to 60.3 on 15 retrieval tasks within Massive Text Embedding Benchmark ([MTEB benchmark](https://arxiv.org/abs/2210.07316)).
MM-Embed presents several new training strategies, including modality-aware hard negative mining to improve multimodal retrieval accuracy in UniIR, and demonstrating a continual text-to-text fine-tuning method to further enhance the accuracy of text-to-text retrieval while maintaining mulitmodal retrieval accuracy.

<!-- For more technical details, refer to our paper: [NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models](https://arxiv.org/pdf/2405.17428). -->

<!-- For more benchmark results (other than MTEB), please find the [AIR-Bench](https://huggingface.co/spaces/AIR-Bench/leaderboard) for QA (English only) and Long-Doc. -->

## Model Details
- Multimodal archietecture: [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
- Text Embedding LLM: [nvidia/NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1)

## How to use

Here are two examples of how to encode queries and passages using Huggingface-transformer. Please find the required package version [here](https://huggingface.co/nvidia/MM-Embed#1-required-packages). See more instructions in various retrieval scenario [here](Here are two examples of how to encode queries and passages using Huggingface-transformer. Please find the required package version [here](https://huggingface.co/nvidia/MM-Embed#1-required-packages). See more instructions in various retrieval scenario [here](https://huggingface.co/nvidia/MM-Embed/blob/main/instructions.json)

### Usage of Multimodal Retrieval (HuggingFace Transformers)
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import requests

# Each query needs to be accompanied by an corresponding instruction describing the task. 
task_name_to_instruct = {"example": "Retrieve a Wikipedia paragraph that provides an answer to the given query about the image."}

img1_url = 'https://cdn.contexttravel.com/image/upload/w_1500,q_60/v1574869648/blog/Facts%20about%20the%20Eiffel%20Tower/eiffelhero.jpg'
img2_url = 'https://trumpwhitehouse.archives.gov/wp-content/uploads/2021/01/40508989563_514189250a_o-1500x720.jpg'

instruction = task_name_to_instruct['example']
queries = [
    {'txt': 'What country does this place belong to?', 'img': Image.open(requests.get(img1_url, stream=True).raw)}, 
    {'txt': 'What country does this place belong to?', 'img': Image.open(requests.get(img2_url, stream=True).raw)},
]

# No instruction needed for retrieval passages
passages = [
    {'txt': "France, officially the French Republic, is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north, Germany to the northeast, Switzerland to the east, Italy and Monaco to the southeast, Andorra and Spain to the south, and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions (five of which are overseas) span a combined area of 643,801 km2 (248,573 sq mi) and have a total population of 68.4 million as of January 2024. France is a semi-presidential republic with its capital in Paris, the country's largest city and main cultural and commercial centre."},
    {'txt': "The United States of America (USA), commonly known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal union of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the states of Alaska to the northwest and the archipelagic Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands. The country has the world's third-largest land area, largest exclusive economic zone, and third-largest population, exceeding 334 million. Its three largest metropolitan areas are New York, Los Angeles, and Chicago, and its three most populous states are California, Texas, and Florida."},
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/MM-Embed', trust_remote_code=True)
model = model.cuda()

# get the embeddings, the output embeddings are normalized to one
max_length = 4096
query_embeddings = model.encode(queries, is_query=True, instruction=instruction, max_length=max_length)['hidden_states']
passage_embeddings = model.encode(passages, max_length=max_length)['hidden_states']

# compute relevance scores
scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())
#[[31.019872665405273, 12.753520965576172], [11.135049819946289, 22.12639617919922]]
```

### Usage of Text-to-Text Retrieval (HuggingFace Transformers)
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question"}

instruction = task_name_to_instruct['example']
queries = [
    {'txt': 'are judo throws allowed in wrestling?'}, 
    {'txt': 'how to become a radiology technician in michigan?'},
]

# No instruction needed for retrieval passages
passages = [
    {'txt': "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force."},
    {'txt': "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."},
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/MM-Embed', trust_remote_code=True)
model = model.cuda()

# get the embeddings, the output embeddings are normalized to one
max_length = 4096
query_embeddings = model.encode(queries, is_query=True, instruction=instruction, max_length=max_length)['hidden_states']
passage_embeddings = model.encode(passages, max_length=max_length)['hidden_states']

# compute relevance scores
scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())
#[[80.78538513183594, 2.030935049057007], [3.7138314247131348, 83.22908782958984]]
```

## Correspondence to
Sheng-Chieh Lin (s269lin@uwaterloo.ca), Wei Ping (wping@nvidia.com)

## Citation
Coming soon.
<!-- If you find this code useful in your research, please consider citing:

```bibtex
@misc{lin2024nvmmembed,
      title={UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS}, 
      author={Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping},
      year={2024},
      eprint={2405.17428},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
``` -->
## License
This model should not be used for any commercial purpose. Refer the [license](https://spdx.org/licenses/CC-BY-NC-4.0) for the detailed terms.

For commercial purpose, we recommend you to use the models of [NeMo Retriever Microservices (NIMs)](https://build.nvidia.com/explore/retrieval).


## Troubleshooting


#### 1. Required Packages

If you have trouble, try installing the python packages as below
```python
pip uninstall -y transformer-engine
pip install torch==2.2.0
pip install transformers==4.42.4
pip install flash-attn==2.2.0
pip install pillow
```

#### 2. Access to model nvidia/MMEmbed is restricted. You must be authenticated to access it

Use your huggingface access [token](https://huggingface.co/settings/tokens) to execute *"huggingface-cli login"*.

## Model Architectures

**Network Architecture:** Decoder-Only Transformer 

### Input
**Input Type(s):** Text, Image <br>
**Input Format(s):** String, [Pillow Library-Supported Formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) <br>
**Input Dimensions:** One-Dimensional (1D), Two Dimensional (2D) <br>
**Other Properties Related to Input:** Maximum Token Length = 32,768 Tokens <br>

### Output
**Output Type(s):** Embedding Vector <br>
**Output Format:** Numeric <br>
**Model Output:** 1D <br>
**Other Properties Related to Output:** None <br> 

## Software Integration
**Runtime Engine(s):** PyTorch <br>

**Supported Hardware Microarchitecture Compatibility:** NVIDIA Hopper <br>

**[Preferred/Supported] Operating System(s):** Linux <br>

## Inference
**Engine:** PyTorch <br>
**Test Hardware:** H100 <br>

## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.    

Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).