jamesagilesoda/Twindoc-Mistral-7B-Alpha-Backbone-Embedding

Model Details

How to use

Model, Tokenizer

from transformers import AutoTokenizer
from transformers import AutoModel


model_name_or_path = "jamesagilesoda/Twindoc-Mistral-7B-Alpha-Backbone-Embedding"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)

Data

def process_function_prompt(task, query):
    return f'Instruct: {task}\nQuery: {query}'

task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    process_function_prompt(task, 'What is the representative and unique ethnic food of South Korea that have no similar counterparts in other countries?'),
    process_function_prompt(task, 'what is the approximate population of Seoul in 2024?')
]

chunks = [
    "Kimchi (/ˈkɪmtʃiː/; Korean: 김치, romanized: gimchi, IPA: [kim.tɕʰi]) is a traditional Korean banchan consisting of salted and fermented vegetables, most commonly using napa cabbage or Korean radish. A wide selection of seasonings are used, including gochugaru (Korean chili powder), spring onions, garlic, ginger, and jeotgal (salted seafood), etc. Kimchi is also used in a variety of soups and stews. Kimchi is a staple food in Korean cuisine and is eaten as a side dish with almost every Korean meal.",
    "Seoul, officially Seoul Special City, is the capital of South Korea and the country's most extensive urban center. The broader Seoul Capital Area, encompassing Gyeonggi province and Incheon metropolitan city, emerged as the world's fourth largest metropolitan economy in 2014, trailing only Tokyo, New York City, and Los Angeles, hosting more than half of South Korea's population. Although Seoul's population peaked at slightly over 10 million, it has gradually decreased since 2014, standing at approximately 9.97 million residents as of 2020. Seoul is the seat of the South Korean government."
]

texts = queries + chunks

Embedding

embeddings = model(**tokenizer(texts, max_length=4096, padding=True, truncation=True, return_tensors="pt"))

Similarity

import torch.nn.functional as F
embeddings = F.normalize(embeddings, p=2, dim=1)  # normalized along batch by L2 Euclidean norm
scores = (embeddings[:2] @ embeddings[2:].T) * 100  # batch 4
scores = scores.tolist()

Results

[[66.47566223144531, 52.19638442993164],
 [41.07378005981445, 79.05313110351562]]

Disscussion

When dealing with multiple batches of data, where each batch is normalized independently using L2 Euclidean normalization, you might encounter situations where you want to apply a "global" normalization across all batches. This is especially relevant in contexts where the data from all batches should be considered as a single dataset for training or analysis, and you aim to maintain consistency in the normalization scale across batches.

To achieve global normalization across multiple batches, follow these steps:

1. Calculate the Global Norm

First, you need to calculate the global norm. This involves combining data from all batches into a single dataset or alternatively calculating the global norm in a way that is equivalent to having combined all the data. The global norm can be calculated by taking the square root of the sum of squares of all elements across all batches. This process can be computationally intensive for very large datasets, so optimizations or approximations may be necessary in practice.

Formula

If you have batches $b_{1}, b_{2}, . . ., b_{n}$ , and each batch contains vectors

$x_{1}, x_{1}, . . ., x_{m}$

, the global norm

$N_{global}$ can be estimated by aggregating the square of all elements in all batches and then taking the square root: $N_{global} = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{m}\|\mathbf{x}_j^{(i)}\|_2^2}$

2. Apply the Global Norm

After calculating the global norm, normalize each vector in each batch by dividing by the global norm. This ensures that the normalization is consistent across all batches, reflecting the scale of the entire dataset rather than individual batches.

Normalization

For each vector x in any batch:

$\mathbf{x}_{\text{global normalized}} = \frac{\mathbf{x}}{N_{global}}$

3. Considerations for Large Datasets

For very large datasets, calculating the global norm directly by combining all data might not be feasible due to memory constraints. In such cases, you can calculate the sum of squares in a piecewise manner across batches and then take the square root of the aggregate sum to find the global norm. This method allows for a more memory-efficient calculation that can be done incrementally.

4. Implementation

When implementing this approach, it's essential to carefully manage memory and computational resources, especially for very large datasets. Efficient data structures and parallel processing can help mitigate performance issues.

This global normalization approach ensures that the scale of features is consistent across the entire dataset.