Model Card for Pharia-1-Embedding-4608-control

This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control. For additional training details, including architecture, tokenization, tokenizer fertility, pre-training, instruction fine-tuning and resource usage we refer to the model card of Pharia-1-LLM-7B-control.

Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. Furthermore it shows strong cross-lingual performance allowing for prompting and text to be embedded written in different languages. The finetuning was always performed using English instructions.

Model Overview

Developed by: Aleph Alpha Research
Model type/architecture: Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational instruction-tuning (inspired by the approach of GritLM).
Language(s) (NLP): Trained on English, German, French, Spanish.
USP: Model exhibits superior quality in pure cross-lingual tasks for (German, English, French & Spanish pairings, see evaluation below)

Model Description

Model	Embedding Size	Description
Pharia-1-Embedding-4608-control	4608	Pharia-1-Embedding-4608-control is an Embedding model optimized for German, French and Spanish and designed for customizable embeddings at runtime via instructions (prompts)

Model Access

We provide access to our models through the channels listed below.

On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment. We provide our customers with open access to our full model checkpoint including weights and code for commercial use. Downloadable from Huggingface: An HF-adapted version of our model can be found in our Huggingface repo (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf) together with code snippets that make the model easy to use. Please refer to the changelog for updates to the models served. We do not deprecate officially released versions of old model generations when we release newer versions, so users can continue to have access to available models. No prompt data is stored when using our systems, which means that we do not collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions. We do not log user inputs to the models. We do not train on user data.
Note: The same models are made available to users regardless of their geographic location, and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.

Intended Use

Pharia-1-Embedding-4608-control is intended to be deployed as components of AI systems or applications. Use-cases and the model's capabilities include but are not limited to: information retrieval, semantic search, re-ranking and clustering.

Out-of-Scope Use

Pharia-1-Embedding-4608-control is not to be used for illegal or unlawful actions of any kind and with any illegal or unlawful content. This includes in particular prohibited activities such as engaging in terrorism, violence, human trafficking, illegal distribution of materials to minors, sexual solicitation, any other criminal activities, harassment, discrimination, creating or promoting malicious code or activities risking death or harm, including those related to military or nuclear applications, and activities not in compliance with sanction regimes, technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards. The utilization of our technology is always governed by, and may be limited in accordance with, our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.

For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via our dedicated contact address violations@aleph-alpha.com to communicate with us.

Customers and partners are enabled to use our ticketing system ticketing system for appeals, claims and feedback.

Use limitations

Beyond the risks & limitations stated in the original Pharia-1-LLM-7B-control, the following limitation applies:

Pharia-1-Embedding-4608-control has been optimized on embedding computation only. Therefore, we do not recommend usage for text generation purposes.

How to Use

We provide two access pathways for our Pharia4608 embedding model. The first one leverages the HF ecosystem and can be found here: https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf. The code snippet in the box below demonstrates its use. As soon as the model class is invoked, the model will we loaded from the repo and is ready for use. The other access pathway is through our public scaling code base. In this version the model weights were not converted to HF format and the repo https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control can be cloned as is. The model path has to be adjusted to the local path where the model was downloaded. The model cards in the corresponding repositories only the code snippet which applies to the specific repo.

Use with Huggingface

from torch.nn import CosineSimilarity
from transformers import AutoConfig, AutoModel
from transformers import PreTrainedTokenizerFast
MODEL_PATH = 'Aleph-Alpha/Pharia-1-Embedding-4608-control-hf'
config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_PATH)
model = AutoModel.from_pretrained(MODEL_PATH, 
                                  trust_remote_code=True, 
                                  config=config,
                                  tokenizer=tokenizer).cuda()
query = "Which country is Galileo from?"
query_embeddings = model.encode_queries(query, convert_to_tensor=True)
print(f"Type of embeddings: {type(query_embeddings)},\n\
       shape of query embeddings: {query_embeddings.shape}")
# embed the documents:
document_1 =  "Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql."
document_embeddings_1 = model.encode_corpus(document_1, convert_to_tensor=True)
document_2 = "Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy."
document_embeddings_2 = model.encode_corpus(document_2, convert_to_tensor=True)
# customized embeddings steering the query:
instruction = "Represent the question about TV shows to find a paragraph that answers it."
steered_query_embeddings = model.encode_queries(
                                            query, 
                                            instruction=instruction,
                                            convert_to_tensor=True
                                        )
# compute similarity between steered query and both documents
cossim = CosineSimilarity(dim=0, eps=1e-6)
sim1 = round(cossim(document_embeddings_1, steered_query_embeddings).item(), 3)
sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
print("Steered embedding causes higher similarity of query to TV show:")
print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")

Disclaimer: For the official evaluation scores we used the Scaling compatible checkpoint available under Pharia-1-Embedding-4608-control (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control)

Example for instruction embedding

Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering. We use the task of information retrieval as a guiding example where we assume the following query: “Which country is Galileo from?” and two documents:

Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql.
Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy. Source: Wikipedia For our guiding example we assume the context of this use-case is a Question-Answer system for movies and TV shows.

Step 1:

Embed the Query

"input": "Which country is Galileo from?"

→ Embedding: [-0.6780134, 0.61449033, 0.102911085, ...]

Step 2:

Embed the Documents "input": "Galileo is a German television program series ..." → Embedding: [-0.36119246, 0.7793595, -0.38735497, ...] "input": "Galileo di Vincenzo Bonaiuti de' Galilei ..." → Embedding: [-0.25108248, 1.0496024, -0.20945309, ...]

Step 3:

Compare the similarity A typical similarity measure between vectors is cosine similarity. Higher numbers indicate more similar vectors and by extension capture the concept of relevance. In a RAG application these scores determine the ranking during the retrieval step. In this example, we obtain the following cosine similarities: Query vs. German TV show: ~0.661 Query vs. Italian polymath: ~0.757 This implies that the paragraph about the Italian polymath would be ranked higher than the paragraph about the German TV show which is the one we’re interested in.

Customized Embeddings

To further improve performance you can use instructions to steer the model. Instructions can help the model understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case. In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher than the paragraph about the Italian polymath.

Step 1: Embed the Query with an Instruction "instruction": "Represent the question about TV shows to find a paragraph that answers it." "input": "input": "Which country is Galileo from?" → Embedding: [-0.6310919, 1.4309896, -0.85546875, ...]

Step 2: Compare the similarity We leave the embeddings of the documents untouched and now obtain the following cosine similarities: Query vs. German TV show: ~0.632 Query vs. Italian polymath: ~0.512 These new cosine similarities imply that the ranking has indeed changed and the paragraph about the German TV show is now more relevant. This shows that instructions can help the model understand nuances in the data better and ultimately lead to embeddings that are more useful for your use-case.

Tips on using the model

First try and ideally evaluate the model on your data without instructions to see whether performance aligns with your expectations out-of-the-box
If you decide to use an instruction with the aim of further boosting performance we suggest using this template as a guideline
- Template: Represent the [X] to find a [Y] that [describe how the X and Y relate]
- Examples
  1. Represent the newspaper paragraph to find a newspaper paragraph with the same topic
  2. Represent the sentence to find another sentence with the same meaning
In cases where the two texts to compare are different in nature (e.g. query and document) – also called “asymmetric” – we suggest to first add an instruction to query texts only. Again, try and ideally evaluate the model in this setting. Then, if your aim is to further boost performance, we suggest that you add instructions to document texts as well where [X] and [Y] are flipped accordingly.

Evaluation

Evaluations on cross-lingual capabilities

There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated in a different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual capabilities we evaluated Pharia-1-Embedding-4608-control, GritLM, Nvidia-Embed-v2 and BGE-Multilingual-Gemma2 on the MLQA-V1 datasets (Facebook) for German/English and English/Spanish language pairings. For German/French we used the CLSD-WMT19 dataset providing correct and adversarial translations of a sentence in the corresponding pair language. In order to check quality over a larger range of sample size we did the accuracy computations for varying number of samples taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the full set of data (2900 samples available).

MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models

# of samples	Pharia4608	GritLM	Nvidia-Embed-v2	BGE-Gemma2
1000	86.0%	82.5%	77.0%	87.0%
2000	79.5%	73.4%	69.4%	76.8%
4000	65.3%	59.2%	56.0%	62.7%
6000	54.3%	48.6%	45.6%	52.6%
10000	38.6%	32.8%	32.8%	39.4%

MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models

# samples	Pharia4608	GritLM	NV-Embed-v2	BGE-Gemma2
1000	87.5%	82.0%	81.5%	87.0%
2000	78.5%	73.9%	70.7%	77.0%
4000	65.5%	59.3%	56.9%	64.2%
6000	55.3%	49.2%	46.2%	53.4%
10000	41.7%	35.5%	33.2%	40.0%

CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models

Model Name	accuracy
Pharia-1-Embedding-4608-control	95.1%
GritLM-7B	94.2%
Nvidia-Embed-v2	93.4%
BGE-Gemma2	95.4%

Evaluations on MTEB tasks

To evaluate our models multilingual capabilities we evaluate it against other source-available, high-performing embedding models listen in the MTEB leaderboard. For the following evaluations we compare the following models:

NVEmbed-V2: The highest scoring model in the MTEB leaderboard at time of the release
BGE-Multilingual-Gemma2: The highest scoring multilingual model in the MTEB leaderboard at the time of release.
GritLM: A generative representational instruction tuned language model.

Methodology for Multilingual Evaluations (European languages)

Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script actually comes from gritlm/evaluation/eval_mteb.py at main · ContextualAI/gritlm and includes Medi2-style instructions for many MTEB Tasks. The instructions are all in English. All evaluations use Medi2-style instructions except for the “no instructions” case (see above). If a task does not have Medi2-style instructions, we skip the task. As European languages for MTEB tests German, Italian, Spanish, Portuguese and French were used.
For our Multilingual Evaluations (European languages) we use the tasks from mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchmark/mteb and then filter for tasks where there is at least one subset with at least one of the European languages.
We skip BibleNLPBitextMining and FloresBitextMining because they don’t have ‘test’ splits, only ‘train’ split which we don’t want to use for evaluation (→ training data contamination likely)
We evaluate subsets which contain at least one of the European languages → that’s why there is also an “English” language column because there are subsets that are e.g. En ↔︎ De and are thus considered
The tasks that remain are
- AmazonCounterfactualClassification
- BUCC.v2
- DiaBlaBitextMining
- MassiveScenarioClassification
- NTREXBitextMining
- STS17
For NTREXBitextMining the subsets are further filtered down to only pairs of the European languages instead of at least one European language
- i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
- this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)

We used the official scores reported in MTEB Leaderboard if reported, but for some models and subset we created the scores ourselves with the official Huggingface checkpoints and instructions referenced in the Paper or Model card.

Europe by task

Model Name	AmazonCounterfactualClassification	BUCC.v2	DiaBlaBitextMining	MassiveScenarioClassification	NTREXBitextMining	STS17	Average
Pharia-1-Embedding-4608-control	72.49	99.19	86.51	75.58	98.24	87.67	86.61
GritLM-7B	76.64	99.43	86.45	78.93	98.46	88.07	87.99
BGE-Multilingual-Gemma2	69.72	99.38	86.90	78.57	98.58	86.69	86.64
Nvidia-Embed-v2	70.72	99.14	73.22	75.21	96.65	87.36	83.72

Europe by language

Model Name	deu-Latn	eng-Latn	fra-Latn	por-Latn	ita-Latn	spa-Latn	Average
Pharia-1-Embedding-4608-control	0.925309	0.902113	0.937961	0.953719	0.942352	0.945642	0.934516
GritLM-7B	0.934603	0.905669	0.942364	0.962042	0.949731	0.947428	0.940306
BGE-Multilingual-Gemma2	93.07	92.17	94.91	94.64	96.28	94.94	94.35
Nvidia-Embed-v2	91.58	88.85	90.51	93.94	95.08	93.78	92.29

MTEB – English only

	Retrieval	Classification	STS	Summarization	PairClassification	Clustering	Reranking	Average
Nvidia-Embed-v2	62.65	90.37	84.31	30.7	88.67	58.46	60.65	72.31
BGE-Multilingual-Gemma2	59.24	88.08	83.88	31.2	85.84	54.65	59.72	69.88
GritLM-7B	57.36	78.65	83.35	30.39	87.29	50.61	60.48	66.58
Pharia-1-Embedding-4608-control	39.15	74.40	82.7	30.95	81.73	46.23	57.45	58.94

Ablation for “No Instruction” case

We ablate how performance changes when not using task-specific instructions for the embeddings.

Model Name	ArguAna	AskUbuntuDupQuestions	BIOSSES	Banking77Classification	EmotionClassification	MedrxivClusteringS2S	NFCorpus	STS17	STSBenchmark	SciFact	SummEval	TwitterSemEval2015	Average
Instruction	51.09	61.71	84.56	86.37	51.77	34.29	37.82	89.56	87.08	69.7	30.95	70.97	62.99
No Instruction	50.23	60.31	84.45	86.36	50.6	31.87	37.58	88.75	86.39	71.28	31.00	68.92	62.31
Relative Δ	-1.71%	-2.32%	-0.13%	-0.01%	-2.31%	-7.59%	-0.64%	-0.91%	-0.80%	2.22%	0.16%	-2.97%	-1.09%

We observe slightly reduced performance across most tasks when not using task-specific instructions with an average loss in performance of roughly 1%.

Training Details

Model architecture


Number of layers	27
Number of attention heads	36
Head size	128
Number of Key-Value heads	4
Size hidden dimension	4608
MLP expansion factor	4
MLP type	Standard
Vocabulary size	128,000
Rotary base	1,000,000
Total parameter count	7,041,544,704

Training

Pharia-1-Embedding-4608-control is an adapter on top of Pharia-1-LLM-7B-control, trained with a context window of 2048 Tokens. Pharia-1-Embedding-4608-control was trained with representational instruction-tuning (inspired by the approach of GritLM) and a contrastive learning approach. The final layer is an embedding head with weighted mean pooling. The train set consisted of a blend of open-source and proprietary datasets. Further postprocessing was used to optimize for downstream use and multilinguality.

Tokenization

Tokenization taking place in this embedding model takes full advantage of the one in Pharia-1-LLM-7B-control model