LUSIFER / README.md

Update README.md

260e63d verified 2 days ago

6.75 kB

	---
	license: mit
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	---

	# LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

	LUSIFER is framework for bridging the gap between multilingual understanding and task-specific text embeddings without relying on explicit multilingual supervision. It does this by combining a multilingual encoder (providing a universal language foundation) with an LLM-based embedding model (optimized for embedding tasks), connected through a minimal set of trainable parameters. LUSIFER also introduces two stages of training process: 1) Alignment Training and 2) Representation Fine-tuning to optimize the model for zero-shot multilingual embeddings.

	## Installation
	To use LUSFIER, install evironment from ```environment.yaml``` (optional)
	```bash
	conda env create -f environment.yaml
	```

	After that, you can install our package from source by
	```bash
	git clone https://github.com/hieum98/lusifer.git
	cd lusifer
	pip install -e .
	```

	You also need to install the Flash-Attention before running the code because we use the Flash-Attention as the attention implementation in our model. You can install the Flash-Attention by running the following command:
	```bash
	pip install packaging
	pip install ninja
	pip install flash-attn --no-build-isolation
	```

	## Getting started
	LUSIFER provides a thorough set of tools for training, evaluating, and using the model. The following sections provide a brief overview of how to use the model for training, evaluation, and inference.

	### Preparing the model
	LUSIFER model can be easily loaded using the `from_pretrained` method. The model can be loaded from the Hugging Face model hub by providing the model name or path to the model weights. The following code snippet demonstrates how to load the model from the Hugging Face model hub.

	```python
	from lusifer.models.lusifer import Lusifer

	model = Lusifer.from_pretrained("Hieuman/LUSIFER")
	```

	### Inference
	This model now returns the text embedding for any input in the form of `str` or `List[str]`. The model also can receive instruction alongside the sentence.

	```python
	import torch
	from lusifer.models.lusifer import Lusifer

	model = Lusifer.from_pretrained("Hieuman/LUSIFER")

	model = model.to("cuda")

	# Encoding queries using instructions
	instruction = "Given a web search query, retrieve relevant passages that answer the query:"
	queries = [
	"how much protein should a female eat",
	"summit define",
	]
	q_reps = model.encode(sentences=queries)

	# Encoding documents. Instruction are not required for documents
	documents = [
	"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
	"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
	]
	d_reps = model.encode(sentences=documents)

	# Compute cosine similarity
	q_reps_norm = torch.nn.functional.normalize(torch.from_numpy(q_reps), p=2, dim=1)
	d_reps_norm = torch.nn.functional.normalize(torch.from_numpy(d_reps), p=2, dim=1)
	cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

	print(cos_sim)
	```

	## Training

	### Alignment Training
	To train the model in the alignment stage, run the following command:
	```bash
	python -m src.main \
	--config_file scripts/configs/aligment_training_reconstruction_and_completion.yaml \
	--nodes 1 \
	--devices 4
	```
	It will run the alignment training on 4 GPUs with both reconstruction and completion tasks with the configuration in the `scripts/configs/aligment_training_reconstruction_and_completion.yaml` file. For more details about the configuration file, please refer to the `scripts/configs/aligment_training_reconstruction_and_completion.yaml` file and the arguments in the `lusifer/args.py` file.

	We also provide the configuration file for the alignment training with the reconstruction task only in the `scripts/configs/alignment_training_reconstruction.yaml` file. We suggest using the reconstruction task only first to stabilize the training process before adding the completion task.

	### Representation Fine-tuning
	To train the model in the representation fine-tuning stage, run the following command:
	```bash
	python -m src.main \
	--config_file scripts/configs/representation_fintuning_retrieval_data_only.yaml \
	--nodes 1 \
	--devices 4
	```

	We also provide the configuration file for the representation fine-tuning with both retrieval and non-retrieval data in the `scripts/configs/representation_finetuning_all.yaml` file. We suggest using the retrieval data only first to stabilize the training process before adding the non-retrieval data.

	To be concise, we suggest the following training process: reconstruction task only -> reconstruction + completion task -> retrieval data only -> retrieval + non-retrieval data.

	## Evaluation
	We propose a new benchmark for evaluating the model on the multilingual text embedding task. The benchmark includes 5 primary embedding tasks: Classification, Clustering, Reranking, Retrieval, and Semantic Textual Similarity (STS) across 123 diverse datasets spanning 14 languages

	We support to evaluate model on various datasets by intergrating [`mteb`](https://github.com/embeddings-benchmark/mteb) library. To evaluate the model, run the following command:
	```bash
	python -m lusifer.eval.eval \
	--model_name_or_path Hieuman/LUSIFER \
	--is_lusifer \
	```

	## Results
	We provide the results of LUSIFER on the multilingual text embedding benchmark in the following table. The results are reported in terms of the average main metric across all tasks and datasets. Please refer the paper for the full results

	## Citation
	If you use LUSIFER in your research, please cite the following paper:
	```bibtex
	@misc{man2025lusiferlanguageuniversalspace,
	title={LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models},
	author={Hieu Man and Nghia Trung Ngo and Viet Dac Lai and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},
	year={2025},
	eprint={2501.00874},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2501.00874},
	}
	```

	## Bugs or questions?
	If you have any questions about the code, feel free to open an issue on the GitHub repository or send me an email at hieum@uoregon.edu.