Hieuman commited on
Commit
b7071bf
·
verified ·
1 Parent(s): 2b850ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -3
README.md CHANGED
@@ -5,6 +5,145 @@ tags:
5
  - pytorch_model_hub_mixin
6
  ---
7
 
8
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
9
- - Library: https://github.com/hieum98/lusifer
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - pytorch_model_hub_mixin
6
  ---
7
 
8
+ # *LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models*
9
+
10
+ [![ArXiv](https://img.shields.io/badge/ArXiv-2025-fb1b1b.svg)](https://arxiv.org/abs/2501.00874)
11
+ [![HF Paper](https://img.shields.io/badge/HF%20Paper-2025-b31b1b.svg)](https://huggingface.co/papers/2501.00874)
12
+ [![HF Link](https://img.shields.io/badge/HF%20Model-LUSIFER-FFD21E.svg)](https://huggingface.co/Hieuman/LUSIFER)
13
+ [![License](https://img.shields.io/badge/License-MIT-FD21E.svg)](LICENSE)
14
+
15
+ LUSIFER is framework for bridging the gap between multilingual understanding and task-specific text embeddings without relying on explicit multilingual supervision. It does this by combining a multilingual encoder (providing a universal language foundation) with an LLM-based embedding model (optimized for embedding tasks), connected through a minimal set of trainable parameters. LUSIFER also introduces two stages of training process: 1) Alignment Training and 2) Representation Fine-tuning to optimize the model for zero-shot multilingual embeddings.
16
+
17
+ <p align="center">
18
+ <img src="https://github.com/hieum98/lusifer/blob/main/asserts/Model_overview.png" width="85%" alt="LUSIFER_figure1"/>
19
+ </p>
20
+
21
+ ## Installation
22
+ To use LUSFIER, install evironment from ```environment.yaml``` (optional)
23
+ ```bash
24
+ conda env create -f environment.yaml
25
+ ```
26
+
27
+ After that, you can install our package from source by
28
+ ```bash
29
+ pip install -e .
30
+ ```
31
+
32
+ You also need to install the Flash-Attention before running the code because we use the Flash-Attention as the attention implementation in our model. You can install the Flash-Attention by running the following command:
33
+ ```bash
34
+ pip install packaging
35
+ pip install ninja
36
+ pip install flash-attn --no-build-isolation
37
+ ```
38
+
39
+ ## Getting started
40
+ LUSIFER provides a thorough set of tools for training, evaluating, and using the model. The following sections provide a brief overview of how to use the model for training, evaluation, and inference.
41
+
42
+ ### Preparing the model
43
+ LUSIFER model can be easily loaded using the `from_pretrained` method. The model can be loaded from the Hugging Face model hub by providing the model name or path to the model weights. The following code snippet demonstrates how to load the model from the Hugging Face model hub.
44
+
45
+ ```python
46
+ from lusifer.models.lusifer import Lusifer
47
+
48
+ model = Lusifer.from_pretrained("Hieuman/LUSIFER")
49
+ ```
50
+
51
+ ### Inference
52
+ This model now returns the text embedding for any input in the form of `str` or `List[str]`. The model also can receive instruction alongside the sentence.
53
+
54
+ ```python
55
+ import torch
56
+ from lusifer.models.lusifer import Lusifer
57
+
58
+ model = Lusifer.from_pretrained("Hieuman/LUSIFER")
59
+
60
+ model = model.to("cuda")
61
+
62
+ # Encoding queries using instructions
63
+ instruction = "Given a web search query, retrieve relevant passages that answer the query:"
64
+ queries = [
65
+ "how much protein should a female eat",
66
+ "summit define",
67
+ ]
68
+ q_reps = model.encode(sentences=queries)
69
+
70
+ # Encoding documents. Instruction are not required for documents
71
+ documents = [
72
+ "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
73
+ "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
74
+ ]
75
+ d_reps = model.encode(sentences=documents)
76
+
77
+ # Compute cosine similarity
78
+ q_reps_norm = torch.nn.functional.normalize(torch.from_numpy(q_reps), p=2, dim=1)
79
+ d_reps_norm = torch.nn.functional.normalize(torch.from_numpy(d_reps), p=2, dim=1)
80
+ cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
81
+
82
+ print(cos_sim)
83
+ ```
84
+
85
+ ## Training
86
+
87
+ ### Alignment Training
88
+ To train the model in the alignment stage, run the following command:
89
+ ```bash
90
+ python -m src.main \
91
+ --config_file scripts/configs/aligment_training_reconstruction_and_completion.yaml \
92
+ --nodes 1 \
93
+ --devices 4
94
+ ```
95
+ It will run the alignment training on 4 GPUs with both reconstruction and completion tasks with the configuration in the `scripts/configs/aligment_training_reconstruction_and_completion.yaml` file. For more details about the configuration file, please refer to the `scripts/configs/aligment_training_reconstruction_and_completion.yaml` file and the arguments in the `lusifer/args.py` file.
96
+
97
+ We also provide the configuration file for the alignment training with the reconstruction task only in the `scripts/configs/alignment_training_reconstruction.yaml` file. We suggest using the reconstruction task only first to stabilize the training process before adding the completion task.
98
+
99
+ ### Representation Fine-tuning
100
+ To train the model in the representation fine-tuning stage, run the following command:
101
+ ```bash
102
+ python -m src.main \
103
+ --config_file scripts/configs/representation_fintuning_retrieval_data_only.yaml \
104
+ --nodes 1 \
105
+ --devices 4
106
+ ```
107
+
108
+ We also provide the configuration file for the representation fine-tuning with both retrieval and non-retrieval data in the `scripts/configs/representation_finetuning_all.yaml` file. We suggest using the retrieval data only first to stabilize the training process before adding the non-retrieval data.
109
+
110
+ To be concise, we suggest the following training process: reconstruction task only -> reconstruction + completion task -> retrieval data only -> retrieval + non-retrieval data.
111
+
112
+ ## Evaluation
113
+ We propose a new benchmark for evaluating the model on the multilingual text embedding task. The benchmark includes 5 primary embedding tasks: Classification, Clustering, Reranking, Retrieval, and Semantic Textual Similarity (STS) across 123 diverse datasets spanning 14 languages
114
+
115
+ <p align="center">
116
+ <img src="https://github.com/hieum98/lusifer/blob/main/asserts/Benchmark.png" width="85%" alt="Benchmark"/>
117
+ </p>
118
+
119
+ We support to evaluate model on various datasets by intergrating [`mteb`](https://github.com/embeddings-benchmark/mteb) library. To evaluate the model, run the following command:
120
+ ```bash
121
+ python -m lusifer.eval.eval \
122
+ --model_name_or_path Hieuman/LUSIFER \
123
+ --is_lusifer \
124
+ ```
125
+
126
+ ## Results
127
+ We provide the results of LUSIFER on the multilingual text embedding benchmark in the following table. The results are reported in terms of the average main metric across all tasks and datasets.
128
+
129
+ <p align="center">
130
+ <img src="https://github.com/hieum98/lusifer/blob/main/asserts/Results.png" width="85%" alt="results"/>
131
+ </p>
132
+
133
+ ## Citation
134
+ If you use LUSIFER in your research, please cite the following paper:
135
+ ```bibtex
136
+ @misc{man2025lusiferlanguageuniversalspace,
137
+ title={LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models},
138
+ author={Hieu Man and Nghia Trung Ngo and Viet Dac Lai and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},
139
+ year={2025},
140
+ eprint={2501.00874},
141
+ archivePrefix={arXiv},
142
+ primaryClass={cs.CL},
143
+ url={https://arxiv.org/abs/2501.00874},
144
+ }
145
+ ```
146
+
147
+ ## Bugs or questions?
148
+ If you have any questions about the code, feel free to open an issue on the GitHub repository or send me an email at hieum@uoregon.edu.
149
+