Safetensors
Korean
new
reranker
korean
custom_code
sigridjineth commited on
Commit
c2dce2d
·
verified ·
1 Parent(s): d5dfd00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -36
README.md CHANGED
@@ -4,11 +4,11 @@
4
 
5
  ## Overview
6
 
7
- **sigridjineth/ko-reranker-v1.1-preview** is an advanced Korean reranker fine-tuned to excel in understanding Korean text and delivering high-quality, context-aware relevance scores. This model builds upon [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) and employs sophisticated techniques such as hard negative mining and teacher-student distillation to achieve robust performance.
8
 
9
  ## Training Data
10
 
11
- We leveraged [sigridjineth/korean_nli_dataset_reranker_v0](https://huggingface.co/datasets/sigridjineth/korean_nli_dataset_reranker_v0) as the core training resource. This dataset itself is composed of multiple publicly available datasets:
12
 
13
  - **kor_nli (train)**: [https://huggingface.co/datasets/kor_nli](https://huggingface.co/datasets/kor_nli)
14
  - **mnli_ko (train)**: [https://huggingface.co/datasets/kozistr/mnli_ko](https://huggingface.co/datasets/kozistr/mnli_ko)
@@ -16,66 +16,103 @@ We leveraged [sigridjineth/korean_nli_dataset_reranker_v0](https://huggingface.c
16
  - **mr_tydi_korean (train)**: [https://huggingface.co/datasets/castorini/mr-tydi](https://huggingface.co/datasets/castorini/mr-tydi)
17
  - **klue_nli (train)**: [https://huggingface.co/datasets/klue/klue](https://huggingface.co/datasets/klue/klue)
18
 
19
- These combined resources ensure coverage across a wide range of topics, styles, and complexities in Korean language data, providing the model with the necessary diversity to handle various linguistic nuances.
20
 
21
  ## Key Features
22
 
23
  - **Hard Negative Mining**:
24
- Utilized [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) to extract challenging negatives. This approach enriches the training set with more difficult contrasts, enhancing model robustness and refinement.
25
 
26
- - **Teacher Distillation**:
27
- Employed [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) as a teacher model. The student model learned from teacher-provided signals (positive and negative scores), accelerating convergence and boosting final performance.
28
 
29
  ## Intended Use
30
 
31
- This model is well-suited for:
32
-
33
- - **Search & Information Retrieval**: Improving document ranking for Korean-language search results.
34
- - **Question Answering (QA)**: Enhancing QA pipelines by reordering candidate answers for better relevance.
35
- - **Content Recommendation**: Refining results in recommendation engines that rely on textual signals.
36
 
37
  ## Limitations & Future Work
38
 
39
  - **Preview Release**:
40
- As a preview, the model may not be fully optimized. Future releases aim to improve stability and generalization.
41
 
42
- - **Lack of Evaluation**:
43
- There is a need to develop the specific benchmark to evaluate the generalized Korean retrieval tasks for rerankers.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- ## References
 
 
46
 
47
- For more on the methodologies and theories behind multilingual rerankers and text embeddings, we encourage reviewing the following references:
 
 
 
 
 
 
 
48
 
49
  ```
50
  @misc{zhang2024mgtegeneralizedlongcontexttext,
51
- title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
52
- author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
53
- year={2024},
54
- eprint={2407.19669},
55
- archivePrefix={arXiv},
56
- primaryClass={cs.CL},
57
- url={https://arxiv.org/abs/2407.19669},
58
  }
59
 
60
  @misc{li2023making,
61
- title={Making Large Language Models A Better Foundation For Dense Retrieval},
62
- author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
63
- year={2023},
64
- eprint={2312.15503},
65
- archivePrefix={arXiv},
66
- primaryClass={cs.CL}
67
  }
68
 
69
  @misc{chen2024bge,
70
- title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
71
- author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
72
- year={2024},
73
- eprint={2402.03216},
74
- archivePrefix={arXiv},
75
- primaryClass={cs.CL}
76
  }
77
  ```
78
 
79
  ## Contact & Feedback
80
 
81
- We welcome constructive feedback, suggestions, and contributions. For improvements or inquiries, please reach out via GitHub issues or join the Hugging Face Discussions. We’re committed to continuous iteration and making **sigridjineth/ko-reranker-v1.1-preview** the go-to solution for Korean text reranking.
 
4
 
5
  ## Overview
6
 
7
+ **sigridjineth/ko-reranker-v1.1-preview** is an advanced Korean reranker fine-tuned to excel in understanding Korean text and delivering high-quality, context-aware relevance scores. Built on top of the [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base), it leverages cutting-edge techniques like hard negative mining and teacher-student distillation for enhanced performance.
8
 
9
  ## Training Data
10
 
11
+ This model is trained on [sigridjineth/korean_nli_dataset_reranker_v0](https://huggingface.co/datasets/sigridjineth/korean_nli_dataset_reranker_v0), which aggregates several publicly available datasets, ensuring rich linguistic diversity:
12
 
13
  - **kor_nli (train)**: [https://huggingface.co/datasets/kor_nli](https://huggingface.co/datasets/kor_nli)
14
  - **mnli_ko (train)**: [https://huggingface.co/datasets/kozistr/mnli_ko](https://huggingface.co/datasets/kozistr/mnli_ko)
 
16
  - **mr_tydi_korean (train)**: [https://huggingface.co/datasets/castorini/mr-tydi](https://huggingface.co/datasets/castorini/mr-tydi)
17
  - **klue_nli (train)**: [https://huggingface.co/datasets/klue/klue](https://huggingface.co/datasets/klue/klue)
18
 
19
+ These combined resources ensure coverage across a wide range of topics, styles, and complexities in Korean language data, enabling the model to capture nuanced semantic differences.
20
 
21
  ## Key Features
22
 
23
  - **Hard Negative Mining**:
24
+ Integrated [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) to mine challenging negatives. This approach sharpens the model’s ability to distinguish subtle contrasts, boosting robustness and improving ranking quality.
25
 
26
+ - **Teacher-Student Distillation**:
27
+ Leveraged [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) as a teacher model. The student reranker learned from teacher-provided positive/negative scores, accelerating convergence and achieving better final performance.
28
 
29
  ## Intended Use
30
 
31
+ - **Search & Information Retrieval**: Improve document ranking for Korean-language search queries.
32
+ - **Question Answering (QA)**: Enhance QA pipelines by reordering candidate answers for improved relevance.
33
+ - **Content Recommendation**: Refine recommendation engines that rely on textual signals to deliver more accurate suggestions.
 
 
34
 
35
  ## Limitations & Future Work
36
 
37
  - **Preview Release**:
38
+ The model is still in the refinement phase. Expect future updates to improve stability, generalization, and performance.
39
 
40
+ - **Need for Evaluation**:
41
+ Developing and standardizing benchmarks for generalized Korean retrieval tasks (especially for rerankers) will be an ongoing effort.
42
+
43
+ ## Usage (transformers>=4.36.0)
44
+
45
+ ```python
46
+ import torch
47
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
48
+
49
+ model_name_or_path = "Alibaba-NLP/gte-multilingual-reranker-base"
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
52
+ model = AutoModelForSequenceClassification.from_pretrained(
53
+ model_name_or_path,
54
+ trust_remote_code=True,
55
+ torch_dtype=torch.float16
56
+ )
57
+ model.eval()
58
+
59
+ pairs = [
60
+ ["中国的首都在哪儿","北京"],
61
+ ["what is the capital of China?", "北京"],
62
+ ["how to implement quick sort in python?","Introduction of quick sort"]
63
+ ]
64
+
65
+ with torch.no_grad():
66
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
67
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
68
+ print(scores)
69
+ # Example output:
70
+ # tensor([1.2315, 0.5923, 0.3041])
71
+ ```
72
 
73
+ ## Usage with Infinity
74
+
75
+ [Infinity](https://github.com/michaelfeil/infinity) is an MIT-licensed inference REST API server that can easily host and serve models. For instance:
76
 
77
+ ```bash
78
+ docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \
79
+ michaelf34/infinity:0.0.68 \
80
+ v2 --model-id Alibaba-NLP/gte-multilingual-reranker-base --revision "main" \
81
+ --dtype bfloat16 --batch-size 32 --device cuda --engine torch --port 7997
82
+ ```
83
+
84
+ ## References
85
 
86
  ```
87
  @misc{zhang2024mgtegeneralizedlongcontexttext,
88
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
89
+ author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
90
+ year={2024},
91
+ eprint={2407.19669},
92
+ archivePrefix={arXiv},
93
+ primaryClass={cs.CL},
94
+ url={https://arxiv.org/abs/2407.19669},
95
  }
96
 
97
  @misc{li2023making,
98
+ title={Making Large Language Models A Better Foundation For Dense Retrieval},
99
+ author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
100
+ year={2023},
101
+ eprint={2312.15503},
102
+ archivePrefix={arXiv},
103
+ primaryClass={cs.CL}
104
  }
105
 
106
  @misc{chen2024bge,
107
+ title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
108
+ author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
109
+ year={2024},
110
+ eprint={2402.03216},
111
+ archivePrefix={arXiv},
112
+ primaryClass={cs.CL}
113
  }
114
  ```
115
 
116
  ## Contact & Feedback
117
 
118
+ We welcome constructive feedback, suggestions, and contributions. For improvements or inquiries, please reach out via GitHub issues or join the Hugging Face Discussions. We’re committed to continuous iteration and making **sigridjineth/ko-reranker-v1.1-preview** your go-to solution for Korean text reranking.