Safetensors
Korean
new
reranker
korean
custom_code
File size: 7,547 Bytes
d67baf4
 
 
 
 
 
22e3707
 
 
 
 
d67baf4
faa712f
d5dfd00
faa712f
d5dfd00
 
 
01a1eb9
d5dfd00
 
 
 
 
 
 
c2dce2d
d5dfd00
 
 
 
c2dce2d
d5dfd00
c2dce2d
 
d5dfd00
 
 
c2dce2d
 
 
d5dfd00
 
 
 
c2dce2d
d5dfd00
c2dce2d
 
 
d25c6cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b791d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2dce2d
 
 
 
 
 
36012c9
c2dce2d
 
 
 
 
 
 
 
 
 
b54f8d0
 
 
c2dce2d
 
 
 
 
 
 
 
 
d5dfd00
c2dce2d
 
 
d5dfd00
c2dce2d
 
 
 
 
 
 
 
d5dfd00
 
 
c2dce2d
 
 
 
 
 
 
d5dfd00
 
 
c2dce2d
 
 
 
 
 
d5dfd00
 
 
c2dce2d
 
 
 
 
 
d5dfd00
dc6559b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
license: apache-2.0
language:
- ko
base_model:
- Alibaba-NLP/gte-multilingual-reranker-base
datasets:
- sigridjineth/korean_nli_dataset_reranker_v0
tags:
- reranker
- korean
---
# Model Card: sigridjineth/ko-reranker-v1.1

- The model, which is finetuned from [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) is currently under development and may undergo further changes as we refine and improve its performance. Underwent A100 x 8 with 12 hours for training.

## Training Data

This model is trained on 328K Korean Triplets, which aggregates several publicly available datasets, ensuring rich linguistic diversity:

- **kor_nli (train)**: [https://huggingface.co/datasets/kor_nli](https://huggingface.co/datasets/kor_nli)  
- **mnli_ko (train)**: [https://huggingface.co/datasets/kozistr/mnli_ko](https://huggingface.co/datasets/kozistr/mnli_ko)  
- **ko-wiki-reranking (train)**: [https://huggingface.co/datasets/upskyy/ko-wiki-reranking](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)  
- **mr_tydi_korean (train)**: [https://huggingface.co/datasets/castorini/mr-tydi](https://huggingface.co/datasets/castorini/mr-tydi)  
- **klue_nli (train)**: [https://huggingface.co/datasets/klue/klue](https://huggingface.co/datasets/klue/klue)

These combined resources ensure coverage across a wide range of topics, styles, and complexities in Korean language data, enabling the model to capture nuanced semantic differences.

## Key Features

- **Hard Negative Mining**:  
  Integrated [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) to mine challenging negatives. This approach sharpens the modelโ€™s ability to distinguish subtle contrasts, boosting robustness and improving ranking quality.

- **Teacher-Student Distillation**:  
  Leveraged [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) as a teacher model. The student reranker learned from teacher-provided positive/negative scores, accelerating convergence and achieving better final performance.

## Intended Use

- **Search & Information Retrieval**: Improve document ranking for Korean-language search queries.  
- **Question Answering (QA)**: Enhance QA pipelines by reordering candidate answers for improved relevance.  
- **Content Recommendation**: Refine recommendation engines that rely on textual signals to deliver more accurate suggestions.

## Limitations & Future Work

- **Preview Release**:  
  The model is still in the refinement phase. Expect future updates to improve stability, generalization, and performance.
  
- **Need for Evaluation**:  
  Developing and standardizing benchmarks for generalized Korean retrieval tasks (especially for rerankers) will be an ongoing effort.

## Evaluation
The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics.

### Model: `sigridjineth/ko-reranker-v1.1-preview`

| top_k | Execution Time | F1     | Recall | Precision | MAP    | MRR    | NDCG   | Is Best |
|-------|----------------|--------|--------|-----------|--------|--------|--------|---------|
| 1     | 0.0438         | 0.6754 | 0.6754 | 0.6754    | 0.6754 | 0.6754 | 0.6754 | True    |
| 3     | 0.0486         | 0.3684 | 0.7368 | 0.2456    | 0.7032 | 0.7032 | 0.7119 | False   |
| 5     | 0.0446         | 0.3684 | 0.7368 | 0.2456    | 0.7032 | 0.7032 | 0.7119 | False   |

---

### Model: `Alibaba-NLP/gte-multilingual-reranker-base`

| top_k | Execution Time | F1     | Recall | Precision | MAP    | MRR    | NDCG   | Is Best |
|-------|----------------|--------|--------|-----------|--------|--------|--------|---------|
| 1     | 0.0481         | 0.6316 | 0.6316 | 0.6316    | 0.6316 | 0.6316 | 0.6316 | True    |
| 3     | 0.0427         | 0.3596 | 0.7193 | 0.2398    | 0.6725 | 0.6725 | 0.6846 | False   |
| 5     | 0.0442         | 0.3596 | 0.7193 | 0.2398    | 0.6725 | 0.6725 | 0.6846 | False   |

---

### Model: `dragonkue/bge-reranker-v2-m3-ko`

| top_k | Execution Time | F1     | Recall | Precision | MAP    | MRR    | NDCG   | Is Best |
|-------|----------------|--------|--------|-----------|--------|--------|--------|---------|
| 1     | 0.0814         | 0.6930 | 0.6930 | 0.6930    | 0.6930 | 0.6930 | 0.6930 | True    |
| 3     | 0.0813         | 0.3596 | 0.7193 | 0.2398    | 0.7061 | 0.7061 | 0.7096 | False   |
| 5     | 0.0824         | 0.3596 | 0.7193 | 0.2398    | 0.7061 | 0.7061 | 0.7096 | False   |

```
Evaluation Results (k=1,3,5,10):
  Accuracy@1:  0.8070
  F1@1:        0.8070
  Recall@1:    0.8070
  Precision@1: 0.8070
  Accuracy@3:  0.9211
  F1@3:        0.4605
  Recall@3:    0.9211
  Precision@3: 0.3070
  Accuracy@5:  0.9474
  F1@5:        0.3158
  Recall@5:    0.9474
  Precision@5: 0.1895
  Accuracy@10:  0.9737
  F1@10:        0.1770
  Recall@10:    0.9737
  Precision@10: 0.0974

Total inference time (all queries): 142.64 sec
Average inference time (per query): 1.2512 sec
```

## Usage (transformers>=4.36.0)

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name_or_path = "sigridjineth/ko-reranker-v1.1-preview"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name_or_path, 
    trust_remote_code=True,
    torch_dtype=torch.float16
)
model.eval()

pairs = [
    ["์ค‘๊ตญ์˜ ์ˆ˜๋„๋Š”","๋ฒ ์ด์ง•"], 
    ["2024๋…„ ๋Œ€ํ•œ๋ฏผ๊ตญ ๋Œ€ํ†ต๋ น์€?", "๋Œ€ํ•œ๋ฏผ๊ตญ ๋Œ€ํ†ต๋ น์€ ์œค์„์—ด์ด๋‹ค"], 
    ["ํŒŒ์ด์ฌ์—์„œ ํ€ต ์†ŒํŠธ๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ","quick sort๋กœ ์ฝ”ํ…Œ 1๋“ฑ ๋จน์–ด๋ณด์ž"]
]

with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)
# Example output:
# tensor([1.2315, 0.5923, 0.3041])
```

## Usage with Infinity

[Infinity](https://github.com/michaelfeil/infinity) is an MIT-licensed inference REST API server that can easily host and serve models. For instance:

```bash
docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \
michaelf34/infinity:0.0.68 \
v2 --model-id Alibaba-NLP/gte-multilingual-reranker-base --revision "main" \
--dtype bfloat16 --batch-size 32 --device cuda --engine torch --port 7997
```

## References

```
@misc{zhang2024mgtegeneralizedlongcontexttext,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, 
  author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
  year={2024},
  eprint={2407.19669},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.19669}, 
}

@misc{li2023making,
  title={Making Large Language Models A Better Foundation For Dense Retrieval}, 
  author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
  year={2023},
  eprint={2312.15503},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{chen2024bge,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
  author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
  year={2024},
  eprint={2402.03216},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
```