jaeyong2 commited on
Commit
42eeb0f
1 Parent(s): f2f768a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -1
README.md CHANGED
@@ -4,4 +4,90 @@ language:
4
  - kg
5
  base_model:
6
  - BAAI/bge-m3
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - kg
5
  base_model:
6
  - BAAI/bge-m3
7
+ ---
8
+
9
+
10
+ # Model Card for Model ID
11
+
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
+
14
+
15
+
16
+ ## Model Details
17
+
18
+
19
+ ## Train
20
+
21
+ - H/W : colab A100 40GB
22
+ - Data : jaeyong2/Ko-emb-PreView (step : 18000)
23
+
24
+ ```
25
+ !torchrun --nproc_per_node 1 \
26
+ -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
27
+ --output_dir "/content/drive/My Drive/bge_ko.pth" \
28
+ --model_name_or_path BAAI/bge-m3 \
29
+ --train_data ./train.jsonl \
30
+ --learning_rate 1e-5 \
31
+ --bf16 \
32
+ --num_train_epochs 1 \
33
+ --per_device_train_batch_size 1 \
34
+ --dataloader_drop_last True \
35
+ --temperature 0.02 \
36
+ --query_max_len 2048 \
37
+ --passage_max_len 512 \
38
+ --train_group_size 2 \
39
+ --negatives_cross_device \
40
+ --logging_steps 10 \
41
+ --save_steps 1000 \
42
+ --query_instruction_for_retrieval ""
43
+ ```
44
+
45
+ ## Evaluation
46
+
47
+ Code :
48
+ ```
49
+ import torch
50
+ import numpy as np
51
+ from sklearn.metrics import pairwise_distances
52
+ from tqdm import tqdm
53
+ import datasets
54
+ def get_embedding(text, model):
55
+ with torch.no_grad():
56
+ embedding = model.encode(text)['dense_vecs']
57
+ return embedding
58
+
59
+
60
+ dataset = datasets.load_dataset("jaeyong2/Ko-emb-PreView")
61
+ validation_dataset = dataset["test"].select(range((1000)))
62
+
63
+
64
+ def evaluate(validation_dataset):
65
+ correct_count = 0
66
+
67
+ for item in tqdm(validation_dataset):
68
+ query_embedding = get_embedding(item["context"], fine_tuned_model)
69
+ document_embedding = get_embedding(item["Title"], fine_tuned_model)
70
+ negative_embedding = get_embedding(item["Fake Title"], fine_tuned_model)
71
+
72
+
73
+ # 쿼리와 모든 문서 간의 유사도 계산 (코사인 거리 사용)
74
+ positive_distances = pairwise_distances(query_embedding.reshape(1, -1), document_embedding.reshape(1, -1), metric="cosine")
75
+ negative_distances = pairwise_distances(query_embedding.reshape(1, -1), negative_embedding.reshape(1, -1), metric="cosine")
76
+
77
+ if positive_distances < negative_distances:
78
+ correct_count += 1
79
+
80
+ accuracy = correct_count / len(validation_dataset)
81
+ return accuracy
82
+
83
+ results = evaluate(validation_dataset)
84
+ print(f"Validation Results: {results}")
85
+ ```
86
+
87
+ Accuracy
88
+ - Alibaba-NLP/gte-multilingual-base : 0.971
89
+ - jaeyong2/gte-multilingual-base-Ko-embedding : 0.992
90
+
91
+
92
+ ### License
93
+ - Alibaba-NLP/gte-multilingual-base : https://choosealicense.com/licenses/apache-2.0/