jaeyong2
/

gte-multilingual-base-Ko-embedding

Feature Extraction

text-embeddings-inference

Model card Files Files and versions Community

jaeyong2 commited on Dec 1, 2024

Commit

f167063

·

verified ·

1 Parent(s): c75b96e

Update README.md

Files changed (1) hide show

README.md +46 -0

README.md CHANGED Viewed

@@ -20,6 +20,52 @@ base_model:
 ## Train
 ## Evaluation

 ## Train
+H/W : colab A100 40GB
+Data : jaeyong2/Ko-emb-PreView
+```
+model_name = "Alibaba-NLP/gte-multilingual-base"
+dataset = datasets.load_dataset("jaeyong2/Ko-emb-PreView")
+train_dataloader = DataLoader(dataset['train'], batch_size=8, shuffle=True)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name).to(torch.bfloat16)
+triplet_loss = TripletLoss(margin=1.0)
+optimizer = AdamW(model.parameters(), lr=5e-5)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+for epoch in range(3):  # 에포크 반복
+    model.train()
+    total_loss = 0
+    count = 0
+    for batch in tqdm(train_dataloader):
+        optimizer.zero_grad()
+        loss = None
+        for index in range(len(batch["context"])):
+            anchor_encodings = tokenizer([batch["context"][index]], truncation=True, padding="max_length", max_length=4096, return_tensors="pt")
+            positive_encodings = tokenizer([batch["Title"][index]], truncation=True, padding="max_length", max_length=256, return_tensors="pt")
+            negative_encodings = tokenizer([batch["Fake Title"][index]], truncation=True, padding="max_length", max_length=256, return_tensors="pt")
+            anchor_encodings = batch_to_device(anchor_encodings, device)
+            positive_encodings = batch_to_device(positive_encodings, device)
+            negative_encodings = batch_to_device(negative_encodings, device)
+            # 모델 출력 (임베딩 벡터 생성)
+            anchor_output = model(**anchor_encodings)[0][:, 0, :]  # [CLS] 토큰의 벡터
+            positive_output = model(**positive_encodings)[0][:, 0, :]
+            negative_output = model(**negative_encodings)[0][:, 0, :]
+            # 삼중항 손실 계산
+            if loss==None:
+                loss = triplet_loss(anchor_output, positive_output, negative_output)
+            else:
+                loss += triplet_loss(anchor_output, positive_output, negative_output)
+        loss /= len(batch["context"])
+        loss.backward()
+        optimizer.step()
+```
 ## Evaluation