sutd-bge-large-aug98

Finetuned BAAI/bge-large-en-v1.5 for job-description-to-module retrieval in the SUTD Course Recommendation Chatbot (MLOps Group 9). This is the production embedding model used in the final Hybrid pipeline.

Given a job description, this model retrieves relevant SUTD elective modules. It is the dense retrieval backbone in the RAG and Hybrid pipelines.

Model Details

Property Value
Base model BAAI/bge-large-en-v1.5
Embedding dimension 1024
Max sequence length 512
Similarity function Cosine
Loss MultipleNegativesRankingLoss

Training Data

Trained on 98 (job description, relevant SUTD module) pairs spanning four pillars: ASD, EPD, ESD, ISTD/CSD:

  • 67 hand-annotated pairs โ€” original dataset
  • 31 augmented pairs โ€” sourced from MyCareersFuture postings, labelled by Claude

After train/validation splitting and hard-negative expansion by the Sentence Transformers trainer, this produces 831 training samples and 146 validation samples.

A version trained on the smaller 67-pair dataset is available at henreads/sutd-bge-large-ft67.

Training Setup

  • Hardware: Modal A10G (24 GB VRAM)
  • Epochs: up to 10 with early stopping (patience 4); converged at epoch 3
  • Effective batch size: 16 (per-device batch 4, gradient accumulation 4)
  • Learning rate: 2e-5
  • Tracking: Weights & Biases (sutd-mlops-bge-finetune)

Evaluation

Evaluated on a 10-job held-out retrieval set (completely separate from training). This model achieves the best retrieval NDCG@10 among all embedding configurations.

Model NDCG@10
BGE-large-en-v1.5 (base) 0.679
sutd-bge-large-ft67 0.747
sutd-bge-large-aug98 (this model) 0.770

When paired with the matched henreads/sutd-reranker-aug98, the Hybrid pipeline achieves a Gemini LLM-as-judge chat quality score of 4.06 / 5.0, exceeding the project success criterion of 4.0.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("henreads/sutd-bge-large-aug98")

job_description = "Data Scientist at GovTech. Build ML models with Python..."
module_passage = "50.007 Machine Learning โ€” Topics: supervised learning, neural networks..."

embeddings = model.encode([job_description, module_passage], normalize_embeddings=True)
similarity = embeddings[0] @ embeddings[1]
print(similarity)

Project

Part of the SUTD Course Recommendation Chatbot โ€” MLOps Group 9.
Code: github.com/henreads/sutd-mlops-group9

Downloads last month
41
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for henreads/sutd-bge-large-aug98

Finetuned
(82)
this model