Instructions to use henreads/sutd-bge-large-aug98 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use henreads/sutd-bge-large-aug98 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("henreads/sutd-bge-large-aug98") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
sutd-bge-large-aug98
Finetuned BAAI/bge-large-en-v1.5 for job-description-to-module retrieval in the SUTD Course Recommendation Chatbot (MLOps Group 9). This is the production embedding model used in the final Hybrid pipeline.
Given a job description, this model retrieves relevant SUTD elective modules. It is the dense retrieval backbone in the RAG and Hybrid pipelines.
Model Details
| Property | Value |
|---|---|
| Base model | BAAI/bge-large-en-v1.5 |
| Embedding dimension | 1024 |
| Max sequence length | 512 |
| Similarity function | Cosine |
| Loss | MultipleNegativesRankingLoss |
Training Data
Trained on 98 (job description, relevant SUTD module) pairs spanning four pillars: ASD, EPD, ESD, ISTD/CSD:
- 67 hand-annotated pairs โ original dataset
- 31 augmented pairs โ sourced from MyCareersFuture postings, labelled by Claude
After train/validation splitting and hard-negative expansion by the Sentence Transformers trainer, this produces 831 training samples and 146 validation samples.
A version trained on the smaller 67-pair dataset is available at henreads/sutd-bge-large-ft67.
Training Setup
- Hardware: Modal A10G (24 GB VRAM)
- Epochs: up to 10 with early stopping (patience 4); converged at epoch 3
- Effective batch size: 16 (per-device batch 4, gradient accumulation 4)
- Learning rate: 2e-5
- Tracking: Weights & Biases (
sutd-mlops-bge-finetune)
Evaluation
Evaluated on a 10-job held-out retrieval set (completely separate from training). This model achieves the best retrieval NDCG@10 among all embedding configurations.
| Model | NDCG@10 |
|---|---|
| BGE-large-en-v1.5 (base) | 0.679 |
| sutd-bge-large-ft67 | 0.747 |
| sutd-bge-large-aug98 (this model) | 0.770 |
When paired with the matched henreads/sutd-reranker-aug98, the Hybrid pipeline achieves a Gemini LLM-as-judge chat quality score of 4.06 / 5.0, exceeding the project success criterion of 4.0.
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("henreads/sutd-bge-large-aug98")
job_description = "Data Scientist at GovTech. Build ML models with Python..."
module_passage = "50.007 Machine Learning โ Topics: supervised learning, neural networks..."
embeddings = model.encode([job_description, module_passage], normalize_embeddings=True)
similarity = embeddings[0] @ embeddings[1]
print(similarity)
Project
Part of the SUTD Course Recommendation Chatbot โ MLOps Group 9.
Code: github.com/henreads/sutd-mlops-group9
- Downloads last month
- 41
Model tree for henreads/sutd-bge-large-aug98
Base model
BAAI/bge-large-en-v1.5