JobBERT Job Chunk Classifier
A fine-tuned version of TechWolf/JobBERT-v3 that identifies the minimal set of sentences in a job posting needed to determine the job title.
Given a short passage from a job posting, the model predicts whether it is relevant (directly identifies or describes the role being hired for) or irrelevant (responsibilities, requirements, boilerplate, company marketing, benefits, equal opportunity statements, etc.). The goal is to extract only the sentences that are sufficient to infer the job title โ not to summarise the full posting.
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("AP678/jobbert-job-chunk-classifier")
tokenizer = AutoTokenizer.from_pretrained("AP678/jobbert-job-chunk-classifier")
model.eval()
chunks = [
"We are looking for a Senior Data Engineer to join our platform team.",
"We are an equal opportunity employer and value diversity at our company.",
]
enc = tokenizer(chunks, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
with torch.no_grad():
probs = torch.softmax(model(**enc).logits, dim=-1)[:, 1] # P(relevant)
for chunk, p in zip(chunks, probs):
print(f"[{'RELEVANT' if p >= 0.5 else 'irrelevant'}] ({p:.2f}) {chunk}")
Model Details
| Base model | TechWolf/JobBERT-v3 (XLM-RoBERTa base, 12 layers, hidden=768) |
| Task | Binary sequence classification (irrelevant=0, relevant=1) |
| Max token length | 128 |
| Labels | 0 irrelevant, 1 relevant |
The pretrained pooler from JobBERT-v3 is replaced by a new classification head trained from scratch. The full model (including the JobBERT encoder) is fine-tuned end-to-end.
Training Data Format
Training expects Parquet files where each row is one job posting containing a list of labelled chunks:
{
"chunks": [
{"chunk_text": "We are looking for a Senior Data Engineer...", "label": 1},
{"chunk_text": "We are an equal opportunity employer...", "label": 0},
{"chunk_text": "Some ambiguous sentence...", "label": -1},
]
}
| Field | Type | Description |
|---|---|---|
chunk_text |
str |
The sentence or short passage to classify |
label |
int |
1 = relevant to job title, 0 = irrelevant, -1 = skip (excluded from training) |
Chunks labelled -1 are ignored during training. Place all Parquet files in a single directory and pass it via --parts-dir.
Training
- Dataset: ~529k labelled chunks extracted from job postings, labelled by Qwen as relevant or irrelevant to the job title
- Split: 70% train / 15% val / 15% test (~370k / 79k / 79k chunks)
- Class balance: ~20% relevant, ~80% irrelevant โ compensated with class weights
[0.62, 2.52] - Optimizer: AdamW, lr=2e-5, cosine schedule, 10% warmup, weight decay=0.01
- Loss: CrossEntropyLoss with class weights
- Hardware: Apple M-series (MPS)
Results
| Epoch | Val F1 | Test F1 | Precision | Recall |
|---|---|---|---|---|
| 1 | 0.584 | 0.588 | 0.497 | 0.720 |
| 2 | 0.636 | 0.638 | 0.558 | 0.745 |
| 3 | 0.652 | 0.654 | 0.569 | 0.769 |
| 4 | 0.677 | 0.681 | 0.627 | 0.745 |
| 5 | 0.685 | 0.688 | 0.650 | 0.720 |
Intended Use
Extracting the minimal set of sentences from a job posting that are sufficient to identify the job title. Relevant chunks are those that name or directly describe the role โ not responsibilities or requirements. Typical downstream uses:
- Job title extraction and normalisation
- Deduplication of job postings by role
- Reducing noise before feeding postings to an NER or classification model
Limitations
- Trained on English job postings only
- Chunk boundaries depend on the upstream sentence splitter
- A threshold of 0.5 works well in general; you may want to tune it lower if recall matters more than precision for your use case
- Downloads last month
- 113