JobBERT Job Chunk Classifier

A fine-tuned version of TechWolf/JobBERT-v3 that identifies the minimal set of sentences in a job posting needed to determine the job title.

Given a short passage from a job posting, the model predicts whether it is relevant (directly identifies or describes the role being hired for) or irrelevant (responsibilities, requirements, boilerplate, company marketing, benefits, equal opportunity statements, etc.). The goal is to extract only the sentences that are sufficient to infer the job title — not to summarise the full posting.

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model     = AutoModelForSequenceClassification.from_pretrained("AP678/jobbert-job-chunk-classifier")
tokenizer = AutoTokenizer.from_pretrained("AP678/jobbert-job-chunk-classifier")

model.eval()

chunks = [
    "We are looking for a Senior Data Engineer to join our platform team.",
    "We are an equal opportunity employer and value diversity at our company.",
]

enc   = tokenizer(chunks, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
with torch.no_grad():
    probs = torch.softmax(model(**enc).logits, dim=-1)[:, 1]  # P(relevant)

for chunk, p in zip(chunks, probs):
    print(f"[{'RELEVANT' if p >= 0.5 else 'irrelevant'}] ({p:.2f})  {chunk}")

Model Details


Base model	`TechWolf/JobBERT-v3` (XLM-RoBERTa base, 12 layers, hidden=768)
Task	Binary sequence classification (irrelevant=0, relevant=1)
Max token length	128
Labels	`0` irrelevant, `1` relevant

The pretrained pooler from JobBERT-v3 is replaced by a new classification head trained from scratch. The full model (including the JobBERT encoder) is fine-tuned end-to-end.

Training Data Format

Training expects Parquet files where each row is one job posting containing a list of labelled chunks:

{
    "chunks": [
        {"chunk_text": "We are looking for a Senior Data Engineer...", "label": 1},
        {"chunk_text": "We are an equal opportunity employer...",       "label": 0},
        {"chunk_text": "Some ambiguous sentence...",                    "label": -1},
    ]
}

Field	Type	Description
`chunk_text`	`str`	The sentence or short passage to classify
`label`	`int`	`1` = relevant to job title, `0` = irrelevant, `-1` = skip (excluded from training)

Chunks labelled -1 are ignored during training. Place all Parquet files in a single directory and pass it via --parts-dir.

Training

Dataset: ~529k labelled chunks extracted from job postings, labelled by Qwen as relevant or irrelevant to the job title
Split: 70% train / 15% val / 15% test (~370k / 79k / 79k chunks)
Class balance: ~20% relevant, ~80% irrelevant — compensated with class weights [0.62, 2.52]
Optimizer: AdamW, lr=2e-5, cosine schedule, 10% warmup, weight decay=0.01
Loss: CrossEntropyLoss with class weights
Hardware: Apple M-series (MPS)

Results

Epoch	Val F1	Test F1	Precision	Recall
1	0.584	0.588	0.497	0.720
2	0.636	0.638	0.558	0.745
3	0.652	0.654	0.569	0.769
4	0.677	0.681	0.627	0.745
5	0.685	0.688	0.650	0.720

Intended Use

Extracting the minimal set of sentences from a job posting that are sufficient to identify the job title. Relevant chunks are those that name or directly describe the role — not responsibilities or requirements. Typical downstream uses:

Job title extraction and normalisation
Deduplication of job postings by role
Reducing noise before feeding postings to an NER or classification model

Limitations

Trained on English job postings only
Chunk boundaries depend on the upstream sentence splitter
A threshold of 0.5 works well in general; you may want to tune it lower if recall matters more than precision for your use case

Downloads last month: 113

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for AP678/jobbert-job-chunk-classifier

Base model

sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Finetuned

TechWolf/JobBERT-v3

Finetuned

(2)

this model