JobBERT Job Chunk Classifier

A fine-tuned version of TechWolf/JobBERT-v3 that identifies the minimal set of sentences in a job posting needed to determine the job title.

Given a short passage from a job posting, the model predicts whether it is relevant (directly identifies or describes the role being hired for) or irrelevant (responsibilities, requirements, boilerplate, company marketing, benefits, equal opportunity statements, etc.). The goal is to extract only the sentences that are sufficient to infer the job title โ€” not to summarise the full posting.

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model     = AutoModelForSequenceClassification.from_pretrained("AP678/jobbert-job-chunk-classifier")
tokenizer = AutoTokenizer.from_pretrained("AP678/jobbert-job-chunk-classifier")

model.eval()

chunks = [
    "We are looking for a Senior Data Engineer to join our platform team.",
    "We are an equal opportunity employer and value diversity at our company.",
]

enc   = tokenizer(chunks, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
with torch.no_grad():
    probs = torch.softmax(model(**enc).logits, dim=-1)[:, 1]  # P(relevant)

for chunk, p in zip(chunks, probs):
    print(f"[{'RELEVANT' if p >= 0.5 else 'irrelevant'}] ({p:.2f})  {chunk}")

Model Details

Base model TechWolf/JobBERT-v3 (XLM-RoBERTa base, 12 layers, hidden=768)
Task Binary sequence classification (irrelevant=0, relevant=1)
Max token length 128
Labels 0 irrelevant, 1 relevant

The pretrained pooler from JobBERT-v3 is replaced by a new classification head trained from scratch. The full model (including the JobBERT encoder) is fine-tuned end-to-end.

Training Data Format

Training expects Parquet files where each row is one job posting containing a list of labelled chunks:

{
    "chunks": [
        {"chunk_text": "We are looking for a Senior Data Engineer...", "label": 1},
        {"chunk_text": "We are an equal opportunity employer...",       "label": 0},
        {"chunk_text": "Some ambiguous sentence...",                    "label": -1},
    ]
}
Field Type Description
chunk_text str The sentence or short passage to classify
label int 1 = relevant to job title, 0 = irrelevant, -1 = skip (excluded from training)

Chunks labelled -1 are ignored during training. Place all Parquet files in a single directory and pass it via --parts-dir.

Training

  • Dataset: ~529k labelled chunks extracted from job postings, labelled by Qwen as relevant or irrelevant to the job title
  • Split: 70% train / 15% val / 15% test (~370k / 79k / 79k chunks)
  • Class balance: ~20% relevant, ~80% irrelevant โ€” compensated with class weights [0.62, 2.52]
  • Optimizer: AdamW, lr=2e-5, cosine schedule, 10% warmup, weight decay=0.01
  • Loss: CrossEntropyLoss with class weights
  • Hardware: Apple M-series (MPS)

Results

Epoch Val F1 Test F1 Precision Recall
1 0.584 0.588 0.497 0.720
2 0.636 0.638 0.558 0.745
3 0.652 0.654 0.569 0.769
4 0.677 0.681 0.627 0.745
5 0.685 0.688 0.650 0.720

Intended Use

Extracting the minimal set of sentences from a job posting that are sufficient to identify the job title. Relevant chunks are those that name or directly describe the role โ€” not responsibilities or requirements. Typical downstream uses:

  • Job title extraction and normalisation
  • Deduplication of job postings by role
  • Reducing noise before feeding postings to an NER or classification model

Limitations

  • Trained on English job postings only
  • Chunk boundaries depend on the upstream sentence splitter
  • A threshold of 0.5 works well in general; you may want to tune it lower if recall matters more than precision for your use case
Downloads last month
113
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AP678/jobbert-job-chunk-classifier