Resume Section Classifier v1 (DistilBERT)

A highly accurate, production-ready text classification model designed to categorize raw, messy resume text into 15 distinct sections.

This model builds upon the original concepts of gr8monk3ys/resume-section-classifier, but expands the label set, introduces deep regional (African/European/Global) context, and provides publicly hosted weights for immediate deployment.

It is heavily optimized for parsing output generated by OCR and PDF extraction tools like the Unstructured API.

🚀 Model Details

Base Architecture: distilbert-base-uncased
Task: Text Classification (Sequence Classification)
Language: English
Training Data: ~58,000 rows of hybrid organic and synthetically generated resume lines.
Accuracy: 99.97%

🎯 Intended Use & Key Features

This model is designed to act as the "Routing Engine" in resume parsing pipelines. When a PDF parser extracts unstructured blocks of text, this model categorizes those blocks so they can be routed to strict data schemas (like Pydantic or Zod) for targeted entity extraction.

Key Features:

Unstructured API Resilience: Trained heavily on "engineered noise" (random bullet points, markdown artifacts, pipe separators, missing newlines, and concatenated dates). It learns the semantic meaning of the text, ignoring formatting garbage.
Global & Regional Coverage: Excellent recognition of standard global formats (BSc, AWS, PMP, GPA) as well as highly specific regional/Nigerian formats (ND, HND, PGD, NYSC, SIWES, ICAN, CIBN).
Multi-line Block Handling: Capable of classifying dense, multi-line blocks of text. It easily identifies a 4-line project description block as projects rather than confusing it with experience.

🏷️ Supported Labels (15)

The model predicts one of the following 15 classes:

Label	Description / Examples
`contact`	Emails, phone numbers, addresses, LinkedIn/GitHub URLs.
`summary`	Professional summaries, profiles, or executive overviews.
`objective`	Career objectives and personal statements.
`experience`	Work history, NYSC, SIWES, internships.
`education`	Degrees (BSc, HND, PhD), institutions, and grades.
`skills`	Technical skills, soft skills, programming languages.
`certifications`	Professional certs (AWS, ICAN, PMP), including "In View" status.
`projects`	Personal or professional projects and open-source contributions.
`awards`	Honors, scholarships, and Dean's Lists.
`hobbies`	Interests, passions, and extracurricular activities.
`languages`	Spoken languages and proficiency levels (e.g., Fluent, B2).
`volunteer`	Community service and pro-bono work.
`publications`	Research papers, articles, and academic journals.
`references`	Referees or "References available upon request" statements.
`additional_info`	Relocation willingness, visa status, notice periods.

💻 How to Use

You can easily load this model in your pipeline using the Hugging Face transformers library.

from transformers import pipeline

# Load the classifier
classifier = pipeline("text-classification", model="amosify/resume-section-classifier-v1")

# Example 1: Messy Unstructured API chunk
chunk_1 = "• Professional Development\nGoogle Data Analytics Professional Certificate - 2023"
print(classifier(chunk_1))
# Output: [{'label': 'certifications', 'score': 0.9998}]

# Example 2: Multi-line project description
chunk_2 = "Interactive Search Engine (C#, Java, PHP)\n* Attracted 100+ GitHub stars\n* Deployed to Heroku with Docker"
print(classifier(chunk_2))
# Output: [{'label': 'projects', 'score': 0.9997}]

# Example 3: Regional Education
chunk_3 = "Higher National Diploma (HND) in Computer Science, Yaba College of Technology (Upper Credit)"
print(classifier(chunk_3))
# Output: [{'label': 'education', 'score': 0.9999}]

📊 Training Procedure & Metrics

The model was fine-tuned for 5 epochs on Kaggle using NVIDIA T4 x2 GPUs. It leverages a custom load_best_model_at_end strategy, ensuring the final weights avoid overfitting.

Training Hyperparameters

learning_rate: 2e-05
train_batch_size: 64
eval_batch_size: 64
optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
lr_scheduler_type: linear
num_epochs: 5

Training Results

Epoch	Step	Training Loss	Validation Loss	Accuracy
1.0	1004	0.0093	0.0054	0.9996
2.0	2008	0.0024	0.0042	0.9996
3.0	3012	0.0008	0.0032	0.9997
4.0	4016	0.0004	0.0027	0.9997
5.0	5020	0.0003	0.0030	0.9997

(Note: The model automatically saved the Epoch 4 weights as they yielded the lowest validation loss of 0.0027).

⚠️ Limitations & Scope

Sequence Length Limitation: DistilBERT has a hard limit of 512 tokens, but this model was trained with a max_length of 256 tokens to optimize for speed. If you pass an entire 2-page resume as a single string, it will truncate the text. You must chunk your PDF first (e.g., using Unstructured) and pass the chunks to this model individually.
Not an NER Model: This is a Sequence Classifier, not a Named Entity Recognition (NER) model. It will confidently tell you that a block of text belongs to the "Education" section, but it will not extract the specific substring "Harvard University" out of it. You should route the classified text to an LLM or strict extraction schema (like Zod/Pydantic) for final data extraction.

Downloads last month: 473

Safetensors

Model size

67M params

Tensor type

F32

Model tree for amosify/resume-section-classifier-v1

Base model

distilbert/distilbert-base-uncased

Finetuned

(11654)

this model

Finetunes

1 model