Resume Section Classifier v1 (DistilBERT)

A highly accurate, production-ready text classification model designed to categorize raw, messy resume text into 15 distinct sections.

This model builds upon the original concepts of gr8monk3ys/resume-section-classifier, but expands the label set, introduces deep regional (African/European/Global) context, and provides publicly hosted weights for immediate deployment.

It is heavily optimized for parsing output generated by OCR and PDF extraction tools like the Unstructured API.

πŸš€ Model Details

  • Base Architecture: distilbert-base-uncased
  • Task: Text Classification (Sequence Classification)
  • Language: English
  • Training Data: ~58,000 rows of hybrid organic and synthetically generated resume lines.
  • Accuracy: 99.97%

🎯 Intended Use & Key Features

This model is designed to act as the "Routing Engine" in resume parsing pipelines. When a PDF parser extracts unstructured blocks of text, this model categorizes those blocks so they can be routed to strict data schemas (like Pydantic or Zod) for targeted entity extraction.

Key Features:

  1. Unstructured API Resilience: Trained heavily on "engineered noise" (random bullet points, markdown artifacts, pipe separators, missing newlines, and concatenated dates). It learns the semantic meaning of the text, ignoring formatting garbage.
  2. Global & Regional Coverage: Excellent recognition of standard global formats (BSc, AWS, PMP, GPA) as well as highly specific regional/Nigerian formats (ND, HND, PGD, NYSC, SIWES, ICAN, CIBN).
  3. Multi-line Block Handling: Capable of classifying dense, multi-line blocks of text. It easily identifies a 4-line project description block as projects rather than confusing it with experience.

🏷️ Supported Labels (15)

The model predicts one of the following 15 classes:

Label Description / Examples
contact Emails, phone numbers, addresses, LinkedIn/GitHub URLs.
summary Professional summaries, profiles, or executive overviews.
objective Career objectives and personal statements.
experience Work history, NYSC, SIWES, internships.
education Degrees (BSc, HND, PhD), institutions, and grades.
skills Technical skills, soft skills, programming languages.
certifications Professional certs (AWS, ICAN, PMP), including "In View" status.
projects Personal or professional projects and open-source contributions.
awards Honors, scholarships, and Dean's Lists.
hobbies Interests, passions, and extracurricular activities.
languages Spoken languages and proficiency levels (e.g., Fluent, B2).
volunteer Community service and pro-bono work.
publications Research papers, articles, and academic journals.
references Referees or "References available upon request" statements.
additional_info Relocation willingness, visa status, notice periods.

πŸ’» How to Use

You can easily load this model in your pipeline using the Hugging Face transformers library.

from transformers import pipeline

# Load the classifier
classifier = pipeline("text-classification", model="amosify/resume-section-classifier-v1")

# Example 1: Messy Unstructured API chunk
chunk_1 = "β€’ Professional Development\nGoogle Data Analytics Professional Certificate - 2023"
print(classifier(chunk_1))
# Output: [{'label': 'certifications', 'score': 0.9998}]

# Example 2: Multi-line project description
chunk_2 = "Interactive Search Engine (C#, Java, PHP)\n* Attracted 100+ GitHub stars\n* Deployed to Heroku with Docker"
print(classifier(chunk_2))
# Output: [{'label': 'projects', 'score': 0.9997}]

# Example 3: Regional Education
chunk_3 = "Higher National Diploma (HND) in Computer Science, Yaba College of Technology (Upper Credit)"
print(classifier(chunk_3))
# Output: [{'label': 'education', 'score': 0.9999}]

πŸ“Š Training Procedure & Metrics

The model was fine-tuned for 5 epochs on Kaggle using NVIDIA T4 x2 GPUs. It leverages a custom load_best_model_at_end strategy, ensuring the final weights avoid overfitting.

Training Hyperparameters

  • learning_rate: 2e-05
  • train_batch_size: 64
  • eval_batch_size: 64
  • optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
  • lr_scheduler_type: linear
  • num_epochs: 5

Training Results

Epoch Step Training Loss Validation Loss Accuracy
1.0 1004 0.0093 0.0054 0.9996
2.0 2008 0.0024 0.0042 0.9996
3.0 3012 0.0008 0.0032 0.9997
4.0 4016 0.0004 0.0027 0.9997
5.0 5020 0.0003 0.0030 0.9997

(Note: The model automatically saved the Epoch 4 weights as they yielded the lowest validation loss of 0.0027).

⚠️ Limitations & Scope

  • Sequence Length Limitation: DistilBERT has a hard limit of 512 tokens, but this model was trained with a max_length of 256 tokens to optimize for speed. If you pass an entire 2-page resume as a single string, it will truncate the text. You must chunk your PDF first (e.g., using Unstructured) and pass the chunks to this model individually.
  • Not an NER Model: This is a Sequence Classifier, not a Named Entity Recognition (NER) model. It will confidently tell you that a block of text belongs to the "Education" section, but it will not extract the specific substring "Harvard University" out of it. You should route the classified text to an LLM or strict extraction schema (like Zod/Pydantic) for final data extraction.

Downloads last month
473
Safetensors
Model size
67M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for amosify/resume-section-classifier-v1

Finetuned
(11654)
this model
Finetunes
1 model