Resume NER: Pre and Post Processing Implementation Guide
This document explains the full inference pipeline from raw resume text to structured output, covering all pre-processing, model inference, and post-processing steps driven by resume_config.json.
Pipeline Overview
Raw PDF/Text
|
v
[1. Pre-processing] β resume_config.json β pre_processing
|
v
[2. Tokenization] β distilbert-base-cased tokenizer
|
v
[3. NER Inference] β DistilBERT token classification (27 labels)
|
v
[4. Span Assembly] β BIO β character-offset spans
|
v
[5. Section Detection] β Rule-based gap-filling for SKILLS, CERTS, LANGUAGES
|
v
[6. Post-processing] β resume_config.json β post_processing
|
v
Structured JSON output
1. Pre-processing (text_preprocess.py)
Config section: resume_config.json β pre_processing
Normalizes raw PDF extraction artifacts before the model sees the text. All rules are config-driven.
Steps (in order):
CRLF normalization - Convert
\r\nand\rto\nDash normalization (
normalize_dashes: true)- Replace em-dash
βand en-dashβwith hyphen- - Configured via
dash_replacementsmap
- Replace em-dash
Bullet normalization (
normalize_bullets: true)- Replace unicode bullets (
β,β’,βͺ,β,βΈ,βΊ,β£,β) with"- " - Characters listed in
bullet_chars, replacement inbullet_replacement
- Replace unicode bullets (
Multi-space collapse (
collapse_multi_spaces: true)- Reduce runs of 2+ spaces to single space
Label stripping (
strip_labels: ["Phone:", "Email:"])- Remove literal prefixes like "Phone:" or "Email:" that add noise
Skill table expansion (
expand_skill_tables: true)- Detects two-column "Category: skill1, skill2" tables common in resumes
- Expands them into flat lists for better NER tagging
- Recognizes categories from
skill_table_categorieslist - Limits:
table_prose_max_words: 15,table_continuation_max_chars: 60
Usage:
from training.text_preprocess import preprocess_resume_text
# Uses resume_config.json from current directory
clean_text = preprocess_resume_text(raw_text)
# Or with explicit config path:
from training.text_preprocess import ResumeTextPreprocessor
pp = ResumeTextPreprocessor("/path/to/model_dir")
clean_text = pp.preprocess(raw_text)
2. Tokenization & Chunking
Model max sequence length: 512 tokens (DistilBERT).
For resumes exceeding 512 tokens, section-aware chunking is used (benchmark_structured.py β chunked_predicted_spans):
- Split text at
\n\n(paragraph) boundaries - Greedily group consecutive sections into chunks that fit within 512 tokens
- Run inference on each chunk independently
- Map character offsets back to original text
This preserves entity context within natural resume sections (Experience, Education, Skills).
3. NER Inference
Model: distilbert-base-cased fine-tuned for token classification.
27 BIO labels:
| Entity | B-tag | I-tag | Description |
|---|---|---|---|
| NAME | 1 | 2 | Person's full name |
| 3 | 4 | Email address | |
| PHONE | 5 | 6 | Phone number |
| LOCATION | 7 | 8 | City, state, country |
| COMPANY | 9 | 10 | Employer name |
| TITLE | 11 | 12 | Job title |
| DATE | 13 | 14 | Employment/education dates |
| DEGREE | 15 | 16 | Academic degree |
| INSTITUTION | 17 | 18 | School/university |
| FIELD | 19 | 20 | Field of study |
| SKILL | 21 | 22 | Technical/professional skill |
| CERT | 23 | 24 | Certification |
| LANGUAGE | 25 | 26 | Spoken language |
Tag 0 = O (outside any entity).
Subword alignment:
The tokenizer splits words into subword tokens. During training:
- First subword of a word: gets the word's BIO label
- Continuation subwords: B-X converts to I-X, other labels propagate
- Special tokens ([CLS], [SEP], [PAD]): label = -100 (ignored in loss)
4. Span Assembly
Convert BIO predictions back to character-offset spans:
@dataclass
class Span:
label: str # Entity type (NAME, COMPANY, etc.)
text: str # Extracted text
start: int # Character offset start
end: int # Character offset end
score: float # Confidence (1.0 for argmax)
Rules:
- B-X starts a new span
- I-X continues the current span (including whitespace gaps between subwords)
- O or different entity type closes the current span
5. Section Detection (section_detector.py)
Rule-based gap-filling that runs AFTER NER. Catches entities the model missed using section context:
- Detects section headers (SKILLS, CERTIFICATIONS, LANGUAGES, EDUCATION) by keyword matching
- Within detected sections, extracts untagged text as entities
- Especially useful for skills lists that the model partially tags
6. Post-processing (structured_postprocess.py)
Config section: resume_config.json β post_processing
Transforms raw spans into clean structured JSON.
6.1 Span Merging
"span_merge_max_gap": 3,
"span_merge_labels": ["TITLE", "COMPANY"]
Adjacent spans of same type (TITLE or COMPANY) separated by <= 3 characters are merged. Handles cases where the model splits "Senior Software Engineer" into multiple spans.
6.2 Entity Validation Rules
Each entity type has validation rules in entity_rules:
COMPANY:
min_length: 4β reject spans shorter than 4 charsgazetteer_bypass: trueβ known companies fromcompanies.jsonskip length checkstrip_trailing_state_code: trueβ remove trailing US state codes ("Acme Inc. CA" β "Acme Inc.")
TITLE:
min_length: 2exceptions: ["VP", "PA", "RN", "MD", "DO", "QA"]β short titles that are valid
SKILL:
min_length: 4uppercase_bypass: trueβ short all-caps skills (AWS, GCP) passexceptions: ["Go", "R", "C", "C#", "F#", "D"]β valid short skillsblocked_wordsβ language proficiency descriptors ("native", "fluent", "bilingual") filtered outaliasesβ normalize variants ("nodejs" β "node.js", "cpp" β "c++")
EMAIL:
require: "@"β must contain @reject_patterns: ["//", "www."]β filter URLs misclassified as emailsstrip_prefixes: ["Esq.", "Dr.", ...]β remove honorifics attached by OCR
DATE:
min_length: 3date_wordslist validates month namespresent_words: ["present", "current"]β recognized as end-date markers
6.3 Text Cleanup
"space_collapse_pairs": [
[" . ", "."],
[" + + ", "++"],
[" # ", "#"],
[" ,", ","]
]
Fixes tokenizer-induced spacing artifacts in extracted text (e.g., "C + +" β "C++").
6.4 Seniority Inference
Determines career level from title keywords and experience duration:
"seniority_keywords": {
"Executive": ["cto", "ceo", ...],
"Senior": ["senior", "sr.", "lead", "director", ...],
"Junior": ["junior", "intern", "trainee", ...]
}
Fallback by years of experience:
"seniority_by_years": { "Staff": 15, "Senior": 8, "Mid": 3, "Junior": 0 }
6.5 Country Detection
- Phone prefix matching (
phone_country_prefixes) - Location span matching against
city_country_map.json(317 cities) - US state code detection (
us_stateslist) - Country name aliases ("usa" β "United States")
6.6 Experience Years Calculation
- Parse start/end dates from DATE spans
max_experience_months: 600β cap at 50 yearspresent_wordstreated as current date
Structured Output Format
{
"personal": {
"name": "string",
"email": "string",
"phone": "string",
"location": "string"
},
"experience": [
{
"title": "string",
"company": "string",
"start_date": "string",
"end_date": "string"
}
],
"education": [
{
"degree": "string",
"field": "string",
"institution": "string"
}
],
"skills": ["string"],
"certifications": ["string"],
"seniority": "Executive|Principal|Staff|Senior|Mid|Junior",
"country": "string",
"experience_years": number
}
Training Configuration
| Parameter | Value |
|---|---|
| Base model | distilbert-base-cased |
| Max sequence length | 512 |
| Epochs | 25 |
| Batch size | 8 |
| Learning rate | 3e-5 |
| Weight decay | 0.01 |
| Warmup steps | 20 |
| Metric for best model | entity_f1 |
| Noise augmentation | 2x multiplier |
Training Data Sources
| File | Records | Description |
|---|---|---|
ner_train.json |
~3,647 | Synthetic + manual + DataTurks (with noise augmentation) |
kaggle_train.json |
~7,449 | Kaggle resumes: 2,483 clean + 4,966 noise-augmented |
Evaluation
| File | Records | Description |
|---|---|---|
ner_val.json |
652 | Validation split |
gold/resume_resource_gold.json |
93 | Hand-annotated gold standard |
Quick Start: Running Inference
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
from training.benchmark_structured import chunked_predicted_spans
from training.structured_postprocess import StructuredPostProcessor
# Load model
tokenizer = AutoTokenizer.from_pretrained("path/to/model")
model = AutoModelForTokenClassification.from_pretrained("path/to/model")
model.eval()
postprocessor = StructuredPostProcessor("path/to/model")
# Run pipeline
from training.text_preprocess import ResumeTextPreprocessor
pp = ResumeTextPreprocessor("path/to/model")
clean_text = pp.preprocess(raw_resume_text)
_, spans = chunked_predicted_spans(clean_text, model, tokenizer)
result = postprocessor.build_structured_resume_from_spans(spans, clean_text)
File Reference
| File | Role |
|---|---|
resume_config.json |
All pre/post processing rules |
label_config.json |
Label β ID mappings |
city_country_map.json |
City β country lookup |
training/data/companies.json |
Company name gazetteer |
training/data/titles.json |
Job title gazetteer |