|
_\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\_
|
|
_\\----------- **Resume Parser** ----------\\_
|
|
_\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\_
|
|
|
|
# Overview:
|
|
This project is a comprehensive Resume Parsing tool built using Python,
|
|
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
|
|
If Mistral fails or encounters issues,
|
|
the system falls back to a custom-trained spaCy model to ensure continued functionality.
|
|
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
|
|
|
|
|
|
# Installation Guide:
|
|
|
|
1. Create and Activate a Virtual Environment
|
|
python -m venv venv
|
|
source venv/bin/activate # For Linux/Mac
|
|
# or
|
|
venv\Scripts\activate # For Windows
|
|
|
|
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
|
|
- For Linux/Mac:
|
|
source venv/bin/activate
|
|
- For Windows:
|
|
venv\Scripts\activate
|
|
|
|
2. Install Required Libraries
|
|
pip install -r requirements.txt
|
|
|
|
# Ensure the following dependencies are included:
|
|
- Flask
|
|
- spaCy
|
|
- huggingface_hub
|
|
- PyMuPDF
|
|
- python-docx
|
|
- Tesseract-OCR (for image-based parsing)
|
|
|
|
3. Set up Hugging Face Token
|
|
- Add your Hugging Face token to the .env file as:
|
|
HF_TOKEN=<your_huggingface_token>
|
|
|
|
|
|
# File Structure Overview:
|
|
Mistral_With_Spacy/
|
|
β
|
|
βββ Spacy_Models/
|
|
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
|
|
β
|
|
βββ templates/
|
|
β βββ index.html # UI for file upload
|
|
β βββ result.html # Display parsed results in structured JSON
|
|
β
|
|
βββ uploads/ # Directory for uploaded resume files
|
|
β
|
|
βββ utils/
|
|
β βββ mistral.py # Code for calling Mistral API and handling responses
|
|
β βββ spacy.py # spaCy fallback model for parsing resumes
|
|
β βββ error.py # Error handling utilities
|
|
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
|
|
β
|
|
βββ venv/ # Virtual environment
|
|
β
|
|
βββ .env # Environment variables file (contains Hugging Face token)
|
|
β
|
|
βββ main.py # Flask app handling API routes for uploading and processing resumes
|
|
β
|
|
βββ requirements.txt # Dependencies required for the project
|
|
|
|
|
|
# Program Overview:
|
|
|
|
# Mistral Integration (utils/mistral.py)
|
|
- Mistral API Calls: Uses Hugging Faceβs Mistral-Nemo-Instruct-2407 model to parse resumes.
|
|
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
|
|
- Fallback Mechanism: If Mistral fails, spaCy NER model is used as a fallback.
|
|
|
|
# SpaCy Integration (utils/spacy.py)
|
|
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
|
|
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
|
|
- Validation: Includes validation for extracted emails and contacts.
|
|
|
|
# File Conversion (utils/fileTotext.py)
|
|
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
|
|
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
|
|
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
|
|
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
|
|
- RSF Files: Reads plain text from RSF files.
|
|
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
|
|
|
|
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
|
|
|
|
|
|
# Error Handling (utils/error.py)
|
|
- Handles API response errors, file format errors, and ensures smooth fallbacks without crashing the app.
|
|
|
|
# Flask API (main.py)
|
|
Endpoints:
|
|
- /upload for uploading resumes.
|
|
- Displays parsed results in JSON format on the results page.
|
|
- UI: Simple interface for uploading resumes and viewing the parsing results.
|
|
|
|
|
|
# Tree map of your program:
|
|
|
|
main.py
|
|
βββ Handles API side
|
|
βββ File upload/remove
|
|
βββ Process resumes
|
|
βββ Show result
|
|
|
|
utils
|
|
βββ fileTotext.py
|
|
β βββ Converts files to text
|
|
β βββ PDF
|
|
β βββ DOCX
|
|
β βββ RTF
|
|
β βββ ODT
|
|
β βββ PNG
|
|
β βββ JPG
|
|
β βββ JPEG
|
|
βββ mistral.py
|
|
β βββ Mistral API Calls
|
|
β β βββ Uses Mistral-Nemo-Instruct-2407 model
|
|
β βββ Personal and Professional Extraction
|
|
β β βββ Extracts personal information
|
|
β β βββ Extracts professional information
|
|
β βββ Fallback Mechanism
|
|
β βββ Uses spaCy NER model if Mistral fails
|
|
βββ spacy.py
|
|
βββ Custom Trained Model
|
|
β βββ Uses spaCy model (ner_model_05_3)
|
|
βββ Named Entity Recognition
|
|
β βββ Extracts key information (Name, Email, Contact, etc.)
|
|
βββ Validation
|
|
βββ Validates emails and contacts
|
|
|
|
|
|
# References:
|
|
|
|
- [Flask Documentation](https://flask.palletsprojects.com/)
|
|
- [spaCy Documentation](https://spacy.io/usage)
|
|
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
|
|
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
|
|
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
|
|
- [Tesseract OCR Documentation](https://github.com/tesseract-ocr/tesseract)
|
|
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)
|
|
|