Objective
Normalizer is a tool that processes various file types to extract prompt-response pairs for finetuning LLM models on unstructured data.
The end-state will be a GUI where users of any technical level can drag and drop multiple files at once into the application, then they hit a 'create fine-tuning data' button and they'll receive a .csv with a system prompt and a series of matched pair prompt-completion responses.
Working Repository
https://github.com/anarchy-ai/normalizer
Input Formats
- Text: .txt, .md, .mdx
- Documents: .pdf, .doc, .docx
- Spreadsheets: .xlsx, .csv
- Code: .py, .js, .html, .java
- Images: .jpg, .png (will require OCR)
Output Format
A .csv file with 3 columns:
System Prompt | User Prompt | Response |
---|---|---|
Hi, how can I help you today? | ||
What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
... | ... |
Required Libraries
pip install PyPDF2 python-docx pandas openpyxl pillow pytesseract beautifulsoup4 transformers datasets
Project Structure
normalizer/
β
βββ src/
β βββ __init__.py
β βββ file_ingest.py
β βββ prompt_extractor.py
β βββ main.py
β
βββ app.py
βββ requirements.txt
βββ README.md
βββ .gitignore