Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Objective

Normalizer is a tool that processes various file types to extract prompt-response pairs for finetuning LLM models on unstructured data.

The end-state will be a GUI where users of any technical level can drag and drop multiple files at once into the application, then they hit a 'create fine-tuning data' button and they'll receive a .csv with a system prompt and a series of matched pair prompt-completion responses.

Working Repository

https://github.com/anarchy-ai/normalizer

Input Formats

  • Text: .txt, .md, .mdx
  • Documents: .pdf, .doc, .docx
  • Spreadsheets: .xlsx, .csv
  • Code: .py, .js, .html, .java
  • Images: .jpg, .png (will require OCR)

Output Format

A .csv file with 3 columns:

System Prompt User Prompt Response
Hi, how can I help you today?
What do these lab results suggest? These lab results suggest that the patient is healthy, as no anomalous data has been detected.
What is the sentiment of the last 5 customers who came into support chat? The last five customers have a neutral to positive sentiment.
... ...

Required Libraries

pip install PyPDF2 python-docx pandas openpyxl pillow pytesseract beautifulsoup4 transformers datasets

Project Structure

normalizer/
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ file_ingest.py
β”‚   β”œβ”€β”€ prompt_extractor.py
β”‚   β”œβ”€β”€ main.py
β”‚
β”œβ”€β”€ app.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── .gitignore
Downloads last month

-

Downloads are not tracked for this model. How to track
Unable to determine this model's library. Check the docs .