Update Readme.md
Browse files
README.md
CHANGED
@@ -10,3 +10,92 @@ pinned: false
|
|
10 |
---
|
11 |
|
12 |
# AskMyPDF
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
# AskMyPDF
|
13 |
+
|
14 |
+
A comprehensive solution to query your PDFs using modern LLM-based techniques. This tool extracts and embeds PDF contents into a vector store, enabling natural language queries and context-rich answers. By leveraging LangChain, FAISS, and HuggingFace embeddings, it provides flexible and fast semantic search over document chunks.
|
15 |
+
|
16 |
+
## Features
|
17 |
+
|
18 |
+
- **PDF Parsing & Splitting:**
|
19 |
+
Automatically load PDF content and break it down into chunks suitable for all-MiniLM embeddings.
|
20 |
+
|
21 |
+
- **Semantic Embeddings & Vector Store:**
|
22 |
+
Use `sentence-transformers/all-MiniLM-L6-v2` embeddings to represent text as vectors.
|
23 |
+
FAISS vector storage for efficient similarity search.
|
24 |
+
|
25 |
+
- **Few-Shot Prompting & Structured Answers:**
|
26 |
+
Integrate few-shot examples to guide the model towards a specific output format.
|
27 |
+
Return answers in a structured JSON format.
|
28 |
+
|
29 |
+
- **Chain Orchestration with LangChain:**
|
30 |
+
Utilize LangChain’s `LLMChain` and prompt templates for controlled and reproducible queries.
|
31 |
+
|
32 |
+
- **Token-Safe Implementation:**
|
33 |
+
Custom token splitting and truncation ensure input fits within model token limits, avoiding errors.
|
34 |
+
|
35 |
+
## Installation
|
36 |
+
|
37 |
+
This project requires **Python 3.11**. We recommend using a virtual environment to keep dependencies isolated.
|
38 |
+
|
39 |
+
1. **Clone the Repository**
|
40 |
+
```bash
|
41 |
+
git clone https://github.com/yourusername/AskMyPDF.git
|
42 |
+
cd AskMyPDF
|
43 |
+
```
|
44 |
+
|
45 |
+
2. Set up a Python 3.11 environment (optional but recommended)
|
46 |
+
```bash
|
47 |
+
python3.11 -m venv venv
|
48 |
+
source venv/bin/activate
|
49 |
+
```
|
50 |
+
|
51 |
+
|
52 |
+
3. Install Dependencies
|
53 |
+
```bash
|
54 |
+
pip install --upgrade pip
|
55 |
+
pip install -r requirements.txt
|
56 |
+
```
|
57 |
+
|
58 |
+
|
59 |
+
|
60 |
+
4. Usage
|
61 |
+
```bash
|
62 |
+
gradio app.py
|
63 |
+
```
|
64 |
+
|
65 |
+
|
66 |
+
## Output
|
67 |
+
The system will:
|
68 |
+
- Parse and split the PDF into token-limited chunks.
|
69 |
+
- Embed the chunks using all-MiniLM embeddings.
|
70 |
+
- Store them in FAISS.
|
71 |
+
- Retrieve the top chunks relevant to your query.
|
72 |
+
- Use the language model to produce a final JSON-structured answer.
|
73 |
+
|
74 |
+
## Implementation Details
|
75 |
+
- Token-Based Splitting:
|
76 |
+
We tokenize the PDF text using Hugging Face’s AutoTokenizer for the all-MiniLM model. By maintaining a chunk_size and chunk_overlap, and adding truncation at the embedding stage, we ensure that the embedding model’s maximum token length is respected.
|
77 |
+
- Vector Store & Retrieval:
|
78 |
+
With FAISS indexing, similarity search is fast and scalable. Queries are answered by referencing only relevant chunks, ensuring context-aware responses.
|
79 |
+
- Few-Shot Prompting:
|
80 |
+
The prompt includes a few-shot example, demonstrating how the model should respond with a JSON-formatted answer. This guides the LLM to produce consistent and machine-readable output.
|
81 |
+
- Chain Invocation:
|
82 |
+
Instead of chain.run(), we use chain.invoke({}). This approach can be more flexible and allows for passing parameters in a structured manner if needed later.
|
83 |
+
|
84 |
+
## Improvements
|
85 |
+
- Multi-File Support:
|
86 |
+
- Extend the script to handle multiple PDFs at once.
|
87 |
+
- Aggregate or differentiate embeddings by metadata, ensuring queries can target specific documents or sections.
|
88 |
+
- Model Agnosticism:
|
89 |
+
- Easily switch embeddings or language models.
|
90 |
+
- Try different Sentence Transformers models or local LLMs like LLaMA or Falcon.
|
91 |
+
- User Interface:
|
92 |
+
- Add a simple command-line interface or a web UI (e.g., Streamlit or Gradio) for a more user-friendly querying experience.
|
93 |
+
- Caching & Persistence:
|
94 |
+
- Store FAISS indexes on disk for instant reloads without re-embedding.
|
95 |
+
- Implement caching of embeddings and query results to speed up repeated queries.
|
96 |
+
- Advanced Prompt Engineering:
|
97 |
+
- Experiment with different few-shot examples, system messages, and instructions to improve answer quality and formatting.
|
98 |
+
|
99 |
+
|
100 |
+
With AskMyPDF, harness the power of LLMs and embeddings to transform your PDFs into a fully interactive, queryable knowledge source.
|
101 |
+
|