agoyal496 commited on
Commit
bb6ef4f
1 Parent(s): 69992ee

Update Readme.md

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md CHANGED
@@ -10,3 +10,92 @@ pinned: false
10
  ---
11
 
12
  # AskMyPDF
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
  # AskMyPDF
13
+
14
+ A comprehensive solution to query your PDFs using modern LLM-based techniques. This tool extracts and embeds PDF contents into a vector store, enabling natural language queries and context-rich answers. By leveraging LangChain, FAISS, and HuggingFace embeddings, it provides flexible and fast semantic search over document chunks.
15
+
16
+ ## Features
17
+
18
+ - **PDF Parsing & Splitting:**
19
+ Automatically load PDF content and break it down into chunks suitable for all-MiniLM embeddings.
20
+
21
+ - **Semantic Embeddings & Vector Store:**
22
+ Use `sentence-transformers/all-MiniLM-L6-v2` embeddings to represent text as vectors.
23
+ FAISS vector storage for efficient similarity search.
24
+
25
+ - **Few-Shot Prompting & Structured Answers:**
26
+ Integrate few-shot examples to guide the model towards a specific output format.
27
+ Return answers in a structured JSON format.
28
+
29
+ - **Chain Orchestration with LangChain:**
30
+ Utilize LangChain’s `LLMChain` and prompt templates for controlled and reproducible queries.
31
+
32
+ - **Token-Safe Implementation:**
33
+ Custom token splitting and truncation ensure input fits within model token limits, avoiding errors.
34
+
35
+ ## Installation
36
+
37
+ This project requires **Python 3.11**. We recommend using a virtual environment to keep dependencies isolated.
38
+
39
+ 1. **Clone the Repository**
40
+ ```bash
41
+ git clone https://github.com/yourusername/AskMyPDF.git
42
+ cd AskMyPDF
43
+ ```
44
+
45
+ 2. Set up a Python 3.11 environment (optional but recommended)
46
+ ```bash
47
+ python3.11 -m venv venv
48
+ source venv/bin/activate
49
+ ```
50
+
51
+
52
+ 3. Install Dependencies
53
+ ```bash
54
+ pip install --upgrade pip
55
+ pip install -r requirements.txt
56
+ ```
57
+
58
+
59
+
60
+ 4. Usage
61
+ ```bash
62
+ gradio app.py
63
+ ```
64
+
65
+
66
+ ## Output
67
+ The system will:
68
+ - Parse and split the PDF into token-limited chunks.
69
+ - Embed the chunks using all-MiniLM embeddings.
70
+ - Store them in FAISS.
71
+ - Retrieve the top chunks relevant to your query.
72
+ - Use the language model to produce a final JSON-structured answer.
73
+
74
+ ## Implementation Details
75
+ - Token-Based Splitting:
76
+ We tokenize the PDF text using Hugging Face’s AutoTokenizer for the all-MiniLM model. By maintaining a chunk_size and chunk_overlap, and adding truncation at the embedding stage, we ensure that the embedding model’s maximum token length is respected.
77
+ - Vector Store & Retrieval:
78
+ With FAISS indexing, similarity search is fast and scalable. Queries are answered by referencing only relevant chunks, ensuring context-aware responses.
79
+ - Few-Shot Prompting:
80
+ The prompt includes a few-shot example, demonstrating how the model should respond with a JSON-formatted answer. This guides the LLM to produce consistent and machine-readable output.
81
+ - Chain Invocation:
82
+ Instead of chain.run(), we use chain.invoke({}). This approach can be more flexible and allows for passing parameters in a structured manner if needed later.
83
+
84
+ ## Improvements
85
+ - Multi-File Support:
86
+ - Extend the script to handle multiple PDFs at once.
87
+ - Aggregate or differentiate embeddings by metadata, ensuring queries can target specific documents or sections.
88
+ - Model Agnosticism:
89
+ - Easily switch embeddings or language models.
90
+ - Try different Sentence Transformers models or local LLMs like LLaMA or Falcon.
91
+ - User Interface:
92
+ - Add a simple command-line interface or a web UI (e.g., Streamlit or Gradio) for a more user-friendly querying experience.
93
+ - Caching & Persistence:
94
+ - Store FAISS indexes on disk for instant reloads without re-embedding.
95
+ - Implement caching of embeddings and query results to speed up repeated queries.
96
+ - Advanced Prompt Engineering:
97
+ - Experiment with different few-shot examples, system messages, and instructions to improve answer quality and formatting.
98
+
99
+
100
+ With AskMyPDF, harness the power of LLMs and embeddings to transform your PDFs into a fully interactive, queryable knowledge source.
101
+