Spaces:
Running
A newer version of the Streamlit SDK is available: 1.57.0
π document-qa-engine documentation
License: Apache 2.0 Β· PyPI:
pip install document-qa-engine
A Python library and Streamlit application for Question/Answering on scientific PDF documents using Retrieval-Augmented Generation (RAG). It uses GROBID for structured text extraction, ChromaDB for vector storage, and any OpenAI-compatible LLM for answering.
Overview
Most PDF Q/A tools feed raw extracted text to an LLM, which is noisy and loses document structure. document-qa-engine takes a different approach:
- Structured extraction Sends the PDF to a GROBID server, which returns TEI-XML with separate sections (title, abstract, body paragraphs, figures, back matter) and precise bounding-box coordinates for every paragraph.
- Smart chunking Paragraphs can be kept as-is or merged into larger chunks using token-aware merging, while preserving coordinate metadata.
- Vector embeddings Each chunk is embedded (via a remote API or local model) and stored in an in-memory ChromaDB collection.
- Retrieval + LLM answering User questions are embedded, the most similar chunks are retrieved, and an LLM generates an answer from that context.
- PDF highlighting The Streamlit frontend highlights the exact PDF regions the LLM used, with a color gradient (orange = most relevant, blue = least relevant).
- NER post-processing (optional) LLM responses are scanned for physical quantities (via grobid-quantities) and materials mentions (via grobid-superconductors), then annotated inline.
Installation
Option 1: PyPI (library only)
pip install document-qa-engine
Option 2: From source (full app)
git clone https://github.com/lfoppiano/document-qa.git
cd document-qa
pip install -r requirements.txt
Option 3: Docker
# Latest stable release
docker run -p 8501:8501 lfoppiano/document-insights-qa:latest
# Latest development build
docker run -p 8501:8501 lfoppiano/document-insights-qa:latest-develop
Prerequisites
You need access to:
| Service | Required? | Purpose |
|---|---|---|
| GROBID server | β Yes | Parses PDFs into structured text |
| Embedding API | β Yes | Converts text to vectors |
| LLM API (OpenAI-compatible) | β Yes | Answers questions |
| grobid-quantities | β Optional | NER for measurements |
| grobid-superconductors | β Optional | NER for materials |
Configuration
All configuration is through environment variables. Create a .env file in the project root:
# ββ LLM Endpoints ββββββββββββββββββββββββββββββββββββββββ
# Each key in API_MODELS maps a model name to its base URL.
PHI_URL=http://localhost:1234/v1 # Phi-4-mini-instruct endpoint
QWEN_URL=http://localhost:1234/v1 # Qwen3-0.6B endpoint
API_KEY=your-llm-api-key # Auth key for LLM APIs
# ββ Embedding Endpoint βββββββββββββββββββββββββββββββββββ
EMBEDS_URL=http://127.0.0.1:1234/v1 # Embedding service URL
EMBEDS_API_KEY=your-embedding-api-key # Auth key for embedding API
# ββ Defaults βββββββββββββββββββββββββββββββββββββββββββββ
DEFAULT_MODEL=microsoft/Phi-4-mini-instruct
DEFAULT_EMBEDDING=intfloat/multilingual-e5-large-instruct-modal
# ββ GROBID Services ββββββββββββββββββββββββββββββββββββββ
GROBID_URL=https://your-grobid-url
GROBID_QUANTITIES_URL=https://your-grobid-quantities-url/
GROBID_MATERIALS_URL=https://your-grobid-superconductors-url/
Variable Reference
| Variable | Description |
|---|---|
PHI_URL |
Base URL for the Phi-4-mini-instruct vLLM server (OpenAI-compatible) |
QWEN_URL |
Base URL for the Qwen3-0.6B vLLM server (OpenAI-compatible) |
API_KEY |
Bearer token for authenticating with the LLM endpoints |
EMBEDS_URL |
Base URL for the embedding service (must expose /embeddings endpoint) |
EMBEDS_API_KEY |
Bearer token for authenticating with the embedding service |
DEFAULT_MODEL |
Model name pre-selected in the UI dropdown |
DEFAULT_EMBEDDING |
Embedding name pre-selected in the UI dropdown |
GROBID_URL |
Full URL to a running GROBID server |
GROBID_QUANTITIES_URL |
URL to a grobid-quantities server (for measurement NER) |
GROBID_MATERIALS_URL |
URL to a grobid-superconductors server (for materials NER) |
Quick Start β Streamlit App
# 1. Set up environment
cp .env.example .env # Edit with your endpoints
# 2. Run the app
streamlit run streamlit_app.py
Then open http://localhost:8501, upload a PDF, and ask questions.
Quick Start β As a Python Library
from langchain_openai import ChatOpenAI
from document_qa.custom_embeddings import ModalEmbeddings
from document_qa.document_qa_engine import DocumentQAEngine, DataStorage
# 1. Set up the LLM
llm = ChatOpenAI(
model="microsoft/Phi-4-mini-instruct",
temperature=0.0,
base_url="http://localhost:1234/v1",
api_key="your-api-key"
)
# 2. Set up embeddings
embeddings = ModalEmbeddings(
url="http://localhost:1234/v1",
model_name="intfloat/multilingual-e5-large-instruct",
api_key="your-embedding-key"
)
# 3. Create the storage and engine
storage = DataStorage(embeddings)
engine = DocumentQAEngine(
llm=llm,
data_storage=storage,
grobid_url="https://lfoppiano-grobid.hf.space/"
)
# 4. Load a PDF (creates in-memory embeddings)
doc_id = engine.create_memory_embeddings(
pdf_path="path/to/paper.pdf",
chunk_size=500 # tokens per chunk (-1 = keep paragraphs)
)
# 5. Ask a question
_, answer, coordinates = engine.query_document(
query="What is the main contribution of this paper?",
doc_id=doc_id,
context_size=10 # number of chunks to use as context
)
print(answer)
# 6. Or just retrieve relevant passages (no LLM)
passages, coordinates = engine.query_storage(
query="What materials were studied?",
doc_id=doc_id,
context_size=5
)
for p in passages:
print(p)
Streamlit App Features
Query Modes
| Mode | What It Does | When to Use |
|---|---|---|
| LLM Q/A | Retrieves context β sends to LLM β returns a natural language answer | Default β for asking questions |
| Embeddings | Returns the raw text passages most similar to your question | Debugging β to see what context the LLM would receive |
| Question Coefficient | Computes min_similarity - mean_similarity as a quality estimate |
Experimental β to predict answer reliability |
Settings
| Setting | Default | Description |
|---|---|---|
| Chunk size | -1 (paragraphs) |
Token count per text chunk. -1 keeps GROBID paragraphs intact. |
| Context size | 10 (paragraphs) / 4 (chunks) |
Number of chunks sent to the LLM as context |
| Scroll to context | Off | Auto-scroll the PDF viewer to the most relevant passage |
| NER processing | Off | Run grobid-quantities + grobid-superconductors on LLM responses |
PDF Annotations
After each query, the PDF viewer highlights the passages used as context:
- Orange (warm) = most relevant passage
- Blue (cold) = least relevant passage
- Dotted border = the single most relevant passage
Troubleshooting
SQLite version error
streamlit: Your system has an unsupported version of sqlite3.
Chroma requires sqlite3 >= 3.35.0.
Linux fix: See this StackOverflow answer. More info: Chroma troubleshooting docs.
"The information is not provided in the given context"
The LLM couldn't find the answer in the retrieved passages. Try:
- Increase context size β use the sidebar slider to retrieve more passages
- Decrease chunk size β smaller chunks may match more precisely
- Use Embeddings mode β switch to "Embeddings" query mode to see what passages are being retrieved and verify they contain the answer
MissingSchema error on embeddings
requests.exceptions.MissingSchema: Invalid URL
Ensure EMBEDS_URL in your .env starts with https:// or http://. Example:
EMBEDS_URL=https://your-modal-endpoint.modal.run/v1
GROBID connection errors
Make sure your GROBID server is running and accessible:
curl https://grobid.hf.space/api/isalive
If using a local GROBID instance:
docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.0
# Then set GROBID_URL=http://localhost:8070
Embedding API returning empty results
- Verify the API is running:
curl {EMBEDS_URL}/embeddings - Check that
EMBEDS_API_KEYmatches the server's expected key - Ensure the URL does not have a trailing
/embeddings(the client appends it automatically)