Spaces:
Sleeping
Sleeping
Geoffrey Kip
commited on
Commit
Β·
7c4c603
1
Parent(s):
c5b629d
Docs & Cleanup: Update README with Docker info and remove legacy code
Browse files- README.md +16 -2
- ct_agent_app.py +3 -5
- modules/cohort_tools.py +1 -3
README.md
CHANGED
|
@@ -87,6 +87,7 @@ The agent is equipped with specialized tools to handle different types of reques
|
|
| 87 |
|
| 88 |
## βοΈ How It Works (RAG Pipeline)
|
| 89 |
|
|
|
|
| 90 |
1. **Ingestion**: `ingest_ct.py` fetches study data from ClinicalTrials.gov. It extracts rich text (including **Eligibility Criteria** and **Interventions**) and structured metadata. It uses **multiprocessing** for speed.
|
| 91 |
2. **Embedding**: Text is converted into vector embeddings using `PubMedBERT` and stored in **LanceDB**.
|
| 92 |
3. **Retrieval**:
|
|
@@ -94,10 +95,23 @@ The agent is equipped with specialized tools to handle different types of reques
|
|
| 94 |
* **Pre-Filtering**: Strict filters (Status, Year, Sponsor) reduce the search scope.
|
| 95 |
* **Hybrid Search**: Parallel **Vector Search** (Semantic) and **BM25** (Keyword) combined via **LanceDB Native Hybrid Search**.
|
| 96 |
* **Post-Filtering**: Additional metadata checks (Phase, Intervention) on retrieved candidates.
|
| 97 |
-
* **Re-Ranking**: Cross-Encoder re-scoring.
|
| 98 |
4. **Synthesis**: **Google Gemini** synthesizes the final answer.
|
| 99 |
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
```mermaid
|
| 103 |
graph TD
|
|
|
|
| 87 |
|
| 88 |
## βοΈ How It Works (RAG Pipeline)
|
| 89 |
|
| 90 |
+
### ποΈ Ingestion Pipeline
|
| 91 |
1. **Ingestion**: `ingest_ct.py` fetches study data from ClinicalTrials.gov. It extracts rich text (including **Eligibility Criteria** and **Interventions**) and structured metadata. It uses **multiprocessing** for speed.
|
| 92 |
2. **Embedding**: Text is converted into vector embeddings using `PubMedBERT` and stored in **LanceDB**.
|
| 93 |
3. **Retrieval**:
|
|
|
|
| 95 |
* **Pre-Filtering**: Strict filters (Status, Year, Sponsor) reduce the search scope.
|
| 96 |
* **Hybrid Search**: Parallel **Vector Search** (Semantic) and **BM25** (Keyword) combined via **LanceDB Native Hybrid Search**.
|
| 97 |
* **Post-Filtering**: Additional metadata checks (Phase, Intervention) on retrieved candidates.
|
| 98 |
+
* **Re-Ranking**: Cross-Encoder re-scoring (Cached for performance).
|
| 99 |
4. **Synthesis**: **Google Gemini** synthesizes the final answer.
|
| 100 |
|
| 101 |
+
## π³ Docker Deployment Structure
|
| 102 |
+
|
| 103 |
+
The application is containerized for easy deployment to Hugging Face Spaces or any Docker-compatible environment.
|
| 104 |
+
|
| 105 |
+
### Dockerfile Breakdown
|
| 106 |
+
* **Base Image**: `python:3.10-slim` (Lightweight and secure).
|
| 107 |
+
* **Dependencies**: Installs system tools (`build-essential`, `git`) and Python packages from `requirements.txt`.
|
| 108 |
+
* **Port**: Exposes port `8501` for Streamlit.
|
| 109 |
+
* **Entrypoint**: Runs `streamlit run ct_agent_app.py`.
|
| 110 |
+
|
| 111 |
+
### Recent Updates π
|
| 112 |
+
* **RAG Optimization**: Implemented a **Cached Reranker** and reduced retrieval candidates (`TOP_K=200`) for 2-3x faster search performance.
|
| 113 |
+
* **Enhanced Analytics**: Added support for grouping by **Country** and **State** in the Analytics Engine.
|
| 114 |
+
* **Dynamic Configuration**: Improved API key handling for secure, multi-user sessions.
|
| 115 |
|
| 116 |
```mermaid
|
| 117 |
graph TD
|
ct_agent_app.py
CHANGED
|
@@ -102,7 +102,6 @@ llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api
|
|
| 102 |
index = load_index()
|
| 103 |
|
| 104 |
|
| 105 |
-
# 3. Define Agent (Cached)
|
| 106 |
# 3. Define Agent (Cached)
|
| 107 |
@st.cache_resource
|
| 108 |
def get_agent(api_key: str):
|
|
@@ -182,7 +181,7 @@ def generate_dashboard_analytics():
|
|
| 182 |
}
|
| 183 |
|
| 184 |
# Get values from session state
|
| 185 |
-
#
|
| 186 |
g_by = st.session_state.get("dash_group_by", "Sponsor")
|
| 187 |
p_filter = st.session_state.get("dash_phase", "")
|
| 188 |
s_filter = st.session_state.get("dash_sponsor", "")
|
|
@@ -300,7 +299,7 @@ if page == "Chat Assistant":
|
|
| 300 |
if page == "Analytics Dashboard":
|
| 301 |
st.header("π Global Analytics")
|
| 302 |
st.write(
|
| 303 |
-
"Analyze trends across the entire clinical trial dataset
|
| 304 |
)
|
| 305 |
|
| 306 |
col1, col2 = st.columns([1, 3])
|
|
@@ -508,8 +507,7 @@ if page == "Raw Data":
|
|
| 508 |
st.header("π Raw Data Explorer")
|
| 509 |
st.write("View and filter the underlying dataset.")
|
| 510 |
|
| 511 |
-
# Load a sample
|
| 512 |
-
# We load a sample (top 100) to avoid performance issues.
|
| 513 |
col_raw_1, col_raw_2 = st.columns([1, 1])
|
| 514 |
|
| 515 |
with col_raw_1:
|
|
|
|
| 102 |
index = load_index()
|
| 103 |
|
| 104 |
|
|
|
|
| 105 |
# 3. Define Agent (Cached)
|
| 106 |
@st.cache_resource
|
| 107 |
def get_agent(api_key: str):
|
|
|
|
| 181 |
}
|
| 182 |
|
| 183 |
# Get values from session state
|
| 184 |
+
# Use .get() to avoid KeyErrors if the widget hasn't initialized yet
|
| 185 |
g_by = st.session_state.get("dash_group_by", "Sponsor")
|
| 186 |
p_filter = st.session_state.get("dash_phase", "")
|
| 187 |
s_filter = st.session_state.get("dash_sponsor", "")
|
|
|
|
| 299 |
if page == "Analytics Dashboard":
|
| 300 |
st.header("π Global Analytics")
|
| 301 |
st.write(
|
| 302 |
+
"Analyze trends across the entire clinical trial dataset."
|
| 303 |
)
|
| 304 |
|
| 305 |
col1, col2 = st.columns([1, 3])
|
|
|
|
| 507 |
st.header("π Raw Data Explorer")
|
| 508 |
st.write("View and filter the underlying dataset.")
|
| 509 |
|
| 510 |
+
# Load a sample (top 100) to avoid performance issues.
|
|
|
|
| 511 |
col_raw_1, col_raw_2 = st.columns([1, 1])
|
| 512 |
|
| 513 |
with col_raw_1:
|
modules/cohort_tools.py
CHANGED
|
@@ -27,8 +27,6 @@ def get_llm():
|
|
| 27 |
|
| 28 |
return ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
|
| 29 |
|
| 30 |
-
# Initialize LLM (Dynamic)
|
| 31 |
-
# llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)
|
| 32 |
|
| 33 |
EXTRACT_PROMPT = PromptTemplate(
|
| 34 |
template="""
|
|
@@ -120,7 +118,7 @@ def get_cohort_sql(nct_id: str) -> str:
|
|
| 120 |
str: A formatted string containing the Extracted Requirements (JSON) and the Generated SQL.
|
| 121 |
"""
|
| 122 |
# 1. Fetch Study Details
|
| 123 |
-
#
|
| 124 |
study_text = get_study_details.invoke(nct_id)
|
| 125 |
|
| 126 |
if "No study found" in study_text:
|
|
|
|
| 27 |
|
| 28 |
return ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
|
| 29 |
|
|
|
|
|
|
|
| 30 |
|
| 31 |
EXTRACT_PROMPT = PromptTemplate(
|
| 32 |
template="""
|
|
|
|
| 118 |
str: A formatted string containing the Extracted Requirements (JSON) and the Generated SQL.
|
| 119 |
"""
|
| 120 |
# 1. Fetch Study Details
|
| 121 |
+
# Reuse the existing tool logic to get the text
|
| 122 |
study_text = get_study_details.invoke(nct_id)
|
| 123 |
|
| 124 |
if "No study found" in study_text:
|