Spaces:

gkip
/

clinical_trial_inspector

Sleeping

App Files Files Community

Geoffrey Kip commited on 25 days ago

Commit

7c4c603

1 Parent(s): c5b629d

Docs & Cleanup: Update README with Docker info and remove legacy code

Browse files

Files changed (3) hide show

README.md +16 -2
ct_agent_app.py +3 -5
modules/cohort_tools.py +1 -3

README.md CHANGED Viewed

@@ -87,6 +87,7 @@ The agent is equipped with specialized tools to handle different types of reques
 ## ⚙️ How It Works (RAG Pipeline)
 1.  **Ingestion**: `ingest_ct.py` fetches study data from ClinicalTrials.gov. It extracts rich text (including **Eligibility Criteria** and **Interventions**) and structured metadata. It uses **multiprocessing** for speed.
 2.  **Embedding**: Text is converted into vector embeddings using `PubMedBERT` and stored in **LanceDB**.
 3.  **Retrieval**:
@@ -94,10 +95,23 @@ The agent is equipped with specialized tools to handle different types of reques
     *   **Pre-Filtering**: Strict filters (Status, Year, Sponsor) reduce the search scope.
     *   **Hybrid Search**: Parallel **Vector Search** (Semantic) and **BM25** (Keyword) combined via **LanceDB Native Hybrid Search**.
     *   **Post-Filtering**: Additional metadata checks (Phase, Intervention) on retrieved candidates.
-    *   **Re-Ranking**: Cross-Encoder re-scoring.
 4.  **Synthesis**: **Google Gemini** synthesizes the final answer.
-### 🏗️ Ingestion Pipeline
 ```mermaid
 graph TD

 ## ⚙️ How It Works (RAG Pipeline)
+### 🏗️ Ingestion Pipeline
 1.  **Ingestion**: `ingest_ct.py` fetches study data from ClinicalTrials.gov. It extracts rich text (including **Eligibility Criteria** and **Interventions**) and structured metadata. It uses **multiprocessing** for speed.
 2.  **Embedding**: Text is converted into vector embeddings using `PubMedBERT` and stored in **LanceDB**.
 3.  **Retrieval**:
     *   **Pre-Filtering**: Strict filters (Status, Year, Sponsor) reduce the search scope.
     *   **Hybrid Search**: Parallel **Vector Search** (Semantic) and **BM25** (Keyword) combined via **LanceDB Native Hybrid Search**.
     *   **Post-Filtering**: Additional metadata checks (Phase, Intervention) on retrieved candidates.
+    *   **Re-Ranking**: Cross-Encoder re-scoring (Cached for performance).
 4.  **Synthesis**: **Google Gemini** synthesizes the final answer.
+## 🐳 Docker Deployment Structure
+The application is containerized for easy deployment to Hugging Face Spaces or any Docker-compatible environment.
+### Dockerfile Breakdown
+*   **Base Image**: `python:3.10-slim` (Lightweight and secure).
+*   **Dependencies**: Installs system tools (`build-essential`, `git`) and Python packages from `requirements.txt`.
+*   **Port**: Exposes port `8501` for Streamlit.
+*   **Entrypoint**: Runs `streamlit run ct_agent_app.py`.
+### Recent Updates 🚀
+*   **RAG Optimization**: Implemented a **Cached Reranker** and reduced retrieval candidates (`TOP_K=200`) for 2-3x faster search performance.
+*   **Enhanced Analytics**: Added support for grouping by **Country** and **State** in the Analytics Engine.
+*   **Dynamic Configuration**: Improved API key handling for secure, multi-user sessions.
 ```mermaid
 graph TD

ct_agent_app.py CHANGED Viewed

@@ -102,7 +102,6 @@ llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api
 index = load_index()
-# 3. Define Agent (Cached)
 # 3. Define Agent (Cached)
 @st.cache_resource
 def get_agent(api_key: str):
@@ -182,7 +181,7 @@ def generate_dashboard_analytics():
     }
     # Get values from session state
-    # We use .get() to avoid KeyErrors if the widget hasn't initialized yet (though it should have)
     g_by = st.session_state.get("dash_group_by", "Sponsor")
     p_filter = st.session_state.get("dash_phase", "")
     s_filter = st.session_state.get("dash_sponsor", "")
@@ -300,7 +299,7 @@ if page == "Chat Assistant":
 if page == "Analytics Dashboard":
     st.header("📊 Global Analytics")
     st.write(
-        "Analyze trends across the entire clinical trial dataset (60,000+ studies)."
     )
     col1, col2 = st.columns([1, 3])
@@ -508,8 +507,7 @@ if page == "Raw Data":
     st.header("📂 Raw Data Explorer")
     st.write("View and filter the underlying dataset.")
-    # Load a sample or full dataset? Full might be slow.
-    # We load a sample (top 100) to avoid performance issues.
     col_raw_1, col_raw_2 = st.columns([1, 1])
     with col_raw_1:

 index = load_index()
 # 3. Define Agent (Cached)
 @st.cache_resource
 def get_agent(api_key: str):
     }
     # Get values from session state
+    # Use .get() to avoid KeyErrors if the widget hasn't initialized yet
     g_by = st.session_state.get("dash_group_by", "Sponsor")
     p_filter = st.session_state.get("dash_phase", "")
     s_filter = st.session_state.get("dash_sponsor", "")
 if page == "Analytics Dashboard":
     st.header("📊 Global Analytics")
     st.write(
+        "Analyze trends across the entire clinical trial dataset."
     )
     col1, col2 = st.columns([1, 3])
     st.header("📂 Raw Data Explorer")
     st.write("View and filter the underlying dataset.")
+    # Load a sample (top 100) to avoid performance issues.
     col_raw_1, col_raw_2 = st.columns([1, 1])
     with col_raw_1:

modules/cohort_tools.py CHANGED Viewed

@@ -27,8 +27,6 @@ def get_llm():
     return ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
-# Initialize LLM (Dynamic)
-# llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)
 EXTRACT_PROMPT = PromptTemplate(
     template="""
@@ -120,7 +118,7 @@ def get_cohort_sql(nct_id: str) -> str:
         str: A formatted string containing the Extracted Requirements (JSON) and the Generated SQL.
     """
     # 1. Fetch Study Details
-    # We reuse the existing tool logic to get the text
     study_text = get_study_details.invoke(nct_id)
     if "No study found" in study_text:

     return ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
 EXTRACT_PROMPT = PromptTemplate(
     template="""
         str: A formatted string containing the Extracted Requirements (JSON) and the Generated SQL.
     """
     # 1. Fetch Study Details
+    # Reuse the existing tool logic to get the text
     study_text = get_study_details.invoke(nct_id)
     if "No study found" in study_text: