Geoffrey Kip commited on
Commit
7c4c603
Β·
1 Parent(s): c5b629d

Docs & Cleanup: Update README with Docker info and remove legacy code

Browse files
Files changed (3) hide show
  1. README.md +16 -2
  2. ct_agent_app.py +3 -5
  3. modules/cohort_tools.py +1 -3
README.md CHANGED
@@ -87,6 +87,7 @@ The agent is equipped with specialized tools to handle different types of reques
87
 
88
  ## βš™οΈ How It Works (RAG Pipeline)
89
 
 
90
  1. **Ingestion**: `ingest_ct.py` fetches study data from ClinicalTrials.gov. It extracts rich text (including **Eligibility Criteria** and **Interventions**) and structured metadata. It uses **multiprocessing** for speed.
91
  2. **Embedding**: Text is converted into vector embeddings using `PubMedBERT` and stored in **LanceDB**.
92
  3. **Retrieval**:
@@ -94,10 +95,23 @@ The agent is equipped with specialized tools to handle different types of reques
94
  * **Pre-Filtering**: Strict filters (Status, Year, Sponsor) reduce the search scope.
95
  * **Hybrid Search**: Parallel **Vector Search** (Semantic) and **BM25** (Keyword) combined via **LanceDB Native Hybrid Search**.
96
  * **Post-Filtering**: Additional metadata checks (Phase, Intervention) on retrieved candidates.
97
- * **Re-Ranking**: Cross-Encoder re-scoring.
98
  4. **Synthesis**: **Google Gemini** synthesizes the final answer.
99
 
100
- ### πŸ—οΈ Ingestion Pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ```mermaid
103
  graph TD
 
87
 
88
  ## βš™οΈ How It Works (RAG Pipeline)
89
 
90
+ ### πŸ—οΈ Ingestion Pipeline
91
  1. **Ingestion**: `ingest_ct.py` fetches study data from ClinicalTrials.gov. It extracts rich text (including **Eligibility Criteria** and **Interventions**) and structured metadata. It uses **multiprocessing** for speed.
92
  2. **Embedding**: Text is converted into vector embeddings using `PubMedBERT` and stored in **LanceDB**.
93
  3. **Retrieval**:
 
95
  * **Pre-Filtering**: Strict filters (Status, Year, Sponsor) reduce the search scope.
96
  * **Hybrid Search**: Parallel **Vector Search** (Semantic) and **BM25** (Keyword) combined via **LanceDB Native Hybrid Search**.
97
  * **Post-Filtering**: Additional metadata checks (Phase, Intervention) on retrieved candidates.
98
+ * **Re-Ranking**: Cross-Encoder re-scoring (Cached for performance).
99
  4. **Synthesis**: **Google Gemini** synthesizes the final answer.
100
 
101
+ ## 🐳 Docker Deployment Structure
102
+
103
+ The application is containerized for easy deployment to Hugging Face Spaces or any Docker-compatible environment.
104
+
105
+ ### Dockerfile Breakdown
106
+ * **Base Image**: `python:3.10-slim` (Lightweight and secure).
107
+ * **Dependencies**: Installs system tools (`build-essential`, `git`) and Python packages from `requirements.txt`.
108
+ * **Port**: Exposes port `8501` for Streamlit.
109
+ * **Entrypoint**: Runs `streamlit run ct_agent_app.py`.
110
+
111
+ ### Recent Updates πŸš€
112
+ * **RAG Optimization**: Implemented a **Cached Reranker** and reduced retrieval candidates (`TOP_K=200`) for 2-3x faster search performance.
113
+ * **Enhanced Analytics**: Added support for grouping by **Country** and **State** in the Analytics Engine.
114
+ * **Dynamic Configuration**: Improved API key handling for secure, multi-user sessions.
115
 
116
  ```mermaid
117
  graph TD
ct_agent_app.py CHANGED
@@ -102,7 +102,6 @@ llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api
102
  index = load_index()
103
 
104
 
105
- # 3. Define Agent (Cached)
106
  # 3. Define Agent (Cached)
107
  @st.cache_resource
108
  def get_agent(api_key: str):
@@ -182,7 +181,7 @@ def generate_dashboard_analytics():
182
  }
183
 
184
  # Get values from session state
185
- # We use .get() to avoid KeyErrors if the widget hasn't initialized yet (though it should have)
186
  g_by = st.session_state.get("dash_group_by", "Sponsor")
187
  p_filter = st.session_state.get("dash_phase", "")
188
  s_filter = st.session_state.get("dash_sponsor", "")
@@ -300,7 +299,7 @@ if page == "Chat Assistant":
300
  if page == "Analytics Dashboard":
301
  st.header("πŸ“Š Global Analytics")
302
  st.write(
303
- "Analyze trends across the entire clinical trial dataset (60,000+ studies)."
304
  )
305
 
306
  col1, col2 = st.columns([1, 3])
@@ -508,8 +507,7 @@ if page == "Raw Data":
508
  st.header("πŸ“‚ Raw Data Explorer")
509
  st.write("View and filter the underlying dataset.")
510
 
511
- # Load a sample or full dataset? Full might be slow.
512
- # We load a sample (top 100) to avoid performance issues.
513
  col_raw_1, col_raw_2 = st.columns([1, 1])
514
 
515
  with col_raw_1:
 
102
  index = load_index()
103
 
104
 
 
105
  # 3. Define Agent (Cached)
106
  @st.cache_resource
107
  def get_agent(api_key: str):
 
181
  }
182
 
183
  # Get values from session state
184
+ # Use .get() to avoid KeyErrors if the widget hasn't initialized yet
185
  g_by = st.session_state.get("dash_group_by", "Sponsor")
186
  p_filter = st.session_state.get("dash_phase", "")
187
  s_filter = st.session_state.get("dash_sponsor", "")
 
299
  if page == "Analytics Dashboard":
300
  st.header("πŸ“Š Global Analytics")
301
  st.write(
302
+ "Analyze trends across the entire clinical trial dataset."
303
  )
304
 
305
  col1, col2 = st.columns([1, 3])
 
507
  st.header("πŸ“‚ Raw Data Explorer")
508
  st.write("View and filter the underlying dataset.")
509
 
510
+ # Load a sample (top 100) to avoid performance issues.
 
511
  col_raw_1, col_raw_2 = st.columns([1, 1])
512
 
513
  with col_raw_1:
modules/cohort_tools.py CHANGED
@@ -27,8 +27,6 @@ def get_llm():
27
 
28
  return ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
29
 
30
- # Initialize LLM (Dynamic)
31
- # llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)
32
 
33
  EXTRACT_PROMPT = PromptTemplate(
34
  template="""
@@ -120,7 +118,7 @@ def get_cohort_sql(nct_id: str) -> str:
120
  str: A formatted string containing the Extracted Requirements (JSON) and the Generated SQL.
121
  """
122
  # 1. Fetch Study Details
123
- # We reuse the existing tool logic to get the text
124
  study_text = get_study_details.invoke(nct_id)
125
 
126
  if "No study found" in study_text:
 
27
 
28
  return ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
29
 
 
 
30
 
31
  EXTRACT_PROMPT = PromptTemplate(
32
  template="""
 
118
  str: A formatted string containing the Extracted Requirements (JSON) and the Generated SQL.
119
  """
120
  # 1. Fetch Study Details
121
+ # Reuse the existing tool logic to get the text
122
  study_text = get_study_details.invoke(nct_id)
123
 
124
  if "No study found" in study_text: