Spaces:

dashVector
/

dashVectorSpace

Running

App Files Files Community

justmotes commited on 7 days ago

Commit

b92d96d

1 Parent(s): 212947c

Deploy dashVectorspace v1 (Full)

Browse files

Files changed (15) hide show

.gitignore +2 -0
README.md +70 -12
app.py +127 -0
config.py +29 -0
logs/active_learning_queue.jsonl +59 -0
main.py +151 -0
notebooks/xVector_Analysis.ipynb +88 -0
requirements.txt +10 -0
scripts/ingest_ms_marco.py +82 -0
src/__init__.py +0 -0
src/active_learning.py +40 -0
src/comparison.py +95 -0
src/data_pipeline.py +117 -0
src/router.py +132 -0
src/vector_db.py +188 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ venv/
2	+ __pycache__/

README.md CHANGED Viewed

@@ -1,12 +1,70 @@
----
-title: DashVectorSpace
-emoji: 🏆
-colorFrom: purple
-colorTo: pink
-sdk: gradio
-sdk_version: 6.0.2
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# dashVectorspace (xVector)
+**Production-Grade Learned Hybrid Retrieval Engine**
+This project implements a high-efficiency vector search engine using a **Learned Router** and **Custom Sharding** on top of **Qdrant**. It optimizes search efficiency by ~90% by routing queries to specific data clusters instead of performing a brute-force search across the entire dataset.
+## Core Architecture
+1.  **The Brain (Router)**: A Machine Learning model (LightGBM/Logistic/MLP) predicts which cluster contains the answer.
+2.  **The Body (Vector DB)**: Qdrant with **Custom Sharding**. Data is partitioned into 32 Clusters + 1 Freshness Shard.
+3.  **The Optimization**: **Matryoshka Representation Learning (MRL)**. The Router uses sliced 64-dim vectors for speed, while the DB stores full vectors for accuracy.
+## Project Structure
+```
+dashVectorspace/
+├── config.py                   # Configuration (Clusters, Models, Paths)
+├── main.py                     # Benchmark Runner (P&C Matrix)
+├── requirements.txt            # Dependencies
+├── src/
+│   ├── data_pipeline.py        # Data loading & MRL slicing
+│   ├── router.py               # LearnedRouter (Train/Predict)
+│   ├── vector_db.py            # UnifiedQdrant (Custom Sharding)
+│   └── active_learning.py      # Hard Negative Logging
+└── notebooks/
+    └── xVector_Analysis.ipynb  # Analysis Notebook
+```
+## Setup & Usage
+1.  **Install Dependencies**:
+    ```bash
+    pip install -r requirements.txt
+    ```
+2.  **Run Benchmarks**:
+    Execute the main script to run the Permutation & Combination matrix of experiments:
+    ```bash
+    python main.py
+    ```
+    This will:
+    - Generate/Load Data (MS MARCO or Synthetic).
+    - Train different Router models (LightGBM, Logistic, MLP).
+    - Index data into Qdrant with Custom Sharding.
+    - Run test queries and report Accuracy, Latency, and Compute Savings.
+3.  **Analyze Results**:
+    Open `notebooks/xVector_Analysis.ipynb` to visualize the active learning logs and performance metrics.
+## Key Features
+-   **Custom Sharding**: Explicit control over where data lives (Clusters 0-31) and a dedicated **Freshness Shard (999)** for new data.
+-   **Drift Defense**:
+    -   **Layer 1**: Always searches the Freshness Shard.
+    -   **Layer 2**: Falls back to **Global Search** if Router confidence is low (< 0.5).
+-   **Active Learning**: Logs "Hard Negatives" (low confidence or zero results) to `logs/active_learning_queue.jsonl` for future model retraining.
+## Hugging Face Space Deployment
+This project is ready for deployment on Hugging Face Spaces.
+1.  **Create a New Space**: Select "Gradio" as the SDK.
+2.  **Upload Files**: Upload the entire `dashVectorspace` folder content to the Space.
+3.  **Set Secrets**: Go to "Settings" -> "Repository secrets" and add:
+    -   `QDRANT_URL`: Your Qdrant Cloud Cluster URL.
+    -   `QDRANT_API_KEY`: Your Qdrant Cloud API Key.
+4.  **Ingest Data**:
+    -   Run `python scripts/ingest_ms_marco.py` locally (with env vars set) to populate your Qdrant Cloud instance and generate `models/router_v1.pkl`.
+    -   **Upload `models/router_v1.pkl`** to the Space (inside a `models/` folder).
+5.  **Run**: The Space will automatically launch `app.py`.

app.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import gradio as gr
+import pandas as pd
+import os
+import time
+from src.vector_db import UnifiedQdrant
+from src.router import LearnedRouter
+from src.comparison import ComparisonEngine
+from config import COLLECTION_NAME, NUM_CLUSTERS, FRESHNESS_SHARD_ID, MRL_DIMS
+# --- Initialization ---
+print("Initializing dashVectorspace App...")
+# 1. Initialize DB
+# Note: In a real HF Space, secrets are in os.environ
+db = UnifiedQdrant(
+    collection_name=COLLECTION_NAME,
+    vector_size=384, # Assuming MiniLM for demo
+    num_clusters=NUM_CLUSTERS,
+    freshness_shard_id=FRESHNESS_SHARD_ID
+)
+db.initialize()
+# 2. Initialize Router
+ROUTER_PATH = "models/router_v1.pkl"
+if os.path.exists(ROUTER_PATH):
+    router = LearnedRouter.load(ROUTER_PATH)
+else:
+    print("WARNING: Router model not found. Creating a DUMMY router for demo UI.")
+    router = LearnedRouter(model_type="lightgbm", n_clusters=NUM_CLUSTERS, mrl_dims=MRL_DIMS)
+    # We can't really predict without training, but let's mock it or fail gracefully.
+    # For the UI to load, we need an object.
+    # If we try to predict, it will crash if not trained.
+    # Let's mock the predict method if not trained.
+    router.predict = lambda x: (0, 0.99) # Mock prediction: Cluster 0, High Confidence
+# 3. Initialize Engine
+engine = ComparisonEngine(db, router, embedding_model_name="minilm")
+# --- UI Logic ---
+def run_comparison(query):
+    if not query:
+        return "Please enter a query.", None, None, None, None
+    # Run Direct Search
+    res_direct = engine.direct_search(query)
+    # Run xVector Search
+    res_xvector = engine.xvector_search(query)
+    # Format Results
+    def format_results(res_dict):
+        points = res_dict["results"]
+        text_res = ""
+        for p in points:
+            # Payload might be dict or object depending on client version/mock
+            payload = p.payload
+            text = payload.get("text", "No text") if payload else "No text"
+            score = p.score
+            text_res += f"- [{score:.4f}] {text[:100]}...\n"
+        return text_res
+    out_direct = format_results(res_direct)
+    out_xvector = format_results(res_xvector)
+    # Metrics
+    metrics_df = pd.DataFrame({
+        "Metric": ["Latency (ms)", "Shards Searched"],
+        "Brute Force": [res_direct["latency_ms"], res_direct["shards_searched"]],
+        "xVector": [res_xvector["latency_ms"], res_xvector["shards_searched"]]
+    })
+    # Compute Savings
+    savings = (1 - (res_xvector["shards_searched"] / res_direct["shards_searched"])) * 100
+    savings_text = f"Compute Savings: {savings:.1f}%"
+    # Telemetry
+    telemetry = f"""
+    **Search Mode:** {res_xvector['mode']}
+    **Router Confidence:** {res_xvector.get('confidence', 0):.4f}
+    **Target Cluster:** {res_xvector.get('target_cluster', 'N/A')}
+    **Shards Scanned:** {res_xvector['shards_searched']} vs {res_direct['shards_searched']}
+    """
+    return out_direct, out_xvector, metrics_df, savings_text, telemetry
+# --- Gradio Layout ---
+with gr.Blocks(title="dashVectorspace: Learned Hybrid Retrieval", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("# 🚀 dashVectorspace: Learned Hybrid Retrieval Engine")
+    gr.Markdown("Comparing **Brute Force Vector Search** vs **xVector (Learned Router + Custom Sharding)**.")
+    with gr.Row():
+        query_input = gr.Textbox(label="Enter your query", placeholder="e.g., What is the impact of AI on healthcare?", lines=2)
+        submit_btn = gr.Button("🚀 Run Comparison", variant="primary")
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### 🐢 Brute Force (Standard)")
+            out_baseline = gr.Textbox(label="Results", lines=10)
+        with gr.Column(scale=1):
+            gr.Markdown("### ⚡ xVector (Optimized)")
+            out_optimized = gr.Textbox(label="Results", lines=10)
+    with gr.Row():
+        with gr.Column():
+            metrics_plot = gr.BarPlot(
+                x="Metric",
+                y="Brute Force",
+                title="Performance Comparison",
+                tooltip=["Metric", "Brute Force", "xVector"],
+                # Gradio BarPlot expects long format usually, but let's try simple DF display first if BarPlot is complex
+            )
+            # Actually, let's use a simple DataFrame for metrics first, it's cleaner.
+            metrics_table = gr.Dataframe(label="Performance Metrics")
+        with gr.Column():
+            savings_display = gr.Markdown("### Compute Savings: --%")
+            telemetry_display = gr.Markdown("### Telemetry\nWaiting for query...")
+    submit_btn.click(
+        run_comparison,
+        inputs=[query_input],
+        outputs=[out_baseline, out_optimized, metrics_table, savings_display, telemetry_display]
+    )
+if __name__ == "__main__":
+    demo.launch()

config.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import os
+# --- Architecture Constants ---
+NUM_CLUSTERS = 32
+FRESHNESS_SHARD_ID = 999
+MRL_DIMS = 64
+# --- Qdrant Configuration ---
+# Use in-memory for testing if QDRANT_URL is not set, otherwise connect to cloud/local instance
+QDRANT_URL = os.getenv("QDRANT_URL", "https://justmotes-xvector-db-node.hf.space")
+QDRANT_API_KEY = os.getenv("QDRANT_API_KEY", "xvector_secret_pass_123")
+COLLECTION_NAME = "dashVector_v1"
+# --- Model Configurations ---
+EMBEDDING_MODELS = {
+    "minilm": "sentence-transformers/all-MiniLM-L6-v2",  # Baseline (384 dims)
+    "nomic": "nomic-ai/nomic-embed-text-v1.5",           # Primary, MRL-capable (768 dims, matryoshka compatible)
+    "qwen": "Alibaba-NLP/gte-Qwen2-1.5B-instruct"        # SOTA (1536 dims)
+}
+ROUTER_MODELS = ["lightgbm", "logistic", "mlp"]
+# --- Paths ---
+BASE_DIR = os.path.dirname(os.path.abspath(__file__))
+LOGS_DIR = os.path.join(BASE_DIR, "logs")
+ACTIVE_LEARNING_LOG = os.path.join(LOGS_DIR, "active_learning_queue.jsonl")
+# Ensure logs directory exists
+os.makedirs(LOGS_DIR, exist_ok=True)

logs/active_learning_queue.jsonl ADDED Viewed

	@@ -0,0 +1,59 @@

+{"timestamp": "2025-12-05T04:02:49.290476", "query": "Best Answer: NaOH is a alkali (bases soluble in water) & HCl is an acid. 1) Acid + base \u2192 Salt + water + heat (neutralisation reaction). 2)The equation is already balanced. To check whether it is balanced check whether the number of atoms of each kind are equal on both RHS & LHS.", "confidence": 0.3504535932154855, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:02:49.296195", "query": "Evangelista Torricelli 's parents were Gaspare Torricelli and Caterina Angetti. It was a fairly poor family with Gaspare being a textile worker. Evangelista was the eldest of his parents three children, having two younger brothers at least one of whom went on to work with cloth.", "confidence": 0.27602324660862104, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:02:49.307386", "query": "Definition. Turner syndrome, a condition that affects only girls and women, results when a sex chromosome (the X chromosome) is missing or partially missing. ", "confidence": 0.5561548368359581, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:02:49.326399", "query": "Overview. Spider angiomas are common in both children and adults. They appear more frequently during pregnancy, in people on birth control pills, or in people with liver disease. Who's At Risk. Spider angiomas are most often seen on the face or trunk, and they also may be seen on the hands, forearms, and ears. There may be one spider angioma or several. Each one is a small (1\u201310 mm) area of redness, which disappears with direct finger pressure but rapidly returns when the pressure is released.", "confidence": 0.5066801533454477, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:02:49.332817", "query": "If more than one egg is released, but only one is fertilized, the unfertilized eggs are absorbed by your body and do not cause you to have a period. The uterine lining that you normally loose during your menstrual cycle is needed for the fertilized egg so it does not slough off. If your ovaries release more than 1 egg but only 1 egg is fertilized then you still will NOT have a period. It simply is not possible to be pregnant and have a period, but it is very possible to be pregnant, have abnormal bleeding that the woman mistakes for her period. jilldaniel_wv \u00b7 9 years ago. Thumbs up.", "confidence": 0.2832711737078318, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:02:49.372817", "query": "1 One type of Jack Russell is longer than he is tall, standing only 10 to 12 inches at the shoulder. 2  These dogs are nicknamed Shorty Jacks and resemble Corgis or Dachshunds more than they do the taller Parson Russell or Jack Russel Terrier Club of America (JRTCA) Jack Russell. ", "confidence": 0.4964724302274644, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:02:49.386680", "query": "You can provide a suitable butterfly habitat that will help fortify the butterfly population, and as an added bonus, the habitat will bring you enjoyment in watching beautiful butterflies in your yard. The butterfly habitat should be relatively sunny (5-6 hours per day) and out of the wind. Facts About Butterflies. Natural butterfly habitats have been destroyed or affected by construction of housing and shopping developments, as well as by the use of pesticides and other chemicals.", "confidence": 0.48337516671682024, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:02:49.390507", "query": "In Illinois, you have to work for your employer for 30 days in order for that employer to be chargeable for unemployment benefits in the event you become unemployed. Regular unemployment benefits: If you meet the eligibility requirements of the law, you will have some income while you are looking for a job, up to a maximum of 26 full weeks \u2026 in a one-year period.", "confidence": 0.45567612613411346, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.849402", "query": "Definition. Turner syndrome, a condition that affects only girls and women, results when a sex chromosome (the X chromosome) is missing or partially missing. ", "confidence": 0.15475964960575342, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.855185", "query": "The current that is sent to the hair follicle can be adjusted depending on the thickness and type of the hair. 1  According to eHow.com, the cost of your electrolysis is determined by the number of treatments and by the time needed for the procedure. 2  However, the average cost is $15 to $35 for 20 minutes.", "confidence": 0.1676933617190043, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.860697", "query": "Nova is a very prominent first name for women (#1600 out of 4276, Top 37%) and also a very prominent last name for both adults and children (#13153 out of 150436, Top 9%). (2000 U.S. Census). Astronomy: a nova is a star that releases a tremendous burst of energy, becoming temporarily extraordinarily bright. Chevrolet used to make a small car called a Nova. Novia is also Spanish for girlfriend.", "confidence": 0.2974863522179262, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.865692", "query": "Medical Definition of DEAD SPACE. 1. : space in the respiratory system in which air does not undergo significant gaseous exchange\u2014see anatomical dead space, physiological dead space. 2. : a space (as that in the chest following excision of a lung) left in the body as the result of a surgical procedure. Definition of DEAD SPACE. : the portion of the respiratory system which is external to the bronchioles and through which air must pass to reach the bronchioles and alveoli. ADVERTISEMENT. noun. 1", "confidence": 0.20809382514381405, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.872096", "query": "Al(OH) 3, for example, acts as an acid when it reacts with a base. Conversely, it acts as a base when it reacts with an acid. The Br nsted Definition of Acids and Bases. The Brnsted, or Brnsted-Lowry, model is based on a simple assumption: Acids donate H ions to another ion or molecule, which acts as a base. Arrhenius bases include ionic compounds that contain the OH-ion, such as NaOH, KOH, and Ca(OH) 2. This theory explains why acids have similar properties: The characteristic properties of acids result from the presence of the H + ion generated when an acid dissolves in water.", "confidence": 0.133487619260465, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.877720", "query": "Non melanoma skin cancer is different from melanoma. Melanoma is the type of skin cancer that most often develops from a mole. If you are looking for information on melanoma, go to the separate section on melanoma skin cancer. There are 2 main types of n", "confidence": 0.10373228154886284, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.881981", "query": "A U.S. utility patent, explained above, is generally granted for 20 years from the date the patent application is filed; however, periodic fees are required to maintain the enforceability of the patent. A design patent is generally granted protection for 14 years measured from the date the design patent is granted.", "confidence": 0.32575387402314276, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.891362", "query": "Return to work should be phased, however, with three half-days in the first week, two full days in the second week, five half-days in the third week, and full-time by week four. Recovery from knee replacement surgery takes a minimum of three months, but most likely six. A full recovery can take around eight to ten months. The degree of improvement during rehabilitation often depends on the strength of your body before surgery, your body weight, and your ability to manage pain. 1 2 1 1. Many patients are eager to know when they can return to work and resume their normal activities after knee replacement surgery. While the desire for a speedy recovery is nearly universal among patients, less understood and appreciated is the value of recovery time itself.", "confidence": 0.3363066805360089, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.896638", "query": "Serratia Marcescens is a human pathogenic species of Serratia. It is sometimes linked to disease in humans. The disease is commonly known as either Serratia plymuthica, Serratia liquefaciens, Serratia rubidaea, Serratia odorifera, or Serratia fonticola. ", "confidence": 0.10608763032101182, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.901731", "query": "The viscosity of a fluid is basically a measure of how sticky it is. Water has a fairly low viscosity; things like shampoo or syrup have higher viscosities. Viscosity also depends on temperature: engine oil, for instance, is much less viscous at high temperatures than it is in a cold engine in the middle of winter.", "confidence": 0.22168547234298666, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.910398", "query": "Pennsylvania Governor Corbett has ordered all Pennsylvania State flags lowered to half-staff in the Capitol Complex and at Commonwealth facilities in Westmoreland County immediately, October 13, 2011 in honor of Patrolman Derek Kotecki who died in the line of duty on October 12, 2011. Pennsylvania Governor Corbett has ordered all United States flags and Pennsylvania flags lowered to half-staff at the Capitol Complex and at Commonwealth facilities in Northampton County on until sunset on Saturday, September 24, 2011 in honor of Air Force Major Bruce Lawrence who was shot down over Vietnam in 1968.", "confidence": 0.20135046065446768, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.915485", "query": "Anatomically and functionally, the esophagus is the least complex section of the digestive tube. Its role in digestion is simple: to convey boluses of food from the pharynx to the stomach. The esophagus begins as an extension of the pharynx in the back of the oral cavity. Absorption in the esophagus is virtually nil. The mucosa does contain mucous glands that are expressed as foodstuffs distend the esophagus, allowing mucus to be secreted and aid in lubrication. The body of the esophagus is bounded by physiologic sphincters known as the upper and lower esophageal sphincters.", "confidence": 0.24093048229383604, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.925538", "query": "Research does not support the theory that carbohydrates from wheat, other grains, or starchy vegetables are the source of injury that leads to chronic inflammation. In contrast, scientific research does solidly support that the source of injury leading to chronic inflammation is animal foods. ", "confidence": 0.11461193959930165, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.929388", "query": "If you are in there for your 10,000 mile maintenance brake job, than by all means have them resurfaced. A lot of places will even resurface them for free or a very small fee when you buy pads. If it\u2019s a regular maintenance thing, than resurfacing them once or twice is perfectly fine. If you\u2019re just keeping your ship tight with regular brake maintenance, resurfacing is fine and will get you a few extra miles on your rotors. If you\u2019re noticing a problem, sound, or vibration, than you\u2019re dealing with a bigger problem and should replace them. We try to keep it simple. Safety first!", "confidence": 0.14603582187496875, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.932928", "query": "Directions. 1  Put the dry seaweed in a large bowl and fill it with cold water. 2  If you like your seaweed crunchy, soak it for 5 minutes, if you like it more tender, soak it for 10 minutes. 3  To make the dressing, combine the rice vinegar, sesame oil, soy sauce, sugar, salt and ginger juice in a small bowl and whisk together. 1 If you like your seaweed crunchy, soak it for 5 minutes, if you like it more tender, soak it for 10 minutes. 2  To make the dressing, combine the rice vinegar, sesame oil, soy sauce, sugar, salt and ginger juice in a small bowl and whisk together. 3  Drain the seaweed and use your hands to squeeze out excess water.", "confidence": 0.18253477190186296, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.941938", "query": "Organic herb plant. French Tarragon is a delicately flavored herb reminiscent of mint and licorice that goes particularly well with fish, vinegars, and vegetables. It is delicious in creamy sauces and in combination with chives, garlic, and any lemon-flavored herb. The buttery French sauce, bearnaise, b\u00e9arnaise Includes. tarragon ", "confidence": 0.12814197187040818, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.945195", "query": "Q19: What information is available on Where's My Refund? Information on Where's My Refund? is for the most recent tax year we have on file for you. You can check on the status of your refund 24 hours after you e-file. If you filed a paper return, please allow 4 weeks before checking on the status. Even though we issue most refunds in less than 21 days, it\u2019s possible your tax return may require additional review and take longer. Also, if you are anticipating a refund, take into consideration the time it takes for your financial institution to post the refund to your account, or for mail delivery.", "confidence": 0.13153029409341419, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.948576", "query": "A conceptual new approach for ligand-receptor capture on the living cell. Ligand receptor capture LRC-TriCEPS\u2122 is a conceptually new technology to analyse the protein interaction of your biologics. We can identify the targets of your protein, antibody and peptide at the cell surface. Ligand-receptor capture technology can be used for many types of orphan ligands, such as peptides, proteins, antibodies, engineered affinity binders but also for viruses. Orphan ligands: 1  Extracellular proteins. 2  Peptide ligands. 3  Antibodies. 4  Engineered affinity binders. 5  Viruses.", "confidence": 0.14592690697904515, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.954874", "query": "Regulations, Rules and by-laws are examples of delegated legislation (also called s ubordinate legislation), which is so named because Parliament has delegated power to a local council, government department or other body to make further laws under a particular Act. Acts and Delegated Legislation. Acts (also called statutes) have a name and date, for example the Road Traffic Act 1961 (SA). The name usually reflects the subject matter of the Act and the date indicates the year in which the Act passed through Parliament.", "confidence": 0.21296590264960524, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:04:13.960836", "query": "ANSWER: It depends. Under Florida's sales and use tax, if no parts were used in the service and the charge was for labor only, then there would be no tax to pay. But if any parts were used in the service work, the store must charge sales tax on the entire bill. If you want further clarification or information on the state's sales tax laws, call the Florida Department of Revenue's toll-free consumer line at 1-800-352-3671, Ext 4.", "confidence": 0.14158310009689912, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:05:22.847419", "query": "Hydreigon's offensive movepool is enormous; it has access to nearly every single respectable attacking move in the game, which gives it a myriad of options to choose from. On the physical side, Hydreigon has access to Acrobatics, Crunch, Dragon Tail, Outrage, and Head Smash. Overview. Hydreigon belongs to the special group of Pokemon that can boast they possess no true counters: they potentially carry a move that can OHKO or 2HKO any Pokemon in the game, and as such are virtually impossible to switch into. Its peers include such wrecking balls as Deoxys-A, Excadrill, and Salamence.", "confidence": 0.49984776973724365, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:05:22.850909", "query": "The foremost white Russian variation is the black Russian. A black Russian consists of 3 parts vodka to 2 parts coffee liqueur. For example, measure 3 ounces of vodka and 2 ounces of coffee liqueur. Pour the black Russian over a glass filled with crushed ice.", "confidence": 0.47514423727989197, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:05:22.864608", "query": "Why it's done. Your doctor may recommend a nuclear stress test to: Diagnose coronary artery disease. Your coronary arteries are the major blood vessels that supply your heart with blood, oxygen and nutrients. If you have symptoms that might indicate coronary artery disease, such as shortness of breath or chest pains, a nuclear stress test can help determine if you have coronary artery disease. 1  See the size and shape of your heart. 2  Guide treatment of heart disorders. 3  Definition. 4  Risks.", "confidence": 0.5250684022903442, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:05:22.885837", "query": "If you fail to appear for court proceedings regarding a misdemeanor charge, you may be charged with misdemeanor failure to appear. With any misdemeanor charges, you could face up to one year in jail for this charge, along with fines and potentially a suspended license. If you fail to appear for court proceedings related to a felony criminal charge, you will be charged with this additional felony. The punishment for this charge is quite severe at up to one year in state prison and fines up to $5,000.", "confidence": 0.37632015347480774, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:14:40.082222", "query": "Chris A. Gillespie, MEd, ATC, LAT. THE FACTS Since 2000 exertional sickling is the leading cause of non-traumatic death in NCAA Football \u2026 All Divisions. In FBS --- if you add heat, heart, and asthma --- Combined, match the total dead from sickling. ", "confidence": 0.5787151512765106, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:14:40.405485", "query": "Of all the snakes on this list, the ball python sits right at the edge of a good beginner snake. It has more specific care requirements than the others. In addition to its care requirements, the ball python often stops eating, or going \u201coff feed\u201d for whatever reason at any time of the year. We often get questions about what is an ideal beginner-friendly snake for those new to the hobby. Beginner meaning fairly easy to care for with not a lot of requirements other than good husbandry and attention to detail. Of all the reptiles available in the hobby, snakes seem to be the most popular. Go to any reptile show, and the majority of the animals available are of the legless kind", "confidence": 0.4973230852988249, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:14:40.433679", "query": "Vesphene IIse Germicidal Detergent is intended for use in institutions such as hospitals, nursing homes, schools, medical and dental offices, pharmaceutical plants, and other indoor areas where disinfection, cleaning, and deodorizing are necessary. For full access to this content, please Register or Sign In. Vesphene IIse Germicidal Detergent cleans and disinfects all washable hard non-porous environmental surfaces such as floors, walls, woodwork, bathroom fixtures, equipment, and furniture.", "confidence": 0.46005434348332924, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:14:40.517460", "query": "ANSWER: It depends. Under Florida's sales and use tax, if no parts were used in the service and the charge was for labor only, then there would be no tax to pay. But if any parts were used in the service work, the store must charge sales tax on the entire bill. If you want further clarification or information on the state's sales tax laws, call the Florida Department of Revenue's toll-free consumer line at 1-800-352-3671, Ext 4.", "confidence": 0.505778632955981, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.629043", "query": "1 Caring for one pet should cost less than caring for multiple pets. 2  If you have a number of animals, expect to pay more for services. 3  The type of pet you have is a determining factor in how much the sitter may charge. 4  Pet sitters usually charge less for cats and hamsters, for example, than they do for dogs. 1 A 30-minute walk may go for $10 or $15 dollars. 2  If you require the pet sitter to spend the night with your sick or lonely pet, expect to pay extra. 3  The usual charge is from $40 to $60 a night.", "confidence": 0.11957169270954886, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.642537", "query": "You\u2019ll also need to pay for the CDL itself; CDL costs range from $25 to $100. Special endorsements such as air brakes, doubles/triples, and hazardous materials may cost between $5 and $45 each. You may not have to pay the cost of truck driving school on your own, though. The average cost of PTDI-certified courses is about $4,200, though the cost of truck driving school may be as low as $1,500 and as high as $10,000. Related program expenses that may or may not be included in truck driving school tuition are books, uniforms, and fees for tests, medical exams, and graduation.", "confidence": 0.18618601375859162, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.655558", "query": "LOCATION AND SIZE. Iraq is located in the Middle East, between Iran and Saudi Arabia. Iraq is also bordered by Jordan and Syria to the west, Kuwait to the south, and Turkey to the north. A very small sliver of the Persian Gulf (58 kilometers, or 36.04 miles) abuts Iraq on its southeast border. With an area of 437,072 square kilometers (168,753 square miles), Iraq is slightly more than twice the size of Idaho. Iraq's capital city, Baghdad, is located in the center of the country. Other major cities include al-Basra in the south and Mosul in the north", "confidence": 0.34377298418659913, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.672735", "query": "Glandular fever is a viral infection caused by the Epstein\u2013Barr virus. Glandular fever is often spread through oral acts such as kissing, which is why it is sometimes called the kissing disease. However, glandular fever can also be spread by airborne saliva droplets. Symptoms of glandular fever include: 1  Fever. ", "confidence": 0.40762822012572364, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.678416", "query": "One thing I like about Vistex Support is the fact that when you have an issue with it you can log a message through the service marketplace from SAP itself (service.sap.com), and the message is routed to a vistex development directly, which sure speeds things up. Good Luck, SAP Logistics Sales and Distribution. The SAP Logistics SD group is for the discussion of specific configuration, administration, and development issues that arise when utilizing the SAP Logistics Sales and Distribution component.", "confidence": 0.2921026160124872, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.683881", "query": "Flavonoids are a group of plant metabolites thought to provide health benefits through cell signalling pathways and antioxidant effects. These molecules are found in a variety of fruits and vegetables. Flavonoids are polyphenolic molecules containing 15 carbon atoms and are soluble in water. The abundance of flavonoids coupled with their low toxicity relative to other plant compounds means they can be ingested in large quantities by animals, including humans. Examples of foods that are rich in flavonoids include onions, parsley, blueberries, bananas, dark chocolate and red wine.", "confidence": 0.34804435928642086, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.689161", "query": "Glucose, or simple sugar, is a solid until it reaches a temperature of 145 to 150 Co. At that point, it melts, it becomes a liquid. Sugar will not boil before the applied heat will begin to pyrolyze or, in the presence of air (oxygen) burn. ", "confidence": 0.17559626322839966, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.695286", "query": "Discuss the Role of the Early Years Practitioner in Planning Provision to Meet the Needs of the Child. Text Preview. Discuss the role of the early years practitioner in planning provision to meet the needs of the child. This essay aims to explore the role of the early years practitioner in planning provision to meet the needs of the child, simultaneously applying theoretical research and professional practice.", "confidence": 0.09886203232531417, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.700640", "query": "From a low down payment mortgage to using your Registered Retirement Savings Plan (RRSP) as a source of funds, buying a home has never been easier. The down payment is that portion of the purchase price you furnish yourself. The balance is obtained from a financial institution in the form of a mortgage. The withdrawal is not taxable as long as you repay it within a 15-year period. To qualify, the RRSP funds you plan to use must have been in your RRSP for at least 90 days. Even if you already have enough money for your down payment, it may make sense to access your RRSP savings through the Home Buyers' Plan.", "confidence": 0.4254641849725119, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.710947", "query": "Gastrointestinal Effects. Brewer's yeast is sometimes used to treat diarrhea and constipation. It can have a similar water-binding effect to fiber. The more common side effects of supplementing with brewer's yeast are those of a gastrointestinal nature, such as gas, flatulence and a laxative effect. Brewer's yeast can have adverse effects as well as beneficial ones. You can use brewer's yeast, the type used to make beer, not bread, as a nutritional supplement. It can potentially lower your risk for high cholesterol and help you control your weight and blood sugar levels, according to the University of Maryland Medical Center website", "confidence": 0.2242658315328104, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.716510", "query": "Elevation Effects. For every 1,000 feet of change in elevation, there is a loss of 1 foot in suction or lift and a 0.5 pounds per square inch decrease in atmospheric pressure. Example 2 - An engine can lift water 22.5 feet at sea level. The same engine is driven to a fire at an elevation of 2,000 feet above", "confidence": 0.13683717029393855, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.722165", "query": "Cefuroxime is used to treat certain infections caused by bacteria, such as bronchitis; gonorrhea; Lyme disease; and infections of the ears, throat, sinuses, urinary tract, and skin. Cefuroxime is in a class of medications called cephalosporin antibiotics. It works by stopping the growth of bacteria.", "confidence": 0.2563230222960035, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.727727", "query": "HRMS also refers to the integration of human resource management and information technology to automate and facilitate human resource activities. The general notion of an HRMS helps small-business managers craft suitable human resource systems based on their field of business and business growth stage. In a broad definition, a human resource management system, or HRMS, encompasses the highest level of human resource management activities. It is a program of multiple human resource policies that are internally consistent in relation to a human resource objective.", "confidence": 0.14126607329632876, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.735148", "query": "Kombucha is a fermented tea which contains lots of probiotics to improve your health. If you're hesitant to try organic kombucha tea because it's fermented and has a unique smell, take a chance and it may soon become your favorite beverage. Kombucha is an ancient recipe for traditional tea that originated in China. Organic ingredients, like the tea and sugar in organic kombucha, come from farms that do not use synthetic fertilizers or pesticides or genetically engineered foods. It also means that the farm and food have been scrutinized to meet the USDA requirements.", "confidence": 0.2486887518366076, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.744467", "query": "Fertile chicken eggs can be stored up to 10 days (before incubating) with little loss in hatchability \u2013 as long as you keep them out of the refrigerator. The ideal storage conditions are 55 to 60 degrees Fahrenheit and 70 to 75 percent relative humidity. The easiest way to incubate and hatch fertile chicken eggs is to have a broody hen do all the work for you. What\u2019s a broody hen, you wonder? This hen has undergone progesterone-induced changes that make her want to sit on eggs to hatch them and brood the resulting chicks.", "confidence": 0.12317020216125468, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.751434", "query": "I heard before that they were, but then I\u2019m a sushi cheif, and I know that they are used to set food on and as decoration, and some asian coutries stem sweet rice inside of them. No i dont think so. Yes you are right, cheif frequently use bamboo leaves to wrape and also as an ingredient of recepie. Bamboo shoots are very common in Thai cousine. Bamboo is a woody grass\u2014not a tree. There are approximately 1,000 species of bamboo from small plants to giant timber bamboos that can grow to over 40m. Bamboo grows in temperate and tropical countries around the world, but is best known in China and S.E. Asia. It is used for food, construction, and making tools. ", "confidence": 0.136365011466467, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.757415", "query": "Ayurveda (Sanskrit: \u0906\u092f\u0941\u0930\u094d\u0935\u0947\u0926 \u0100yurveda,  life-knowledge ; English pronunciation /\u02cca\u026a.\u0259r\u02c8ve\u026ad\u0259/) or Ayurvedic medicine is a system of medicine with historical roots in the Indian subcontinent. The use of opium is not found in the ancient Ayurvedic texts, and is first mentioned in the Sarngadhara Samhita (1300-1400 CE), a book on pharmacy used in Rajasthan in Western India, as an ingredient of an aphrodisiac to delay male ejaculation.", "confidence": 0.13617304541428732, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:23:23.769459", "query": "Yohimbe contains a chemical that affects the body. This chemical is called yohimbine. Yohimbine might affect the body in some of the same ways as some medications for depression called MAOIs. Taking yohimbe along with MAOIs might increase the effects and side effects of yohimbe and MAOIs. Yohimbe contains a chemical that can affect the brain. This chemical is called yohimbine. Naloxone (Narcan) also affects the brain. Taking naloxone (Narcan) along with yohimbine might increase the chance of side effects such as anxiety, nervousness, trembling, and hot flashes.", "confidence": 0.2631052261136738, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:33:20.049816", "query": "The Cannondale Bicycle Corporation, is an American division of Canadian conglomerate Dorel Industries that supplies bicycles. It is headquartered in Wilton, Connecticut with manufacturing and assembly facilities in China and Taichung, Taiwan. The Lefty is now seen on many of Cannondale's high-end models, such as all the Scalpels, Rizes, and the expensive models in F series, both cross-country lines. Continual efforts at weight reduction have provided models with a carbon fiber upper tube and a titanium spindle.", "confidence": 0.5676314830780029, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:33:20.169625", "query": "Biomarkers (short for biological markers) are biological measures of a biological state. By definition, a biomarker is a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes or pharmacological responses to a therapeutic intervention..", "confidence": 0.37469613552093506, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:33:20.188712", "query": "Visible light has frequencies ranging from 4*10 14 Hz to 8*10 14 Hz (400 THz to 800 THz) and wavelengths from 3.8*10 -7 m to 7.5*10 -7 m (380 nm to 750 nm). Red light has the lowest frequency (longest wavelength) and violet light has the highest frequency (shortest wavelength) of visible light.", "confidence": 0.3559046983718872, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:33:20.344797", "query": "For example, let's assume it costs Company XYZ $10,000 to purchase 5,000 widgets that it will resell in its retail outlets. Company XYZ's cost per unit is: $10,000 / 5,000 = $2 per unit. Often, calculating the cost per unit isn't so simple, especially in manufacturing situations. Usually, costs per unit involve variable costs (costs that vary with the number of units made) and fixed costs (costs that don't vary with the number of units made). For example, at XYZ Restaurant, which sells only pepperoni pizza, the variable expenses per pizza might be: Flour: $0.50. Yeast: $0.05. ", "confidence": 0.5727376937866211, "result_count": 10, "reasons": ["low_confidence"]}
+{"timestamp": "2025-12-05T04:33:20.367353", "query": "A bridal or wedding set is the engagement ring and wedding ring for the woman sold as 1 together. You can pull them apart and they are made to clip together. And they match! Buy the bridal set and give the engagement ring to the woman and then tell her you already have the matching wedding ring for it. An engagement ring is a ring given to propose (1 ring by itself. This is the BLING ring. The one with a big diamond 1-2 cts)", "confidence": 0.5388101935386658, "result_count": 10, "reasons": ["low_confidence"]}

main.py ADDED Viewed

	@@ -0,0 +1,151 @@

+import time
+import numpy as np
+import pandas as pd
+from tabulate import tabulate
+import itertools
+from config import (
+    NUM_CLUSTERS, FRESHNESS_SHARD_ID, MRL_DIMS,
+    EMBEDDING_MODELS, ROUTER_MODELS, COLLECTION_NAME
+)
+from src.data_pipeline import get_embeddings, mrl_slice, load_ms_marco, generate_synthetic_data
+from src.router import LearnedRouter
+from src.vector_db import UnifiedQdrant
+from src.active_learning import log_for_retraining
+def run_benchmark():
+    print("============================================================")
+    print("   xVector / dashVector: Learned Hybrid Retrieval Engine    ")
+    print("============================================================")
+    results_table = []
+    # P&C Matrix: Iterate through all Embedding Models x Router Models
+    combinations = list(itertools.product(EMBEDDING_MODELS.keys(), ROUTER_MODELS))
+    for embed_name, router_name in combinations:
+        print(f"\n>>> Running Experiment: Embedding='{embed_name}' | Router='{router_name}'")
+        model_id = EMBEDDING_MODELS[embed_name]
+        # 1. Generate/Load Data
+        # We need enough data to cluster meaningfully.
+        N_SAMPLES = 2000
+        raw_texts = load_ms_marco(N_SAMPLES)
+        # Generate Embeddings
+        embeddings = get_embeddings(model_id, raw_texts)
+        vector_dim = embeddings.shape[1]
+        # Split into Train (for Router) and Index (for DB)
+        # In a real scenario, we might train on a subset and index everything.
+        # Here, let's use 50% for training router, and index the other 50% + some "fresh" data.
+        split_idx = int(N_SAMPLES * 0.5)
+        X_train = embeddings[:split_idx]
+        X_index = embeddings[split_idx:]
+        texts_index = raw_texts[split_idx:]
+        # 2. Train Router
+        router = LearnedRouter(model_type=router_name, n_clusters=NUM_CLUSTERS, mrl_dims=MRL_DIMS)
+        router.train(X_train)
+        # 3. Index Data
+        # We need to assign clusters to X_index using the router (or ground truth?)
+        # For the "Index Data" phase, we usually index based on the Router's prediction
+        # OR we can index based on Ground Truth K-Means if we want the DB to be perfect,
+        # and then test if the Router can find it.
+        # The prompt says: "The Brain (Router)... predicts which data partition... contains the answer".
+        # Usually, we partition data using K-Means (Ground Truth) during ingestion.
+        # Then at query time, the Router predicts where to look.
+        # So:
+        # A. Run K-Means on X_index to determine where they SHOULD go.
+        # (Ideally, we use the SAME K-Means model from training if possible, but K-Means is transductive.
+        #  We should probably use the router's kmeans to predict labels for X_index)
+        # Let's use the router's internal kmeans to assign ground truth labels for indexing.
+        # This ensures consistency.
+        ground_truth_labels = router.kmeans.predict(X_index)
+        # Initialize DB
+        db = UnifiedQdrant(
+            collection_name=COLLECTION_NAME,
+            vector_size=vector_dim,
+            num_clusters=NUM_CLUSTERS,
+            freshness_shard_id=FRESHNESS_SHARD_ID
+        )
+        db.initialize()
+        # Prepare payloads
+        payloads = [{"text": t, "origin": "historical"} for t in texts_index]
+        # Index Historical Data (Assigned to specific clusters)
+        db.index_data(X_index, payloads, ground_truth_labels)
+        # Index some "Fresh" Data (No cluster assigned -> Freshness Shard)
+        # Let's simulate 100 fresh items
+        fresh_texts = generate_synthetic_data(100)
+        fresh_embeddings = get_embeddings(model_id, fresh_texts)
+        fresh_payloads = [{"text": t, "origin": "fresh"} for t in fresh_texts]
+        db.index_data(fresh_embeddings, fresh_payloads, [None] * len(fresh_texts))
+        # 4. Run Test Queries
+        # We'll use a subset of X_index as queries to see if we can find them back (Self-Recall)
+        # And maybe some completely new queries.
+        test_indices = np.random.choice(len(X_index), size=20, replace=False)
+        test_queries = X_index[test_indices]
+        test_query_texts = [texts_index[i] for i in test_indices]
+        latencies = []
+        hits = 0
+        shards_searched_count = 0
+        print("  - Running Test Queries...")
+        for i, query_vec in enumerate(test_queries):
+            start_time = time.time()
+            # Router Prediction
+            target_cluster, confidence = router.predict(query_vec)
+            # Search
+            results, search_mode = db.search_hybrid(query_vec, target_cluster, confidence)
+            end_time = time.time()
+            latencies.append((end_time - start_time) * 1000) # ms
+            # Check if we found the correct document (Self-Recall)
+            # We look for the text in the results
+            target_text = test_query_texts[i]
+            found = any(res.payload['text'] == target_text for res in results)
+            if found:
+                hits += 1
+            # Log for Active Learning
+            log_for_retraining(target_text, confidence, results)
+            # Track efficiency
+            if "GLOBAL" in search_mode:
+                shards_searched_count += (NUM_CLUSTERS + 1)
+            else:
+                shards_searched_count += 2 # Target + Freshness
+        # 5. Metrics
+        avg_latency = np.mean(latencies)
+        accuracy = hits / len(test_queries)
+        avg_shards = shards_searched_count / len(test_queries)
+        total_shards = NUM_CLUSTERS + 1
+        savings = (1 - (avg_shards / total_shards)) * 100
+        results_table.append([
+            embed_name, router_name,
+            f"{accuracy:.2%}", f"{avg_latency:.2f} ms",
+            f"{savings:.1f}%"
+        ])
+    # Print Summary
+    print("\n\n================ RESULTS SUMMARY ================")
+    headers = ["Embedding", "Router", "Accuracy", "Latency", "Compute Savings"]
+    print(tabulate(results_table, headers=headers, tablefmt="grid"))
+if __name__ == "__main__":
+    run_benchmark()

notebooks/xVector_Analysis.ipynb ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+    "cells": [
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "# xVector Analysis\n",
+                "\n",
+                "This notebook is a template for visualizing the results of the dashVector / xVector engine.\n",
+                "It connects to the generated logs and the Qdrant instance to provide insights."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "import pandas as pd\n",
+                "import matplotlib.pyplot as plt\n",
+                "import json\n",
+                "import os\n",
+                "\n",
+                "# Path to logs\n",
+                "LOG_FILE = \"../logs/active_learning_queue.jsonl\"\n",
+                "\n",
+                "def load_logs():\n",
+                "    data = []\n",
+                "    if os.path.exists(LOG_FILE):\n",
+                "        with open(LOG_FILE, 'r') as f:\n",
+                "            for line in f:\n",
+                "                data.append(json.loads(line))\n",
+                "    return pd.DataFrame(data)\n",
+                "\n",
+                "df = load_logs()\n",
+                "if not df.empty:\n",
+                "    print(f\"Loaded {len(df)} log entries.\")\n",
+                "    display(df.head())\n",
+                "else:\n",
+                "    print(\"No logs found yet. Run main.py first.\")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "## Confidence Distribution\n",
+                "Analyze the confidence scores of queries that triggered active learning."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "if not df.empty:\n",
+                "    plt.figure(figsize=(10, 6))\n",
+                "    plt.hist(df['confidence'], bins=20, color='skyblue', edgecolor='black')\n",
+                "    plt.title('Distribution of Confidence Scores (Hard Negatives)')\n",
+                "    plt.xlabel('Confidence')\n",
+                "    plt.ylabel('Count')\n",
+                "    plt.show()"
+            ]
+        }
+    ],
+    "metadata": {
+        "kernelspec": {
+            "display_name": "Python 3",
+            "language": "python",
+            "name": "python3"
+        },
+        "language_info": {
+            "codemirror_mode": {
+                "name": "ipython",
+                "version": 3
+            },
+            "file_extension": ".py",
+            "mimetype": "text/x-python",
+            "name": "python",
+            "nbconvert_exporter": "python",
+            "pygments_lexer": "ipython3",
+            "version": "3.8.10"
+        }
+    },
+    "nbformat": 4,
+    "nbformat_minor": 5
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+qdrant-client>=1.10.0
+sentence-transformers
+lightgbm
+scikit-learn
+datasets
+numpy
+pandas
+tqdm
+einops
+gradio

scripts/ingest_ms_marco.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import sys
+import os
+import numpy as np
+from tqdm import tqdm
+# Add project root to path
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from config import (
+    NUM_CLUSTERS, FRESHNESS_SHARD_ID, MRL_DIMS,
+    EMBEDDING_MODELS, ROUTER_MODELS, COLLECTION_NAME,
+    QDRANT_URL, QDRANT_API_KEY
+)
+from src.data_pipeline import get_embeddings, load_ms_marco
+from src.router import LearnedRouter
+from src.vector_db import UnifiedQdrant
+def ingest_data():
+    print(">>> Starting Ingestion Pipeline for Qdrant Cloud...")
+    if QDRANT_URL == ":memory:":
+        print("WARNING: QDRANT_URL is still :memory:. Please set QDRANT_URL env var for production.")
+        # We continue anyway for testing logic, but warn user.
+    # 1. Load Data (101k samples for production proof)
+    # For demo speed, we might start with 10k, but let's aim for 20k to be significant.
+    N_SAMPLES = 1000
+    print(f"Loading {N_SAMPLES} samples from MS MARCO...")
+    raw_texts = load_ms_marco(N_SAMPLES)
+    # 2. Generate Embeddings
+    # Use 'nomic' or 'minilm'. Let's stick to 'minilm' for speed/reliability in this demo unless specified.
+    # Config says 'nomic' is primary, but 'minilm' is baseline.
+    # Let's use 'minilm' for the first pass to ensure it works, or 'nomic' if we want MRL power.
+    # The prompt mentioned MRL optimization, so 'nomic' is better if we want real MRL.
+    # However, 'minilm' is 384 dims. 'nomic' is 768.
+    # Our config MRL_DIMS is 64.
+    # Let's use 'minilm' as it's faster to download/run on CPU if needed.
+    MODEL_NAME = EMBEDDING_MODELS["minilm"]
+    print(f"Generating embeddings using {MODEL_NAME}...")
+    embeddings = get_embeddings(MODEL_NAME, raw_texts)
+    vector_dim = embeddings.shape[1]
+    # 3. Train Router
+    # We need to train the router on this data to cluster it.
+    print("Training Router...")
+    router = LearnedRouter(model_type="lightgbm", n_clusters=NUM_CLUSTERS, mrl_dims=MRL_DIMS)
+    router.train(embeddings)
+    # Save Router
+    os.makedirs("models", exist_ok=True)
+    router.save("models/router_v1.pkl")
+    # 4. Assign Clusters (Ground Truth for Indexing)
+    print("Assigning clusters...")
+    # We use the router's internal KMeans to get the "Ground Truth" cluster for each point.
+    # This ensures that the data actually lives where the router *should* predict it to be (mostly).
+    cluster_ids = router.kmeans.predict(embeddings)
+    # 5. Index to Qdrant
+    print("Initializing Qdrant...")
+    db = UnifiedQdrant(
+        collection_name=COLLECTION_NAME,
+        vector_size=vector_dim,
+        num_clusters=NUM_CLUSTERS,
+        freshness_shard_id=FRESHNESS_SHARD_ID
+    )
+    db.initialize()
+    print("Indexing data...")
+    # Batching is handled inside index_data somewhat, but let's pass it all
+    # The index_data method groups by shard, which is efficient for custom sharding.
+    payloads = [{"text": t, "origin": "ms_marco"} for t in raw_texts]
+    # We can process in chunks to avoid OOM if 20k is too big for memory (it's fine for 20k).
+    db.index_data(embeddings, payloads, cluster_ids)
+    print(">>> Ingestion Complete!")
+if __name__ == "__main__":
+    ingest_data()

src/__init__.py ADDED Viewed

File without changes

src/active_learning.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import json
+import os
+from datetime import datetime
+from typing import List, Any
+from config import ACTIVE_LEARNING_LOG
+def log_for_retraining(query: str, confidence: float, results: List[Any]):
+    """
+    Logs queries that have low confidence or zero results for active learning (retraining).
+    Logic:
+    - If confidence < 0.6 OR len(results) == 0:
+      Append to logs/active_learning_queue.jsonl
+    """
+    should_log = False
+    reason = []
+    if confidence < 0.6:
+        should_log = True
+        reason.append("low_confidence")
+    if len(results) == 0:
+        should_log = True
+        reason.append("zero_results")
+    if should_log:
+        entry = {
+            "timestamp": datetime.now().isoformat(),
+            "query": query,
+            "confidence": float(confidence),
+            "result_count": len(results),
+            "reasons": reason
+        }
+        try:
+            with open(ACTIVE_LEARNING_LOG, "a") as f:
+                f.write(json.dumps(entry) + "\n")
+        except Exception as e:
+            print(f"Error logging for active learning: {e}")

src/comparison.py ADDED Viewed

	@@ -0,0 +1,95 @@

+import time
+import numpy as np
+from typing import List, Dict, Any, Tuple
+from src.vector_db import UnifiedQdrant
+from src.router import LearnedRouter
+from src.data_pipeline import get_embeddings
+from config import NUM_CLUSTERS, FRESHNESS_SHARD_ID, EMBEDDING_MODELS
+class ComparisonEngine:
+    def __init__(self, db: UnifiedQdrant, router: LearnedRouter, embedding_model_name: str = "minilm"):
+        self.db = db
+        self.router = router
+        self.embedding_model_name = EMBEDDING_MODELS.get(embedding_model_name, embedding_model_name)
+    def get_query_embedding(self, query: str) -> np.ndarray:
+        # Returns 1D array
+        emb = get_embeddings(self.embedding_model_name, [query])
+        return emb[0]
+    def direct_search(self, query: str) -> Dict[str, Any]:
+        """
+        Brute Force Search (Baseline).
+        Searches ALL shards.
+        """
+        query_vec = self.get_query_embedding(query)
+        start_time = time.time()
+        # In Qdrant, searching without shard_key_selector searches all shards.
+        # However, our UnifiedQdrant.search_hybrid is designed for hybrid.
+        # We need a raw search method or just use the client directly.
+        # Let's use the client directly to be pure "Brute Force".
+        # Note: In local mode, everything is one collection anyway.
+        # In Cloud with custom sharding, omitting shard_key searches all.
+        if self.db.is_local:
+             results = self.db.client.query_points(
+                collection_name=self.db.collection_name,
+                query=query_vec,
+                limit=10
+            ).points
+        else:
+            results = self.db.client.query_points(
+                collection_name=self.db.collection_name,
+                query=query_vec,
+                limit=10
+                # No shard_key_selector -> Global Search
+            ).points
+        end_time = time.time()
+        latency_ms = (end_time - start_time) * 1000
+        # Compute Units: All Clusters + Freshness
+        shards_searched = self.db.num_clusters + 1
+        return {
+            "results": results,
+            "latency_ms": latency_ms,
+            "shards_searched": shards_searched,
+            "mode": "Brute Force"
+        }
+    def xvector_search(self, query: str) -> Dict[str, Any]:
+        """
+        xVector Search (Optimized).
+        Uses Router -> Targeted Shard Search.
+        """
+        query_vec = self.get_query_embedding(query)
+        start_time = time.time()
+        # 1. Router Prediction
+        target_cluster, confidence = self.router.predict(query_vec.reshape(1, -1))
+        # 2. Hybrid Search (Target + Freshness OR Global Fallback)
+        results, search_mode = self.db.search_hybrid(query_vec, target_cluster, confidence)
+        end_time = time.time()
+        latency_ms = (end_time - start_time) * 1000
+        # Calculate Shards Searched
+        if "GLOBAL" in search_mode:
+            shards_searched = self.db.num_clusters + 1
+        else:
+            shards_searched = 2 # Target + Freshness
+        return {
+            "results": results,
+            "latency_ms": latency_ms,
+            "shards_searched": shards_searched,
+            "mode": f"xVector ({search_mode})",
+            "confidence": confidence,
+            "target_cluster": target_cluster
+        }

src/data_pipeline.py ADDED Viewed

	@@ -0,0 +1,117 @@

+import numpy as np
+from sentence_transformers import SentenceTransformer
+from datasets import load_dataset
+import pandas as pd
+from typing import List, Union
+import torch
+import torch.nn.functional as F
+def get_embeddings(model_name: str, texts: List[str]) -> np.ndarray:
+    """
+    Loads the specified model and generates embeddings for the given texts.
+    Handles 'nomic' and 'qwen' specific requirements (trust_remote_code).
+    """
+    print(f"Loading embedding model: {model_name}...")
+    trust_remote_code = False
+    if "nomic" in model_name or "qwen" in model_name:
+        trust_remote_code = True
+    model = SentenceTransformer(model_name, trust_remote_code=trust_remote_code, device='cpu')
+    # Generate embeddings
+    # Convert to numpy array if it returns a tensor or list
+    embeddings = model.encode(texts, convert_to_numpy=True, show_progress_bar=True)
+    return embeddings
+def mrl_slice(vectors: np.ndarray, dims: int) -> np.ndarray:
+    """
+    Slices the vectors to the specified dimensions AND applies L2 normalization *after* slicing.
+    This is crucial for Matryoshka Representation Learning (MRL).
+    """
+    # 1. Slice
+    sliced_vectors = vectors[:, :dims]
+    # 2. L2 Normalize
+    # Using sklearn's normalize or manual calculation.
+    # Manual calculation to avoid extra dependency import inside function if possible,
+    # but we have numpy.
+    norms = np.linalg.norm(sliced_vectors, axis=1, keepdims=True)
+    # Avoid division by zero
+    norms[norms == 0] = 1e-10
+    normalized_sliced_vectors = sliced_vectors / norms
+    return normalized_sliced_vectors
+def load_ms_marco(n_samples: int = 1000) -> List[str]:
+    """
+    Loads the MS MARCO dataset from Hugging Face.
+    Streams the dataset to save RAM.
+    Falls back to synthetic data if loading fails.
+    """
+    try:
+        print(f"Attempting to load {n_samples} samples from MS MARCO...")
+        dataset = load_dataset("microsoft/ms_marco", "v1.1", split="train", streaming=True)
+        texts = []
+        count = 0
+        for row in dataset:
+            # MS MARCO has 'query' and 'passages'. We'll use passages for the DB.
+            # The dataset structure can vary, usually 'passages' is a dict.
+            # Let's check the structure or just use a simpler dataset if this is too complex for a quick demo.
+            # Actually, let's use the 'query' for simplicity or 'passages' content.
+            # For a retrieval engine, we usually index documents.
+            # Let's try to get passage text.
+            # Note: ms_marco v1.1 structure:
+            # {'query_id': ..., 'query': ..., 'passages': {'is_selected': [...], 'url': [...], 'passage_text': [...]}}
+            if 'passages' in row:
+                # Take the first passage text
+                passage_list = row['passages']['passage_text']
+                if passage_list:
+                    texts.append(passage_list[0])
+                    count += 1
+            elif 'query' in row:
+                 # Fallback to queries if passages are weird, but we want documents.
+                 texts.append(row['query'])
+                 count += 1
+            if count >= n_samples:
+                break
+        if len(texts) < n_samples:
+            print("Warning: Could not fetch enough samples from MS MARCO.")
+        return texts
+    except Exception as e:
+        print(f"Error loading MS MARCO: {e}")
+        print("Falling back to synthetic data.")
+        return generate_synthetic_data(n_samples)
+def generate_synthetic_data(n_samples: int) -> List[str]:
+    """
+    Generates synthetic text data for testing.
+    """
+    base_sentences = [
+        "The quick brown fox jumps over the lazy dog.",
+        "Artificial intelligence is transforming the world.",
+        "Vector databases enable fast similarity search.",
+        "Machine learning models require data for training.",
+        "Python is a popular programming language for data science.",
+        "Cloud computing provides scalable resources.",
+        "Cybersecurity is essential for protecting digital assets.",
+        "Blockchain technology ensures decentralized transactions.",
+        "Quantum computing will solve complex problems.",
+        "Sustainable energy is the future of the planet."
+    ]
+    data = []
+    for i in range(n_samples):
+        # Create variations
+        base = base_sentences[i % len(base_sentences)]
+        data.append(f"{base} Variation {i}")
+    return data

src/router.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import numpy as np
+from sklearn.cluster import KMeans
+from sklearn.linear_model import LogisticRegression
+from sklearn.neural_network import MLPClassifier
+import lightgbm as lgb
+from typing import Tuple, Any
+import joblib
+import os
+class LearnedRouter:
+    def __init__(self, model_type: str = "lightgbm", n_clusters: int = 32, mrl_dims: int = 64):
+        self.model_type = model_type
+        self.n_clusters = n_clusters
+        self.mrl_dims = mrl_dims
+        self.kmeans = None
+        self.classifier = None
+    def train(self, X_full: np.ndarray):
+        """
+        Trains the router:
+        1. Cluster X_full using K-Means to generate ground-truth labels.
+        2. Slice X_full to MRL_DIMS.
+        3. Train the specified classifier on sliced vectors to predict cluster labels.
+        """
+        print(f"Training Router ({self.model_type})...")
+        # 1. Generate Ground Truth Labels with K-Means on FULL vectors
+        # (We want the clusters to be based on the high-fidelity data)
+        print("  - Running K-Means for ground truth labels...")
+        self.kmeans = KMeans(n_clusters=self.n_clusters, random_state=42, n_init=10)
+        y_labels = self.kmeans.fit_predict(X_full)
+        # 2. Slice Input Data for the Router
+        # The router only sees the low-dim MRL vector
+        print(f"  - Slicing vectors to {self.mrl_dims} dimensions...")
+        # Note: We assume X_full is already normalized if needed,
+        # but for MRL slicing we should re-normalize the slice.
+        # We'll do a quick slice and normalize here locally or assume caller handles it.
+        # Ideally, we use the mrl_slice function from data_pipeline, but to avoid circular imports
+        # or dependency issues, let's implement the logic here or import it.
+        # Let's do the math here to be self-contained in the class logic.
+        X_sliced = X_full[:, :self.mrl_dims]
+        norms = np.linalg.norm(X_sliced, axis=1, keepdims=True)
+        norms[norms == 0] = 1e-10
+        X_train = X_sliced / norms
+        # 3. Train Classifier
+        print(f"  - Training classifier: {self.model_type}...")
+        if self.model_type == "lightgbm":
+            # LightGBM
+            train_data = lgb.Dataset(X_train, label=y_labels)
+            params = {
+                'objective': 'multiclass',
+                'num_class': self.n_clusters,
+                'metric': 'multi_logloss',
+                'verbosity': -1,
+                'seed': 42
+            }
+            self.classifier = lgb.train(params, train_data, num_boost_round=100)
+        elif self.model_type == "logistic":
+            # Logistic Regression
+            self.classifier = LogisticRegression(max_iter=1000, multi_class='multinomial', random_state=42)
+            self.classifier.fit(X_train, y_labels)
+        elif self.model_type == "mlp":
+            # MLP Classifier
+            self.classifier = MLPClassifier(hidden_layer_sizes=(128, 64), max_iter=500, random_state=42)
+            self.classifier.fit(X_train, y_labels)
+        else:
+            raise ValueError(f"Unknown router model type: {self.model_type}")
+        print("Router training complete.")
+    def predict(self, vector_full: np.ndarray) -> Tuple[int, float]:
+        """
+        Predicts the target cluster for a query vector.
+        1. Slice input to MRL dims.
+        2. Predict probabilities.
+        3. Return (best_cluster, confidence_score).
+        """
+        # Ensure input is 2D
+        if vector_full.ndim == 1:
+            vector_full = vector_full.reshape(1, -1)
+        # 1. Slice and Normalize
+        X_sliced = vector_full[:, :self.mrl_dims]
+        norms = np.linalg.norm(X_sliced, axis=1, keepdims=True)
+        norms[norms == 0] = 1e-10
+        X_input = X_sliced / norms
+        # 2. Predict
+        if self.model_type == "lightgbm":
+            probs = self.classifier.predict(X_input) # Returns (n_samples, n_classes)
+        elif self.model_type in ["logistic", "mlp"]:
+            probs = self.classifier.predict_proba(X_input)
+        else:
+            raise ValueError("Model not trained or unknown type")
+        # 3. Get best cluster and confidence
+        best_cluster = np.argmax(probs, axis=1)[0]
+        confidence = np.max(probs, axis=1)[0]
+        return int(best_cluster), float(confidence)
+    def save(self, path: str):
+        """Saves the router (KMeans + Classifier) to disk."""
+        print(f"Saving router to {path}...")
+        joblib.dump({
+            'model_type': self.model_type,
+            'n_clusters': self.n_clusters,
+            'mrl_dims': self.mrl_dims,
+            'kmeans': self.kmeans,
+            'classifier': self.classifier
+        }, path)
+        print("Router saved.")
+    @classmethod
+    def load(cls, path: str):
+        """Loads the router from disk."""
+        print(f"Loading router from {path}...")
+        data = joblib.load(path)
+        router = cls(
+            model_type=data['model_type'],
+            n_clusters=data['n_clusters'],
+            mrl_dims=data['mrl_dims']
+        )
+        router.kmeans = data['kmeans']
+        router.classifier = data['classifier']
+        print("Router loaded.")
+        return router

src/vector_db.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import os
+from qdrant_client import QdrantClient, models
+from qdrant_client.http.models import Distance, VectorParams
+import numpy as np
+from typing import List, Optional, Dict, Any
+import uuid
+class UnifiedQdrant:
+    def __init__(self, collection_name: str, vector_size: int, num_clusters: int = 32, freshness_shard_id: int = 999):
+        self.client = None
+        self.collection_name = collection_name
+        self.vector_size = vector_size
+        self.num_clusters = num_clusters
+        self.freshness_shard_id = freshness_shard_id
+    def initialize(self):
+        """
+        Connects to Qdrant and sets up the collection with Custom Sharding.
+        Handles fallback if Free Tier limits are hit.
+        """
+        # Connect
+        url = os.getenv("QDRANT_URL", ":memory:")
+        api_key = os.getenv("QDRANT_API_KEY", None)
+        print(f"Connecting to Qdrant at {url}...")
+        self.client = QdrantClient(location=url, api_key=api_key, timeout=60)
+        self.is_local = url == ":memory:" or not url.startswith("http")
+        if self.is_local:
+            print("WARNING: Running in local/memory mode. Custom Sharding is NOT supported. Simulating behavior.")
+        # Check if collection exists, if so, recreate it for a clean slate (or handle gracefully)
+        if self.client.collection_exists(self.collection_name):
+            self.client.delete_collection(self.collection_name)
+        # Try to create collection with full clusters
+        try:
+            self._create_collection_and_shards(self.num_clusters)
+            print(f"Successfully created collection with {self.num_clusters} clusters.")
+        except Exception as e:
+            print(f"Failed to create {self.num_clusters} clusters: {e}")
+            print("Attempting fallback to 8 clusters (Free Tier limit mitigation)...")
+            # Fallback 1: 8 Clusters
+            try:
+                self.num_clusters = 8
+                if self.client.collection_exists(self.collection_name):
+                    self.client.delete_collection(self.collection_name)
+                self._create_collection_and_shards(self.num_clusters)
+                print(f"Fallback successful: Created collection with {self.num_clusters} clusters.")
+            except Exception as e2:
+                print(f"Failed to create 8 clusters: {e2}")
+                print("CRITICAL: Custom Sharding not supported. Falling back to Standard Collection (No Sharding).")
+                # Fallback 2: Standard Collection
+                self.num_clusters = 1 # Virtual clusters only
+                if self.client.collection_exists(self.collection_name):
+                    self.client.delete_collection(self.collection_name)
+                self.client.create_collection(
+                    collection_name=self.collection_name,
+                    vectors_config=VectorParams(size=self.vector_size, distance=Distance.COSINE)
+                )
+                print("Fallback successful: Created Standard Collection.")
+    def _create_collection_and_shards(self, n_clusters):
+        print(f"Creating collection '{self.collection_name}' with custom sharding ({n_clusters} clusters)...")
+        if self.is_local:
+            # Local mode doesn't support sharding_method=CUSTOM
+            self.client.create_collection(
+                collection_name=self.collection_name,
+                vectors_config=VectorParams(size=self.vector_size, distance=Distance.COSINE)
+            )
+        else:
+            self.client.create_collection(
+                collection_name=self.collection_name,
+                vectors_config=VectorParams(size=self.vector_size, distance=Distance.COSINE),
+                sharding_method=models.ShardingMethod.CUSTOM,
+                shard_number=n_clusters + 1 # Clusters + Freshness
+            )
+        # CRITICAL: Create Shard Keys
+        if not self.is_local:
+            print("Creating shard keys...")
+            for i in range(n_clusters):
+                self.client.create_shard_key(self.collection_name, str(i))
+            # Create freshness shard key
+            self.client.create_shard_key(self.collection_name, str(self.freshness_shard_id))
+            print("Shard keys created successfully.")
+    def index_data(self, vectors: np.ndarray, payloads: List[Dict[str, Any]], cluster_ids: List[Optional[int]]):
+        """
+        Indexes data into the specific shards based on cluster_ids.
+        If cluster_id is None, it goes to the Freshness Shard.
+        """
+        points = []
+        # We need to batch this properly, but for simplicity we'll group by shard
+        # to minimize network calls if possible, or just iterate.
+        # Qdrant's upsert can take a batch, but they must share the same shard key?
+        # Actually, with custom sharding, if we provide a list of points,
+        # we might need to specify the shard key per operation or batch by shard key.
+        # The `upsert` method allows `shard_key_selector`.
+        # It's best to batch by shard key.
+        data_by_shard = {}
+        for i, vec in enumerate(vectors):
+            cluster_id = cluster_ids[i]
+            if cluster_id is None:
+                key = str(self.freshness_shard_id)
+            else:
+                key = str(cluster_id)
+            if key not in data_by_shard:
+                data_by_shard[key] = []
+            point_id = str(uuid.uuid4())
+            data_by_shard[key].append(
+                models.PointStruct(
+                    id=point_id,
+                    vector=vec.tolist(),
+                    payload=payloads[i]
+                )
+            )
+        # Upsert batches
+        print(f"Indexing data across {len(data_by_shard)} shards...")
+        for key, batch_points in data_by_shard.items():
+            if self.is_local:
+                self.client.upsert(
+                    collection_name=self.collection_name,
+                    points=batch_points
+                    # No shard_key_selector in local
+                )
+            else:
+                self.client.upsert(
+                    collection_name=self.collection_name,
+                    points=batch_points,
+                    shard_key_selector=key
+                )
+    def search_hybrid(self, query_vec: np.ndarray, target_cluster: int, confidence: float) -> List[Any]:
+        """
+        Performs the hybrid search strategy.
+        - Always include FRESHNESS_SHARD_ID.
+        - If confidence < 0.5, Global Search (all shards).
+        - Else, search [target_cluster, FRESHNESS_SHARD_ID].
+        """
+        # Ensure query_vec is list
+        if isinstance(query_vec, np.ndarray):
+            query_vec = query_vec.tolist()
+            if isinstance(query_vec[0], list): # Handle 2D array if passed
+                query_vec = query_vec[0]
+        shard_keys = []
+        # Logic
+        if confidence < 0.5:
+            # Global Search
+            # In Qdrant, if we don't specify shard_key_selector, does it search all?
+            # With custom sharding, usually yes, or we might need to specify all keys.
+            # Let's assume passing None or not passing it searches all.
+            # However, the prompt says "Trigger a Global Search".
+            # Explicitly, we can just NOT pass shard_key_selector.
+            shard_keys = None
+            search_mode = "GLOBAL"
+        else:
+            # Targeted Search
+            shard_keys = [str(target_cluster), str(self.freshness_shard_id)]
+            search_mode = f"TARGETED (Cluster {target_cluster} + Freshness)"
+        # print(f"Searching: {search_mode} | Confidence: {confidence:.4f}")
+        if self.is_local:
+             results = self.client.query_points(
+                collection_name=self.collection_name,
+                query=query_vec,
+                limit=10
+            ).points
+        else:
+            results = self.client.query_points(
+                collection_name=self.collection_name,
+                query=query_vec,
+                shard_key_selector=shard_keys,
+                limit=10
+            ).points
+        return results, search_mode