File size: 2,931 Bytes
9b2b303
 
 
 
 
a892aab
9b2b303
 
 
6e1da14
602d30e
6e1da14
602d30e
6e1da14
1c69645
6e1da14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c69645
6e1da14
 
 
 
 
1c69645
6e1da14
 
 
 
 
1c69645
6e1da14
 
 
 
 
1c69645
6e1da14
1c69645
6e1da14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c69645
6e1da14
1c69645
6e1da14
 
 
 
 
 
 
 
 
 
 
 
 
4e933f3
6e1da14
 
 
 
 
 
 
9e4e9dc
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: mit
sdk: gradio
emoji: πŸš€
colorFrom: gray
sdk_version: 5.34.0
---


# πŸ”₯ Hybrid Search RAGtim Bot

A sophisticated hybrid search system combining semantic vector search with BM25 keyword matching for optimal information retrieval.

## πŸš€ Features

- **Hybrid Search**: Combines transformer-based semantic similarity with BM25 keyword ranking
- **Multi-Modal Search**: Vector search, BM25 search, and intelligent fusion
- **Real-time API**: RESTful endpoints for integration
- **Interactive UI**: Three interfaces - Chat, Advanced Search, and Statistics
- **Knowledge Base**: Comprehensive markdown-based knowledge system

## πŸ”§ Technology Stack

- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384-dim)
- **Search**: Custom BM25 implementation + Vector similarity
- **Framework**: Gradio 4.44.0
- **ML**: Transformers, PyTorch, NumPy
- **Deployment**: Hugging Face Spaces

## πŸ“š Knowledge Base Structure

The system processes markdown files from the `knowledge_base/` directory:
- `about.md` - Personal information and professional summary
- `research_details.md` - Research projects and methodologies
- `publications_detailed.md` - Publications with technical details
- `skills_expertise.md` - Technical skills and expertise
- `experience_detailed.md` - Professional experience
- `statistics.md` - Statistical methods and biostatistics

## πŸ” Search Methods

### Hybrid Search (Recommended)
Combines semantic and keyword search with configurable weights:
- Default: 60% vector + 40% BM25
- Optimal for most queries
- Balances meaning and exact term matching

### Vector Search
Pure semantic similarity using transformer embeddings:
- Best for conceptual questions
- Finds semantically related content
- Language-agnostic similarity

### BM25 Search  
Traditional keyword-based ranking:
- Excellent for specific terms
- TF-IDF with document length normalization
- Fast and interpretable

## πŸ› οΈ API Endpoints

### Search API
GET /api/stats

## πŸ“Š Configuration

Key parameters in `config.py`:
- `BM25_K1 = 1.5` - Term frequency saturation
- `BM25_B = 0.75` - Document length normalization
- `DEFAULT_VECTOR_WEIGHT = 0.6` - Hybrid search weighting
- `DEFAULT_BM25_WEIGHT = 0.4` - Hybrid search weighting

## πŸš€ Deployment

1. Clone to Hugging Face Spaces
2. Ensure all markdown files are in `knowledge_base/`
3. The system auto-initializes on startup
4. Access via the provided Space URL

## πŸ’‘ Usage Examples

**Chat Interface:**
- "What is Raktim's LLM research?"
- "Tell me about statistical methods"
- "Describe multimodal AI capabilities"

**Advanced Search:**
- Adjust vector/BM25 weights
- Compare search methods
- Fine-tune result count

**API Integration:**
```python
import requests

response = requests.get(
    "https://your-space.hf.space/api/search",
    params={
        "query": "machine learning research",
        "top_k": 5,
        "search_type": "hybrid"
    }
)
```