Phase 2 Architecture Diagram
System Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SCAMSHIELD AI SYSTEM β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β USER INTERFACES β β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β Phase 1 UI β β Phase 2 UI β β β
β β β (index.html) β β (voice.html) β β β
β β β β β β β β
β β β Text Chat β β π€ Voice Chat β β β
β β β Interface β β Interface β β β
β β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β β
β βββββββββββββΌβββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββ β
β β β β
β β HTTP/JSON β HTTP/FormData β
β β β β
ββββββββββββββββΌββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API LAYER β
β β
β ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β Phase 1 Endpoints β β Phase 2 Endpoints β β
β β (endpoints.py) β β (voice_endpoints.py) β β
β β β β β β
β β POST /honeypot/engage β β POST /voice/engage β β
β β GET /honeypot/sessionβ β GET /voice/audio/:id β β
β β POST /honeypot/batch β β GET /voice/health β β
β ββββββββββββββ¬ββββββββββββ ββββββββββββββ¬ββββββββββββ β
βββββββββββββββββΌβββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββ
β β
β β
β βΌ
β ββββββββββββββββββββββββββββ
β β Phase 2 Voice Layer β
β β β
β β ββββββββββββββββββββββ β
β β β Audio Upload β β
β β β (multipart/form) β β
β β βββββββββββ¬βββββββββββ β
β β β β
β β βΌ β
β β ββββββββββββββββββββββ β
β β β ASR Engine β β
β β β (Whisper) β β
β β β β β
β β β Audio β Text β β
β β βββββββββββ¬βββββββββββ β
β β β β
β ββββββββββββββΌββββββββββββββ
β β
β β Transcribed Text
β β
βΌββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 1 CORE (UNCHANGED) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DETECTION LAYER β β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ β β
β β β Language Detectorβ β Scam Detector β β Scam Type β β β
β β β (language.py) βββββΆβ (detector.py) βββββΆβ Classifier β β β
β β β β β β β β β β
β β β Auto-detect β β IndicBERT + β β Financial β β β
β β β en/hi/gu/etc β β Rules-based β β Tech Supportβ β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β Impersonationβ β
β β ββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β HONEYPOT LAYER β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β LangGraph Workflow β β β
β β β β β β
β β β βββββββββββ βββββββββββ βββββββββββ β β β
β β β β Plan βββββββΆβGenerate βββββββΆβ Extract β β β β
β β β β Node β β Node β β Node β β β β
β β β βββββββββββ βββββββββββ βββββββββββ β β β
β β β β β β β β β
β β β β βΌ β β β β
β β β β ββββββββββββ β β β β
β β β β β Groq LLM β β β β β
β β β β β (Llama) β β β β β
β β β β ββββββββββββ β β β β
β β β β β β β β
β β β βββββββββββββββββ¬ββββββββββββββββββββ β β β
β β β β β β β
β β β βΌ β β β
β β β βββββββββββββββββ β β β
β β β β State Manager β β β β
β β β β (Redis) β β β β
β β β βββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β Output: Text Reply β β
β ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EXTRACTION LAYER β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β UPI Extractorβ β Bank Account β β Phone/URL β β β
β β β (Regex) β β Extractor β β Extractor β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β β
β β Output: Extracted Intelligence β β
β ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β
β Text Reply
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 2 OUTPUT LAYER β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β TTS Engine (gTTS) β β
β β β β
β β Text Reply ββββΆ Text-to-Speech ββββΆ Audio File (.mp3) β β
β β β β
β β Languages: en, hi, gu, ta, te, bn, mr β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Output: Audio URL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β Audio URL
β
βΌ
ββββββββββββ
β User β
β Hears β
β AI Voiceβ
ββββββββββββ
Data Flow: Voice Conversation
Step-by-Step Flow
1. USER SPEAKS
β
β "Your account is blocked. Send OTP now!"
β
βΌ
2. BROWSER CAPTURES AUDIO
β
β MediaRecorder API β WebM/WAV blob
β
βΌ
3. UPLOAD TO API
β
β POST /api/v1/voice/engage
β FormData: audio_file, session_id, language
β
βΌ
4. ASR (WHISPER)
β
β Audio β Text Transcription
β Output: "Your account is blocked. Send OTP now!"
β Language: "en"
β Confidence: 0.95
β
βΌ
5. PHASE 1 DETECTION (UNCHANGED)
β
β Input: Transcribed text
β Scam Detector β is_scam: true, confidence: 0.92
β Type: "financial_fraud"
β
βΌ
6. PHASE 1 HONEYPOT (UNCHANGED)
β
β LangGraph Workflow:
β - Plan: Select "confused_elderly" persona
β - Generate: Groq LLM creates reply
β - Extract: Parse for UPI/bank/phone
β
β Output: "Oh no! What should I do? I'm scared!"
β
βΌ
7. TTS (gTTS)
β
β Text β Speech Synthesis
β Input: "Oh no! What should I do? I'm scared!"
β Language: "en"
β Output: /tmp/reply_xyz.mp3
β
βΌ
8. RESPONSE TO USER
β
β JSON Response:
β {
β "ai_reply_text": "Oh no! What should I do?",
β "ai_reply_audio_url": "/api/v1/voice/audio/reply_xyz.mp3",
β "transcription": {...},
β "scam_detected": true,
β ...
β }
β
βΌ
9. BROWSER PLAYS AUDIO
β
β <audio controls src="/api/v1/voice/audio/reply_xyz.mp3">
β
βΌ
10. USER HEARS AI VOICE
β
β "Oh no! What should I do? I'm scared!"
β
βββΆ Loop continues for multi-turn conversation
Component Isolation
Phase 1 (Existing - Untouched)
βββββββββββββββββββββββββββββββββββββββ
β PHASE 1 COMPONENTS β
β β
β β
app/models/detector.py β
β β
app/models/language.py β
β β
app/models/extractor.py β
β β
app/agent/honeypot.py β
β β
app/agent/personas.py β
β β
app/agent/strategies.py β
β β
app/api/endpoints.py β
β β
app/api/schemas.py β
β β
ui/index.html β
β β
ui/app.js β
β β
β NO MODIFICATIONS REQUIRED β
βββββββββββββββββββββββββββββββββββββββ
Phase 2 (New - Isolated)
βββββββββββββββββββββββββββββββββββββββ
β PHASE 2 COMPONENTS β
β β
β π app/voice/asr.py β
β π app/voice/tts.py β
β π app/voice/fraud_detector.py β
β π app/api/voice_endpoints.py β
β π app/api/voice_schemas.py β
β π ui/voice.html β
β π ui/voice.js β
β π ui/voice.css β
β β
β COMPLETELY SEPARATE β
βββββββββββββββββββββββββββββββββββββββ
Integration Points (Minimal)
βββββββββββββββββββββββββββββββββββββββ
β INTEGRATION POINTS β
β β
β π app/main.py β
β + Add voice router (conditional)β
β β
β π app/config.py β
β + Add Phase 2 settings β
β β
β π .env β
β + Add PHASE_2_ENABLED=true β
β β
β MINIMAL CHANGES β
βββββββββββββββββββββββββββββββββββββββ
Optional: Voice Fraud Detection
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VOICE FRAUD DETECTION (OPTIONAL) β
β β
β Audio Input β
β β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββ βββββββββββ β
β β ASR β β Fraud β β
β β(Whisper)β βDetector β β
β ββββββ¬βββββ β(Wav2Vec2β β
β β βresemb.) β β
β β ββββββ¬βββββ β
β β β β
β β Transcribed Text β Fraud Score β
β β β β
β βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Combined Analysis β β
β β β β
β β - Text content (scam detection) β β
β β - Voice authenticity (fraud detection) β β
β β β β
β β Risk Score = f(scam_conf, fraud_conf) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β If VOICE_FRAUD_DETECTION=true β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Database & State Management
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STATE PERSISTENCE β
β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Redis β β PostgreSQL β β
β β (Session State) β β (Conversation Logs) β β
β β β β β β
β β - Active sessions β β - Full transcripts β β
β β - Turn count β β - Extracted intel β β
β β - Extracted intel β β - Audio metadata β β
β β - Persona state β β - Timestamps β β
β β - TTL: 1 hour β β - Permanent storage β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β
β SAME AS PHASE 1 - NO CHANGES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Performance Considerations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LATENCY BREAKDOWN β
β β
β User Speaks (5s audio) β
β β β
β βΌ β
β Upload to API (0.5s) β
β β β
β βΌ β
β ASR Transcription (1.5s) βββ Whisper base model β
β β β
β βΌ β
β Scam Detection (0.2s) βββ IndicBERT β
β β β
β βΌ β
β Honeypot LLM (1.0s) βββ Groq API β
β β β
β βΌ β
β TTS Synthesis (0.8s) βββ gTTS β
β β β
β βΌ β
β Download Audio (0.3s) β
β β β
β βΌ β
β Total: ~4.3s β
β β
β Target: <5s β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Deployment Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEPLOYMENT β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Docker Container β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β FastAPI Application β β β
β β β β β β
β β β - Phase 1 Endpoints (always enabled) β β β
β β β - Phase 2 Endpoints (if PHASE_2_ENABLED=true) β β β
β β β β β β
β β β Dependencies: β β β
β β β - Base: transformers, langchain, groq β β β
β β β - Phase 2: whisper, gTTS, torchaudio β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Model Cache β β β
β β β β β β
β β β - IndicBERT (~500MB) β β β
β β β - Whisper base (~150MB) [Phase 2] β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β External Services: β
β - Redis (session state) β
β - PostgreSQL (conversation logs) β
β - Groq API (LLM) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Security Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SECURITY LAYERS β
β β
β 1. API Authentication β
β βββΆ x-api-key header (both Phase 1 & 2) β
β β
β 2. Input Validation β
β βββΆ File size limits (<10MB) β
β βββΆ File type validation (audio/* only) β
β βββΆ Sanitize filenames β
β β
β 3. Rate Limiting β
β βββΆ Max requests per session β
β β
β 4. Data Privacy β
β βββΆ Temporary audio storage β
β βββΆ Auto-delete after processing β
β βββΆ No raw audio in database β
β β
β 5. Error Handling β
β βββΆ No sensitive info in error messages β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Takeaways
- Phase 2 wraps Phase 1: Voice is input/output layer only
- Zero modifications to Phase 1: Core honeypot unchanged
- Conditional loading: Phase 2 only loads if enabled
- Separate UI: Voice UI is independent of text UI
- Same state management: Reuses Redis/PostgreSQL
- Performance target: <5s for full voice loop
- Security first: Audio files temporary, validated, rate-limited
For detailed implementation steps, see: PHASE_2_VOICE_IMPLEMENTATION_PLAN.md
For quick start guide, see: PHASE_2_README.md
For progress tracking, see: PHASE_2_CHECKLIST.md