Phase 2: Voice Implementation - Quick Start Guide
What is Phase 2?
Phase 2 adds live two-way voice conversation to the ScamShield AI honeypot:
- You speak (as scammer) β AI transcribes β processes β AI speaks back
- Completely isolated from Phase 1 (text honeypot)
- Optional feature (enabled via
PHASE_2_ENABLED=true)
Architecture
Voice Input (You) β ASR (Whisper) β Text
β
Phase 1 Honeypot (Unchanged)
β
Voice Output (AI) β TTS (gTTS) β Text Reply
Key Point: Phase 1 text honeypot is not modified. Voice is just input/output wrapper.
Quick Setup
1. Install Dependencies
# Install Phase 2 dependencies
pip install -r requirements-phase2.txt
# Note: PyAudio may need system packages
# Windows: pip install pipwin && pipwin install pyaudio
# Linux: sudo apt-get install portaudio19-dev
# Mac: brew install portaudio
2. Configure Environment
# Add to your .env file
PHASE_2_ENABLED=true
WHISPER_MODEL=base
TTS_ENGINE=gtts
VOICE_FRAUD_DETECTION=false
3. Start Server
# Start FastAPI server (same as Phase 1)
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
4. Open Voice UI
Open in browser: http://localhost:8000/ui/voice.html
Testing the Voice Feature
Option 1: Record Live
- Click "Start Recording"
- Speak as a scammer (e.g., "Your account is blocked. Send OTP immediately.")
- Click "Stop Recording"
- Wait for AI to:
- Transcribe your voice
- Process through honeypot
- Reply with voice
Option 2: Upload Audio File
- Click "Upload Audio File"
- Select a
.wav,.mp3, or.m4afile - AI processes and replies
API Endpoint
POST /api/v1/voice/engage
Request:
curl -X POST "http://localhost:8000/api/v1/voice/engage" \
-H "x-api-key: dev-key-12345" \
-F "audio_file=@recording.wav" \
-F "session_id=voice-test-001" \
-F "language=auto"
Response:
{
"session_id": "voice-test-001",
"scam_detected": true,
"scam_confidence": 0.92,
"scam_type": "financial_fraud",
"turn_count": 1,
"ai_reply_text": "Oh no! What should I do? Can you help me?",
"ai_reply_audio_url": "/api/v1/voice/audio/reply_xyz.mp3",
"transcription": {
"text": "Your account is blocked. Send OTP immediately.",
"language": "en",
"confidence": 0.95
},
"voice_fraud": null,
"extracted_intelligence": {
"upi_ids": [],
"bank_accounts": [],
"phone_numbers": [],
"urls": []
},
"processing_time_ms": 3450
}
File Structure
app/
βββ voice/ # NEW: Phase 2 voice modules
β βββ __init__.py
β βββ asr.py # Whisper ASR
β βββ tts.py # gTTS text-to-speech
β βββ fraud_detector.py # Optional voice fraud detection
βββ api/
β βββ voice_endpoints.py # NEW: Voice API endpoints
β βββ voice_schemas.py # NEW: Voice API schemas
βββ ... (Phase 1 unchanged)
ui/
βββ voice.html # NEW: Voice UI
βββ voice.js # NEW: Voice UI logic
βββ voice.css # NEW: Voice UI styles
βββ ... (Phase 1 unchanged)
PHASE_2_VOICE_IMPLEMENTATION_PLAN.md # Full implementation plan
requirements-phase2.txt # Phase 2 dependencies
.env.phase2.example # Phase 2 config example
Impact on Phase 1
ZERO IMPACT:
- β Phase 1 text honeypot unchanged
- β All existing tests pass
- β Existing API endpoints unchanged
- β Existing UI unchanged
- β Phase 2 is opt-in (disabled by default)
Performance
| Metric | Target | Notes |
|---|---|---|
| ASR Latency | <2s | Whisper base model |
| TTS Latency | <1s | gTTS |
| Total Loop | <5s | Voice in β Voice out |
| Accuracy | >85% | Transcription WER |
Troubleshooting
"Voice API unavailable"
- Check
PHASE_2_ENABLED=truein.env - Verify dependencies installed:
pip list | grep whisper - Check logs:
tail -f logs/app.log
"Microphone access denied"
- Browser needs microphone permission
- Check browser settings β Privacy β Microphone
- Use HTTPS or localhost (required for
getUserMedia)
"PyAudio installation failed"
# Windows
pip install pipwin
pipwin install pyaudio
# Linux
sudo apt-get install portaudio19-dev python3-pyaudio
pip install pyaudio
# Mac
brew install portaudio
pip install pyaudio
"Whisper model download slow"
- First run downloads model (~150MB for base)
- Models cached in
~/.cache/whisper/ - Use smaller model:
WHISPER_MODEL=tiny
Advanced Features
Voice Fraud Detection (Optional)
Detect synthetic/deepfake voices:
# Enable in .env
VOICE_FRAUD_DETECTION=true
# Install additional dependency
pip install resemblyzer
Response includes:
"voice_fraud": {
"is_synthetic": false,
"confidence": 0.85,
"risk_level": "low"
}
Custom TTS Voice
Future: Replace gTTS with IndicTTS for better Indic language support.
Streaming Audio
Future: Real-time audio streaming instead of record-then-send.
Testing Checklist
- Install Phase 2 dependencies
- Set
PHASE_2_ENABLED=true - Start server
- Open voice UI
- Record voice message
- Verify transcription
- Verify AI reply (text)
- Verify AI reply (audio)
- Check metadata (language, confidence)
- Verify Phase 1 tests still pass
Next Steps
- Review: Read
PHASE_2_VOICE_IMPLEMENTATION_PLAN.mdfor full details - Install: Run
pip install -r requirements-phase2.txt - Configure: Copy settings from
.env.phase2.exampleto.env - Test: Open
ui/voice.htmland try recording - Deploy: Set
PHASE_2_ENABLED=truein production
Support
- Full plan:
PHASE_2_VOICE_IMPLEMENTATION_PLAN.md - Issues: Check logs in
logs/app.log - Questions: Review implementation plan sections
Phase 2 Status: β Planned, π§ Ready to Implement
Estimated Implementation Time: 17-21 hours
Priority: Optional (Phase 1 is complete and sufficient for competition)