Spaces:

colin730
/

SummarizerApp

Running

ming Claude commited on 13 days ago

Commit

45b6536

1 Parent(s): 75fe59b

Add V4 local server setup with MPS optimization for Android testing

- Optimize V4 model for Apple Silicon MPS GPU (4x faster than CPU)
- Fix MPS detection and BFloat16 incompatibility
- Add comprehensive local server management guide
- Add Android integration documentation with connection details
- Add startup script for easy server management

Performance improvements:
- CPU (before): 2+ minutes (timeout)
- MPS (after): 32 seconds for complete summary
- Inference speed: 2.7 tokens/second on M4 MacBook Pro

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (4) hide show

ANDROID_V4_LOCAL_TESTING.md +406 -0
README_LOCAL_SETUP.md +605 -0
app/services/structured_summarizer.py +61 -29
start_v4_local.sh +138 -0

ANDROID_V4_LOCAL_TESTING.md ADDED Viewed

	@@ -0,0 +1,406 @@

+# Android V4 Local Testing Guide
+## Quick Start
+Your V4 API is running on your Mac and accessible to your Android app on the same WiFi network.
+### Connection Details
+- **Base URL**: `http://192.168.88.12:7860`
+- **V4 Endpoint**: `/api/v4/scrape-and-summarize/stream-ndjson` (recommended)
+- **Alternative Endpoint**: `/api/v4/scrape-and-summarize/stream`
+- **Model**: Qwen/Qwen2.5-3B-Instruct (high quality, ~6-7GB RAM)
+- **Network**: Both devices must be on the same WiFi network
+---
+## Android App Configuration
+### Update Your Base URL
+In your Android app's network configuration, change the base URL to:
+```kotlin
+// Development/Local Testing
+const val BASE_URL = "http://192.168.88.12:7860"
+// Production (HuggingFace Spaces)
+const val BASE_URL_PROD = "https://your-hf-space.hf.space"
+```
+### Network Security Config
+Add this to `res/xml/network_security_config.xml` to allow HTTP connections to your local server:
+```xml
+<?xml version="1.0" encoding="utf-8"?>
+<network-security-config>
+    <domain-config cleartextTrafficPermitted="true">
+        <domain includeSubdomains="true">192.168.88.12</domain>
+    </domain-config>
+</network-security-config>
+```
+Update your `AndroidManifest.xml`:
+```xml
+<application
+    android:networkSecurityConfig="@xml/network_security_config"
+    ...>
+```
+---
+## API Usage Examples
+### Endpoint 1: NDJSON Streaming (Recommended - 43% faster)
+**URL**: `http://192.168.88.12:7860/api/v4/scrape-and-summarize/stream-ndjson`
+**Request Body** (URL mode):
+```json
+{
+  "url": "https://example.com/article",
+  "style": "executive",
+  "max_tokens": 512
+}
+```
+**Request Body** (Text mode):
+```json
+{
+  "text": "Your article text here (minimum 50 characters)...",
+  "style": "executive",
+  "max_tokens": 512
+}
+```
+**Response Format** (NDJSON patches):
+```
+data: {"op":"replace","path":"/title","value":"Breaking News"}
+data: {"op":"replace","path":"/main_summary","value":"This is the summary..."}
+data: {"op":"add","path":"/key_points/0","value":"First key point"}
+data: {"op":"add","path":"/key_points/1","value":"Second key point"}
+data: {"op":"replace","path":"/category","value":"Technology"}
+data: {"op":"replace","path":"/sentiment","value":"neutral"}
+data: {"op":"replace","path":"/read_time_min","value":3}
+```
+**Final JSON Structure**:
+```json
+{
+  "title": "Breaking News",
+  "main_summary": "This is the summary...",
+  "key_points": [
+    "First key point",
+    "Second key point",
+    "Third key point"
+  ],
+  "category": "Technology",
+  "sentiment": "neutral",
+  "read_time_min": 3
+}
+```
+### Endpoint 2: Raw JSON Streaming
+**URL**: `http://192.168.88.12:7860/api/v4/scrape-and-summarize/stream`
+**Request/Response**: Same as above, but streams raw JSON tokens instead of NDJSON patches
+---
+## Summarization Styles
+Choose the style that best fits your use case:
+| Style | Description | Use Case |
+|-------|-------------|----------|
+| `executive` | Business-focused with key takeaways (default) | General articles, news |
+| `skimmer` | Quick facts and highlights | Fast reading, headlines |
+| `eli5` | "Explain Like I'm 5" - simple explanations | Complex topics, education |
+---
+## cURL Testing Commands
+### Test with URL (Web Scraping)
+```bash
+curl -X POST http://192.168.88.12:7860/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://www.bbc.com/news/technology",
+    "style": "executive",
+    "max_tokens": 512
+  }'
+```
+### Test with Direct Text
+```bash
+curl -X POST http://192.168.88.12:7860/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text": "Artificial intelligence is rapidly transforming the technology landscape. Companies are investing billions in AI research and development. Machine learning models are becoming more sophisticated and capable of handling complex tasks. From healthcare to finance, AI applications are revolutionizing industries and creating new opportunities for innovation.",
+    "style": "executive",
+    "max_tokens": 512
+  }'
+```
+### Test from Your Android Device
+```bash
+# If you have Termux or similar on Android:
+curl -X POST http://192.168.88.12:7860/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{"text":"Test from Android","style":"executive"}'
+```
+---
+## Kotlin/Android Example
+### Using OkHttp + SSE
+```kotlin
+import okhttp3.*
+import okhttp3.sse.EventSource
+import okhttp3.sse.EventSourceListener
+import okhttp3.sse.EventSources
+class V4ApiClient {
+    private val client = OkHttpClient()
+    fun summarizeUrl(
+        url: String,
+        style: String = "executive",
+        maxTokens: Int = 512,
+        onPatch: (String) -> Unit,
+        onComplete: () -> Unit,
+        onError: (Throwable) -> Unit
+    ) {
+        val request = Request.Builder()
+            .url("http://192.168.88.12:7860/api/v4/scrape-and-summarize/stream-ndjson")
+            .post(
+                """
+                {
+                  "url": "$url",
+                  "style": "$style",
+                  "max_tokens": $maxTokens
+                }
+                """.trimIndent().toRequestBody("application/json".toMediaType())
+            )
+            .build()
+        val eventSourceListener = object : EventSourceListener() {
+            override fun onEvent(
+                eventSource: EventSource,
+                id: String?,
+                type: String?,
+                data: String
+            ) {
+                onPatch(data) // NDJSON patch
+            }
+            override fun onClosed(eventSource: EventSource) {
+                onComplete()
+            }
+            override fun onFailure(
+                eventSource: EventSource,
+                t: Throwable?,
+                response: Response?
+            ) {
+                onError(t ?: Exception("Unknown error"))
+            }
+        }
+        EventSources.createFactory(client)
+            .newEventSource(request, eventSourceListener)
+    }
+}
+// Usage:
+val apiClient = V4ApiClient()
+val summary = mutableMapOf<String, Any>()
+apiClient.summarizeUrl(
+    url = "https://example.com/article",
+    style = "executive",
+    onPatch = { patch ->
+        // Parse NDJSON patch and update summary object
+        val jsonPatch = JSONObject(patch)
+        val op = jsonPatch.getString("op")
+        val path = jsonPatch.getString("path")
+        val value = jsonPatch.get("value")
+        // Apply patch to summary map
+        applyPatch(summary, op, path, value)
+        // Update UI with partial results
+        updateUI(summary)
+    },
+    onComplete = {
+        Log.d("V4", "Summary complete: $summary")
+    },
+    onError = { error ->
+        Log.e("V4", "Error: ${error.message}")
+    }
+)
+```
+---
+## Performance Expectations
+### Qwen/Qwen2.5-3B-Instruct (Current Configuration)
+- **Memory**: ~6-7GB unified memory on Mac
+- **Inference Time**: 40-60 seconds per request
+- **Quality**: ⭐⭐⭐⭐ (high quality, coherent summaries)
+- **First Token**: ~1-2 seconds (fast UI feedback)
+- **Device**: CPU (MPS not detected in current run)
+### Optimization Tips
+1. **Use NDJSON endpoint** for 43% faster time-to-first-token
+2. **Keep max_tokens at 512** for complete summaries
+3. **Test with WiFi** (Bluetooth/USB tethering may be slower)
+4. **Monitor battery** on Android during long sessions
+---
+## Troubleshooting
+### Connection Refused
+**Problem**: `Failed to connect to /192.168.88.12:7860`
+**Solutions**:
+1. Check both devices are on same WiFi network
+2. Verify server is running: `lsof -i :7860`
+3. Check Mac's firewall settings (System Settings → Network → Firewall)
+4. Try pinging Mac from Android: `ping 192.168.88.12`
+### Empty or Incomplete Summaries
+**Problem**: Summary JSON is incomplete or empty
+**Solutions**:
+1. Increase `max_tokens` to 512 or higher
+2. Ensure input text is at least 50 characters
+3. Check server logs: `tail -f server.log`
+4. Try switching from URL mode to text mode
+### Slow Response
+**Problem**: Takes > 2 minutes to get results
+**Solutions**:
+1. V4 with 3B model is computationally intensive (40-60s normal)
+2. Consider switching to 1.5B model for faster responses (lower quality)
+3. Update `.env`: `V4_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct`
+4. Restart server after model change
+### SSRF Protection Blocking URLs
+**Problem**: "Invalid URL or SSRF protection triggered"
+**Solutions**:
+1. Don't use localhost/127.0.0.1 URLs
+2. Don't use private IP ranges (10.x, 192.168.x, 172.x)
+3. Use public URLs only
+4. For testing, use text mode instead of URL mode
+---
+## Server Management
+### Start Server
+```bash
+# Option 1: Using conda environment
+conda run -n summarizer python -m uvicorn app.main:app --host 0.0.0.0 --port 7860
+# Option 2: Using startup script (see below)
+./start_v4_local.sh
+```
+### Check Server Status
+```bash
+# Check if server is running
+lsof -i :7860
+# View real-time logs
+tail -f server.log
+# Check health endpoint
+curl http://localhost:7860/health
+```
+### Stop Server
+```bash
+# Find and kill the process
+pkill -f "uvicorn app.main:app"
+# Or kill by PID
+lsof -ti :7860 | xargs kill
+```
+---
+## API Documentation
+### Health Check
+```bash
+GET http://192.168.88.12:7860/health
+Response:
+{
+  "status": "ok",
+  "service": "summarizer",
+  "version": "4.0.0"
+}
+```
+### Available Endpoints
+- `GET /` - API documentation (Swagger UI)
+- `GET /health` - Health check
+- `POST /api/v1/*` - Ollama + Transformers (requires Ollama service)
+- `POST /api/v2/*` - HuggingFace streaming (distilbart)
+- `POST /api/v3/*` - Web scraping + V2 summarization
+- `POST /api/v4/*` - Structured JSON summarization (Qwen model)
+---
+## Security Notes
+1. **HTTP Only**: Local testing uses HTTP (not HTTPS)
+2. **No Authentication**: API is open on local network
+3. **Rate Limiting**: Not enabled by default for local testing
+4. **SSRF Protection**: Blocks localhost and private IPs in URL mode
+5. **Production**: Use HTTPS and authentication for production deployments
+---
+## Next Steps
+1. ✅ Configure your Android app's base URL to `http://192.168.88.12:7860`
+2. ✅ Add network security config for cleartext HTTP
+3. ✅ Test connection with cURL before Android testing
+4. ✅ Implement SSE parsing for NDJSON patches
+5. ✅ Add error handling for network failures
+6. ✅ Monitor performance and adjust `max_tokens` as needed
+---
+## Support
+- **Server Logs**: `/Users/ming/AndroidStudioProjects/SummerizerApp/server.log`
+- **Configuration**: `/Users/ming/AndroidStudioProjects/SummerizerApp/.env`
+- **Documentation**: See `V4_LOCAL_SETUP.md` and `V4_TESTING_LEARNINGS.md`

README_LOCAL_SETUP.md ADDED Viewed

	@@ -0,0 +1,605 @@

+# Local V4 Server Setup & Management Guide
+Complete guide for running and managing the V4 summarization server locally for Android app development and testing.
+---
+## Quick Start
+### Prerequisites
+- ✅ Conda environment `summarizer` activated
+- ✅ All dependencies installed (`requirements.txt`)
+- ✅ M4 MacBook Pro with MPS support
+- ✅ Both Mac and Android device on same WiFi network
+### Start Server (Fastest Method)
+```bash
+cd /Users/ming/AndroidStudioProjects/SummerizerApp
+./start_v4_local.sh
+```
+**Your Connection Details:**
+- **Mac IP**: `192.168.88.12`
+- **Base URL**: `http://192.168.88.12:7860`
+- **V4 Endpoint**: `/api/v4/scrape-and-summarize/stream-ndjson`
+---
+## Server Management Commands
+### Starting the Server
+#### Option 1: Using Startup Script (Recommended)
+```bash
+./start_v4_local.sh
+```
+**Features:**
+- Automatically detects and stops existing server
+- Shows your local IP address
+- Displays V4 configuration
+- Waits for model to load
+- Shows connection URL
+- Option to view real-time logs
+#### Option 2: Manual Start
+```bash
+# Foreground (blocks terminal)
+/opt/anaconda3/envs/summarizer/bin/python -m uvicorn app.main:app --host 0.0.0.0 --port 7860
+# Background (with logging to file)
+/opt/anaconda3/envs/summarizer/bin/python -m uvicorn app.main:app --host 0.0.0.0 --port 7860 > server.log 2>&1 &
+echo "Server PID: $!"
+```
+**Expected Startup Time**: 15-20 seconds
+- Model loading: ~10 seconds
+- V4 warmup: ~2-3 seconds
+- Other services: ~3-5 seconds
+---
+### Stopping the Server
+#### Option 1: Kill by Process Name (Recommended)
+```bash
+pkill -f "uvicorn app.main:app"
+```
+#### Option 2: Force Kill by Process Name
+```bash
+pkill -9 -f "uvicorn app.main:app" && echo "Server stopped"
+```
+#### Option 3: Kill by Port
+```bash
+# Find and kill process using port 7860
+lsof -ti :7860 | xargs kill
+# Force kill if needed
+lsof -ti :7860 | xargs kill -9
+```
+#### Option 4: Kill by PID
+```bash
+# If you know the PID (shown when server started)
+kill <PID>
+# Force kill
+kill -9 <PID>
+```
+---
+### Restarting the Server
+#### Quick Restart
+```bash
+pkill -f "uvicorn app.main:app" && sleep 2 && ./start_v4_local.sh
+```
+#### Manual Restart
+```bash
+# Stop
+pkill -f "uvicorn app.main:app"
+sleep 2
+# Start
+/opt/anaconda3/envs/summarizer/bin/python -m uvicorn app.main:app --host 0.0.0.0 --port 7860 > server.log 2>&1 &
+```
+---
+### Checking Server Status
+#### Check if Server is Running
+```bash
+# Check port 7860
+lsof -i :7860
+# Expected output if running:
+# COMMAND   PID  USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
+# Python  12345  ming    7u  IPv4 0x1234567890      0t0  TCP *:7860 (LISTEN)
+```
+#### Check Server Health
+```bash
+# Health endpoint
+curl http://localhost:7860/health
+# Expected response:
+# {"status":"ok","service":"summarizer","version":"4.0.0"}
+```
+#### Check Process Details
+```bash
+# Find Python process running uvicorn
+ps aux | grep "uvicorn app.main:app"
+```
+---
+## Viewing Logs
+### Real-Time Logs
+```bash
+# Follow logs as they happen
+tail -f server.log
+# Stop following: Ctrl+C
+```
+### Recent Logs
+```bash
+# Last 50 lines
+tail -50 server.log
+# Last 100 lines
+tail -100 server.log
+# Search for specific events
+tail -100 server.log | grep "V4"
+tail -100 server.log | grep "ERROR"
+```
+### Log File Location
+```
+/Users/ming/AndroidStudioProjects/SummerizerApp/server.log
+```
+---
+## Configuration Reference
+### Current .env Settings
+```bash
+# V4 Structured JSON API
+ENABLE_V4_STRUCTURED=true       # Enable V4 API
+ENABLE_V4_WARMUP=true           # Load model at startup (faster first request)
+# V4 Model Configuration
+V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct   # High-quality 3B model
+V4_MAX_TOKENS=512                       # Max tokens to generate
+V4_TEMPERATURE=0.2                      # Low temp for consistent output
+# V4 Performance (M4 MacBook Pro)
+V4_USE_FP16_FOR_SPEED=true      # Enable FP16 for MPS GPU (2-3x faster)
+V4_ENABLE_QUANTIZATION=false    # Quantization not needed with FP16
+# Server Configuration
+SERVER_HOST=0.0.0.0             # Listen on all interfaces
+SERVER_PORT=7860                # Standard port (required for HF Spaces)
+LOG_LEVEL=INFO                  # Logging verbosity
+# V3 Web Scraping (also enabled)
+ENABLE_V3_SCRAPING=true         # Enable URL scraping
+SCRAPING_TIMEOUT=10             # HTTP timeout (seconds)
+SCRAPING_CACHE_ENABLED=true     # Cache scraped content
+SCRAPING_CACHE_TTL=3600         # Cache for 1 hour
+```
+### Configuration Presets
+**Fast Inference (Current)**
+```bash
+V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct
+V4_USE_FP16_FOR_SPEED=true
+V4_MAX_TOKENS=384
+```
+**High Quality (Slower)**
+```bash
+V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct
+V4_USE_FP16_FOR_SPEED=true
+V4_MAX_TOKENS=512
+```
+**Fastest (Lower Quality)**
+```bash
+V4_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
+V4_USE_FP16_FOR_SPEED=true
+V4_MAX_TOKENS=256
+```
+---
+## Testing Commands
+### Health Check
+```bash
+curl http://localhost:7860/health
+```
+**Expected Response:**
+```json
+{"status":"ok","service":"summarizer","version":"4.0.0"}
+```
+---
+### V4 Direct Text Test
+```bash
+curl -X POST http://localhost:7860/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text": "Artificial intelligence continues to reshape industries worldwide. Tech giants are investing billions in AI development.",
+    "style": "executive",
+    "max_tokens": 256
+  }'
+```
+**Expected Time**: ~30-40 seconds
+**Expected Output**: NDJSON streaming events with structured summary
+---
+### V4 URL Scraping Test
+```bash
+curl -X POST http://localhost:7860/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://en.wikipedia.org/wiki/Machine_learning",
+    "style": "executive",
+    "max_tokens": 512
+  }'
+```
+**Expected Time**: ~35-65 seconds (scrape + summarize)
+**Expected Output**: Metadata event + NDJSON streaming summary
+---
+### Test from Android Device (Same WiFi)
+```bash
+# Run this from your Android device terminal (Termux, etc.)
+curl -X POST http://192.168.88.12:7860/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{"text":"Test from Android","style":"executive","max_tokens":256}'
+```
+---
+## Troubleshooting
+### Problem: Port Already in Use
+**Symptom**: `error while attempting to bind on address ('0.0.0.0', 7860): address already in use`
+**Solution:**
+```bash
+# Find what's using port 7860
+lsof -i :7860
+# Kill it
+lsof -ti :7860 | xargs kill -9
+# Start server again
+./start_v4_local.sh
+```
+---
+### Problem: Server Won't Start
+**Symptom**: Server exits immediately or crashes on startup
+**Check Logs:**
+```bash
+tail -50 server.log
+```
+**Common Causes:**
+1. **Missing loguru**: `pip install "loguru>=0.7.0"`
+2. **Wrong conda environment**: `conda activate summarizer`
+3. **Missing dependencies**: `pip install -r requirements.txt`
+4. **Port conflict**: See "Port Already in Use" above
+---
+### Problem: Model Loading Errors
+**Symptom**: `Failed to initialize V4 model` in logs
+**Solutions:**
+1. **Clear model cache:**
+```bash
+rm -rf /tmp/huggingface
+```
+2. **Check disk space:**
+```bash
+df -h /tmp
+# Need at least 10GB free
+```
+3. **Verify internet connection** (for first-time model download)
+---
+### Problem: Slow Performance
+**Expected Performance:**
+- Startup: 15-20 seconds
+- Inference: 30-40 seconds (short text)
+- Inference: 60-90 seconds (long text/URL)
+**If slower than expected:**
+1. **Check if MPS is being used:**
+```bash
+tail -50 server.log | grep "MPS\|Model device"
+# Should see: "Model device: mps:0"
+```
+2. **Check system load:**
+```bash
+top -l 1 | grep "CPU usage"
+# High CPU usage by other apps?
+```
+3. **Verify FP16 is enabled:**
+```bash
+grep "V4_USE_FP16_FOR_SPEED" .env
+# Should be: V4_USE_FP16_FOR_SPEED=true
+```
+---
+### Problem: Connection Refused from Android
+**Symptom**: Android app can't connect to `http://192.168.88.12:7860`
+**Checklist:**
+1. **Both devices on same WiFi?**
+```bash
+# On Mac, check network
+ifconfig | grep "inet " | grep -v "127.0.0.1"
+```
+2. **Mac firewall blocking port 7860?**
+   - Go to System Settings → Network → Firewall
+   - Allow incoming connections or disable firewall temporarily
+3. **Server actually running?**
+```bash
+lsof -i :7860
+curl http://localhost:7860/health
+```
+4. **Test from Mac first:**
+```bash
+curl http://192.168.88.12:7860/health
+# Should work from Mac's own IP
+```
+5. **Android network security config?**
+   - See `ANDROID_V4_LOCAL_TESTING.md` for cleartext HTTP setup
+---
+### Problem: Empty or Incomplete Summaries
+**Symptom**: Summary JSON missing fields or truncated
+**Solutions:**
+1. **Increase max_tokens:**
+```bash
+# In request, use:
+"max_tokens": 512  # instead of 256
+```
+2. **Check input text length:**
+```bash
+# Minimum 50 characters required
+# Maximum 50,000 characters for URL scraping
+```
+3. **Try different style:**
+```bash
+# Styles: "executive", "skimmer", "eli5"
+"style": "executive"  # Most reliable
+```
+---
+## Performance Guide
+### Expected Metrics
+| Metric | Value |
+|--------|-------|
+| **Startup Time** | 15-20 seconds |
+| **Model Load** | ~10 seconds |
+| **V4 Warmup** | ~2-3 seconds |
+| **Memory Usage** | ~6-7GB unified memory |
+| **Tokens/Second** | 2.7 tok/s (3B model on MPS) |
+| **Short Text** (500 chars) | ~30-40 seconds |
+| **Long Text** (5000 chars) | ~60-90 seconds |
+| **URL Scraping** | +2-5 seconds (first time) |
+| **URL Scraping** (cached) | +<10ms |
+### Hardware Requirements
+**Minimum:**
+- Apple Silicon Mac (M1/M2/M3/M4)
+- 8GB unified memory
+- 10GB free disk space
+**Recommended (Current Setup):**
+- M4 MacBook Pro
+- 24GB unified memory
+- MPS GPU support
+- Fast internet (for model downloads)
+### Network Requirements
+**For Scraping:**
+- Active internet connection
+- Firewall allows outbound HTTPS (443)
+**For Android Connection:**
+- Both devices on same WiFi network
+- Mac firewall allows incoming on port 7860
+---
+## API Endpoints Reference
+### Available Endpoints
+| Endpoint | Method | Purpose |
+|----------|--------|---------|
+| `/health` | GET | Health check |
+| `/docs` | GET | Interactive API documentation |
+| `/api/v1/*` | POST | Ollama + Transformers (requires Ollama) |
+| `/api/v2/*` | POST | HuggingFace streaming (distilbart) |
+| `/api/v3/*` | POST | Web scraping + V2 summarization |
+| `/api/v4/scrape-and-summarize/stream-ndjson` | POST | **Structured JSON summarization (RECOMMENDED)** |
+| `/api/v4/scrape-and-summarize/stream` | POST | Raw JSON streaming |
+### V4 Request Format
+```json
+{
+  "url": "https://example.com/article",    // URL mode
+  // OR
+  "text": "Your article text here...",     // Text mode
+  "style": "executive",                    // "executive", "skimmer", "eli5"
+  "max_tokens": 512                        // 128-2048 range
+}
+```
+### V4 Response Format (NDJSON)
+```
+data: {"type":"metadata","data":{...}}
+data: {"delta":{"op":"set","field":"title","value":"..."},...}
+data: {"delta":{"op":"set","field":"main_summary","value":"..."},...}
+data: {"delta":{"op":"append","field":"key_points","value":"..."},...}
+data: {"delta":{"op":"done"},"done":true,"latency_ms":38891.94}
+```
+---
+## Android Integration
+For complete Android integration guide, see:
+📱 **[ANDROID_V4_LOCAL_TESTING.md](./ANDROID_V4_LOCAL_TESTING.md)**
+**Quick Reference:**
+- Base URL: `http://192.168.88.12:7860`
+- Endpoint: `/api/v4/scrape-and-summarize/stream-ndjson`
+- Network security: Allow cleartext HTTP for `192.168.88.12`
+- Expected latency: 35-65 seconds per request
+---
+## Development Workflow
+### Typical Session
+1. **Start server**
+```bash
+./start_v4_local.sh
+```
+2. **Test locally**
+```bash
+curl http://localhost:7860/health
+```
+3. **Test from Android**
+   - Open your Android app
+   - Configure base URL: `http://192.168.88.12:7860`
+   - Test summarization
+4. **Monitor logs**
+```bash
+tail -f server.log
+```
+5. **Stop server when done**
+```bash
+pkill -f "uvicorn app.main:app"
+```
+---
+## Quick Command Reference
+```bash
+# START
+./start_v4_local.sh
+# STOP
+pkill -f "uvicorn app.main:app"
+# RESTART
+pkill -f "uvicorn app.main:app" && sleep 2 && ./start_v4_local.sh
+# STATUS
+lsof -i :7860
+curl http://localhost:7860/health
+# LOGS
+tail -f server.log
+tail -50 server.log | grep "ERROR"
+# TEST
+curl -X POST http://localhost:7860/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{"text":"Test","style":"executive","max_tokens":256}'
+```
+---
+## Support & Documentation
+- **Android Integration**: [ANDROID_V4_LOCAL_TESTING.md](./ANDROID_V4_LOCAL_TESTING.md)
+- **V4 Testing Learnings**: [V4_TESTING_LEARNINGS.md](./V4_TESTING_LEARNINGS.md)
+- **V4 Local Setup**: [V4_LOCAL_SETUP.md](./V4_LOCAL_SETUP.md)
+- **Server Logs**: `server.log`
+- **Configuration**: `.env`
+---
+## Notes
+- Server must be running for Android app to connect
+- Both devices must be on same WiFi network
+- Mac IP address may change if you reconnect to WiFi
+- Model is cached in `/tmp/huggingface` (survives restarts)
+- Logs are appended to `server.log` (not rotated automatically)
+- V4 warmup happens on every server start (~2-3 seconds)
+---
+**Last Updated**: 2025-12-12
+**Server Version**: 4.0.0
+**Model**: Qwen/Qwen2.5-3B-Instruct
+**Device**: M4 MacBook Pro with MPS

app/services/structured_summarizer.py CHANGED Viewed

@@ -90,16 +90,21 @@ class StructuredSummarizer:
             # Decide device / quantization strategy
             use_cuda = torch.cuda.is_available()
             quantization_desc = "None"
             if use_cuda:
-                logger.info("CUDA is available. Using GPU for V4 model.")
             else:
-                logger.info("CUDA is NOT available. V4 model will run on CPU.")
             # ------------------------------------------------------------------
-            # Preferred path: 4-bit NF4 on GPU via bitsandbytes (memory efficient)
             # OR FP16 for speed (2-3x faster, uses more memory)
             # ------------------------------------------------------------------
             use_fp16_for_speed = getattr(settings, "v4_use_fp16_for_speed", False)
@@ -128,42 +133,69 @@ class StructuredSummarizer:
                 )
                 quantization_desc = "4-bit NF4 (bitsandbytes, GPU)"
-            elif use_cuda and use_fp16_for_speed:
-                # Use FP16 for 2-3x faster inference (uses ~2-3GB GPU memory)
                 logger.info(
-                    "Loading V4 model in FP16 for maximum speed (2-3x faster than 4-bit)..."
-                )
-                self.model = AutoModelForCausalLM.from_pretrained(
-                    settings.v4_model_id,
-                    dtype=torch.float16,
-                    device_map="auto",
-                    cache_dir=settings.hf_cache_dir,
-                    trust_remote_code=True,
                 )
                 quantization_desc = "FP16 (GPU, fast)"
             else:
                 # ------------------------------------------------------------------
                 # Fallback path:
-                #   - GPU without bitsandbytes  -> FP16
-                #   - CPU                        -> FP32 + optional dynamic INT8
                 # ------------------------------------------------------------------
-                base_dtype = torch.float16 if use_cuda else torch.float32
-                logger.info(
-                    "Loading V4 model without 4-bit bitsandbytes. "
-                    f"Base dtype: {base_dtype}"
-                )
-                self.model = AutoModelForCausalLM.from_pretrained(
-                    settings.v4_model_id,
-                    dtype=base_dtype,
-                    device_map="auto" if use_cuda else None,
-                    cache_dir=settings.hf_cache_dir,
-                    trust_remote_code=True,
-                )
-                # Optional dynamic INT8 quantization on CPU
-                if getattr(settings, "v4_enable_quantization", True) and not use_cuda:
                     try:
                         logger.info(
                             "Applying dynamic INT8 quantization to V4 model on CPU..."

             # Decide device / quantization strategy
             use_cuda = torch.cuda.is_available()
+            use_mps = torch.backends.mps.is_available() if hasattr(torch.backends, 'mps') else False
+            use_gpu = use_cuda or use_mps
             quantization_desc = "None"
             if use_cuda:
+                logger.info("CUDA is available. Using NVIDIA GPU for V4 model.")
+            elif use_mps:
+                logger.info("MPS (Metal Performance Shaders) is available. Using Apple Silicon GPU for V4 model.")
             else:
+                logger.info("No GPU available. V4 model will run on CPU.")
             # ------------------------------------------------------------------
+            # Preferred path: 4-bit NF4 on CUDA GPU via bitsandbytes (memory efficient)
             # OR FP16 for speed (2-3x faster, uses more memory)
+            # Note: bitsandbytes only works on CUDA, not MPS
             # ------------------------------------------------------------------
             use_fp16_for_speed = getattr(settings, "v4_use_fp16_for_speed", False)
                 )
                 quantization_desc = "4-bit NF4 (bitsandbytes, GPU)"
+            elif use_gpu and use_fp16_for_speed:
+                # Use FP16 for 2-3x faster inference
+                # Note: MPS doesn't support BFloat16, so we avoid device_map="auto" for MPS
                 logger.info(
+                    "Loading V4 model in FP16 for maximum speed (2-3x faster than FP32)..."
                 )
+                if use_mps:
+                    # MPS: Load without device_map, then manually move to MPS
+                    self.model = AutoModelForCausalLM.from_pretrained(
+                        settings.v4_model_id,
+                        torch_dtype=torch.float16,
+                        cache_dir=settings.hf_cache_dir,
+                        trust_remote_code=True,
+                    )
+                    self.model = self.model.to("mps")
+                else:
+                    # CUDA: Use device_map="auto" for multi-GPU support
+                    self.model = AutoModelForCausalLM.from_pretrained(
+                        settings.v4_model_id,
+                        torch_dtype=torch.float16,
+                        device_map="auto",
+                        cache_dir=settings.hf_cache_dir,
+                        trust_remote_code=True,
+                    )
                 quantization_desc = "FP16 (GPU, fast)"
             else:
                 # ------------------------------------------------------------------
                 # Fallback path:
+                #   - GPU (CUDA/MPS) without quantization/FP16  -> FP16
+                #   - CPU                                       -> FP32 + optional dynamic INT8
                 # ------------------------------------------------------------------
+                base_dtype = torch.float16 if use_gpu else torch.float32
+                if use_mps:
+                    # MPS fallback: Load without device_map, manually move to MPS
+                    logger.info(
+                        f"Loading V4 model for MPS with dtype={base_dtype}"
+                    )
+                    self.model = AutoModelForCausalLM.from_pretrained(
+                        settings.v4_model_id,
+                        torch_dtype=base_dtype,
+                        cache_dir=settings.hf_cache_dir,
+                        trust_remote_code=True,
+                    )
+                    self.model = self.model.to("mps")
+                else:
+                    # CUDA or CPU
+                    device_strategy = "auto" if use_cuda else None
+                    logger.info(
+                        f"Loading V4 model with device_map='{device_strategy}', dtype={base_dtype}"
+                    )
+                    self.model = AutoModelForCausalLM.from_pretrained(
+                        settings.v4_model_id,
+                        torch_dtype=base_dtype,
+                        device_map=device_strategy,
+                        cache_dir=settings.hf_cache_dir,
+                        trust_remote_code=True,
+                    )
+                # Optional dynamic INT8 quantization on CPU only (not supported on GPU)
+                if getattr(settings, "v4_enable_quantization", True) and not use_gpu:
                     try:
                         logger.info(
                             "Applying dynamic INT8 quantization to V4 model on CPU..."

start_v4_local.sh ADDED Viewed

	@@ -0,0 +1,138 @@

+#!/bin/bash
+# V4 Local Testing Server Startup Script
+# This script starts the FastAPI server with V4 enabled for Android app testing
+set -e
+# Colors for output
+GREEN='\033[0;32m'
+BLUE='\033[0;34m'
+YELLOW='\033[1;33m'
+RED='\033[0;31m'
+NC='\033[0m' # No Color
+echo -e "${BLUE}╔══════════════════════════════════════════════════════════╗${NC}"
+echo -e "${BLUE}║  V4 Local Testing Server                                 ║${NC}"
+echo -e "${BLUE}╚══════════════════════════════════════════════════════════╝${NC}"
+echo ""
+# Check if server is already running
+if lsof -Pi :7860 -sTCP:LISTEN -t >/dev/null 2>&1; then
+    echo -e "${YELLOW}⚠️  Server already running on port 7860${NC}"
+    echo -e "${YELLOW}   Stopping existing server...${NC}"
+    pkill -f "uvicorn app.main:app" || true
+    sleep 2
+fi
+# Get local IP address
+LOCAL_IP=$(ifconfig | grep "inet " | grep -v "127.0.0.1" | awk '{print $2}' | head -1)
+if [ -z "$LOCAL_IP" ]; then
+    LOCAL_IP="Unable to detect"
+    echo -e "${RED}⚠️  Could not detect local IP address${NC}"
+else
+    echo -e "${GREEN}✅ Local IP Address: ${LOCAL_IP}${NC}"
+fi
+# Check .env configuration
+if [ -f ".env" ]; then
+    echo -e "${GREEN}✅ Found .env configuration${NC}"
+    # Show V4 config
+    echo ""
+    echo -e "${BLUE}V4 Configuration:${NC}"
+    grep "^ENABLE_V4" .env || echo "  No V4 settings found"
+    grep "^V4_MODEL_ID" .env || echo "  No model configured"
+    grep "^V4_MAX_TOKENS" .env || echo "  Using default tokens"
+else
+    echo -e "${RED}❌ No .env file found!${NC}"
+    echo -e "${YELLOW}   Please create .env with V4 configuration${NC}"
+    exit 1
+fi
+echo ""
+echo -e "${BLUE}Starting server...${NC}"
+echo -e "${BLUE}This may take 30-90 seconds for V4 model warmup${NC}"
+echo ""
+# Start server in background and log to file
+/opt/anaconda3/envs/summarizer/bin/python -m uvicorn app.main:app \
+    --host 0.0.0.0 \
+    --port 7860 \
+    > server.log 2>&1 &
+SERVER_PID=$!
+echo -e "${GREEN}✅ Server started (PID: ${SERVER_PID})${NC}"
+# Wait for server to be ready
+echo -e "${YELLOW}⏳ Waiting for server to initialize...${NC}"
+TIMEOUT=120
+ELAPSED=0
+while [ $ELAPSED -lt $TIMEOUT ]; do
+    if lsof -Pi :7860 -sTCP:LISTEN -t >/dev/null 2>&1; then
+        echo -e "${GREEN}✅ Server is listening on port 7860${NC}"
+        break
+    fi
+    sleep 2
+    ELAPSED=$((ELAPSED + 2))
+    # Show progress every 10 seconds
+    if [ $((ELAPSED % 10)) -eq 0 ]; then
+        echo -e "${YELLOW}   Still loading... (${ELAPSED}s / ${TIMEOUT}s)${NC}"
+    fi
+done
+if [ $ELAPSED -ge $TIMEOUT ]; then
+    echo -e "${RED}❌ Server failed to start within ${TIMEOUT} seconds${NC}"
+    echo -e "${YELLOW}   Check server.log for errors${NC}"
+    exit 1
+fi
+# Wait a bit more for V4 warmup
+echo -e "${YELLOW}⏳ Waiting for V4 model warmup (may take 60-90s)...${NC}"
+sleep 15
+# Test health endpoint
+echo ""
+echo -e "${BLUE}Testing server health...${NC}"
+if curl -s http://localhost:7860/health > /dev/null 2>&1; then
+    echo -e "${GREEN}✅ Server is healthy and responding${NC}"
+else
+    echo -e "${YELLOW}⚠️  Health check failed, but server may still be warming up${NC}"
+fi
+echo ""
+echo -e "${GREEN}╔══════════════════════════════════════════════════════════╗${NC}"
+echo -e "${GREEN}║  Server Started Successfully!                            ║${NC}"
+echo -e "${GREEN}╚══════════════════════════════════════════════════════════╝${NC}"
+echo ""
+echo -e "${BLUE}Local Access:${NC}"
+echo -e "  http://localhost:7860"
+echo ""
+echo -e "${BLUE}Android App URL:${NC}"
+echo -e "  http://${LOCAL_IP}:7860"
+echo ""
+echo -e "${BLUE}V4 NDJSON Endpoint:${NC}"
+echo -e "  POST http://${LOCAL_IP}:7860/api/v4/scrape-and-summarize/stream-ndjson"
+echo ""
+echo -e "${BLUE}API Documentation:${NC}"
+echo -e "  http://localhost:7860/docs"
+echo ""
+echo -e "${BLUE}Server Logs:${NC}"
+echo -e "  tail -f server.log"
+echo ""
+echo -e "${BLUE}Stop Server:${NC}"
+echo -e "  pkill -f 'uvicorn app.main:app'"
+echo -e "  or: kill ${SERVER_PID}"
+echo ""
+echo -e "${YELLOW}📱 Update your Android app base URL to: http://${LOCAL_IP}:7860${NC}"
+echo -e "${YELLOW}📖 See ANDROID_V4_LOCAL_TESTING.md for complete setup guide${NC}"
+echo ""
+# Optionally tail logs
+read -p "Show real-time logs? (y/N): " -n 1 -r
+echo
+if [[ $REPLY =~ ^[Yy]$ ]]; then
+    echo -e "${BLUE}Showing server logs (Ctrl+C to stop)...${NC}"
+    tail -f server.log
+fi