Spaces:

Luigi
/

tiny-scribe

Running

App Files Files Community

tiny-scribe / docs /advanced-mode-implementation-plan.md

Luigi

fix: improve extraction success rate with Qwen3 models

061dfb7 about 1 month ago

preview code

raw

history blame contribute delete

48.2 kB

Advanced 2-Stage Meeting Summarization - Complete Implementation Plan

Project: Tiny Scribe - Advanced Mode
Date: 2026-02-04
Status: Ready for Implementation
Estimated Effort: 13-19 hours

Executive Summary
Design Decisions
Model Registries
UI Implementation
Model Management Infrastructure
Extraction Pipeline
Implementation Checklist
Testing Strategy
Implementation Priority
Risk Assessment

Executive Summary

This plan details the implementation of a 3-model Advanced Summarization Pipeline for Tiny Scribe, featuring:

✅ 3 independent model registries (Extraction, Embedding, Synthesis)
✅ User-configurable extraction context (2K-8K tokens, default 4K)
✅ Reasoning/thinking model support with independent toggles per stage
✅ Sequential model loading for memory efficiency
✅ Bilingual support (English + Traditional Chinese zh-TW)
✅ Fail-fast error handling with graceful UI feedback
✅ Complete independence from Standard mode

Architecture

Stage 1: EXTRACTION    → Parse transcript → Create windows → Extract JSON items
Stage 2: DEDUPLICATION → Compute embeddings → Remove semantic duplicates
Stage 3: SYNTHESIS     → Generate executive summary from deduplicated items

Key Metrics

Metric	Value
New Code	~1,800 lines
Modified Code	~60 lines
Total Models	33 unique (13 + 4 + 16)
Default Models	`qwen3_1.7b_q4`, `granite-107m`, `qwen3_1.7b_q4`
Memory Strategy	Sequential load/unload (safe for HF Spaces Free Tier)

Design Decisions

Q1: Extraction Model List Composition (REVISION)

Decision: Option A - 11 models (≤1.7B), excluding LFM2-Extract models

Rationale: 11 models excluding LFM2-Extract specialized models (removed after testing showed 85.7% failure rate due to hallucination and schema non-compliance. Replaced with Qwen3 models that support reasoning and better handle Chinese content.)

Q1a: Synthesis Model Selection (NEW)

Decision: Restrict to models ≤4GB (max 4B parameters)

Rationale: HF Spaces Free Tier only has 16GB RAM; 7B+ models will OOM. Remove ernie_21b, glm_4_7_flash_reap_30b, qwen3_30b_thinking_q1, qwen3_30b_instruct_q1

Q2: Independence from Standard Mode

Decision: Option B - Both Extraction AND Synthesis fully independent from AVAILABLE_MODELS

Rationale: Full independence prevents parameter cross-contamination; synthesis models have their own optimized temperatures (0.7-0.9) separate from Standard mode

Q3: Extraction n_ctx UI Control

Decision: Option A - Slider (2K-8K, step 1024, default 4K)

Rationale: Maximum flexibility for users to balance precision vs speed

Q4: Default Models

Decision:

Extraction: qwen3_1.7b_q4 (supports reasoning, better Chinese understanding)
Embedding: granite-107m (fastest, good enough)
Synthesis: qwen3_1.7b_q4 (larger than extraction, better quality)

Rationale: Balanced defaults optimized for quality and speed. Qwen3 1.7B chosen over LFM2-Extract based on empirical testing showing superior extraction success rate and schema compliance.

Q5: Model Key Naming

Decision: Keep same keys (no prefix like adv_synth_)

Rationale: Simpler, less duplication, clear role-based config resolution

Q6: Model Overlap Between Stages

Decision: Allow overlap with independent settings per role

Rationale: Same model can be extraction + synthesis with different parameters

Q7: Reasoning Checkbox UI Flow

Decision: Option B - Separate checkboxes for extraction and synthesis

Rationale: Independent control per stage, clearer user intent

Q8: Thinking Block Display

Decision: Option A - Reuse "MODEL THINKING PROCESS" field

Rationale: Consistent with Standard mode, no UI layout changes needed

Q9: Window Token Counting with User n_ctx

Decision: Option A - Strict adherence to user's slider value

Rationale: Respect user's explicit choice, they may want larger/smaller windows

Q10: Model Loading Error Handling

Decision: Option C - Graceful failure with UI error message

Rationale: Most user-friendly, allows retry with different model

Model Registries

1. EXTRACTION_MODELS (13 models - FINAL)

Location: /home/luigi/tiny-scribe/app.py

Features:

✅ Independent from AVAILABLE_MODELS
✅ User-adjustable n_ctx (2K-8K, default 4K)
✅ Extraction-optimized settings (temp 0.1-0.3)
✅ 2 hybrid models with reasoning toggle
✅ All models verified on HuggingFace

Complete Registry (LFM2-Extract models removed after testing):

EXTRACTION_MODELS = {
    "falcon_h1_100m": {
        "name": "Falcon-H1 100M",
        "repo_id": "mradermacher/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "100M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.2,
            "top_p": 0.9,
            "top_k": 30,
            "repeat_penalty": 1.0,
        },
    },
    "gemma3_270m": {
        "name": "Gemma-3 270M",
        "repo_id": "unsloth/gemma-3-270m-it-qat-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "270M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.3,
            "top_p": 0.9,
            "top_k": 40,
            "repeat_penalty": 1.0,
        },
    },
    "ernie_300m": {
        "name": "ERNIE-4.5 0.3B (131K Context)",
        "repo_id": "unsloth/ERNIE-4.5-0.3B-PT-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 131072,
        "default_n_ctx": 4096,
        "params_size": "300M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.2,
            "top_p": 0.9,
            "top_k": 30,
            "repeat_penalty": 1.0,
        },
    },
    "granite_350m": {
        "name": "Granite-4.0 350M",
        "repo_id": "unsloth/granite-4.0-h-350m-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "350M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.1,
            "top_p": 0.95,
            "top_k": 30,
            "repeat_penalty": 1.0,
        },
    },
    "lfm2_350m": {
        "name": "LFM2 350M",
        "repo_id": "LiquidAI/LFM2-350M-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "350M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.2,
            "top_p": 0.9,
            "top_k": 40,
            "repeat_penalty": 1.0,
        },
    },
    "bitcpm4_500m": {
        "name": "BitCPM4 0.5B (128K Context)",
        "repo_id": "openbmb/BitCPM4-0.5B-GGUF",
        "filename": "*q4_0.gguf",
        "max_context": 131072,
        "default_n_ctx": 4096,
        "params_size": "500M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.2,
            "top_p": 0.9,
            "top_k": 30,
            "repeat_penalty": 1.0,
        },
    },
    "hunyuan_500m": {
        "name": "Hunyuan 0.5B (256K Context)",
        "repo_id": "mradermacher/Hunyuan-0.5B-Instruct-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 262144,
        "default_n_ctx": 4096,
        "params_size": "500M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.2,
            "top_p": 0.9,
            "top_k": 30,
            "repeat_penalty": 1.0,
        },
    },
    "qwen3_600m_q4": {
        "name": "Qwen3 0.6B Q4 (32K Context)",
        "repo_id": "unsloth/Qwen3-0.6B-GGUF",
        "filename": "*Q4_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "600M",
        "supports_reasoning": True,       # ← HYBRID MODEL
        "supports_toggle": True,          # ← User can toggle reasoning
        "inference_settings": {
            "temperature": 0.3,
            "top_p": 0.9,
            "top_k": 20,
            "repeat_penalty": 1.0,
        },
    },
    "granite_3_1_1b_q8": {
        "name": "Granite 3.1 1B-A400M Instruct (128K Context)",
        "repo_id": "bartowski/granite-3.1-1b-a400m-instruct-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 131072,
        "default_n_ctx": 4096,
        "params_size": "1B",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.3,
            "top_p": 0.9,
            "top_k": 30,
            "repeat_penalty": 1.0,
        },
    },
    "falcon_h1_1.5b_q4": {
        "name": "Falcon-H1 1.5B Q4",
        "repo_id": "unsloth/Falcon-H1-1.5B-Deep-Instruct-GGUF",
        "filename": "*Q4_K_M.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "1.5B",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.2,
            "top_p": 0.9,
            "top_k": 30,
            "repeat_penalty": 1.0,
        },
    },
    "qwen3_1.7b_q4": {
        "name": "Qwen3 1.7B Q4 (32K Context)",
        "repo_id": "unsloth/Qwen3-1.7B-GGUF",
        "filename": "*Q4_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "1.7B",
        "supports_reasoning": True,       # ← HYBRID MODEL
        "supports_toggle": True,          # ← User can toggle reasoning
        "inference_settings": {
            "temperature": 0.3,
            "top_p": 0.9,
            "top_k": 20,
            "repeat_penalty": 1.0,
        },
    },
    "lfm2_extract_350m": {
        "name": "LFM2-Extract 350M (Specialized)",
        "repo_id": "LiquidAI/LFM2-350M-Extract-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "350M",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.0,           # ← Greedy decoding per Liquid AI docs
            "top_p": 1.0,
            "top_k": 0,
            "repeat_penalty": 1.0,
        },
    },
    "lfm2_extract_1.2b": {
        "name": "LFM2-Extract 1.2B (Specialized)",
        "repo_id": "LiquidAI/LFM2-1.2B-Extract-GGUF",
        "filename": "*Q8_0.gguf",
        "max_context": 32768,
        "default_n_ctx": 4096,
        "params_size": "1.2B",
        "supports_reasoning": False,
        "supports_toggle": False,
        "inference_settings": {
            "temperature": 0.0,           # ← Greedy decoding per Liquid AI docs
            "top_p": 1.0,
            "top_k": 0,
            "repeat_penalty": 1.0,
        },
    },
}

Hybrid Models (Reasoning Support):

qwen3_600m_q4 - 600M, user-toggleable reasoning
qwen3_1.7b_q4 - 1.7B, user-toggleable reasoning

2. SYNTHESIS_MODELS (16 models)

Location: /home/luigi/tiny-scribe/app.py

Features:

✅ Fully independent from AVAILABLE_MODELS (no shared references)
✅ Synthesis-optimized temperatures (0.7-0.9, higher than extraction)
✅ 3 hybrid + 5 thinking-only models with reasoning support
✅ Default: qwen3_1.7b_q4

Registry Definition:

# FULLY INDEPENDENT from AVAILABLE_MODELS (no shared references)
# Synthesis-optimized settings: higher temperatures (0.7-0.9) for creative summary generation
SYNTHESIS_MODELS = {
    "granite_3_1_1b_q8": {..., "temperature": 0.8},
    "falcon_h1_1.5b_q4": {..., "temperature": 0.7},
    "qwen3_1.7b_q4": {..., "temperature": 0.8},          # DEFAULT
    "granite_3_3_2b_q4": {..., "temperature": 0.8},
    "youtu_llm_2b_q8": {..., "temperature": 0.8},         # reasoning toggle
    "lfm2_2_6b_transcript": {..., "temperature": 0.7},
    "breeze_3b_q4": {..., "temperature": 0.7},
    "granite_3_1_3b_q4": {..., "temperature": 0.8},
    "qwen3_4b_thinking_q3": {..., "temperature": 0.8},    # thinking-only
    "granite4_tiny_q3": {..., "temperature": 0.8},
    "ernie_21b_pt_q1": {..., "temperature": 0.8},
    "ernie_21b_thinking_q1": {..., "temperature": 0.9},   # thinking-only
    "glm_4_7_flash_reap_30b": {..., "temperature": 0.8},  # thinking-only
    "glm_4_7_flash_30b_iq2": {..., "temperature": 0.7},
    "qwen3_30b_thinking_q1": {..., "temperature": 0.8},   # thinking-only
    "qwen3_30b_instruct_q1": {..., "temperature": 0.7},
}

Reasoning Models:

Hybrid (toggleable): qwen3_1.7b_q4, youtu_llm_2b_q8
Thinking-only: qwen3_4b_thinking_q3, ernie_21b_thinking_q1, glm_4_7_flash_reap_30b, qwen3_30b_thinking_q1

3. EMBEDDING_MODELS (4 models)

Location: /home/luigi/tiny-scribe/meeting_summarizer/extraction.py

Features:

✅ Dedicated embedding models (not in AVAILABLE_MODELS)
✅ Used exclusively for deduplication phase
✅ Range: 384-dim to 1024-dim
✅ Default: granite-107m

Registry:

EMBEDDING_MODELS = {
    "granite-107m": {
        "name": "Granite 107M Multilingual (384-dim)",
        "repo_id": "ibm-granite/granite-embedding-107m-multilingual",
        "filename": "*Q8_0.gguf",
        "embedding_dim": 384,
        "max_context": 2048,
        "description": "Fastest, multilingual, good for quick deduplication",
    },
    "granite-278m": {
        "name": "Granite 278M Multilingual (768-dim)",
        "repo_id": "ibm-granite/granite-embedding-278m-multilingual",
        "filename": "*Q8_0.gguf",
        "embedding_dim": 768,
        "max_context": 2048,
        "description": "Balanced speed/quality, multilingual",
    },
    "gemma-300m": {
        "name": "Embedding Gemma 300M (768-dim)",
        "repo_id": "unsloth/embeddinggemma-300m-GGUF",
        "filename": "*Q8_0.gguf",
        "embedding_dim": 768,
        "max_context": 2048,
        "description": "Google embedding model, strong semantics",
    },
    "qwen-600m": {
        "name": "Qwen3 Embedding 600M (1024-dim)",
        "repo_id": "Qwen/Qwen3-Embedding-0.6B-GGUF",
        "filename": "*Q8_0.gguf",
        "embedding_dim": 1024,
        "max_context": 2048,
        "description": "Highest quality, best for critical dedup",
    },
}

UI Implementation

Advanced Mode Controls (Option B: Separate Reasoning Checkboxes)

Location: /home/luigi/tiny-scribe/app.py, Gradio interface section

# ===== ADVANCED MODE CONTROLS =====
# Uses gr.TabItem inside gr.Tabs (not gr.Group with visibility toggle)
with gr.TabItem("🧠 Advanced Mode (3-Model Pipeline)"):
    
    # Model Selection Row
    with gr.Row():
        extraction_model = gr.Dropdown(
            choices=list(EXTRACTION_MODELS.keys()),
            value="qwen3_1.7b_q4",  # ⭐ DEFAULT
            label="🔍 Stage 1: Extraction Model (≤1.7B)",
            info="Extracts structured items (action_items, decisions, key_points, questions) from windows"
        )
        
        embedding_model = gr.Dropdown(
            choices=list(EMBEDDING_MODELS.keys()),
            value="granite-107m",  # ⭐ DEFAULT
            label="🧬 Stage 2: Embedding Model",
            info="Computes semantic embeddings for deduplication across categories"
        )
        
        synthesis_model = gr.Dropdown(
            choices=list(SYNTHESIS_MODELS.keys()),
            value="qwen3_1.7b_q4",  # ⭐ DEFAULT
            label="✨ Stage 3: Synthesis Model (1B-30B)",
            info="Generates final executive summary from deduplicated items"
        )
    
    # Extraction Parameters Row
    with gr.Row():
        extraction_n_ctx = gr.Slider(
            minimum=2048,
            maximum=8192,
            step=1024,
            value=4096,  # ⭐ DEFAULT 4K
            label="🪟 Extraction Context Window (n_ctx)",
            info="Smaller = more windows (higher precision), Larger = fewer windows (faster processing)"
        )
        
        overlap_turns = gr.Slider(
            minimum=1,
            maximum=5,
            step=1,
            value=2,
            label="🔄 Window Overlap (speaker turns)",
            info="Number of speaker turns shared between adjacent windows (reduces information loss)"
        )
    
    # Deduplication Parameters Row
    with gr.Row():
        similarity_threshold = gr.Slider(
            minimum=0.70,
            maximum=0.95,
            step=0.01,
            value=0.85,
            label="🎯 Deduplication Similarity Threshold",
            info="Items with cosine similarity above this are considered duplicates (higher = stricter)"
        )
    
    # SEPARATE REASONING CONTROLS (Q7: Option B)
    with gr.Row():
        enable_extraction_reasoning = gr.Checkbox(
            value=False,
            visible=False,  # Conditional visibility based on extraction model
            label="🧠 Enable Reasoning for Extraction",
            info="Use thinking process before JSON output (Qwen3 hybrid models only)"
        )
        
        enable_synthesis_reasoning = gr.Checkbox(
            value=True,
            visible=True,  # Conditional visibility based on synthesis model
            label="🧠 Enable Reasoning for Synthesis",
            info="Use thinking process for final summary generation"
        )
    
    # Output Settings Row
    with gr.Row():
        adv_output_language = gr.Radio(
            choices=["en", "zh-TW"],
            value="en",
            label="🌐 Output Language",
            info="Extraction auto-detects language from transcript, synthesis uses this setting"
        )
        
        adv_max_tokens = gr.Slider(
            minimum=512,
            maximum=4096,
            step=128,
            value=2048,
            label="📏 Max Synthesis Tokens",
            info="Maximum tokens for final executive summary"
        )
    
    # Logging Control
    enable_detailed_logging = gr.Checkbox(
        value=True,
        label="📝 Enable Detailed Trace Logging",
        info="Save JSONL trace file (embedded in download JSON) for debugging pipeline"
    )
    
    # Model Info Accordion
    with gr.Accordion("📋 Model Details & Settings", open=False):
        with gr.Row():
            with gr.Column():
                extraction_model_info = gr.Markdown("**Extraction Model**\n\nSelect a model to see details")
            with gr.Column():
                embedding_model_info = gr.Markdown("**Embedding Model**\n\nSelect a model to see details")
            with gr.Column():
                synthesis_model_info = gr.Markdown("**Synthesis Model**\n\nSelect a model to see details")

Conditional Reasoning Checkbox Visibility Logic

def update_extraction_reasoning_visibility(model_key):
    """Show/hide extraction reasoning checkbox based on model capabilities."""
    config = EXTRACTION_MODELS.get(model_key, {})
    supports_toggle = config.get("supports_toggle", False)
    
    if supports_toggle:
        # Hybrid model (qwen3_600m_q4, qwen3_1.7b_q4)
        return gr.update(
            visible=True,
            value=False,
            interactive=True,
            label="🧠 Enable Reasoning for Extraction"
        )
    elif config.get("supports_reasoning", False) and not supports_toggle:
        # Thinking-only model (none currently in extraction, but future-proof)
        return gr.update(
            visible=True,
            value=True,
            interactive=False,
            label="🧠 Reasoning Mode for Extraction (Always On)"
        )
    else:
        # Non-reasoning model
        return gr.update(visible=False, value=False)


def update_synthesis_reasoning_visibility(model_key):
    """Show/hide synthesis reasoning checkbox based on model capabilities."""
    # Reuse existing logic from Standard mode
    return update_reasoning_visibility(model_key)  # Existing function


# Wire up event handlers
extraction_model.change(
    fn=update_extraction_reasoning_visibility,
    inputs=[extraction_model],
    outputs=[enable_extraction_reasoning]
)

synthesis_model.change(
    fn=update_synthesis_reasoning_visibility,
    inputs=[synthesis_model],
    outputs=[enable_synthesis_reasoning]
)

Model Info Display Functions

def get_extraction_model_info(model_key):
    """Generate markdown info for extraction model."""
    config = EXTRACTION_MODELS.get(model_key, {})
    settings = config.get("inference_settings", {})
    
    reasoning_support = ""
    if config.get("supports_toggle"):
        reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)"
    elif config.get("supports_reasoning"):
        reasoning_support = "\n**Reasoning:** Thinking-only (always on)"
    
    return f"""**{config.get('name', 'Unknown')}**

**Size:** {config.get('params_size', 'N/A')}  
**Max Context:** {config.get('max_context', 0):,} tokens  
**Default n_ctx:** {config.get('default_n_ctx', 4096):,} tokens (user-adjustable via slider)  
**Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support}

**Extraction-Optimized Settings:**
- Temperature: {settings.get('temperature', 'N/A')} (deterministic for JSON)
- Top P: {settings.get('top_p', 'N/A')}
- Top K: {settings.get('top_k', 'N/A')}
- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
"""


def get_embedding_model_info(model_key):
    """Generate markdown info for embedding model."""
    from meeting_summarizer.extraction import EMBEDDING_MODELS
    config = EMBEDDING_MODELS.get(model_key, {})
    
    return f"""**{config.get('name', 'Unknown')}**

**Embedding Dimension:** {config.get('embedding_dim', 'N/A')}  
**Context:** {config.get('max_context', 0):,} tokens  
**Repository:** `{config.get('repo_id', 'N/A')}`

**Description:** {config.get('description', 'N/A')}
"""


def get_synthesis_model_info(model_key):
    """Generate markdown info for synthesis model."""
    config = SYNTHESIS_MODELS.get(model_key, {})
    settings = config.get("inference_settings", {})
    
    reasoning_support = ""
    if config.get("supports_toggle"):
        reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)"
    elif config.get("supports_reasoning"):
        reasoning_support = "\n**Reasoning:** Thinking-only (always on)"
    
    return f"""**{config.get('name', 'Unknown')}**

**Max Context:** {config.get('max_context', 0):,} tokens  
**Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support}

**Synthesis-Optimized Settings:**
- Temperature: {settings.get('temperature', 'N/A')} (from Standard mode)
- Top P: {settings.get('top_p', 'N/A')}
- Top K: {settings.get('top_k', 'N/A')}
- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
"""


# Wire up info update handlers
extraction_model.change(
    fn=get_extraction_model_info,
    inputs=[extraction_model],
    outputs=[extraction_model_info]
)

embedding_model.change(
    fn=get_embedding_model_info,
    inputs=[embedding_model],
    outputs=[embedding_model_info]
)

synthesis_model.change(
    fn=get_synthesis_model_info,
    inputs=[synthesis_model],
    outputs=[synthesis_model_info]
)

Model Management Infrastructure

Role-Aware Configuration Resolver

def get_model_config(model_key: str, model_role: str) -> Dict[str, Any]:
    """
    Get model configuration based on role.
    
    Ensures same model (e.g., qwen3_1.7b_q4) uses DIFFERENT settings
    for extraction vs synthesis.
    
    Args:
        model_key: Model identifier (e.g., "qwen3_1.7b_q4")
        model_role: "extraction" or "synthesis"
    
    Returns:
        Model configuration dict with role-specific settings
    
    Raises:
        ValueError: If model_key not available for specified role
    """
    if model_role == "extraction":
        if model_key not in EXTRACTION_MODELS:
            available = ", ".join(list(EXTRACTION_MODELS.keys())[:3]) + "..."
            raise ValueError(
                f"Model '{model_key}' not available for extraction role. "
                f"Available: {available}"
            )
        return EXTRACTION_MODELS[model_key]
    
    elif model_role == "synthesis":
        if model_key not in SYNTHESIS_MODELS:
            available = ", ".join(list(SYNTHESIS_MODELS.keys())[:3]) + "..."
            raise ValueError(
                f"Model '{model_key}' not available for synthesis role. "
                f"Available: {available}"
            )
        return SYNTHESIS_MODELS[model_key]
    
    else:
        raise ValueError(
            f"Unknown model role: '{model_role}'. "
            f"Must be 'extraction' or 'synthesis'"
        )

Role-Aware Model Loader (Q9: Option A - Respect user's n_ctx choice)

def load_model_for_role(
    model_key: str,
    model_role: str,
    n_threads: int = 2,
    user_n_ctx: Optional[int] = None  # For extraction, from slider
) -> Tuple[Llama, str]:
    """
    Load model with role-specific configuration.
    
    Args:
        model_key: Model identifier
        model_role: "extraction" or "synthesis"
        n_threads: CPU threads
        user_n_ctx: User-specified n_ctx (extraction only, from slider)
    
    Returns:
        (loaded_model, info_message)
    
    Raises:
        Exception: If model loading fails (Q10: Option C - fail gracefully)
    """
    try:
        config = get_model_config(model_key, model_role)
        
        # Calculate n_ctx (Q9: Option A - Strict adherence to user's choice)
        if model_role == "extraction" and user_n_ctx is not None:
            n_ctx = min(user_n_ctx, config["max_context"], MAX_USABLE_CTX)
        else:
            # Synthesis or default extraction
            n_ctx = min(config.get("max_context", 8192), MAX_USABLE_CTX)
        
        # Detect GPU support
        requested_ngl = int(os.environ.get("N_GPU_LAYERS", 0))
        n_gpu_layers = requested_ngl
        
        if requested_ngl != 0:
            try:
                from llama_cpp import llama_supports_gpu_offload
                gpu_available = llama_supports_gpu_offload()
                if not gpu_available:
                    logger.warning("GPU requested but not available. Using CPU.")
                    n_gpu_layers = 0
            except Exception as e:
                logger.warning(f"Could not detect GPU: {e}. Using CPU.")
                n_gpu_layers = 0
        
        # Load model
        logger.info(f"Loading {config['name']} for {model_role} role (n_ctx={n_ctx:,})")
        
        llm = Llama.from_pretrained(
            repo_id=config["repo_id"],
            filename=config["filename"],
            n_ctx=n_ctx,
            n_batch=min(2048, n_ctx),
            n_threads=n_threads,
            n_threads_batch=n_threads,
            n_gpu_layers=n_gpu_layers,
            verbose=False,
            seed=1337,
        )
        
        info_msg = (
            f"✅ Loaded: {config['name']} for {model_role} "
            f"(n_ctx={n_ctx:,}, threads={n_threads})"
        )
        logger.info(info_msg)
        
        return llm, info_msg
    
    except Exception as e:
        # Q10: Option C - Fail gracefully, let user select different model
        error_msg = (
            f"❌ Failed to load {model_key} for {model_role}: {str(e)}\n\n"
            f"Please select a different model and try again."
        )
        logger.error(error_msg, exc_info=True)
        raise Exception(error_msg)


def unload_model(llm: Llama, model_name: str = "model") -> None:
    """Explicitly unload model and trigger garbage collection."""
    if llm:
        logger.info(f"Unloading {model_name}")
        del llm
        gc.collect()
        time.sleep(0.5)  # Allow OS to reclaim memory

Extraction Pipeline

Extraction System Prompt Builder (Bilingual + Reasoning)

def build_extraction_system_prompt(
    output_language: str,
    supports_reasoning: bool,
    supports_toggle: bool,
    enable_reasoning: bool
) -> str:
    """
    Build extraction system prompt with optional reasoning mode.
    
    Args:
        output_language: "en" or "zh-TW" (auto-detected from transcript)
        supports_reasoning: Model has reasoning capability
        supports_toggle: User can toggle reasoning on/off
        enable_reasoning: User's choice (only applies if supports_toggle=True)
    
    Returns:
        System prompt string
    """
    # Determine reasoning mode
    if supports_toggle and enable_reasoning:
        # Hybrid model with reasoning enabled
        reasoning_instruction_en = """
Use your reasoning capabilities to analyze the content before extracting.

Your reasoning should:
1. Identify key decision points and action items
2. Distinguish explicit decisions from general discussion
3. Categorize information appropriately (action vs point vs question)

After reasoning, output ONLY valid JSON."""
        
        reasoning_instruction_zh = """
使用你的推理能力分析內容後再進行提取。

你的推理應該：
1. 識別關鍵決策點和行動項目
2. 區分明確決策與一般討論
3. 適當分類資訊（行動 vs 要點 vs 問題）

推理後，僅輸出 JSON。"""
    else:
        reasoning_instruction_en = ""
        reasoning_instruction_zh = ""
    
    # Build full prompt
    if output_language == "zh-TW":
        return f"""你是會議分析助手。從逐字稿中提取結構化資訊。
{reasoning_instruction_zh}

僅輸出有效的 JSON，使用此精確架構：
{{
  "action_items": ["包含負責人和截止日期的任務", ...],
  "decisions": ["包含理由的決策", ...],
  "key_points": ["重要討論要點", ...],
  "open_questions": ["未解決的問題或疑慮", ...]
}}

規則：
- 每個項目必須是完整、獨立的句子
- 在每個項目中包含上下文（誰、什麼、何時）
- 如果類別沒有項目，使用空陣列 []
- 僅輸出 JSON，無 markdown，無解釋"""
    
    else:  # English
        return f"""You are a meeting analysis assistant. Extract structured information from transcript.
{reasoning_instruction_en}

Output ONLY valid JSON with this exact schema:
{{
  "action_items": ["Task with owner and deadline", ...],
  "decisions": ["Decision made with rationale", ...],
  "key_points": ["Important discussion point", ...],
  "open_questions": ["Unresolved question or concern", ...]
}}

Rules:
- Each item must be a complete, standalone sentence
- Include context (who, what, when) in each item
- If a category has no items, use empty array []
- Output ONLY JSON, no markdown, no explanations"""

Extraction Streaming with Reasoning Parsing (Q8: Option A - Show in "MODEL THINKING PROCESS")

def stream_extract_from_window(
    extraction_llm: Llama,
    window: Window,
    window_id: int,
    total_windows: int,
    tracer: Tracer,
    tokenizer: NativeTokenizer,
    enable_reasoning: bool = False
) -> Generator[Tuple[str, str, Dict[str, List[str]], bool], None, None]:
    """
    Stream extraction from single window with live progress + optional reasoning.
    
    Yields:
        (ticker_text, thinking_text, partial_items, is_complete)
        - ticker_text: Progress ticker for UI
        - thinking_text: Reasoning/thinking blocks (if extraction model supports it)
        - partial_items: Current extracted items
        - is_complete: True on final yield
    """
    # Auto-detect language from window content
    has_cjk = bool(re.search(r'[\u4e00-\u9fff]', window.content))
    output_language = "zh-TW" if has_cjk else "en"
    
    # Build system prompt with reasoning support
    config = EXTRACTION_MODELS[window.model_key]  # Assuming we pass model_key in Window
    system_prompt = build_extraction_system_prompt(
        output_language=output_language,
        supports_reasoning=config.get("supports_reasoning", False),
        supports_toggle=config.get("supports_toggle", False),
        enable_reasoning=enable_reasoning
    )
    
    user_prompt = f"Transcript:\n\n{window.content}"
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    
    # Stream extraction
    full_response = ""
    thinking_content = ""
    start_time = time.time()
    first_token_time = None
    token_count = 0
    
    try:
        stream = extraction_llm.create_chat_completion(
            messages=messages,
            max_tokens=1024,
            temperature=config["inference_settings"]["temperature"],
            top_p=config["inference_settings"]["top_p"],
            top_k=config["inference_settings"]["top_k"],
            repeat_penalty=config["inference_settings"]["repeat_penalty"],
            stream=True,
        )
        
        for chunk in stream:
            if 'choices' in chunk and len(chunk['choices']) > 0:
                delta = chunk['choices'][0].get('delta', {})
                content = delta.get('content', '')
                
                if content:
                    if first_token_time is None:
                        first_token_time = time.time()
                    
                    token_count += 1
                    full_response += content
                    
                    # Parse thinking blocks if reasoning enabled
                    if enable_reasoning and config.get("supports_reasoning"):
                        thinking, remaining = parse_thinking_blocks(full_response, streaming=True)
                        thinking_content = thinking or ""
                        json_text = remaining
                    else:
                        json_text = full_response
                    
                    # Try to parse JSON
                    partial_items = _try_parse_extraction_json(json_text)
                    
                    # Calculate progress metrics
                    elapsed = time.time() - start_time
                    tps = token_count / elapsed if elapsed > 0 else 0
                    remaining_tokens = 1024 - token_count
                    eta = int(remaining_tokens / tps) if tps > 0 else 0
                    
                    # Get item counts for ticker
                    items_count = {
                        "action_items": len(partial_items.get("action_items", [])),
                        "decisions": len(partial_items.get("decisions", [])),
                        "key_points": len(partial_items.get("key_points", [])),
                        "open_questions": len(partial_items.get("open_questions", []))
                    }
                    
                    # Get last extracted item as snippet
                    last_item = ""
                    for category in ["action_items", "decisions", "key_points", "open_questions"]:
                        if partial_items.get(category):
                            last_item = partial_items[category][-1]
                            break
                    
                    # Format progress ticker
                    input_tokens = tokenizer.count(window.content)
                    ticker = format_progress_ticker(
                        current_window=window_id,
                        total_windows=total_windows,
                        window_tokens=input_tokens,
                        max_tokens=4096,  # Reference max for percentage
                        items_found=items_count,
                        tokens_per_sec=tps,
                        eta_seconds=eta,
                        current_snippet=last_item
                    )
                    
                    # Q8: Option A - Show in "MODEL THINKING PROCESS" field
                    yield (ticker, thinking_content, partial_items, False)
        
        # Final parse
        if enable_reasoning and config.get("supports_reasoning"):
            thinking, remaining = parse_thinking_blocks(full_response)
            thinking_content = thinking or ""
            json_text = remaining
        else:
            json_text = full_response
        
        final_items = _try_parse_extraction_json(json_text)
        
        if not final_items:
            # JSON parsing failed - FAIL ENTIRE PIPELINE (strict mode)
            error_msg = f"Failed to parse JSON from window {window_id}. Response: {json_text[:200]}"
            tracer.log_extraction(
                window_id=window_id,
                extraction=None,
                llm_response=_sample_llm_response(full_response),
                error=error_msg
            )
            raise ValueError(error_msg)
        
        # Log successful extraction
        tracer.log_extraction(
            window_id=window_id,
            extraction=final_items,
            llm_response=_sample_llm_response(full_response),
            thinking=_sample_llm_response(thinking_content) if thinking_content else None,
            error=None
        )
        
        # Final ticker
        elapsed = time.time() - start_time
        tps = token_count / elapsed if elapsed > 0 else 0
        items_count = {k: len(v) for k, v in final_items.items()}
        
        ticker = format_progress_ticker(
            current_window=window_id,
            total_windows=total_windows,
            window_tokens=input_tokens,
            max_tokens=4096,
            items_found=items_count,
            tokens_per_sec=tps,
            eta_seconds=0,
            current_snippet="✅ Extraction complete"
        )
        
        yield (ticker, thinking_content, final_items, True)
    
    except Exception as e:
        # Log error and re-raise to fail entire pipeline
        tracer.log_extraction(
            window_id=window_id,
            extraction=None,
            llm_response=_sample_llm_response(full_response) if full_response else "",
            error=str(e)
        )
        raise

Implementation Checklist

Files to Create

/home/luigi/tiny-scribe/meeting_summarizer/extraction.py (~900 lines)
- NativeTokenizer class
- EmbeddingModel class + EMBEDDING_MODELS registry
- format_progress_ticker() function
- stream_extract_from_window() function (with reasoning support)
- deduplicate_items() function
- stream_synthesize_executive_summary() function

Files to Modify

/home/luigi/tiny-scribe/meeting_summarizer/__init__.py
- Remove filter_validated_items import/export
/home/luigi/tiny-scribe/meeting_summarizer/trace.py
- Add log_extraction() method
- Add log_deduplication() method
- Add log_synthesis() method
/home/luigi/tiny-scribe/app.py (~800 lines added/modified)
- Add EXTRACTION_MODELS registry (13 models)
- Add SYNTHESIS_MODELS reference
- Add get_model_config() function
- Add load_model_for_role() function
- Add unload_model() function
- Add build_extraction_system_prompt() function
- Add summarize_advanced() generator function
- Add Advanced mode UI controls
- Add reasoning visibility logic
- Add model info display functions
- Update download_summary_json() for trace embedding

Code Statistics

Metric	Count
New Lines	~1,800
Modified Lines	~60
Removed Lines	~2
New Functions	12
New Classes	2
UI Controls	11

Testing Strategy

Phase 1: Model Registry Validation

python -c "
from app import EXTRACTION_MODELS, SYNTHESIS_MODELS
from meeting_summarizer.extraction import EMBEDDING_MODELS

assert len(EXTRACTION_MODELS) == 13, 'Extraction models count mismatch'
assert len(EMBEDDING_MODELS) == 4, 'Embedding models count mismatch'
assert len(SYNTHESIS_MODELS) == 16, 'Synthesis models count mismatch'

# Verify independent settings
ext_qwen = EXTRACTION_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
syn_qwen = SYNTHESIS_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
assert ext_qwen == 0.3, f'Extraction temp wrong: {ext_qwen}'
assert syn_qwen == 0.8, f'Synthesis temp wrong: {syn_qwen}'

print('✅ All model registries validated!')
"

Phase 2: UI Control Validation

Manual Checks:

Select "Advanced" mode
Verify 3 dropdowns show correct counts (13, 4, 16)
Verify default models selected
Adjust extraction_n_ctx slider (2K → 8K)
Select qwen3_600m_q4 for extraction → reasoning checkbox appears
Select qwen3_1.7b_q4 for extraction → reasoning checkbox visible (Qwen3 supports reasoning)
Select qwen3_4b_thinking_q3 for synthesis → reasoning locked ON
Verify model info panels update on selection

Phase 3: Pipeline Test - min.txt (Quick)

Configuration:

Extraction: qwen3_1.7b_q4 (default)
Extraction n_ctx: 4096 (default)
Embedding: granite-107m (default)
Synthesis: qwen3_1.7b_q4 (default)
Similarity threshold: 0.85 (default)

Expected:

1 window created
~2-4 items extracted
0-1 duplicates removed
Final summary generated
Total time: ~30-60s
Download JSON contains trace

Phase 4: Pipeline Test - Reasoning Models

Configuration:

Extraction: qwen3_600m_q4
☑ Enable Reasoning for Extraction (test hybrid model)
Extraction n_ctx: 2048 (smaller windows)
Embedding: granite-278m (test balanced embedding)
Synthesis: qwen3_1.7b_q4
☑ Enable Reasoning for Synthesis

Expected:

More windows (~4-6 with 2K context)
"MODEL THINKING PROCESS" shows extraction thinking + ticker
~10-15 items extracted
~2-4 duplicates removed
Final summary with thinking blocks
Total time: ~2-3 min

Phase 5: Pipeline Test - full.txt (Production)

Configuration:

Extraction: qwen3_1.7b_q4 (high quality, reasoning enabled)
Extraction n_ctx: 4096 (default)
Embedding: qwen-600m (highest quality)
Synthesis: qwen3_4b_thinking_q3 (4B thinking model)
Output language: zh-TW (test Chinese)

Expected:

~3-5 windows (4K context)
~40-60 items extracted
~10-15 duplicates removed
Final summary in Traditional Chinese
Total time: ~5-8 min
Download JSON with embedded trace (~1-2MB)

Phase 6: Error Handling Test (Q10: Option C)

Scenarios:

Disconnect internet during model download
Manually corrupt model cache
Use invalid model repo_id in EXTRACTION_MODELS

Expected behavior:

Error message displayed in UI: "❌ Failed to load lfm2_extract_1.2b..."
Pipeline stops (doesn't try fallback)
User can select different model and retry
Trace file saved with error details

Implementation Priority

Suggested Implementation Sequence (13-19 hours total)

1. Model Registries (1-2 hours)

Add EXTRACTION_MODELS to app.py
Add SYNTHESIS_MODELS reference
Add EMBEDDING_MODELS to extraction.py
Validate with smoke test

2. Core Infrastructure (2-3 hours)

Implement get_model_config()
Implement load_model_for_role() with user_n_ctx support
Implement unload_model()
Implement build_extraction_system_prompt() with reasoning support
Update trace.py with 3 new logging methods
Update __init__.py

3. Extraction Module (3-4 hours)

Implement NativeTokenizer class
Implement EmbeddingModel class
Implement format_progress_ticker()
Implement stream_extract_from_window() with reasoning parsing
Implement deduplicate_items()
Implement stream_synthesize_executive_summary()

4. UI Integration (2-3 hours)

Add Advanced mode controls to Gradio interface
Implement reasoning checkbox visibility logic
Implement model info display functions
Wire up all event handlers
Test UI responsiveness

5. Pipeline Orchestration (3-4 hours)

Implement summarize_advanced() generator function
Sequential model loading/unloading logic
Error handling with graceful failures
Progress ticker updates
Trace embedding in download JSON

6. Testing & Validation (2-3 hours)

Run all test phases (min.txt → full.txt)
Validate reasoning models behavior
Test error handling scenarios
Performance optimization (if needed)

Risk Assessment

Risk	Probability	Impact	Mitigation
Memory overflow on HF Spaces Free Tier	Low	High	Sequential loading/unloading tested; add memory monitoring
Reasoning output breaks JSON parsing	Medium	Medium	Robust thinking block parsing with fallback; strict error handling
User n_ctx slider causes OOM	Low	Medium	Cap at MAX_USABLE_CTX (32K); show warning if user sets too high
Embedding models slow down pipeline	Medium	Low	Default to granite-107m (fastest); user can upgrade if needed
Trace file too large	Low	Low	Response sampling (400 chars) already implemented; compress if >5MB

Appendix: Model Comparison Tables

Extraction Models (11)

Model	Size	Context	Reasoning	Settings
falcon_h1_100m	100M	32K	No	temp=0.2
gemma3_270m	270M	32K	No	temp=0.3
ernie_300m	300M	131K	No	temp=0.2
granite_350m	350M	32K	No	temp=0.1
bitcpm4_500m	500M	128K	No	temp=0.2
hunyuan_500m	500M	256K	No	temp=0.2
qwen3_600m_q4	600M	32K	Hybrid	temp=0.3
granite_3_1_1b_q8	1B	128K	No	temp=0.3
falcon_h1_1.5b_q4	1.5B	32K	No	temp=0.2
qwen3_1.7b_q4	1.7B	32K	Hybrid	temp=0.3
lfm2_extract_1.2b	1.2B	32K	No	temp=0.2

Synthesis Models (16)

Model	Size	Context	Reasoning	Settings
granite_3_1_1b_q8	1B	128K	No	temp=0.7
falcon_h1_1.5b_q4	1.5B	32K	No	temp=0.1
qwen3_1.7b_q4	1.7B	32K	Hybrid	temp=0.8
granite_3_3_2b_q4	2B	128K	No	temp=0.8
youtu_llm_2b_q8	2B	128K	Hybrid	temp=0.8
lfm2_2_6b_transcript	2.6B	32K	No	temp=0.7
breeze_3b_q4	3B	32K	No	temp=0.7
granite_3_1_3b_q4	3B	128K	No	temp=0.8
qwen3_4b_thinking_q3	4B	256K	Thinking-only	temp=0.8
granite4_tiny_q3	7B	128K	No	temp=0.8
ernie_21b_pt_q1	21B	128K	No	temp=0.8
ernie_21b_thinking_q1	21B	128K	Thinking-only	temp=0.9
glm_4_7_flash_reap_30b	30B	128K	Thinking-only	temp=0.8
glm_4_7_flash_30b_iq2	30B	128K	No	temp=0.7
qwen3_30b_thinking_q1	30B	256K	Thinking-only	temp=0.8
qwen3_30b_instruct_q1	30B	256K	No	temp=0.7

Embedding Models (4)

Model	Size	Dimension	Speed	Quality
granite-107m	107M	384	Fastest	Good
granite-278m	278M	768	Balanced	Better
gemma-300m	300M	768	Fast	Good
qwen-600m	600M	1024	Slower	Best

Next Steps

Once approved, implementation will proceed in the order outlined in the Priority section. All code will be committed with descriptive messages referencing this plan document.

Ready for implementation approval.

Document Version: 1.1
Last Updated: 2026-02-05
Author: Claude (Anthropic)
Reviewer: Updated post-implementation to match actual code

Advanced 2-Stage Meeting Summarization - Complete Implementation Plan

Table of Contents

Executive Summary

Architecture

Key Metrics

Design Decisions

Q1: Extraction Model List Composition (REVISION)

Q1a: Synthesis Model Selection (NEW)

Q2: Independence from Standard Mode

Q3: Extraction n_ctx UI Control

Q4: Default Models

Q5: Model Key Naming

Q6: Model Overlap Between Stages

Q7: Reasoning Checkbox UI Flow

Q8: Thinking Block Display

Q9: Window Token Counting with User n_ctx

Q10: Model Loading Error Handling

Model Registries

1. EXTRACTION_MODELS (13 models - FINAL)

2. SYNTHESIS_MODELS (16 models)

3. EMBEDDING_MODELS (4 models)

UI Implementation

Advanced Mode Controls (Option B: Separate Reasoning Checkboxes)

Conditional Reasoning Checkbox Visibility Logic

Model Info Display Functions

Model Management Infrastructure

Role-Aware Configuration Resolver

Role-Aware Model Loader (Q9: Option A - Respect user's n_ctx choice)

Extraction Pipeline

Extraction System Prompt Builder (Bilingual + Reasoning)

Extraction Streaming with Reasoning Parsing (Q8: Option A - Show in "MODEL THINKING PROCESS")

Implementation Checklist

Files to Create

Files to Modify

Code Statistics

Testing Strategy

Phase 1: Model Registry Validation

Phase 2: UI Control Validation

Phase 3: Pipeline Test - min.txt (Quick)

Phase 4: Pipeline Test - Reasoning Models

Phase 5: Pipeline Test - full.txt (Production)

Phase 6: Error Handling Test (Q10: Option C)

Implementation Priority

Suggested Implementation Sequence (13-19 hours total)

Risk Assessment

Appendix: Model Comparison Tables

Extraction Models (11)

Synthesis Models (16)

Embedding Models (4)

Next Steps