Spaces:

NeerajCodz
/

scrapeRL

Sleeping

App Files Files Community

scrapeRL / docs /api.md

NeerajCodz

docs: update

df47251 8 days ago

preview code

raw

history blame contribute delete

32.1 kB

🤖 Multi-Model API System

Overview
Supported Providers
Smart Model Router
Model Ensemble
Cost & Token Tracking
Prompt Management
Configuration
API Reference

Overview

The Multi-Model API System provides a unified interface for interacting with multiple LLM providers (OpenAI, Anthropic, Google, Groq, etc.), enabling:

Flexibility: Switch between models without code changes
Optimization: Auto-route requests to the best model for each task
Cost Control: Track spending and enforce budgets
Reliability: Fallback to alternative models on failure
Experimentation: A/B test prompts and models

Architecture

┌────────────────────────────────────────────────────────────────┐
│                     Agent Request                               │
│              "Extract product price"                            │
└────────────────────────┬───────────────────────────────────────┘
                         │
                         ▼
┌────────────────────────────────────────────────────────────────┐
│                  Smart Model Router                             │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Task Classifier:                                         │  │
│  │    • Reasoning → GPT-4 / Claude                           │  │
│  │    • Fast extraction → Groq / Gemini Flash                │  │
│  │    • Long context → Claude / GPT-4-32k                    │  │
│  │    • Cost-sensitive → Gemini / Groq                       │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────┬───────────────────────────────────────┘
                         │
         ┌───────────────┼───────────────┬───────────────┐
         │               │               │               │
         ▼               ▼               ▼               ▼
┌─────────────┐  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│   OpenAI    │  │  Anthropic  │ │   Google    │ │    Groq     │
│   Adapter   │  │   Adapter   │ │   Adapter   │ │   Adapter   │
└──────┬──────┘  └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
       │                │                │                │
       ▼                ▼                ▼                ▼
┌─────────────┐  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ gpt-4-turbo │  │ claude-3.5  │ │ gemini-pro  │ │ llama-3-70b │
│ gpt-4o-mini │  │ claude-3    │ │ gemini-flash│ │ mixtral-8x7b│
└─────────────┘  └─────────────┘ └─────────────┘ └─────────────┘

Supported Providers

1. OpenAI

Models:

gpt-4-turbo - Best reasoning, multimodal
gpt-4o - Fast GPT-4 variant
gpt-4o-mini - Cost-effective, fast
gpt-3.5-turbo - Legacy, cheap

Capabilities:

Function calling
JSON mode
Vision (gpt-4-turbo, gpt-4o)
128k context (gpt-4-turbo)

Configuration:

{
    "provider": "openai",
    "api_key": "sk-...",
    "organization": "org-...",  # Optional
    "models": {
        "default": "gpt-4o-mini",
        "reasoning": "gpt-4-turbo",
        "fast": "gpt-4o-mini"
    },
    "parameters": {
        "temperature": 0.7,
        "max_tokens": 4096,
        "timeout": 60
    }
}

2. Anthropic (Claude)

Models:

claude-3-opus-20240229 - Most capable
claude-3-sonnet-20240229 - Balanced
claude-3-haiku-20240307 - Fast and cheap
claude-3-5-sonnet-20240620 - Latest, best

Capabilities:

200k context window
Strong reasoning
Excellent instruction following
Tool use (function calling)

Configuration:

{
    "provider": "anthropic",
    "api_key": "sk-ant-...",
    "models": {
        "default": "claude-3-5-sonnet-20240620",
        "reasoning": "claude-3-opus-20240229",
        "fast": "claude-3-haiku-20240307"
    },
    "parameters": {
        "temperature": 0.7,
        "max_tokens": 4096,
        "timeout": 90
    }
}

3. Google (Gemini)

Models:

gemini-1.5-pro - Best quality, 2M context
gemini-1.5-flash - Fast, 1M context
gemini-1.0-pro - Legacy

Capabilities:

Massive context (1M-2M tokens)
Multimodal (text, image, video, audio)
Extremely cost-effective
Function calling

Configuration:

{
    "provider": "google",
    "api_key": "AIza...",
    "models": {
        "default": "gemini-1.5-flash",
        "reasoning": "gemini-1.5-pro",
        "fast": "gemini-1.5-flash"
    },
    "parameters": {
        "temperature": 0.7,
        "max_output_tokens": 8192,
        "timeout": 60
    }
}

4. Groq

Models:

llama-3.1-405b - Largest Llama
llama-3.1-70b-versatile - Balanced
llama-3.1-8b-instant - Ultra-fast
mixtral-8x7b-32768 - Good reasoning

Capabilities:

Extremely fast inference (500+ tokens/sec)
Free tier available
Open-source models
JSON mode

Configuration:

{
    "provider": "groq",
    "api_key": "gsk_...",
    "models": {
        "default": "llama-3.1-70b-versatile",
        "reasoning": "llama-3.1-405b",
        "fast": "llama-3.1-8b-instant"
    },
    "parameters": {
        "temperature": 0.7,
        "max_tokens": 8192,
        "timeout": 30
    }
}

5. Mistral AI

Models:

mistral-large-latest - Best quality
mistral-medium-latest - Balanced
mistral-small-latest - Fast and cheap
mixtral-8x22b - Open-source, strong

Configuration:

{
    "provider": "mistral",
    "api_key": "...",
    "models": {
        "default": "mistral-medium-latest",
        "reasoning": "mistral-large-latest",
        "fast": "mistral-small-latest"
    }
}

6. Cohere

Models:

command-r-plus - Best for RAG
command-r - Balanced
command-light - Fast

Specialization: RAG, embeddings, reranking

7. Perplexity

Models:

pplx-70b-online - Web-connected
pplx-7b-online - Fast, web-connected

Specialization: Real-time web search and citations

8. Together AI

Models: 50+ open-source models

Llama variants
Mistral variants
Code models (CodeLlama, StarCoder)

Use Case: Access to latest open-source models

9. Custom / Self-Hosted

Supported:

Ollama (local models)
vLLM (self-hosted inference)
LM Studio (local GUI)
LocalAI (OpenAI-compatible local server)

Configuration:

{
    "provider": "custom",
    "base_url": "http://localhost:11434/v1",  # Ollama
    "api_key": "not-needed",
    "models": {
        "default": "llama3:70b",
        "fast": "llama3:8b"
    }
}

Smart Model Router

The Smart Model Router automatically selects the best model for each request based on task characteristics.

Routing Strategy

class ModelRouter:
    def route(self, task: Task, context: Dict) -> ModelConfig:
        """Select the best model for this task."""
        
        # 1. Explicit user preference
        if context.get("preferred_model"):
            return self.get_model(context["preferred_model"])
        
        # 2. Task-based routing
        if task.type == "reasoning":
            return self.route_reasoning(task, context)
        elif task.type == "extraction":
            return self.route_extraction(task, context)
        elif task.type == "classification":
            return self.route_classification(task, context)
        
        # 3. Fallback to default
        return self.default_model
    
    def route_reasoning(self, task: Task, context: Dict) -> ModelConfig:
        """Route complex reasoning tasks."""
        # Long context? Use Claude or Gemini
        if context.get("input_tokens", 0) > 50000:
            return self.get_model("claude-3-5-sonnet")  # 200k context
        
        # Need reliability? Use GPT-4 or Claude
        if task.importance == "high":
            return self.get_model("gpt-4-turbo")
        
        # Cost-sensitive? Use Gemini or Groq
        if context.get("budget_mode"):
            return self.get_model("gemini-1.5-flash")
        
        return self.get_model("claude-3-5-sonnet")  # Default for reasoning
    
    def route_extraction(self, task: Task, context: Dict) -> ModelConfig:
        """Route simple extraction tasks."""
        # Speed critical? Use Groq
        if context.get("latency_critical"):
            return self.get_model("llama-3.1-70b-versatile", provider="groq")
        
        # Cost-sensitive? Use Gemini Flash or Groq
        return self.get_model("gemini-1.5-flash")

Routing Rules

Task Type	Input Size	Priority	Recommended Model	Reason
Reasoning	Any	High	`gpt-4-turbo`	Best quality
Reasoning	>50k tokens	Any	`claude-3-5-sonnet`	200k context
Reasoning	Any	Budget	`gemini-1.5-flash`	Cheap, good quality
Extraction	<10k tokens	Speed	`groq/llama-3.1-70b`	500+ tok/sec
Extraction	Any	Budget	`gpt-4o-mini`	$0.15/1M tokens
Classification	<5k tokens	Any	`groq/llama-3.1-8b`	Ultra-fast
Long Context	>100k tokens	Any	`gemini-1.5-pro`	2M context
Vision	Images	Any	`gpt-4o`	Best multimodal
Web Search	Any	Any	`perplexity`	Web-connected

Configuration

class RouterConfig(BaseModel):
    enabled: bool = True
    strategy: Literal["task_based", "cost_optimized", "speed_optimized", "quality_optimized"]
    
    # Task-based routing rules
    routing_rules: Dict[str, str] = {
        "reasoning_high_priority": "gpt-4-turbo",
        "reasoning_budget": "gemini-1.5-flash",
        "extraction_fast": "groq/llama-3.1-70b",
        "extraction_accurate": "claude-3-5-sonnet",
        "long_context": "gemini-1.5-pro",
        "vision": "gpt-4o"
    }
    
    # Fallback chain
    fallback_order: List[str] = [
        "claude-3-5-sonnet",
        "gpt-4o-mini",
        "gemini-1.5-flash",
        "groq/llama-3.1-70b"
    ]
    
    # Auto-retry on failure
    auto_retry: bool = True
    max_retries: int = 3

Model Ensemble

Model Ensemble runs multiple models in parallel and merges their outputs for higher quality or consensus.

Ensemble Strategies

1. Voting (Classification/Extraction)

Run 3+ models, take majority vote.

class VotingEnsemble:
    async def predict(self, prompt: str, models: List[str]) -> Any:
        """Run multiple models and vote on result."""
        tasks = [self.call_model(model, prompt) for model in models]
        results = await asyncio.gather(*tasks)
        
        # Count votes
        from collections import Counter
        votes = Counter(results)
        winner, count = votes.most_common(1)[0]
        
        confidence = count / len(results)
        return {
            "result": winner,
            "confidence": confidence,
            "votes": dict(votes)
        }

# Example: Extract price with 3 models
ensemble = VotingEnsemble()
result = await ensemble.predict(
    prompt="Extract the product price: <html>...",
    models=["gpt-4o-mini", "gemini-1.5-flash", "groq/llama-3.1-70b"]
)
# Result: {"result": "$49.99", "confidence": 1.0, "votes": {"$49.99": 3}}

2. Ranking (Quality Assessment)

Run multiple models, rank outputs by quality.

class RankingEnsemble:
    async def generate(self, prompt: str, models: List[str]) -> List[Dict]:
        """Generate with multiple models and rank by quality."""
        tasks = [self.call_model(model, prompt) for model in models]
        results = await asyncio.gather(*tasks)
        
        # Score each result
        scored_results = []
        for model, output in zip(models, results):
            score = self.quality_scorer.score(output, prompt)
            scored_results.append({
                "model": model,
                "output": output,
                "quality_score": score
            })
        
        # Sort by score
        scored_results.sort(key=lambda x: x["quality_score"], reverse=True)
        return scored_results

# Example: Generate reasoning with ranking
ensemble = RankingEnsemble()
results = await ensemble.generate(
    prompt="Explain how to extract a price from HTML",
    models=["gpt-4-turbo", "claude-3-5-sonnet", "gemini-1.5-pro"]
)
best_result = results[0]  # Highest quality

3. Fusion (Merging Outputs)

Merge complementary outputs from multiple models.

class FusionEnsemble:
    async def extract_structured(self, prompt: str, models: List[str]) -> Dict:
        """Extract structured data with multiple models and merge."""
        tasks = [self.call_model(model, prompt) for model in models]
        results = await asyncio.gather(*tasks)
        
        # Merge fields with confidence weighting
        merged = {}
        for field in self.extract_fields(results):
            values = [r.get(field) for r in results if r.get(field)]
            if not values:
                continue
            
            # Use most common value, or highest-confidence model's value
            from collections import Counter
            counts = Counter(values)
            merged[field] = counts.most_common(1)[0][0]
        
        return merged

# Example: Extract product data with fusion
ensemble = FusionEnsemble()
product = await ensemble.extract_structured(
    prompt="Extract product details: <html>...",
    models=["gpt-4o-mini", "gemini-1.5-flash", "claude-3-haiku"]
)
# Merges: {name: "...", price: "$X", rating: "Y" } from all models

4. Verification (Primary + Validator)

One model generates, another validates.

class VerificationEnsemble:
    async def generate_and_verify(
        self,
        prompt: str,
        generator_model: str,
        validator_model: str
    ) -> Dict:
        """Generate with one model, verify with another."""
        # Generate
        output = await self.call_model(generator_model, prompt)
        
        # Verify
        verification_prompt = f"""
        Original task: {prompt}
        Generated output: {output}
        
        Is this output correct and complete? Explain any issues.
        """
        verification = await self.call_model(validator_model, verification_prompt)
        
        return {
            "output": output,
            "verification": verification,
            "confidence": self.parse_confidence(verification)
        }

# Example: Generate with Groq (fast), verify with Claude (accurate)
ensemble = VerificationEnsemble()
result = await ensemble.generate_and_verify(
    prompt="Extract all product prices from this catalog page",
    generator_model="groq/llama-3.1-70b",
    validator_model="claude-3-5-sonnet"
)

Ensemble Configuration

class EnsembleConfig(BaseModel):
    enabled: bool = False  # Off by default (costs more)
    strategy: Literal["voting", "ranking", "fusion", "verification"]
    
    # Model selection
    models: List[str] = []  # If empty, router selects
    
    # Voting settings
    min_agreement: float = 0.67  # Require 67% agreement
    
    # Ranking settings
    quality_metric: Literal["coherence", "accuracy", "completeness"]
    
    # Verification settings
    generator_model: Optional[str] = None
    validator_model: Optional[str] = None

Cost & Token Tracking

Track spending and token usage across all models.

Cost Tracker

class CostTracker:
    # Pricing (as of March 2026, per 1M tokens)
    PRICING = {
        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
        "gpt-4o": {"input": 5.00, "output": 15.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-3-opus": {"input": 15.00, "output": 75.00},
        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
        "claude-3-haiku": {"input": 0.25, "output": 1.25},
        "gemini-1.5-pro": {"input": 3.50, "output": 10.50},
        "gemini-1.5-flash": {"input": 0.35, "output": 1.05},
        "groq/llama-3.1-70b": {"input": 0.59, "output": 0.79},
        "groq/llama-3.1-8b": {"input": 0.05, "output": 0.08},
    }
    
    def calculate_cost(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int
    ) -> float:
        """Calculate cost for this request."""
        pricing = self.PRICING.get(model, {"input": 0, "output": 0})
        cost = (
            (input_tokens / 1_000_000) * pricing["input"] +
            (output_tokens / 1_000_000) * pricing["output"]
        )
        return cost
    
    def track_request(self, request: ModelRequest, response: ModelResponse):
        """Track a model request."""
        cost = self.calculate_cost(
            model=request.model,
            input_tokens=response.usage.prompt_tokens,
            output_tokens=response.usage.completion_tokens
        )
        
        self.db.insert({
            "timestamp": datetime.now(),
            "model": request.model,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens,
            "cost_usd": cost,
            "latency_ms": response.latency_ms,
            "task_type": request.task_type,
            "success": response.success
        })

Budget Enforcement

class BudgetEnforcer:
    def __init__(self, daily_budget_usd: float):
        self.daily_budget = daily_budget_usd
        self.cost_tracker = CostTracker()
    
    def check_budget(self) -> bool:
        """Check if budget allows this request."""
        today_cost = self.cost_tracker.get_today_cost()
        return today_cost < self.daily_budget
    
    async def call_with_budget(self, request: ModelRequest) -> ModelResponse:
        """Make request only if budget allows."""
        if not self.check_budget():
            # Fallback to cheapest model
            request.model = "groq/llama-3.1-8b-instant"
            logger.warning(f"Budget exceeded, downgrading to {request.model}")
        
        response = await self.call_model(request)
        self.cost_tracker.track_request(request, response)
        return response

Token Usage Dashboard

UI Display:

┌──────────────────────────────────────────────────────────────┐
│ Token Usage & Cost (Last 24h)                                │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│ Total Tokens:   1,234,567                                    │
│ Total Cost:     $12.34                                       │
│ Requests:       456                                          │
│ Avg Latency:    1.2s                                         │
│                                                               │
│ ┌────────────────────────────────────────────────────────┐   │
│ │ Cost by Model                                          │   │
│ │ ████████████████████ gpt-4-turbo      $6.50 (53%)    │   │
│ │ ██████████ claude-3-5-sonnet         $3.20 (26%)    │   │
│ │ █████ gemini-1.5-flash               $1.80 (15%)    │   │
│ │ ██ groq/llama-3.1-70b                $0.84 (6%)     │   │
│ └────────────────────────────────────────────────────────┘   │
│                                                               │
│ ┌────────────────────────────────────────────────────────┐   │
│ │ Token Usage by Model                                   │   │
│ │ Model              Input    Output   Total      Cost   │   │
│ │ gpt-4-turbo        123K     45K      168K      $6.50  │   │
│ │ claude-3-5-sonnet  456K     89K      545K      $3.20  │   │
│ │ gemini-1.5-flash   890K     234K     1124K     $1.80  │   │
│ └────────────────────────────────────────────────────────┘   │
│                                                               │
│ Budget: $12.34 / $20.00 (62% used)                           │
│ [█████████████████░░░░░░░░░░]                                │
│                                                               │
│ ⚠️ Budget 80% threshold: Alert enabled                       │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Prompt Management

Manage, version, and A/B test prompts.

Prompt Templates

class PromptTemplate(BaseModel):
    template_id: str
    name: str
    template: str
    variables: List[str]
    version: int
    created_at: datetime
    performance_score: Optional[float] = None

class PromptManager:
    def get_template(self, template_id: str, version: Optional[int] = None) -> PromptTemplate:
        """Get prompt template by ID and version."""
        if version is None:
            return self.get_latest_version(template_id)
        return self.db.get(template_id, version)
    
    def render(self, template_id: str, variables: Dict) -> str:
        """Render template with variables."""
        template = self.get_template(template_id)
        return template.template.format(**variables)
    
    def create_version(self, template_id: str, new_template: str) -> int:
        """Create new version of template."""
        current = self.get_template(template_id)
        new_version = current.version + 1
        
        self.db.insert(PromptTemplate(
            template_id=template_id,
            name=current.name,
            template=new_template,
            variables=current.variables,
            version=new_version,
            created_at=datetime.now()
        ))
        
        return new_version

Example Templates

# Extraction prompt
EXTRACTION_PROMPT = """
You are a web scraping agent. Extract the following fields from the HTML:

Target fields: {target_fields}

HTML content:
{html_content}

Return a JSON object with the extracted values. If a field is not found, use null.

Example output format:
{{
  "field1": "value1",
  "field2": "value2"
}}
"""

# Reasoning prompt
REASONING_PROMPT = """
You are analyzing a web page to plan your next extraction action.

Current goal: {goal}
Page URL: {url}
Available actions: {actions}
Previous attempts: {history}

Think step by step:
1. What information is most important for the goal?
2. What patterns do you see in the HTML structure?
3. Which action is most likely to succeed?
4. What could go wrong?

Provide your reasoning and then choose an action.
"""

# Register templates
prompt_manager = PromptManager()
prompt_manager.register("extraction_v1", EXTRACTION_PROMPT, ["target_fields", "html_content"])
prompt_manager.register("reasoning_v1", REASONING_PROMPT, ["goal", "url", "actions", "history"])

A/B Testing

class PromptABTest:
    def __init__(self, template_id: str, variants: List[int]):
        self.template_id = template_id
        self.variants = variants  # Version numbers
        self.results = {v: [] for v in variants}
    
    def get_variant(self) -> int:
        """Select variant (round-robin or random)."""
        return random.choice(self.variants)
    
    def track_result(self, variant: int, success: bool, score: float):
        """Track performance of a variant."""
        self.results[variant].append({"success": success, "score": score})
    
    def get_winner(self) -> int:
        """Determine which variant performs best."""
        avg_scores = {
            v: np.mean([r["score"] for r in results])
            for v, results in self.results.items()
            if results
        }
        return max(avg_scores, key=avg_scores.get)

# Run A/B test
test = PromptABTest("extraction_v1", variants=[1, 2, 3])

for episode in episodes:
    variant = test.get_variant()
    prompt = prompt_manager.render(f"extraction_v1", variables, version=variant)
    result = await model.generate(prompt)
    test.track_result(variant, result.success, result.score)

winner = test.get_winner()
print(f"Best variant: v{winner}")

Configuration

Settings Panel

class APISettings(BaseModel):
    # Provider configurations
    providers: Dict[str, ProviderConfig] = {}
    
    # Default model
    default_model: str = "gpt-4o-mini"
    
    # Smart routing
    router: RouterConfig = RouterConfig()
    
    # Ensemble
    ensemble: EnsembleConfig = EnsembleConfig()
    
    # Cost control
    daily_budget_usd: float = 20.00
    alert_threshold: float = 0.8  # Alert at 80% budget
    
    # Rate limiting
    max_requests_per_minute: int = 60
    
    # Retry policy
    max_retries: int = 3
    retry_delay_seconds: int = 2
    
    # Prompt management
    prompt_templates: Dict[str, str] = {}

UI Example:

┌────────────────────────────────────────────────────────────┐
│ API Settings                                                │
├────────────────────────────────────────────────────────────┤
│                                                             │
│ Model Providers:                                            │
│ ┌─────────────────────────────────────────────────────┐    │
│ │ ☑ OpenAI                                             │    │
│ │   API Key: [sk-proj-••••••••••••••••] [Test]       │    │
│ │   Default: [gpt-4o-mini ▼]                          │    │
│ │                                                      │    │
│ │ ☑ Anthropic                                          │    │
│ │   API Key: [sk-ant-••••••••••••••••] [Test]        │    │
│ │   Default: [claude-3-5-sonnet ▼]                    │    │
│ │                                                      │    │
│ │ ☑ Google                                             │    │
│ │   API Key: [AIza••••••••••••••••••••] [Test]       │    │
│ │   Default: [gemini-1.5-flash ▼]                     │    │
│ │                                                      │    │
│ │ ☑ Groq                                               │    │
│ │   API Key: [gsk_••••••••••••••••••••] [Test]       │    │
│ │   Default: [llama-3.1-70b-versatile ▼]              │    │
│ │                                                      │    │
│ │ ☐ Mistral   [Configure]                             │    │
│ │ ☐ Cohere    [Configure]                             │    │
│ │ ☐ Custom    [Configure]                             │    │
│ └─────────────────────────────────────────────────────┘    │
│                                                             │
│ Smart Routing:                                              │
│   ☑ Enabled                                                │
│   Strategy: [Task-Based ▼]                                 │
│   Fallback: [claude → gpt-4o-mini → gemini → groq]        │
│                                                             │
│ Model Ensemble:                                             │
│   ☐ Enabled (increases cost)                               │
│   Strategy: [Voting ▼]                                     │
│   Models:   [gpt-4o-mini, gemini-flash, groq/llama ▼]     │
│                                                             │
│ Cost Control:                                               │
│   Daily Budget:  [$20.00]                                  │
│   Alert at:      [80%] of budget                           │
│   Current Usage: $12.34 / $20.00 (62%)                     │
│                                                             │
│              [Save Settings]  [Reset to Defaults]          │
└────────────────────────────────────────────────────────────┘

API Reference

Python Client

from webscraper_env import MultiModelAPI

# Initialize with config
api = MultiModelAPI(settings=APISettings())

# Simple generation
response = await api.generate(
    prompt="Extract product price from: <html>...",
    model="gpt-4o-mini"  # Optional, uses router if omitted
)

# With routing
response = await api.generate(
    prompt="Complex reasoning task...",
    task_type="reasoning",  # Router selects best model
    priority="high"
)

# With ensemble
response = await api.generate_ensemble(
    prompt="Extract all prices",
    strategy="voting",
    models=["gpt-4o-mini", "gemini-1.5-flash", "groq/llama-3.1-70b"]
)

# Streaming
async for chunk in api.generate_stream(prompt="...", model="claude-3-5-sonnet"):
    print(chunk.text, end="", flush=True)

Next: See mcp.md for MCP server integration.

🤖 Multi-Model API System

Table of Contents

Overview

Architecture

Supported Providers

1. OpenAI

2. Anthropic (Claude)

3. Google (Gemini)

4. Groq

5. Mistral AI

6. Cohere

7. Perplexity

8. Together AI

9. Custom / Self-Hosted

Smart Model Router

Routing Strategy

Routing Rules

Configuration

Model Ensemble

Ensemble Strategies

1. Voting (Classification/Extraction)

2. Ranking (Quality Assessment)

3. Fusion (Merging Outputs)

4. Verification (Primary + Validator)

Ensemble Configuration

Cost & Token Tracking

Cost Tracker

Budget Enforcement

Token Usage Dashboard

Prompt Management

Prompt Templates

Example Templates

A/B Testing

Configuration

Settings Panel

API Reference

Python Client