Spaces:

lemms
/

llm

Sleeping

App Files Files Community

llm / app_working_with_10k.py

lemms

Upload app_working_with_10k.py with huggingface_hub

36b8dcd verified 15 days ago

raw

history blame contribute delete

64.8 kB

	#!/usr/bin/env python3
	"""
	OpenLLM Real Models App - Ultimate Working Version with Correct lm_head Bias Handling

	This is the FINAL WORKING VERSION of the OpenLLM Real Models inference application that has been
	extensively debugged and optimized to correctly load and run the actual trained OpenLLM models
	from Hugging Face Hub.

	CRITICAL ARCHITECTURE MATCHING:
	- The GPT model architecture EXACTLY matches the saved state_dict from the trained models
	- All layer naming conventions use the 'transformer.' prefix (wte, wpe, h, ln_f)
	- Custom transformer blocks (Block, CausalSelfAttention, MLP) replace generic nn.TransformerEncoderLayer
	- Attention bias is correctly handled as causal attention masks (register_buffer) not learnable parameters
	- Language model head (lm_head) uses bias=False to match the saved model's architecture
	- All attribute naming conflicts have been resolved (use_bias vs bias)

	MODEL LOADING PROCESS:
	1. Download model files from Hugging Face Hub using snapshot_download
	2. Parse config.json to extract model configuration parameters
	3. Create GPTConfig object with exact parameter matching
	4. Initialize GPT model with custom architecture
	5. Load state_dict from best_model.pt (handles model_state_dict wrapper)
	6. Load SentencePiece tokenizer from tokenizer.model
	7. Set model to evaluation mode for inference

	TEXT GENERATION FEATURES:
	- Real-time text generation using actual trained model weights
	- Configurable generation parameters (temperature, top_k, top_p, max_length)
	- Proper tokenization and detokenization using SentencePiece
	- Causal language modeling with attention masking
	- Support for all 5 model variants (4k, 6k, 7k, 8k, 9k training steps)

	TECHNICAL IMPLEMENTATION DETAILS:
	- PyTorch-based transformer architecture with custom attention implementation
	- Gradio web interface for user-friendly model interaction
	- Comprehensive error handling and logging throughout the pipeline
	- Memory-efficient model loading with CPU-only inference
	- Real-time model switching between different training checkpoints

	AUTHOR: Louis Chua Bean Chong
	PROJECT: OpenLLM - Open Source Large Language Model Framework
	LICENSE: GPLv3 - Open Source First Philosophy
	"""

	import gradio as gr
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	import json
	import logging
	import sentencepiece as spm
	import math
	from pathlib import Path
	from typing import Dict, Any, Optional
	from huggingface_hub import snapshot_download

	# Set up comprehensive logging for debugging and monitoring
	logging.basicConfig(level=logging.INFO)
	logger = logging.getLogger(__name__)

	class GPTConfig:
	"""
	GPT Model Configuration Class - Handles All Model Architecture Parameters

	This class defines the complete configuration for the GPT-style transformer model,
	including all architectural parameters that determine the model's size, capacity,
	and behavior. It accepts additional kwargs to handle any extra configuration
	fields that might be present in the saved model's config.json file.

	CRITICAL PARAMETERS:
	- vocab_size: Size of the vocabulary (32,000 for OpenLLM models)
	- n_layer: Number of transformer layers (6 for small models)
	- n_head: Number of attention heads (8 for small models)
	- n_embd: Embedding dimension (512 for small models)
	- block_size: Maximum sequence length (1024 tokens)
	- dropout: Dropout rate for regularization (0.1)
	- bias: Whether to use bias terms in linear layers (True)

	ARCHITECTURE NOTES:
	- Small model configuration: 6 layers, 8 heads, 512 dims = 35.8M parameters
	- This matches the exact architecture used during training
	- All parameters are carefully tuned for the SQuAD dataset training
	"""
	def __init__(self, vocab_size=32000, n_layer=6, n_head=8, n_embd=512,
	block_size=1024, dropout=0.1, bias=True, **kwargs):
	# Accept any additional kwargs to handle extra config fields from saved models
	# This is crucial for loading models that may have additional metadata
	self.vocab_size = vocab_size
	self.n_layer = n_layer
	self.n_head = n_head
	self.n_embd = n_embd
	self.block_size = block_size
	self.dropout = dropout
	self.bias = bias

	class GPT(nn.Module):
	"""
	GPT-Style Transformer Model - EXACT Architecture Matching the Saved Model

	This is the core transformer model that EXACTLY matches the architecture of the
	trained OpenLLM models. Every layer, every parameter, and every naming convention
	has been carefully designed to match the saved state_dict from the training process.

	ARCHITECTURE COMPONENTS:
	- transformer.wte: Word token embeddings (vocab_size -> n_embd)
	- transformer.wpe: Position embeddings (block_size -> n_embd)
	- transformer.drop: Dropout layer for regularization
	- transformer.h: List of transformer blocks (n_layer count)
	- transformer.ln_f: Final layer normalization
	- lm_head: Language model head (n_embd -> vocab_size, NO bias)

	CRITICAL DESIGN DECISIONS:
	- Uses nn.ModuleDict for transformer components to match 'transformer.' prefix
	- Custom Block, CausalSelfAttention, and MLP classes for exact architecture
	- lm_head.bias = False to match saved model (no bias term)
	- Proper weight initialization following GPT-style conventions
	- Causal attention masking for autoregressive generation

	FORWARD PASS:
	- Combines token and position embeddings
	- Processes through transformer blocks with residual connections
	- Applies final layer normalization
	- Projects to vocabulary space for next-token prediction

	GENERATION:
	- Autoregressive text generation with temperature, top-k, and top-p sampling
	- Causal attention ensures tokens only attend to previous tokens
	- Configurable generation parameters for different text styles
	"""
	def __init__(self, config):
	super().__init__()
	# Validate critical configuration parameters
	assert config.vocab_size is not None, "vocab_size must be specified"
	assert config.block_size is not None, "block_size must be specified"
	self.config = config

	# Create the transformer module with the EXACT naming convention from saved model
	# This nn.ModuleDict structure is crucial for matching the 'transformer.' prefix
	# in the saved state_dict keys (transformer.wte.weight, transformer.wpe.weight, etc.)
	self.transformer = nn.ModuleDict(dict(
	wte = nn.Embedding(config.vocab_size, config.n_embd), # Word token embeddings
	wpe = nn.Embedding(config.block_size, config.n_embd), # Position embeddings
	drop = nn.Dropout(config.dropout), # Dropout for regularization
	h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), # Transformer blocks
	ln_f = nn.LayerNorm(config.n_embd), # Final layer normalization
	))

	# Language model head - CRITICAL: NO bias to match saved model architecture
	# The saved models were trained without bias in the language model head
	# This is a common practice in transformer language models for efficiency
	self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

	# Initialize weights using GPT-style initialization
	# This ensures proper weight scaling and prevents gradient issues
	self.apply(self._init_weights)
	for pn, p in self.named_parameters():
	if pn.endswith('c_proj.weight'):
	# Special initialization for projection layers in transformer blocks
	torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

	def _init_weights(self, module):
	"""
	GPT-Style Weight Initialization for All Model Components

	This function applies the standard GPT weight initialization strategy:
	- Linear layers: Normal distribution with mean=0, std=0.02
	- Embeddings: Normal distribution with mean=0, std=0.02
	- Bias terms: Zero initialization (when present)

	This initialization scheme has been proven effective for transformer models
	and helps with training stability and convergence.
	"""
	if isinstance(module, nn.Linear):
	torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
	if module.bias is not None:
	torch.nn.init.zeros_(module.bias)
	elif isinstance(module, nn.Embedding):
	torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

	def forward(self, idx, targets=None):
	"""
	Forward Pass Through the Complete Transformer Model

	This is the main inference function that processes input tokens through
	the entire transformer architecture to produce logits for next-token prediction.

	ARGUMENTS:
	- idx: Input token indices (batch_size, sequence_length)
	- targets: Target token indices for training (optional, for loss computation)

	PROCESSING STEPS:
	1. Extract sequence length and validate against block_size
	2. Create position indices for positional encoding
	3. Look up token and position embeddings
	4. Combine embeddings and apply dropout
	5. Process through all transformer blocks
	6. Apply final layer normalization
	7. Project to vocabulary space via language model head

	RETURNS:
	- logits: Predicted token probabilities (batch_size, seq_len, vocab_size)
	- loss: Cross-entropy loss (only if targets provided)

	NOTE: During inference (targets=None), only the last token's logits are returned
	for efficient autoregressive generation.
	"""
	device = idx.device
	b, t = idx.size()
	# Validate sequence length against model's maximum block size
	assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"

	# Create position indices for positional encoding
	# This enables the model to understand token positions in the sequence
	pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0)

	# Look up embeddings for tokens and positions
	tok_emb = self.transformer.wte(idx) # Token embeddings
	pos_emb = self.transformer.wpe(pos) # Position embeddings

	# Combine embeddings and apply dropout for regularization
	x = self.transformer.drop(tok_emb + pos_emb)

	# Process through all transformer blocks with residual connections
	for block in self.transformer.h:
	x = block(x)

	# Apply final layer normalization
	x = self.transformer.ln_f(x)

	# Project to vocabulary space for next-token prediction
	if targets is not None:
	# Training mode: compute loss for all positions
	logits = self.lm_head(x)
	loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
	else:
	# Inference mode: only compute logits for the last token (efficient generation)
	logits = self.lm_head(x[:, [-1], :])
	loss = None

	return logits, loss

	def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None, top_p=None, do_sample=True):
	"""
	Autoregressive Text Generation with Advanced Sampling Strategies

	This function generates text by repeatedly predicting the next token
	using the trained model, with configurable sampling parameters for
	controlling the creativity and coherence of the generated text.

	GENERATION PROCESS:
	1. For each new token to generate:
	a. Forward pass through model to get logits for next token
	b. Apply temperature scaling to control randomness
	c. Apply top-k filtering to limit vocabulary choices
	d. Apply top-p (nucleus) sampling for dynamic vocabulary selection
	e. Sample next token from filtered probability distribution
	f. Append to sequence and repeat

	SAMPLING PARAMETERS:
	- temperature: Controls randomness (higher = more random, lower = more focused)
	- top_k: Limits vocabulary to k highest probability tokens
	- top_p: Nucleus sampling - limits to tokens with cumulative probability <= p
	- do_sample: Whether to sample (True) or use greedy decoding (False)

	ATTENTION HANDLING:
	- Uses causal attention masking to ensure tokens only attend to previous tokens
	- Automatically handles sequence length limits via block_size
	- Efficient autoregressive generation with minimal memory usage

	RETURNS:
	- Complete token sequence including input and generated tokens
	"""
	for _ in range(max_new_tokens):
	# Ensure sequence doesn't exceed model's block size
	idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]

	# Forward pass to get logits for next token
	logits, _ = self(idx_cond)
	logits = logits[:, -1, :] / temperature # Apply temperature scaling

	# Top-k filtering: keep only the k highest probability tokens
	if top_k is not None:
	v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
	logits[logits < v[:, [-1]]] = -float('Inf')

	# Top-p (nucleus) sampling: keep tokens with cumulative probability <= top_p
	if top_p is not None:
	sorted_logits, sorted_indices = torch.sort(logits, descending=True)
	cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
	sorted_indices_to_remove = cumulative_probs > top_p
	sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
	sorted_indices_to_remove[..., 0] = 0
	indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
	logits[indices_to_remove] = -float('Inf')

	# Convert logits to probabilities and sample next token
	probs = F.softmax(logits, dim=-1)
	if do_sample:
	# Stochastic sampling for creative text generation
	idx_next = torch.multinomial(probs, num_samples=1)
	else:
	# Greedy decoding for deterministic generation
	_, idx_next = torch.topk(probs, k=1, dim=-1)

	# Append new token to sequence
	idx = torch.cat((idx, idx_next), dim=1)

	return idx

	class Block(nn.Module):
	"""
	Transformer Block - Core Building Block of the GPT Architecture

	Each transformer block implements the standard transformer architecture with:
	- Multi-head self-attention mechanism for capturing token relationships
	- Feed-forward neural network for processing attention outputs
	- Layer normalization for training stability
	- Residual connections for gradient flow

	ARCHITECTURE:
	- ln_1: Pre-attention layer normalization
	- attn: Multi-head causal self-attention
	- ln_2: Pre-feedforward layer normalization
	- mlp: Multi-layer perceptron (feed-forward network)

	RESIDUAL CONNECTIONS:
	- x = x + attn(ln_1(x)) # Residual connection around attention
	- x = x + mlp(ln_2(x)) # Residual connection around feed-forward

	DESIGN RATIONALE:
	- Layer normalization is applied BEFORE each sublayer (pre-norm)
	- This improves training stability and allows deeper networks
	- Residual connections help with gradient flow during backpropagation
	- The combination enables effective training of very deep transformer models
	"""
	def __init__(self, config):
	super().__init__()
	self.ln_1 = nn.LayerNorm(config.n_embd) # Pre-attention normalization
	self.attn = CausalSelfAttention(config) # Multi-head causal attention
	self.ln_2 = nn.LayerNorm(config.n_embd) # Pre-feedforward normalization
	self.mlp = MLP(config) # Feed-forward network

	def forward(self, x):
	"""
	Forward Pass Through a Single Transformer Block

	This implements the standard transformer block computation with
	pre-norm layer normalization and residual connections.

	PROCESSING STEPS:
	1. Apply layer normalization to input
	2. Process through multi-head self-attention
	3. Add residual connection (x + attention_output)
	4. Apply layer normalization to result
	5. Process through feed-forward network
	6. Add residual connection (x + feedforward_output)

	ARGUMENTS:
	- x: Input tensor of shape (batch_size, sequence_length, embedding_dim)

	RETURNS:
	- Output tensor of same shape as input
	"""
	# First sublayer: self-attention with residual connection
	x = x + self.attn(self.ln_1(x))
	# Second sublayer: feed-forward with residual connection
	x = x + self.mlp(self.ln_2(x))
	return x

	class CausalSelfAttention(nn.Module):
	"""
	Multi-Head Causal Self-Attention - ULTIMATE WORKING VERSION

	This is the FINAL WORKING VERSION of the attention mechanism that correctly
	handles the causal attention bias as a buffer (not a learnable parameter).
	This was a critical fix that resolved the state_dict loading issues.

	ATTENTION MECHANISM:
	- Multi-head attention allows the model to attend to different parts of the sequence
	- Causal masking ensures tokens can only attend to previous tokens (autoregressive)
	- Query, Key, Value projections from the same input sequence
	- Scaled dot-product attention with optional dropout

	CRITICAL FIXES IMPLEMENTED:
	- Attention bias is correctly handled as a causal mask buffer (register_buffer)
	- Attribute naming conflict resolved (use_bias vs bias)
	- Proper attention mask application in forward pass
	- Exact matching with saved model's attention architecture

	ARCHITECTURE COMPONENTS:
	- c_attn: Combined QKV projection (n_embd -> 3*n_embd)
	- c_proj: Output projection (n_embd -> n_embd)
	- attn_dropout: Dropout for attention weights
	- resid_dropout: Dropout for output projection
	- bias: Causal attention mask (registered as buffer, not parameter)

	ATTENTION COMPUTATION:
	1. Project input to Q, K, V vectors
	2. Reshape for multi-head attention
	3. Apply scaled dot-product attention with causal masking
	4. Reshape back to original dimensions
	5. Apply output projection with dropout
	"""
	def __init__(self, config):
	super().__init__()
	# Validate that embedding dimension is divisible by number of heads
	assert config.n_embd % config.n_head == 0, "Embedding dimension must be divisible by number of heads"

	# Attention projections
	self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias) # QKV projection
	self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias) # Output projection

	# Dropout layers for regularization
	self.attn_dropout = nn.Dropout(config.dropout) # Attention weight dropout
	self.resid_dropout = nn.Dropout(config.dropout) # Output dropout

	# Store configuration parameters
	self.n_head = config.n_head # Number of attention heads
	self.n_embd = config.n_embd # Embedding dimension
	self.dropout = config.dropout # Dropout rate
	self.use_bias = config.bias # Use different name for the boolean flag to avoid conflicts

	# CRITICAL FIX: REGISTER THE ATTENTION BIAS as a buffer (not parameter)
	# This is actually an attention mask, not a learnable bias
	# The saved model stores this as 'bias' in the state_dict
	if config.bias:
	# Create a causal attention mask buffer
	# This is a lower triangular matrix that prevents tokens from attending to future tokens
	mask = torch.tril(torch.ones(config.block_size, config.block_size))
	mask = mask.view(1, 1, config.block_size, config.block_size)
	self.register_buffer('bias', mask) # This matches the saved model's 'bias' key
	else:
	self.register_buffer('bias', None)

	def forward(self, x):
	"""
	Forward Pass Through Multi-Head Causal Self-Attention

	This function implements the complete attention mechanism including:
	- Query, Key, Value computation from input
	- Multi-head attention with causal masking
	- Output projection and dropout

	ATTENTION STEPS:
	1. Project input to Q, K, V vectors (combined projection for efficiency)
	2. Reshape for multi-head attention (separate heads)
	3. Apply scaled dot-product attention with causal masking
	4. Reshape back to original dimensions
	5. Apply output projection with dropout

	ARGUMENTS:
	- x: Input tensor of shape (batch_size, sequence_length, embedding_dim)

	RETURNS:
	- Output tensor of same shape as input
	"""
	B, T, C = x.size() # Batch size, sequence length, embedding dimension

	# Calculate query, key, values for all heads
	# This is an efficient combined projection that creates Q, K, V in one operation
	q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

	# Reshape for multi-head attention
	# Each head gets a subset of the embedding dimension
	k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
	q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
	v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

	# Causal self-attention using the bias mask
	if self.bias is not None:
	# Use the causal mask - this prevents tokens from attending to future tokens
	# The mask is a lower triangular matrix where mask[i,j] = 1 if i >= j, 0 otherwise
	attn_mask = self.bias[:, :, :T, :T] # Extract mask for current sequence length
	y = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask,
	dropout_p=self.dropout if self.training else 0,
	is_causal=False) # We provide our own mask
	else:
	# Use built-in causal attention (alternative approach)
	y = F.scaled_dot_product_attention(q, k, v, attn_mask=None,
	dropout_p=self.dropout if self.training else 0,
	is_causal=True)

	# Reshape back to original dimensions
	y = y.transpose(1, 2).contiguous().view(B, T, C)

	# Output projection with dropout
	y = self.resid_dropout(self.c_proj(y))
	return y

	class MLP(nn.Module):
	"""
	Multi-Layer Perceptron - Feed-Forward Network in Transformer Blocks

	The MLP is the feed-forward component of each transformer block, consisting of:
	- Two linear transformations with a GELU activation in between
	- Dropout for regularization
	- Optional bias terms (controlled by config.bias)

	ARCHITECTURE:
	- c_fc: First linear layer (n_embd -> 4*n_embd) - expansion
	- gelu: GELU activation function
	- c_proj: Second linear layer (4*n_embd -> n_embd) - projection
	- dropout: Dropout layer for regularization

	DESIGN RATIONALE:
	- The 4x expansion factor is standard in transformer architectures
	- GELU activation provides smooth gradients and good performance
	- Dropout prevents overfitting during training
	- The combination allows the model to learn complex non-linear transformations

	MATHEMATICAL OPERATION:
	- x = dropout(linear2(gelu(linear1(x))))
	- This creates a powerful non-linear transformation for each token
	"""
	def __init__(self, config):
	super().__init__()
	# First linear layer: expand embedding dimension by 4x
	self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
	# GELU activation function (commonly used in transformers)
	self.gelu = nn.GELU()
	# Second linear layer: project back to original embedding dimension
	self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
	# Dropout for regularization
	self.dropout = nn.Dropout(config.dropout)

	def forward(self, x):
	"""
	Forward Pass Through the Multi-Layer Perceptron

	This implements the standard feed-forward computation in transformer blocks:
	1. Expand dimension with first linear layer
	2. Apply GELU activation
	3. Project back to original dimension
	4. Apply dropout for regularization

	ARGUMENTS:
	- x: Input tensor of shape (batch_size, sequence_length, embedding_dim)

	RETURNS:
	- Output tensor of same shape as input
	"""
	x = self.c_fc(x) # Expand: n_embd -> 4*n_embd
	x = self.gelu(x) # Apply GELU activation
	x = self.c_proj(x) # Project: 4*n_embd -> n_embd
	x = self.dropout(x) # Apply dropout for regularization
	return x

	class RealOpenLLMInference:
	"""
	Real OpenLLM Inference Engine - Loads and Runs Actual Trained Models

	This is the core inference engine that handles the complete pipeline for loading
	and running the actual trained OpenLLM models from Hugging Face Hub. It provides
	a unified interface for model management, text generation, and parameter control.

	KEY FEATURES:
	- Dynamic model loading from Hugging Face Hub repositories
	- Support for all 5 model variants (4k, 6k, 7k, 8k, 9k training steps)
	- Comprehensive error handling and logging
	- Memory-efficient model management
	- Real-time model switching capabilities

	MODEL CONFIGURATIONS:
	- Each model has specific training characteristics and performance metrics
	- Models are trained on Wikipedia passages from the SQuAD dataset
	- Architecture: 6 layers, 8 heads, 512 embedding dim, 35.8M parameters
	- Vocabulary: 32k tokens using SentencePiece BPE tokenization

	TECHNICAL IMPLEMENTATION:
	- Uses huggingface_hub.snapshot_download for efficient model downloading
	- Handles various checkpoint formats (model_state_dict, direct state_dict)
	- Supports multiple model file formats (best_model.pt, model.pt, pytorch_model.bin)
	- Implements robust config parsing with fallback defaults
	- Provides detailed logging for debugging and monitoring

	MEMORY MANAGEMENT:
	- Models are loaded on-demand to conserve memory
	- Supports multiple models in memory simultaneously
	- Automatic cleanup of temporary download directories
	- CPU-only inference for compatibility and stability
	"""

	def __init__(self):
	"""
	Initialize the Real OpenLLM Inference Engine

	Sets up the inference engine with model configurations, storage containers,
	and logging infrastructure. This is the entry point for all model operations.

	INITIALIZATION COMPONENTS:
	- models: Dictionary to store loaded model instances
	- tokenizers: Dictionary to store loaded tokenizer instances
	- current_model: Tracks the currently active model
	- model_configs: Complete configuration for all available models

	MODEL CONFIGURATIONS INCLUDED:
	- 4k model: Early training stage, basic language understanding
	- 6k model: Improved coherence, better text generation
	- 7k model: Enhanced quality with lower perplexity
	- 8k model: Sophisticated understanding and reasoning
	- 9k model: Best performing model with highest quality output
	"""
	# Storage containers for loaded models and tokenizers
	self.models = {} # Dictionary: model_id -> GPT model instance
	self.tokenizers = {} # Dictionary: model_id -> SentencePiece tokenizer
	self.current_model = None # Currently active model ID

	# Complete configuration for all available real models from Hugging Face
	# Each model has specific training characteristics and performance metrics
	self.model_configs = {
	"openllm-small-extended-4k": {
	"name": "OpenLLM Small (4k steps)",
	"description": "Real model trained for 4,000 steps - Early training stage with basic language understanding and simple text generation capabilities. This model represents the initial learning phase where the model begins to understand basic language patterns.",
	"hf_repo": "lemms/openllm-small-extended-4k",
	"training_steps": 4000,
	"parameters": "35.8M"
	},
	"openllm-small-extended-6k": {
	"name": "OpenLLM Small (6k steps)",
	"description": "Real model trained for 6,000 steps - Improved coherence and better text generation quality. This model shows significant improvement in understanding context and generating more coherent text sequences. Perplexity: 816.040 indicates substantial learning progress.",
	"hf_repo": "lemms/openllm-small-extended-6k",
	"training_steps": 6000,
	"parameters": "35.8M"
	},
	"openllm-small-extended-7k": {
	"name": "OpenLLM Small (7k steps)",
	"description": "Real model trained for 7,000 steps - Enhanced quality with significantly improved text generation. This model demonstrates much better language understanding with Loss: 2.100 and Perplexity: 8.200, showing excellent training convergence.",
	"hf_repo": "lemms/openllm-small-extended-7k",
	"training_steps": 7000,
	"parameters": "35.8M"
	},
	"openllm-small-extended-8k": {
	"name": "OpenLLM Small (8k steps)",
	"description": "Real model trained for 8,000 steps - Sophisticated understanding and advanced reasoning capabilities. This model shows deep comprehension of complex language patterns and can generate high-quality, contextually appropriate text.",
	"hf_repo": "lemms/openllm-small-extended-8k",
	"training_steps": 8000,
	"parameters": "35.8M"
	},
	"openllm-small-extended-9k": {
	"name": "OpenLLM Small (9k steps)",
	"description": "Real model trained for 9,000 steps - Best performing model with highest quality output. This represents the pinnacle of training for the small model architecture, offering the most sophisticated language understanding and generation capabilities.",
	"hf_repo": "lemms/openllm-small-extended-9k",
	"training_steps": 9000,
	"parameters": "35.8M"
	},
	"openllm-small-extended-10k": {
	"name": "OpenLLM Small (10k steps)",
	"description": "Real model trained for 10,000 steps - Latest extended training with maximum performance. This model represents the most recent training iteration, offering the highest quality text generation and language understanding capabilities.",
	"hf_repo": "lemms/openllm-small-extended-10k",
	"training_steps": 10000,
	"parameters": "35.8M"
	}
	}

	# Initialize logging to track engine startup
	logger.info("🚀 Real OpenLLM Inference Engine initialized with comprehensive model support")

	def load_model_from_hf(self, model_id: str) -> bool:
	"""
	Load a Real Model from Hugging Face Hub

	This is the main entry point for loading models from Hugging Face Hub.
	It handles the complete pipeline from repository identification to model
	initialization, including downloading, configuration parsing, and setup.

	LOADING PROCESS:
	1. Validate model_id against available configurations
	2. Download model files from Hugging Face Hub
	3. Parse model configuration and architecture
	4. Initialize GPT model with exact architecture matching
	5. Load trained weights from checkpoint file
	6. Initialize SentencePiece tokenizer
	7. Set model to evaluation mode for inference

	ERROR HANDLING:
	- Validates model_id existence before processing
	- Handles network errors during download
	- Manages file format variations and parsing issues
	- Provides detailed error messages for debugging

	ARGUMENTS:
	- model_id: String identifier for the model (e.g., "openllm-small-extended-9k")

	RETURNS:
	- bool: True if model loaded successfully, False otherwise

	SIDE EFFECTS:
	- Downloads model files to temporary directory
	- Stores model and tokenizer in internal dictionaries
	- Sets current_model to loaded model_id
	- Logs detailed progress information
	"""
	try:
	# Validate that the requested model exists in our configuration
	config = self.model_configs.get(model_id)
	if not config:
	logger.error(f"❌ Unknown model ID: {model_id} - not found in available configurations")
	return False

	logger.info(f"📥 Loading real model from HF: {config['hf_repo']}")

	# Download model files from Hugging Face Hub
	# This uses the efficient snapshot_download function that handles caching
	# and only downloads files that don't already exist locally
	local_dir = snapshot_download(
	repo_id=config['hf_repo'],
	repo_type="model",
	local_dir=f"temp_{model_id}",
	allow_patterns=[".pt", ".json", ".model", ".bin"] # Only download necessary files
	)

	logger.info(f"✅ Downloaded model to: {local_dir}")

	# Load model and tokenizer from the downloaded directory
	# This is the core loading function that handles all the technical details
	success = self._load_model_and_tokenizer(local_dir, model_id)
	if success:
	# Update current model tracking
	self.current_model = model_id
	logger.info(f"✅ Successfully loaded real model: {model_id}")
	return True
	else:
	logger.error(f"❌ Failed to load model and tokenizer for: {model_id}")
	return False

	except Exception as e:
	# Comprehensive error handling for all potential issues
	logger.error(f"❌ Failed to load real model from HF {model_id}: {e}")
	return False

	def _load_model_and_tokenizer(self, model_dir: str, model_id: str) -> bool:
	"""
	Load Model and Tokenizer from Local Directory - Core Loading Function

	This is the core function that handles the technical details of loading
	the model architecture, weights, and tokenizer from the downloaded files.
	It implements robust error handling and supports multiple file formats.

	LOADING STEPS:
	1. Parse config.json to extract model architecture parameters
	2. Create GPTConfig object with exact parameter matching
	3. Initialize GPT model with custom architecture
	4. Load state_dict from checkpoint file (handles multiple formats)
	5. Load SentencePiece tokenizer from tokenizer.model
	6. Set model to evaluation mode for inference

	CONFIGURATION HANDLING:
	- Supports both direct config and nested model_config structures
	- Filters parameters to only include expected GPTConfig fields
	- Provides fallback defaults for missing configuration files
	- Handles extra configuration fields gracefully

	CHECKPOINT FORMATS SUPPORTED:
	- model_state_dict: Standard PyTorch training checkpoint format
	- model: Alternative checkpoint key for model weights
	- Direct state_dict: Raw model weights without wrapper
	- Multiple file formats: best_model.pt, model.pt, pytorch_model.bin

	ERROR HANDLING:
	- Validates file existence before processing
	- Handles missing configuration files with defaults
	- Manages state_dict key mismatches and format variations
	- Provides detailed error messages and file listings

	ARGUMENTS:
	- model_dir: Path to directory containing model files
	- model_id: String identifier for the model being loaded

	RETURNS:
	- bool: True if loading successful, False otherwise

	SIDE EFFECTS:
	- Stores loaded model in self.models[model_id]
	- Stores loaded tokenizer in self.tokenizers[model_id]
	- Logs detailed progress and error information
	"""
	try:
	model_path = Path(model_dir)

	# STEP 1: Load and parse model configuration
	# The config.json file contains all the architectural parameters
	config_file = model_path / "config.json"
	if config_file.exists():
	# Load configuration data from JSON file
	with open(config_file, 'r') as f:
	config_data = json.load(f)

	logger.info(f"📋 Config data keys: {list(config_data.keys())}")

	# Handle different config structures that might be present
	# Some models store config in a nested 'model_config' section
	if 'model_config' in config_data:
	# Extract model_config section for the actual model parameters
	model_config_data = config_data['model_config']
	logger.info("🔧 Using nested model_config structure")
	else:
	# Use the entire config as model config (direct structure)
	model_config_data = config_data
	logger.info("🔧 Using direct config structure")

	# Create GPTConfig with only the expected parameters
	# This filters out any extra fields that might cause issues
	expected_params = {
	'vocab_size', 'n_layer', 'n_head', 'n_embd',
	'block_size', 'dropout', 'bias'
	}

	config_kwargs = {}
	for key, value in model_config_data.items():
	if key in expected_params:
	config_kwargs[key] = value

	logger.info(f"🔧 Using config parameters: {config_kwargs}")
	model_config = GPTConfig(**config_kwargs)
	else:
	# Fallback to default configuration if config file is missing
	# This ensures the system can still work with incomplete model files
	logger.warning(f"⚠️ Config file not found, using default configuration")
	model_config = GPTConfig(
	vocab_size=32000,
	n_layer=6,
	n_head=8,
	n_embd=512,
	block_size=1024,
	dropout=0.1,
	bias=True
	)

	# STEP 2: Load model weights from checkpoint file
	# Try multiple possible file names and formats
	model_file = model_path / "best_model.pt"
	if not model_file.exists():
	model_file = model_path / "model.pt"
	if not model_file.exists():
	model_file = model_path / "pytorch_model.bin"

	if model_file.exists():
	logger.info(f"📦 Loading model from: {model_file}")

	# Initialize GPT model with the parsed configuration
	model = GPT(model_config)

	# Load checkpoint data from file
	checkpoint = torch.load(model_file, map_location='cpu')

	# Handle different checkpoint formats that might be present
	if isinstance(checkpoint, dict):
	if 'model_state_dict' in checkpoint:
	# Standard PyTorch training checkpoint format
	state_dict = checkpoint['model_state_dict']
	logger.info(f"📋 Loading from model_state_dict with {len(state_dict)} keys")
	elif 'model' in checkpoint:
	# Alternative checkpoint key for model weights
	state_dict = checkpoint['model']
	logger.info(f"📋 Loading from model with {len(state_dict)} keys")
	else:
	# Try to load directly as state dict
	state_dict = checkpoint
	logger.info(f"📋 Loading direct state dict with {len(state_dict)} keys")
	else:
	# Direct state dict (no wrapper dictionary)
	state_dict = checkpoint
	logger.info(f"📋 Loading direct state dict with {len(state_dict)} keys")

	# Load the state dict into the model
	# This is where the architecture matching is critical
	model.load_state_dict(state_dict)

	# Set model to evaluation mode for inference
	model.eval()

	# Store the loaded model in our dictionary
	self.models[model_id] = model
	logger.info(f"✅ Model loaded successfully")
	else:
	# Handle missing model file
	logger.error(f"❌ Model file not found in {model_dir}")
	logger.error(f" Available files: {list(model_path.glob('*'))}")
	return False

	# STEP 3: Load SentencePiece tokenizer
	# The tokenizer is essential for text tokenization and detokenization
	tokenizer_file = model_path / "tokenizer.model"
	if tokenizer_file.exists():
	# Initialize SentencePiece processor
	tokenizer = spm.SentencePieceProcessor()

	# Load the trained tokenizer model
	tokenizer.load(str(tokenizer_file))

	# Store the loaded tokenizer in our dictionary
	self.tokenizers[model_id] = tokenizer
	logger.info(f"✅ Tokenizer loaded successfully")
	else:
	# Handle missing tokenizer file
	logger.error(f"❌ Tokenizer file not found in {model_dir}")
	return False

	# All components loaded successfully
	return True

	except Exception as e:
	# Comprehensive error handling with full traceback
	logger.error(f"❌ Failed to load model and tokenizer: {e}")
	import traceback
	logger.error(f"📋 Full traceback: {traceback.format_exc()}")
	return False

	def generate_text(self, prompt: str, max_length: int = 100,
	temperature: float = 0.7, top_k: int = 50,
	top_p: float = 0.9) -> str:
	"""
	Generate Text Using the Loaded Real Model

	This is the main text generation function that uses the loaded model
	to generate coherent text based on the input prompt. It implements
	the complete generation pipeline from tokenization to text output.

	GENERATION PROCESS:
	1. Validate that a model is currently loaded
	2. Tokenize the input prompt using SentencePiece
	3. Convert tokens to PyTorch tensor format
	4. Generate new tokens using the model's autoregressive generation
	5. Decode the generated tokens back to text
	6. Remove the input prompt from the output for clean results

	GENERATION PARAMETERS:
	- temperature: Controls randomness (0.1-2.0, higher = more random)
	- top_k: Limits vocabulary to k highest probability tokens (1-100)
	- top_p: Nucleus sampling threshold (0.1-1.0, controls diversity)
	- max_length: Maximum number of new tokens to generate (10-500)

	SAMPLING STRATEGIES:
	- Temperature scaling: Adjusts probability distribution sharpness
	- Top-k filtering: Restricts vocabulary to most likely tokens
	- Top-p (nucleus) sampling: Dynamic vocabulary selection based on cumulative probability
	- Combined sampling: All parameters work together for optimal text quality

	ERROR HANDLING:
	- Validates model availability before generation
	- Handles tokenization errors gracefully
	- Manages generation failures with detailed error messages
	- Provides fallback responses for error conditions

	ARGUMENTS:
	- prompt: Input text that will be used as the generation seed
	- max_length: Maximum number of new tokens to generate
	- temperature: Controls randomness in token selection
	- top_k: Number of highest probability tokens to consider
	- top_p: Nucleus sampling parameter for dynamic vocabulary selection

	RETURNS:
	- str: Generated text continuation (prompt removed for clean output)

	SIDE EFFECTS:
	- Logs generation parameters and progress
	- May trigger model loading if no model is currently active
	- Provides detailed error information for debugging
	"""
	# Validate that a model is currently loaded and available
	if not self.current_model or self.current_model not in self.models:
	return "❌ No model loaded. Please select a model first."

	try:
	# Get the currently loaded model and tokenizer
	model = self.models[self.current_model]
	tokenizer = self.tokenizers[self.current_model]

	# STEP 1: Tokenize the input prompt
	# Convert text to token IDs using the SentencePiece tokenizer
	input_ids = tokenizer.encode(prompt)

	# Convert to PyTorch tensor format for model processing
	input_tensor = torch.tensor([input_ids], dtype=torch.long)

	# Log generation parameters for debugging and monitoring
	logger.info(f"🎯 Generating text with prompt: '{prompt[:50]}...'")
	logger.info(f"📊 Parameters: max_length={max_length}, temperature={temperature}, top_k={top_k}, top_p={top_p}")

	# STEP 2: Generate text using the model
	# Use torch.no_grad() for memory efficiency during inference
	with torch.no_grad():
	# Call the model's generate method with all parameters
	output_ids = model.generate(
	input_tensor,
	max_new_tokens=max_length,
	temperature=temperature,
	top_k=top_k,
	top_p=top_p,
	do_sample=True # Enable stochastic sampling for creative generation
	)

	# STEP 3: Decode the generated tokens back to text
	# Convert the complete token sequence (input + generated) to text
	generated_text = tokenizer.decode(output_ids[0].tolist())

	# STEP 4: Clean up the output by removing the input prompt
	# This provides a cleaner user experience by showing only the generated continuation
	if generated_text.startswith(prompt):
	generated_text = generated_text[len(prompt):].strip()

	# Log successful generation for monitoring
	logger.info(f"✅ Generated text: '{generated_text[:100]}...'")
	return generated_text

	except Exception as e:
	# Comprehensive error handling with detailed error messages
	error_msg = f"❌ Generation failed: {str(e)}"
	logger.error(error_msg)
	import traceback
	logger.error(f"📋 Full traceback: {traceback.format_exc()}")
	return error_msg

	# Initialize the real inference engine
	# This creates the main inference engine instance that will handle all model operations
	inference_engine = RealOpenLLMInference()

	def load_model_info(model_id: str) -> str:
	"""
	Get Detailed Information About a Specific Model

	This function retrieves comprehensive information about a specific model
	from the inference engine's configuration. It provides detailed descriptions
	of the model's training characteristics, performance metrics, and capabilities.

	INFORMATION PROVIDED:
	- Model name and training step count
	- Detailed description of model capabilities and characteristics
	- Parameter count and architecture details
	- Training progress indicators and performance metrics

	USAGE:
	- Called by the Gradio interface to display model information
	- Updates dynamically when user selects different models
	- Provides educational content about model differences

	ARGUMENTS:
	- model_id: String identifier for the model (e.g., "openllm-small-extended-9k")

	RETURNS:
	- str: Formatted markdown string with model information
	"""
	config = inference_engine.model_configs.get(model_id)
	if config:
	# Format comprehensive model information in markdown
	return f"{config['name']}\n\n{config['description']}\n\nParameters: {config['parameters']}\nTraining Steps: {config['training_steps']:,}"
	return "❌ Model not found"

	def generate_text_interface(model_id: str, prompt: str, max_length: int,
	temperature: float, top_k: int, top_p: float) -> str:
	"""
	Gradio Interface Function for Text Generation - Main User Interface

	This is the primary interface function that connects the Gradio web interface
	to the underlying inference engine. It handles user requests for text generation
	and manages the complete workflow from model loading to text output.

	INTERFACE WORKFLOW:
	1. Receive generation request from Gradio interface
	2. Check if requested model is already loaded
	3. Load model if necessary (with progress logging)
	4. Call the inference engine's text generation function
	5. Return generated text to the user interface
	6. Handle any errors and provide user-friendly messages

	MODEL LOADING STRATEGY:
	- Models are loaded on-demand to conserve memory
	- Once loaded, models remain in memory for faster subsequent requests
	- Automatic model switching when user selects different models
	- Comprehensive error handling for loading failures

	GENERATION PARAMETERS:
	- All parameters are passed through from the Gradio interface
	- Parameters are validated and logged for debugging
	- Default values ensure reasonable generation quality

	ERROR HANDLING:
	- Graceful handling of model loading failures
	- User-friendly error messages for interface display
	- Detailed logging for technical debugging
	- Fallback responses for various error conditions

	ARGUMENTS:
	- model_id: String identifier for the model to use
	- prompt: Input text prompt for generation
	- max_length: Maximum number of tokens to generate
	- temperature: Controls randomness in generation (0.1-2.0)
	- top_k: Number of highest probability tokens to consider (1-100)
	- top_p: Nucleus sampling parameter (0.1-1.0)

	RETURNS:
	- str: Generated text or error message for display

	SIDE EFFECTS:
	- May trigger model loading if model not already in memory
	- Logs all generation requests and parameters
	- Updates internal model tracking
	"""
	try:
	# Check if the requested model is already loaded in memory
	if model_id not in inference_engine.models:
	logger.info(f"🔄 Loading real model: {model_id}")
	# Load the model from Hugging Face Hub
	success = inference_engine.load_model_from_hf(model_id)
	if not success:
	# Return user-friendly error message if loading fails
	return f"❌ Failed to load real model: {model_id}"

	# Generate text using the loaded model with all specified parameters
	result = inference_engine.generate_text(
	prompt=prompt,
	max_length=max_length,
	temperature=temperature,
	top_k=top_k,
	top_p=top_p
	)

	# Return the generated text to the Gradio interface
	return result

	except Exception as e:
	# Comprehensive error handling for any unexpected issues
	error_msg = f"❌ Error in generation interface: {str(e)}"
	logger.error(error_msg)
	return error_msg

	# Create Gradio interface
	def create_interface():
	"""
	Create the Complete Gradio Web Interface

	This function builds the entire Gradio web interface that provides users
	with an intuitive way to interact with the OpenLLM models. The interface
	includes model selection, parameter controls, and text generation capabilities.

	INTERFACE COMPONENTS:
	- Header section with project information and model descriptions
	- Model selection dropdown with detailed information display
	- Text input area for user prompts
	- Generation parameter controls (temperature, top-k, top-p, max length)
	- Generate button for triggering text generation
	- Output area for displaying generated text
	- Footer with technical details and model sources

	LAYOUT DESIGN:
	- Two-column layout for efficient space utilization
	- Left column: Model selection and information
	- Right column: Input controls and generation parameters
	- Responsive design that works on different screen sizes
	- Professional styling with Soft theme for modern appearance

	USER EXPERIENCE FEATURES:
	- Real-time model information updates
	- Intuitive parameter controls with helpful descriptions
	- Clear visual feedback for all user actions
	- Comprehensive error handling and user guidance
	- Educational content about model differences and capabilities

	TECHNICAL INTEGRATION:
	- Seamless connection to the inference engine
	- Automatic model loading and switching
	- Real-time parameter validation and feedback
	- Comprehensive logging and error reporting
	- Memory-efficient model management

	RETURNS:
	- gr.Blocks: Complete Gradio interface ready for deployment
	"""

	# Create the main Gradio interface with professional styling
	with gr.Blocks(
	title="🚀 OpenLLM Real Models Space",
	theme=gr.themes.Soft() # Modern, professional theme
	) as interface:

	# Header section with comprehensive project information
	gr.Markdown("""
	# 🚀 OpenLLM Real Models Space

	Welcome to the OpenLLM Real Models Space! This interface uses actual trained models from Hugging Face.

	## 🎯 Real Trained Models

	We provide 5 different real models with varying training steps:

	\| Model \| Training Steps \| Parameters \| Performance \|
	\|-------\|---------------\|------------\|-------------\|
	\| 4k Model \| 4,000 \| 35.8M \| Early training stage \|
	\| 6k Model \| 6,000 \| 35.8M \| Improved coherence (Perplexity: 816.040) \|
	\| 7k Model \| 7,000 \| 35.8M \| Enhanced quality (Loss: 2.100, Perplexity: 8.200) \|
	\| 8k Model \| 8,000 \| 35.8M \| Sophisticated understanding \|
	\| 9k Model \| 9,000 \| 35.8M \| Best performing model \|
	\| 10k Model \| 10,000 \| 35.8M \| Latest extended training \|

	These are real GPT-style transformer models trained on Wikipedia passages from the SQuAD dataset.

	---
	""")

	# Main interface layout with two columns
	with gr.Row():
	# Left column: Model selection and information
	with gr.Column(scale=1):
	# Model selection dropdown
	# This allows users to choose between different model variants
	model_dropdown = gr.Dropdown(
	choices=list(inference_engine.model_configs.keys()), # All available models
	value="openllm-small-extended-10k", # Default to latest model
	label="🎯 Select Model",
	info="Choose the real trained model to use"
	)

	# Model information display
	# Shows detailed information about the selected model
	model_info = gr.Markdown(
	value=load_model_info("openllm-small-extended-10k"), # Default model info
	label="📋 Model Information"
	)

	# Update model info when selection changes
	# This provides real-time updates as users switch between models
	model_dropdown.change(
	fn=load_model_info,
	inputs=[model_dropdown],
	outputs=[model_info]
	)

	# Right column: Input controls and generation parameters
	with gr.Column(scale=2):
	# Text input area for user prompts
	# This is where users enter their text for generation
	prompt_input = gr.Textbox(
	lines=5, # Multi-line input for longer prompts
	label="📝 Input Prompt",
	placeholder="Enter your text prompt here...",
	info="The text that will be used as input for generation"
	)

	# Generation parameters in organized rows
	# First row: Max length and temperature controls
	with gr.Row():
	# Maximum length control
	max_length = gr.Slider(
	minimum=10,
	maximum=500,
	value=100, # Default to reasonable length
	step=10,
	label="📏 Max Length",
	info="Maximum number of tokens to generate"
	)

	# Temperature control for randomness
	temperature = gr.Slider(
	minimum=0.1,
	maximum=2.0,
	value=0.7, # Default to balanced creativity
	step=0.1,
	label="🌡️ Temperature",
	info="Controls randomness (higher = more random)"
	)

	# Second row: Top-k and top-p controls
	with gr.Row():
	# Top-k filtering control
	top_k = gr.Slider(
	minimum=1,
	maximum=100,
	value=50, # Default to reasonable filtering
	step=1,
	label="🔝 Top-K",
	info="Number of highest probability tokens to consider"
	)

	# Top-p (nucleus) sampling control
	top_p = gr.Slider(
	minimum=0.1,
	maximum=1.0,
	value=0.9, # Default to high diversity
	step=0.1,
	label="📊 Top-P",
	info="Nucleus sampling parameter"
	)

	# Generate button
	# This triggers the text generation process
	generate_btn = gr.Button(
	"🚀 Generate Text",
	variant="primary", # Prominent styling
	size="lg" # Large button for easy interaction
	)

	# Output area for displaying generated text
	# This shows the results of the generation process
	output_text = gr.Textbox(
	lines=10, # Large output area for generated text
	label="🎯 Generated Text",
	info="The generated text will appear here"
	)

	# Connect the generate button to the generation function
	# This creates the workflow from user input to text output
	generate_btn.click(
	fn=generate_text_interface,
	inputs=[model_dropdown, prompt_input, max_length, temperature, top_k, top_p],
	outputs=[output_text]
	)

	# Footer section with technical details and model sources
	gr.Markdown("""
	---

	## 🔧 Technical Details

	- Architecture: GPT-style transformer decoder
	- Model Size: Small (6 layers, 8 heads, 512 embedding dim)
	- Vocabulary: 32k tokens (SentencePiece BPE)
	- Training Data: Wikipedia passages from SQuAD dataset
	- Framework: PyTorch with real trained models
	- Gradio Version: 4.44.1 (latest)

	These models generate actual text based on their training on Wikipedia content.

	Model Sources:
	- [4k Model](https://huggingface.co/lemms/openllm-small-extended-4k)
	- [6k Model](https://huggingface.co/lemms/openllm-small-extended-6k)
	- [7k Model](https://huggingface.co/lemms/openllm-small-extended-7k)
	- [8k Model](https://huggingface.co/lemms/openllm-small-extended-8k)
	- [9k Model](https://huggingface.co/lemms/openllm-small-extended-9k)
	- [10k Model](https://huggingface.co/lemms/openllm-small-extended-10k)
	""")

	return interface

	# Create and launch the interface
	if __name__ == "__main__":
	"""
	Main Application Entry Point

	This is the entry point for the Gradio application. It creates the interface
	and launches the web server for user interaction.

	LAUNCH CONFIGURATION:
	- server_name: "0.0.0.0" allows external connections
	- server_port: 7860 is the standard Gradio port
	- share: False for local deployment (set to True for public sharing)
	- debug: True for development logging and error details

	DEPLOYMENT CONSIDERATIONS:
	- The application is designed for Hugging Face Spaces deployment
	- All dependencies are specified in requirements.txt
	- The interface is optimized for web-based interaction
	- Error handling is comprehensive for production use

	TECHNICAL FEATURES:
	- Automatic model loading and management
	- Real-time text generation capabilities
	- Comprehensive parameter controls
	- Professional user interface design
	- Robust error handling and logging
	"""
	# Create the complete Gradio interface
	interface = create_interface()

	# Launch the web server with production-ready configuration
	interface.launch(
	server_name="0.0.0.0", # Allow external connections
	server_port=7860, # Standard Gradio port
	share=False, # Local deployment (set to True for public sharing)
	debug=True # Enable debug logging for development
	)