Spaces:

Tonic
/

Petite-LLM-3

Running on Zero

App Files Files Community

Tonic commited on Jul 29

Commit

109031b

1 Parent(s): d784738

adds flash attention

Browse files

Files changed (7) hide show

README.md +14 -12
app.py +20 -17
download_model.py +12 -12
requirements.txt +2 -1
test_float16_compatibility.py +96 -0
test_full_model_loading.py +100 -0
test_pre_quantized_model.py +23 -20

README.md CHANGED Viewed

@@ -13,26 +13,26 @@ short_description: Smollm3 for French Understanding
 # 🤖 Petite Elle L'Aime 3 - Chat Interface
-A complete Gradio application for the [Petite Elle L'Aime 3](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft) model, featuring the int4 quantized version for efficient CPU deployment.
 ## 🚀 Features
 - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
-- **Int4 Quantization**: Optimized for CPU deployment with ~50% memory reduction
 - **Interactive Chat Interface**: Real-time conversation with the model
 - **Customizable System Prompt**: Define the assistant's personality and behavior
 - **Thinking Mode**: Enable reasoning mode with thinking tags
 - **Responsive Design**: Modern UI following the reference layout
 - **Chat Template Integration**: Proper Jinja template formatting
-- **Automatic Model Download**: Downloads int4 model at build time
 ## 📋 Model Information
 - **Base Model**: SmolLM3-3B
 - **Parameters**: ~3B
 - **Context Length**: 128k
-- **Quantization**: int4 (CPU optimized)
-- **Memory Reduction**: ~50%
 - **Languages**: English, French, Italian, Portuguese, Chinese, Arabic
 ## 🛠️ Installation
@@ -94,8 +94,8 @@ The interface follows the reference layout with:
 ### Model Loading Strategy
 The application uses a smart loading strategy:
-1. **Local Check**: First checks if int4 model files exist locally
-2. **Local Loading**: If available, loads from `./int4` folder
 3. **Fallback Download**: If not available, downloads from Hugging Face
 4. **Tokenizer**: Always uses main repo for chat template and configuration
@@ -103,8 +103,8 @@ The application uses a smart loading strategy:
 For Hugging Face Spaces deployment:
 1. **Build Script**: `build.py` runs during Space build
-2. **Model Download**: `download_model.py` downloads int4 model files
-3. **Local Storage**: Model files stored in `./int4` directory
 4. **Fast Loading**: Subsequent runs use local files
 ### Chat Template Integration
@@ -115,9 +115,10 @@ The application uses the custom chat template from the model, which supports:
 - Proper conversation flow management
 ### Memory Optimization
-- Uses int4 quantization for reduced memory footprint
 - Automatic device detection (CUDA/CPU)
 - Efficient tokenization and generation
 ## 📝 Example Usage
@@ -156,8 +157,9 @@ The application uses the custom chat template from the model, which supports:
    - Check the console for detailed error messages
 3. **Performance Issues**:
-   - The int4 model is optimized for CPU but may be slower than GPU versions
-   - Consider using a machine with more RAM for better performance
 4. **System Prompt Issues**:
    - Ensure the system prompt is not too long (max 1000 characters)

 # 🤖 Petite Elle L'Aime 3 - Chat Interface
+A complete Gradio application for the [Petite Elle L'Aime 3](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft) model, featuring the full fine-tuned version for maximum performance and quality.
 ## 🚀 Features
 - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
+- **Full Fine-Tuned Model**: Maximum performance and quality with full precision
 - **Interactive Chat Interface**: Real-time conversation with the model
 - **Customizable System Prompt**: Define the assistant's personality and behavior
 - **Thinking Mode**: Enable reasoning mode with thinking tags
 - **Responsive Design**: Modern UI following the reference layout
 - **Chat Template Integration**: Proper Jinja template formatting
+- **Automatic Model Download**: Downloads full model at build time
 ## 📋 Model Information
 - **Base Model**: SmolLM3-3B
 - **Parameters**: ~3B
 - **Context Length**: 128k
+- **Precision**: Full fine-tuned model (float16/float32)
+- **Performance**: Maximum quality and accuracy
 - **Languages**: English, French, Italian, Portuguese, Chinese, Arabic
 ## 🛠️ Installation
 ### Model Loading Strategy
 The application uses a smart loading strategy:
+1. **Local Check**: First checks if full model files exist locally
+2. **Local Loading**: If available, loads from `./model` folder
 3. **Fallback Download**: If not available, downloads from Hugging Face
 4. **Tokenizer**: Always uses main repo for chat template and configuration
 For Hugging Face Spaces deployment:
 1. **Build Script**: `build.py` runs during Space build
+2. **Model Download**: `download_model.py` downloads full model files
+3. **Local Storage**: Model files stored in `./model` directory
 4. **Fast Loading**: Subsequent runs use local files
 ### Chat Template Integration
 - Proper conversation flow management
 ### Memory Optimization
+- Uses full fine-tuned model for maximum quality
 - Automatic device detection (CUDA/CPU)
 - Efficient tokenization and generation
+- Float16 precision on GPU for optimal performance
 ## 📝 Example Usage
    - Check the console for detailed error messages
 3. **Performance Issues**:
+   - The full model provides maximum quality but requires more memory
+   - GPU acceleration recommended for optimal performance
+   - Consider reducing model parameters if memory is limited
 4. **System Prompt Issues**:
    - Ensure the system prompt is not too long (max 1000 characters)

app.py CHANGED Viewed

@@ -11,8 +11,11 @@ import sys
 import requests
 import accelerate
-# Set torch to use float32 for better compatibility with quantized models
-torch.set_default_dtype(torch.float32)
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
 MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
@@ -21,11 +24,11 @@ model = None
 tokenizer = None
 DEFAULT_SYSTEM_PROMPT = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
 title = "# 🤖 Petite Elle L'Aime 3 - Chat Interface"
-description = "A fine-tuned version of SmolLM3-3B optimized for French conversations. This is the pre-quantized int4 version for efficient deployment."
 presentation1 = """
 ### 🎯 Features
 - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
-- **Pre-Quantized Int4**: Optimized for deployment with memory reduction
 - **Interactive Chat Interface**: Real-time conversation with the model
 - **Customizable System Prompt**: Define the assistant's personality and behavior
 - **Thinking Mode**: Enable reasoning mode with thinking tags
@@ -33,7 +36,7 @@ presentation1 = """
 """
 presentation2 = """### 🎯 Fonctionnalités
 * **Support multilingue** : Anglais, Français, Italien, Portugais, Chinois, Arabe
-* **Pré-quantifié Int4** : Optimisé pour un déploiement avec réduction de mémoire
 * **Interface de chat interactive** : Conversation en temps réel avec le modèle
 * **Invite système personnalisable** : Définissez la personnalité et le comportement de l'assistant
 * **Mode Réflexion** : Activez le mode raisonnement avec des balises de réflexion
@@ -97,34 +100,34 @@ def download_chat_template():
 def load_model():
-    """Load the pre-quantized model and tokenizer"""
     global model, tokenizer
     try:
         logger.info(f"Loading tokenizer from {MAIN_MODEL_ID}")
-        tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder="int4")
         chat_template = download_chat_template()
         if chat_template:
             tokenizer.chat_template = chat_template
         logger.info("Chat template downloaded and set successfully")
-        logger.info(f"Loading pre-quantized int4 model from {MAIN_MODEL_ID}/int4")
-        # Load the pre-quantized model without additional quantization config
         model_kwargs = {
             "device_map": "auto" if DEVICE == "cuda" else "cpu",
-            "torch_dtype": torch.float32,  # Use float32 for compatibility
             "trust_remote_code": True,
             "low_cpu_mem_usage": True,
         }
         logger.info(f"Model loading parameters: {model_kwargs}")
-        model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, subfolder="int4", **model_kwargs)
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
-        logger.info("Pre-quantized model loaded successfully")
         return True
     except Exception as e:
@@ -173,9 +176,9 @@ def create_prompt(system_message, user_message, enable_thinking=True, tools=None
         logger.error(f"Error creating prompt: {e}")
         return ""
-@spaces.GPU(duration=94)
 def generate_response(message, history, system_message, max_tokens, temperature, top_p, repetition_penalty, do_sample, enable_thinking=True, tools=None, use_xml_tools=True):
-    """Generate response using the pre-quantized model with SmolLM3 features"""
     global model, tokenizer
     if model is None or tokenizer is None:
@@ -212,7 +215,7 @@ def generate_response(message, history, system_message, max_tokens, temperature,
             attention_mask=inputs['attention_mask'],
             pad_token_id=tokenizer.eos_token_id,
             eos_token_id=tokenizer.eos_token_id,
-            cache_implementation="static"  # Important for torchao quantized models
         )
         response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
         assistant_response = response[len(full_prompt):].strip()
@@ -261,7 +264,7 @@ def bot(history, system_prompt, max_length, temperature, top_p, repetition_penal
     return history
 # Load model on startup
-logger.info("Starting model loading process with torchao quantization...")
 load_model()
 # Create Gradio interface
@@ -321,7 +324,7 @@ with gr.Blocks() as demo:
                     step=0.01
                 )
                 repetition_penalty = gr.Slider(
-                    label="🔄 Répétition Penalty",
                     minimum=1.0,
                     maximum=2.0,
                     value=1.1,

 import requests
 import accelerate
+# Set torch to use float16 on GPU for better performance, float32 on CPU for compatibility
+if torch.cuda.is_available():
+    torch.set_default_dtype(torch.float16)
+else:
+    torch.set_default_dtype(torch.float32)
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
 MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
 tokenizer = None
 DEFAULT_SYSTEM_PROMPT = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
 title = "# 🤖 Petite Elle L'Aime 3 - Chat Interface"
+description = "A fine-tuned version of SmolLM3-3B optimized for French conversations. This is the full fine-tuned model for maximum performance and quality."
 presentation1 = """
 ### 🎯 Features
 - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
+- **Full Fine-Tuned Model**: Maximum performance and quality with full precision
 - **Interactive Chat Interface**: Real-time conversation with the model
 - **Customizable System Prompt**: Define the assistant's personality and behavior
 - **Thinking Mode**: Enable reasoning mode with thinking tags
 """
 presentation2 = """### 🎯 Fonctionnalités
 * **Support multilingue** : Anglais, Français, Italien, Portugais, Chinois, Arabe
+* **Modèle complet fine-tuné** : Performance et qualité maximales avec précision complète
 * **Interface de chat interactive** : Conversation en temps réel avec le modèle
 * **Invite système personnalisable** : Définissez la personnalité et le comportement de l'assistant
 * **Mode Réflexion** : Activez le mode raisonnement avec des balises de réflexion
 def load_model():
+    """Load the full fine-tuned model and tokenizer"""
     global model, tokenizer
     try:
         logger.info(f"Loading tokenizer from {MAIN_MODEL_ID}")
+        tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID)
         chat_template = download_chat_template()
         if chat_template:
             tokenizer.chat_template = chat_template
         logger.info("Chat template downloaded and set successfully")
+        logger.info(f"Loading full fine-tuned model from {MAIN_MODEL_ID}")
+        # Load the full fine-tuned model with optimized settings
         model_kwargs = {
             "device_map": "auto" if DEVICE == "cuda" else "cpu",
+            "torch_dtype": torch.float16 if DEVICE == "cuda" else torch.float32,  # Use float16 on GPU, float32 on CPU
             "trust_remote_code": True,
             "low_cpu_mem_usage": True,
         }
         logger.info(f"Model loading parameters: {model_kwargs}")
+        model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, **model_kwargs)
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
+        logger.info("Full fine-tuned model loaded successfully")
         return True
     except Exception as e:
         logger.error(f"Error creating prompt: {e}")
         return ""
+@spaces.GPU()
 def generate_response(message, history, system_message, max_tokens, temperature, top_p, repetition_penalty, do_sample, enable_thinking=True, tools=None, use_xml_tools=True):
+    """Generate response using the full fine-tuned model with SmolLM3 features"""
     global model, tokenizer
     if model is None or tokenizer is None:
             attention_mask=inputs['attention_mask'],
             pad_token_id=tokenizer.eos_token_id,
             eos_token_id=tokenizer.eos_token_id,
+            cache_implementation="static"
         )
         response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
         assistant_response = response[len(full_prompt):].strip()
     return history
 # Load model on startup
+logger.info("Starting model loading process with full fine-tuned model...")
 load_model()
 # Create Gradio interface
                     step=0.01
                 )
                 repetition_penalty = gr.Slider(
+                    label="🔄 Pénalité de Répétition",
                     minimum=1.0,
                     maximum=2.0,
                     value=1.1,

download_model.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Helper script to download the int4 model files at build time for Hugging Face Spaces
 """
 import os
@@ -15,12 +15,12 @@ logger = logging.getLogger(__name__)
 # Model configuration
 MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
-LOCAL_MODEL_PATH = "./int4"
 def download_model():
-    """Download the int4 model files to local directory"""
     try:
-        logger.info(f"Downloading int4 model from {MAIN_MODEL_ID}/int4")
         # Create local directory if it doesn't exist
         os.makedirs(LOCAL_MODEL_PATH, exist_ok=True)
@@ -31,9 +31,9 @@ def download_model():
         # List all files in the repository
         all_files = list_repo_files(MAIN_MODEL_ID)
-        # Filter files that are in the int4 subfolder
-        int4_files = [f for f in all_files if f.startswith("int4/")]
-        logger.info(f"Found {len(int4_files)} files in int4 subfolder")
         # Download each required file
         required_files = [
@@ -42,24 +42,24 @@ def download_model():
             "tokenizer.json",
             "tokenizer_config.json",
             "special_tokens_map.json",
-            "generation_config.json"
         ]
         downloaded_count = 0
         for file_name in required_files:
-            int4_file_path = f"int4/{file_name}"
-            if int4_file_path in all_files:
                 logger.info(f"Downloading {file_name}...")
                 hf_hub_download(
                     repo_id=MAIN_MODEL_ID,
-                    filename=int4_file_path,
                     local_dir=LOCAL_MODEL_PATH,
                     local_dir_use_symlinks=False
                 )
                 logger.info(f"Downloaded {file_name}")
                 downloaded_count += 1
             else:
-                logger.warning(f"File {file_name} not found in int4 subfolder")
         logger.info(f"Downloaded {downloaded_count} out of {len(required_files)} required files")
         logger.info(f"Model downloaded successfully to {LOCAL_MODEL_PATH}")

 #!/usr/bin/env python3
 """
+Helper script to download the full fine-tuned model files at build time for Hugging Face Spaces
 """
 import os
 # Model configuration
 MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
+LOCAL_MODEL_PATH = "./model"
 def download_model():
+    """Download the full fine-tuned model files to local directory"""
     try:
+        logger.info(f"Downloading full fine-tuned model from {MAIN_MODEL_ID}")
         # Create local directory if it doesn't exist
         os.makedirs(LOCAL_MODEL_PATH, exist_ok=True)
         # List all files in the repository
         all_files = list_repo_files(MAIN_MODEL_ID)
+        # Filter files that are in the main repository (not in subfolders)
+        main_files = [f for f in all_files if not "/" in f or f.startswith("int4/") == False]
+        logger.info(f"Found {len(main_files)} files in main repository")
         # Download each required file
         required_files = [
             "tokenizer.json",
             "tokenizer_config.json",
             "special_tokens_map.json",
+            "generation_config.json",
+            "chat_template.jinja"
         ]
         downloaded_count = 0
         for file_name in required_files:
+            if file_name in all_files:
                 logger.info(f"Downloading {file_name}...")
                 hf_hub_download(
                     repo_id=MAIN_MODEL_ID,
+                    filename=file_name,
                     local_dir=LOCAL_MODEL_PATH,
                     local_dir_use_symlinks=False
                 )
                 logger.info(f"Downloaded {file_name}")
                 downloaded_count += 1
             else:
+                logger.warning(f"File {file_name} not found in main repository")
         logger.info(f"Downloaded {downloaded_count} out of {len(required_files)} required files")
         logger.info(f"Model downloaded successfully to {LOCAL_MODEL_PATH}")

requirements.txt CHANGED Viewed

@@ -8,4 +8,5 @@ tokenizers>=0.21.2
 pyyaml>=6.0
 psutil>=5.9.0
 tqdm>=4.64.0
-requests>=2.31.0

 pyyaml>=6.0
 psutil>=5.9.0
 tqdm>=4.64.0
+requests>=2.31.0
+https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.9.post1/flash_attn-2.5.9.post1+cu118torch1.12cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

test_float16_compatibility.py ADDED Viewed

	@@ -0,0 +1,96 @@

+#!/usr/bin/env python3
+"""
+Test script for float16 compatibility with pre-quantized model
+"""
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import logging
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def test_float16_compatibility():
+    """Test float16 compatibility with pre-quantized model"""
+    model_id = "Tonic/petite-elle-L-aime-3-sft"
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    logger.info(f"Testing float16 compatibility on device: {device}")
+    # Test both float32 and float16
+    dtypes_to_test = []
+    if device == "cuda":
+        dtypes_to_test = [torch.float32, torch.float16]
+    else:
+        dtypes_to_test = [torch.float32]  # Only test float32 on CPU
+    for dtype in dtypes_to_test:
+        logger.info(f"\nTesting with dtype: {dtype}")
+        try:
+            # Load tokenizer
+            tokenizer = AutoTokenizer.from_pretrained(model_id)
+            if tokenizer.pad_token_id is None:
+                tokenizer.pad_token_id = tokenizer.eos_token_id
+            # Load model with specific dtype
+            model_kwargs = {
+                "device_map": "auto" if device == "cuda" else "cpu",
+                "torch_dtype": dtype,
+                "trust_remote_code": True,
+                "low_cpu_mem_usage": True,
+            }
+            logger.info(f"Loading model with {dtype}...")
+            model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
+            # Test generation
+            test_prompt = "Bonjour, comment allez-vous?"
+            inputs = tokenizer(test_prompt, return_tensors="pt")
+            if device == "cuda":
+                inputs = {k: v.cuda() for k, v in inputs.items()}
+            logger.info("Generating response...")
+            with torch.no_grad():
+                output_ids = model.generate(
+                    inputs['input_ids'],
+                    max_new_tokens=50,
+                    temperature=0.7,
+                    top_p=0.95,
+                    do_sample=True,
+                    attention_mask=inputs['attention_mask'],
+                    pad_token_id=tokenizer.eos_token_id,
+                    eos_token_id=tokenizer.eos_token_id,
+                    cache_implementation="static"
+                )
+            response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+            assistant_response = response[len(test_prompt):].strip()
+            logger.info(f"✅ {dtype} test successful!")
+            logger.info(f"Input: {test_prompt}")
+            logger.info(f"Output: {assistant_response}")
+            # Check memory usage
+            if device == "cuda":
+                memory_used = torch.cuda.memory_allocated() / 1024**3
+                logger.info(f"GPU Memory used: {memory_used:.2f} GB")
+            # Check model dtype
+            logger.info(f"Model dtype: {model.dtype}")
+            # Clean up
+            del model
+            torch.cuda.empty_cache() if device == "cuda" else None
+        except Exception as e:
+            logger.error(f"❌ {dtype} test failed: {e}")
+            import traceback
+            traceback.print_exc()
+if __name__ == "__main__":
+    test_float16_compatibility()

test_full_model_loading.py ADDED Viewed

	@@ -0,0 +1,100 @@

+#!/usr/bin/env python3
+"""
+Test script for full fine-tuned model loading and inference
+"""
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import logging
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def test_full_model_loading():
+    """Test the full fine-tuned model loading and generation"""
+    model_id = "Tonic/petite-elle-L-aime-3-sft"
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    logger.info(f"Testing full fine-tuned model on device: {device}")
+    try:
+        # Load tokenizer
+        logger.info("Loading tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
+        if tokenizer.pad_token_id is None:
+            tokenizer.pad_token_id = tokenizer.eos_token_id
+        # Load full fine-tuned model
+        logger.info("Loading full fine-tuned model...")
+        model_kwargs = {
+            "device_map": "auto" if device == "cuda" else "cpu",
+            "torch_dtype": torch.float16 if device == "cuda" else torch.float32,
+            "trust_remote_code": True,
+            "low_cpu_mem_usage": True,
+        }
+        model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
+        # Test generation
+        test_prompt = "Bonjour, comment allez-vous?"
+        inputs = tokenizer(test_prompt, return_tensors="pt")
+        if device == "cuda":
+            inputs = {k: v.cuda() for k, v in inputs.items()}
+        logger.info("Generating response...")
+        with torch.no_grad():
+            output_ids = model.generate(
+                inputs['input_ids'],
+                max_new_tokens=50,
+                temperature=0.7,
+                top_p=0.95,
+                do_sample=True,
+                attention_mask=inputs['attention_mask'],
+                pad_token_id=tokenizer.eos_token_id,
+                eos_token_id=tokenizer.eos_token_id,
+            )
+        response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+        assistant_response = response[len(test_prompt):].strip()
+        logger.info("✅ Full fine-tuned model test successful!")
+        logger.info(f"Input: {test_prompt}")
+        logger.info(f"Output: {assistant_response}")
+        # Check model precision status
+        logger.info("Checking model precision status...")
+        float16_layers = 0
+        float32_layers = 0
+        total_layers = 0
+        for name, module in model.named_modules():
+            if hasattr(module, 'weight'):
+                total_layers += 1
+                if module.weight.dtype == torch.float16:
+                    float16_layers += 1
+                elif module.weight.dtype == torch.float32:
+                    float32_layers += 1
+        logger.info(f"Float16 layers: {float16_layers}/{total_layers}")
+        logger.info(f"Float32 layers: {float32_layers}/{total_layers}")
+        # Clean up
+        del model
+        torch.cuda.empty_cache() if device == "cuda" else None
+        return True
+    except Exception as e:
+        logger.error(f"❌ Full fine-tuned model test failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+if __name__ == "__main__":
+    success = test_full_model_loading()
+    if success:
+        print("✅ Full model loading test passed!")
+    else:
+        print("❌ Full model loading test failed!")

test_pre_quantized_model.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Test script for pre-quantized model inference
 """
 import torch
@@ -11,31 +11,31 @@ import logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-def test_pre_quantized_model():
-    """Test the pre-quantized model loading and generation"""
     model_id = "Tonic/petite-elle-L-aime-3-sft"
     device = "cuda" if torch.cuda.is_available() else "cpu"
-    logger.info(f"Testing pre-quantized model on device: {device}")
     try:
         # Load tokenizer
         logger.info("Loading tokenizer...")
-        tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="int4")
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
-        # Load pre-quantized model
-        logger.info("Loading pre-quantized model...")
         model_kwargs = {
             "device_map": "auto" if device == "cuda" else "cpu",
-            "torch_dtype": torch.float32,
             "trust_remote_code": True,
             "low_cpu_mem_usage": True,
         }
-        model = AutoModelForCausalLM.from_pretrained(model_id, subfolder="int4", **model_kwargs)
         # Test generation
         test_prompt = "Bonjour, comment allez-vous?"
@@ -55,37 +55,40 @@ def test_pre_quantized_model():
                 attention_mask=inputs['attention_mask'],
                 pad_token_id=tokenizer.eos_token_id,
                 eos_token_id=tokenizer.eos_token_id,
-                cache_implementation="static"  # Important for quantized models
             )
         response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
         assistant_response = response[len(test_prompt):].strip()
-        logger.info("✅ Pre-quantized model test successful!")
         logger.info(f"Input: {test_prompt}")
         logger.info(f"Output: {assistant_response}")
-        # Check model quantization status
-        logger.info("Checking model quantization status...")
-        quantized_layers = 0
         total_layers = 0
         for name, module in model.named_modules():
             if hasattr(module, 'weight'):
                 total_layers += 1
-                if module.weight.dtype != torch.float32:
-                    quantized_layers += 1
-                    logger.info(f"Quantized layer: {name} - {module.weight.dtype}")
-        logger.info(f"Quantized layers: {quantized_layers}/{total_layers}")
         # Clean up
         del model
         torch.cuda.empty_cache() if device == "cuda" else None
     except Exception as e:
-        logger.error(f"❌ Pre-quantized model test failed: {e}")
         import traceback
         traceback.print_exc()
 if __name__ == "__main__":
-    test_pre_quantized_model()

 #!/usr/bin/env python3
 """
+Test script for full fine-tuned model inference
 """
 import torch
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
+def test_full_fine_tuned_model():
+    """Test the full fine-tuned model loading and generation"""
     model_id = "Tonic/petite-elle-L-aime-3-sft"
     device = "cuda" if torch.cuda.is_available() else "cpu"
+    logger.info(f"Testing full fine-tuned model on device: {device}")
     try:
         # Load tokenizer
         logger.info("Loading tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
+        # Load full fine-tuned model
+        logger.info("Loading full fine-tuned model...")
         model_kwargs = {
             "device_map": "auto" if device == "cuda" else "cpu",
+            "torch_dtype": torch.float16 if device == "cuda" else torch.float32,
             "trust_remote_code": True,
             "low_cpu_mem_usage": True,
         }
+        model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
         # Test generation
         test_prompt = "Bonjour, comment allez-vous?"
                 attention_mask=inputs['attention_mask'],
                 pad_token_id=tokenizer.eos_token_id,
                 eos_token_id=tokenizer.eos_token_id,
             )
         response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
         assistant_response = response[len(test_prompt):].strip()
+        logger.info("✅ Full fine-tuned model test successful!")
         logger.info(f"Input: {test_prompt}")
         logger.info(f"Output: {assistant_response}")
+        # Check model precision status
+        logger.info("Checking model precision status...")
+        float16_layers = 0
+        float32_layers = 0
         total_layers = 0
         for name, module in model.named_modules():
             if hasattr(module, 'weight'):
                 total_layers += 1
+                if module.weight.dtype == torch.float16:
+                    float16_layers += 1
+                elif module.weight.dtype == torch.float32:
+                    float32_layers += 1
+                    logger.info(f"Float32 layer: {name} - {module.weight.dtype}")
+        logger.info(f"Float16 layers: {float16_layers}/{total_layers}")
+        logger.info(f"Float32 layers: {float32_layers}/{total_layers}")
         # Clean up
         del model
         torch.cuda.empty_cache() if device == "cuda" else None
     except Exception as e:
+        logger.error(f"❌ Full fine-tuned model test failed: {e}")
         import traceback
         traceback.print_exc()
 if __name__ == "__main__":
+    test_full_fine_tuned_model()