Tonic commited on
Commit
109031b
Β·
1 Parent(s): d784738

adds flash attention

Browse files
README.md CHANGED
@@ -13,26 +13,26 @@ short_description: Smollm3 for French Understanding
13
 
14
  # πŸ€– Petite Elle L'Aime 3 - Chat Interface
15
 
16
- A complete Gradio application for the [Petite Elle L'Aime 3](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft) model, featuring the int4 quantized version for efficient CPU deployment.
17
 
18
  ## πŸš€ Features
19
 
20
  - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
21
- - **Int4 Quantization**: Optimized for CPU deployment with ~50% memory reduction
22
  - **Interactive Chat Interface**: Real-time conversation with the model
23
  - **Customizable System Prompt**: Define the assistant's personality and behavior
24
  - **Thinking Mode**: Enable reasoning mode with thinking tags
25
  - **Responsive Design**: Modern UI following the reference layout
26
  - **Chat Template Integration**: Proper Jinja template formatting
27
- - **Automatic Model Download**: Downloads int4 model at build time
28
 
29
  ## πŸ“‹ Model Information
30
 
31
  - **Base Model**: SmolLM3-3B
32
  - **Parameters**: ~3B
33
  - **Context Length**: 128k
34
- - **Quantization**: int4 (CPU optimized)
35
- - **Memory Reduction**: ~50%
36
  - **Languages**: English, French, Italian, Portuguese, Chinese, Arabic
37
 
38
  ## πŸ› οΈ Installation
@@ -94,8 +94,8 @@ The interface follows the reference layout with:
94
  ### Model Loading Strategy
95
  The application uses a smart loading strategy:
96
 
97
- 1. **Local Check**: First checks if int4 model files exist locally
98
- 2. **Local Loading**: If available, loads from `./int4` folder
99
  3. **Fallback Download**: If not available, downloads from Hugging Face
100
  4. **Tokenizer**: Always uses main repo for chat template and configuration
101
 
@@ -103,8 +103,8 @@ The application uses a smart loading strategy:
103
  For Hugging Face Spaces deployment:
104
 
105
  1. **Build Script**: `build.py` runs during Space build
106
- 2. **Model Download**: `download_model.py` downloads int4 model files
107
- 3. **Local Storage**: Model files stored in `./int4` directory
108
  4. **Fast Loading**: Subsequent runs use local files
109
 
110
  ### Chat Template Integration
@@ -115,9 +115,10 @@ The application uses the custom chat template from the model, which supports:
115
  - Proper conversation flow management
116
 
117
  ### Memory Optimization
118
- - Uses int4 quantization for reduced memory footprint
119
  - Automatic device detection (CUDA/CPU)
120
  - Efficient tokenization and generation
 
121
 
122
  ## πŸ“ Example Usage
123
 
@@ -156,8 +157,9 @@ The application uses the custom chat template from the model, which supports:
156
  - Check the console for detailed error messages
157
 
158
  3. **Performance Issues**:
159
- - The int4 model is optimized for CPU but may be slower than GPU versions
160
- - Consider using a machine with more RAM for better performance
 
161
 
162
  4. **System Prompt Issues**:
163
  - Ensure the system prompt is not too long (max 1000 characters)
 
13
 
14
  # πŸ€– Petite Elle L'Aime 3 - Chat Interface
15
 
16
+ A complete Gradio application for the [Petite Elle L'Aime 3](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft) model, featuring the full fine-tuned version for maximum performance and quality.
17
 
18
  ## πŸš€ Features
19
 
20
  - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
21
+ - **Full Fine-Tuned Model**: Maximum performance and quality with full precision
22
  - **Interactive Chat Interface**: Real-time conversation with the model
23
  - **Customizable System Prompt**: Define the assistant's personality and behavior
24
  - **Thinking Mode**: Enable reasoning mode with thinking tags
25
  - **Responsive Design**: Modern UI following the reference layout
26
  - **Chat Template Integration**: Proper Jinja template formatting
27
+ - **Automatic Model Download**: Downloads full model at build time
28
 
29
  ## πŸ“‹ Model Information
30
 
31
  - **Base Model**: SmolLM3-3B
32
  - **Parameters**: ~3B
33
  - **Context Length**: 128k
34
+ - **Precision**: Full fine-tuned model (float16/float32)
35
+ - **Performance**: Maximum quality and accuracy
36
  - **Languages**: English, French, Italian, Portuguese, Chinese, Arabic
37
 
38
  ## πŸ› οΈ Installation
 
94
  ### Model Loading Strategy
95
  The application uses a smart loading strategy:
96
 
97
+ 1. **Local Check**: First checks if full model files exist locally
98
+ 2. **Local Loading**: If available, loads from `./model` folder
99
  3. **Fallback Download**: If not available, downloads from Hugging Face
100
  4. **Tokenizer**: Always uses main repo for chat template and configuration
101
 
 
103
  For Hugging Face Spaces deployment:
104
 
105
  1. **Build Script**: `build.py` runs during Space build
106
+ 2. **Model Download**: `download_model.py` downloads full model files
107
+ 3. **Local Storage**: Model files stored in `./model` directory
108
  4. **Fast Loading**: Subsequent runs use local files
109
 
110
  ### Chat Template Integration
 
115
  - Proper conversation flow management
116
 
117
  ### Memory Optimization
118
+ - Uses full fine-tuned model for maximum quality
119
  - Automatic device detection (CUDA/CPU)
120
  - Efficient tokenization and generation
121
+ - Float16 precision on GPU for optimal performance
122
 
123
  ## πŸ“ Example Usage
124
 
 
157
  - Check the console for detailed error messages
158
 
159
  3. **Performance Issues**:
160
+ - The full model provides maximum quality but requires more memory
161
+ - GPU acceleration recommended for optimal performance
162
+ - Consider reducing model parameters if memory is limited
163
 
164
  4. **System Prompt Issues**:
165
  - Ensure the system prompt is not too long (max 1000 characters)
app.py CHANGED
@@ -11,8 +11,11 @@ import sys
11
  import requests
12
  import accelerate
13
 
14
- # Set torch to use float32 for better compatibility with quantized models
15
- torch.set_default_dtype(torch.float32)
 
 
 
16
  logging.basicConfig(level=logging.INFO)
17
  logger = logging.getLogger(__name__)
18
  MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
@@ -21,11 +24,11 @@ model = None
21
  tokenizer = None
22
  DEFAULT_SYSTEM_PROMPT = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
23
  title = "# πŸ€– Petite Elle L'Aime 3 - Chat Interface"
24
- description = "A fine-tuned version of SmolLM3-3B optimized for French conversations. This is the pre-quantized int4 version for efficient deployment."
25
  presentation1 = """
26
  ### 🎯 Features
27
  - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
28
- - **Pre-Quantized Int4**: Optimized for deployment with memory reduction
29
  - **Interactive Chat Interface**: Real-time conversation with the model
30
  - **Customizable System Prompt**: Define the assistant's personality and behavior
31
  - **Thinking Mode**: Enable reasoning mode with thinking tags
@@ -33,7 +36,7 @@ presentation1 = """
33
  """
34
  presentation2 = """### 🎯 FonctionnalitΓ©s
35
  * **Support multilingue** : Anglais, FranΓ§ais, Italien, Portugais, Chinois, Arabe
36
- * **PrΓ©-quantifiΓ© Int4** : OptimisΓ© pour un dΓ©ploiement avec rΓ©duction de mΓ©moire
37
  * **Interface de chat interactive** : Conversation en temps rΓ©el avec le modΓ¨le
38
  * **Invite systΓ¨me personnalisable** : DΓ©finissez la personnalitΓ© et le comportement de l'assistant
39
  * **Mode RΓ©flexion** : Activez le mode raisonnement avec des balises de rΓ©flexion
@@ -97,34 +100,34 @@ def download_chat_template():
97
 
98
 
99
  def load_model():
100
- """Load the pre-quantized model and tokenizer"""
101
  global model, tokenizer
102
 
103
  try:
104
  logger.info(f"Loading tokenizer from {MAIN_MODEL_ID}")
105
- tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder="int4")
106
  chat_template = download_chat_template()
107
  if chat_template:
108
  tokenizer.chat_template = chat_template
109
  logger.info("Chat template downloaded and set successfully")
110
 
111
- logger.info(f"Loading pre-quantized int4 model from {MAIN_MODEL_ID}/int4")
112
 
113
- # Load the pre-quantized model without additional quantization config
114
  model_kwargs = {
115
  "device_map": "auto" if DEVICE == "cuda" else "cpu",
116
- "torch_dtype": torch.float32, # Use float32 for compatibility
117
  "trust_remote_code": True,
118
  "low_cpu_mem_usage": True,
119
  }
120
 
121
  logger.info(f"Model loading parameters: {model_kwargs}")
122
- model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, subfolder="int4", **model_kwargs)
123
 
124
  if tokenizer.pad_token_id is None:
125
  tokenizer.pad_token_id = tokenizer.eos_token_id
126
 
127
- logger.info("Pre-quantized model loaded successfully")
128
  return True
129
 
130
  except Exception as e:
@@ -173,9 +176,9 @@ def create_prompt(system_message, user_message, enable_thinking=True, tools=None
173
  logger.error(f"Error creating prompt: {e}")
174
  return ""
175
 
176
- @spaces.GPU(duration=94)
177
  def generate_response(message, history, system_message, max_tokens, temperature, top_p, repetition_penalty, do_sample, enable_thinking=True, tools=None, use_xml_tools=True):
178
- """Generate response using the pre-quantized model with SmolLM3 features"""
179
  global model, tokenizer
180
 
181
  if model is None or tokenizer is None:
@@ -212,7 +215,7 @@ def generate_response(message, history, system_message, max_tokens, temperature,
212
  attention_mask=inputs['attention_mask'],
213
  pad_token_id=tokenizer.eos_token_id,
214
  eos_token_id=tokenizer.eos_token_id,
215
- cache_implementation="static" # Important for torchao quantized models
216
  )
217
  response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
218
  assistant_response = response[len(full_prompt):].strip()
@@ -261,7 +264,7 @@ def bot(history, system_prompt, max_length, temperature, top_p, repetition_penal
261
  return history
262
 
263
  # Load model on startup
264
- logger.info("Starting model loading process with torchao quantization...")
265
  load_model()
266
 
267
  # Create Gradio interface
@@ -321,7 +324,7 @@ with gr.Blocks() as demo:
321
  step=0.01
322
  )
323
  repetition_penalty = gr.Slider(
324
- label="πŸ”„ RΓ©pΓ©tition Penalty",
325
  minimum=1.0,
326
  maximum=2.0,
327
  value=1.1,
 
11
  import requests
12
  import accelerate
13
 
14
+ # Set torch to use float16 on GPU for better performance, float32 on CPU for compatibility
15
+ if torch.cuda.is_available():
16
+ torch.set_default_dtype(torch.float16)
17
+ else:
18
+ torch.set_default_dtype(torch.float32)
19
  logging.basicConfig(level=logging.INFO)
20
  logger = logging.getLogger(__name__)
21
  MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
 
24
  tokenizer = None
25
  DEFAULT_SYSTEM_PROMPT = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
26
  title = "# πŸ€– Petite Elle L'Aime 3 - Chat Interface"
27
+ description = "A fine-tuned version of SmolLM3-3B optimized for French conversations. This is the full fine-tuned model for maximum performance and quality."
28
  presentation1 = """
29
  ### 🎯 Features
30
  - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
31
+ - **Full Fine-Tuned Model**: Maximum performance and quality with full precision
32
  - **Interactive Chat Interface**: Real-time conversation with the model
33
  - **Customizable System Prompt**: Define the assistant's personality and behavior
34
  - **Thinking Mode**: Enable reasoning mode with thinking tags
 
36
  """
37
  presentation2 = """### 🎯 FonctionnalitΓ©s
38
  * **Support multilingue** : Anglais, FranΓ§ais, Italien, Portugais, Chinois, Arabe
39
+ * **Modèle complet fine-tuné** : Performance et qualité maximales avec précision complète
40
  * **Interface de chat interactive** : Conversation en temps rΓ©el avec le modΓ¨le
41
  * **Invite systΓ¨me personnalisable** : DΓ©finissez la personnalitΓ© et le comportement de l'assistant
42
  * **Mode RΓ©flexion** : Activez le mode raisonnement avec des balises de rΓ©flexion
 
100
 
101
 
102
  def load_model():
103
+ """Load the full fine-tuned model and tokenizer"""
104
  global model, tokenizer
105
 
106
  try:
107
  logger.info(f"Loading tokenizer from {MAIN_MODEL_ID}")
108
+ tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID)
109
  chat_template = download_chat_template()
110
  if chat_template:
111
  tokenizer.chat_template = chat_template
112
  logger.info("Chat template downloaded and set successfully")
113
 
114
+ logger.info(f"Loading full fine-tuned model from {MAIN_MODEL_ID}")
115
 
116
+ # Load the full fine-tuned model with optimized settings
117
  model_kwargs = {
118
  "device_map": "auto" if DEVICE == "cuda" else "cpu",
119
+ "torch_dtype": torch.float16 if DEVICE == "cuda" else torch.float32, # Use float16 on GPU, float32 on CPU
120
  "trust_remote_code": True,
121
  "low_cpu_mem_usage": True,
122
  }
123
 
124
  logger.info(f"Model loading parameters: {model_kwargs}")
125
+ model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, **model_kwargs)
126
 
127
  if tokenizer.pad_token_id is None:
128
  tokenizer.pad_token_id = tokenizer.eos_token_id
129
 
130
+ logger.info("Full fine-tuned model loaded successfully")
131
  return True
132
 
133
  except Exception as e:
 
176
  logger.error(f"Error creating prompt: {e}")
177
  return ""
178
 
179
+ @spaces.GPU()
180
  def generate_response(message, history, system_message, max_tokens, temperature, top_p, repetition_penalty, do_sample, enable_thinking=True, tools=None, use_xml_tools=True):
181
+ """Generate response using the full fine-tuned model with SmolLM3 features"""
182
  global model, tokenizer
183
 
184
  if model is None or tokenizer is None:
 
215
  attention_mask=inputs['attention_mask'],
216
  pad_token_id=tokenizer.eos_token_id,
217
  eos_token_id=tokenizer.eos_token_id,
218
+ cache_implementation="static"
219
  )
220
  response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
221
  assistant_response = response[len(full_prompt):].strip()
 
264
  return history
265
 
266
  # Load model on startup
267
+ logger.info("Starting model loading process with full fine-tuned model...")
268
  load_model()
269
 
270
  # Create Gradio interface
 
324
  step=0.01
325
  )
326
  repetition_penalty = gr.Slider(
327
+ label="πŸ”„ PΓ©nalitΓ© de RΓ©pΓ©tition",
328
  minimum=1.0,
329
  maximum=2.0,
330
  value=1.1,
download_model.py CHANGED
@@ -1,6 +1,6 @@
1
  #!/usr/bin/env python3
2
  """
3
- Helper script to download the int4 model files at build time for Hugging Face Spaces
4
  """
5
 
6
  import os
@@ -15,12 +15,12 @@ logger = logging.getLogger(__name__)
15
 
16
  # Model configuration
17
  MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
18
- LOCAL_MODEL_PATH = "./int4"
19
 
20
  def download_model():
21
- """Download the int4 model files to local directory"""
22
  try:
23
- logger.info(f"Downloading int4 model from {MAIN_MODEL_ID}/int4")
24
 
25
  # Create local directory if it doesn't exist
26
  os.makedirs(LOCAL_MODEL_PATH, exist_ok=True)
@@ -31,9 +31,9 @@ def download_model():
31
  # List all files in the repository
32
  all_files = list_repo_files(MAIN_MODEL_ID)
33
 
34
- # Filter files that are in the int4 subfolder
35
- int4_files = [f for f in all_files if f.startswith("int4/")]
36
- logger.info(f"Found {len(int4_files)} files in int4 subfolder")
37
 
38
  # Download each required file
39
  required_files = [
@@ -42,24 +42,24 @@ def download_model():
42
  "tokenizer.json",
43
  "tokenizer_config.json",
44
  "special_tokens_map.json",
45
- "generation_config.json"
 
46
  ]
47
 
48
  downloaded_count = 0
49
  for file_name in required_files:
50
- int4_file_path = f"int4/{file_name}"
51
- if int4_file_path in all_files:
52
  logger.info(f"Downloading {file_name}...")
53
  hf_hub_download(
54
  repo_id=MAIN_MODEL_ID,
55
- filename=int4_file_path,
56
  local_dir=LOCAL_MODEL_PATH,
57
  local_dir_use_symlinks=False
58
  )
59
  logger.info(f"Downloaded {file_name}")
60
  downloaded_count += 1
61
  else:
62
- logger.warning(f"File {file_name} not found in int4 subfolder")
63
 
64
  logger.info(f"Downloaded {downloaded_count} out of {len(required_files)} required files")
65
  logger.info(f"Model downloaded successfully to {LOCAL_MODEL_PATH}")
 
1
  #!/usr/bin/env python3
2
  """
3
+ Helper script to download the full fine-tuned model files at build time for Hugging Face Spaces
4
  """
5
 
6
  import os
 
15
 
16
  # Model configuration
17
  MAIN_MODEL_ID = "Tonic/petite-elle-L-aime-3-sft"
18
+ LOCAL_MODEL_PATH = "./model"
19
 
20
  def download_model():
21
+ """Download the full fine-tuned model files to local directory"""
22
  try:
23
+ logger.info(f"Downloading full fine-tuned model from {MAIN_MODEL_ID}")
24
 
25
  # Create local directory if it doesn't exist
26
  os.makedirs(LOCAL_MODEL_PATH, exist_ok=True)
 
31
  # List all files in the repository
32
  all_files = list_repo_files(MAIN_MODEL_ID)
33
 
34
+ # Filter files that are in the main repository (not in subfolders)
35
+ main_files = [f for f in all_files if not "/" in f or f.startswith("int4/") == False]
36
+ logger.info(f"Found {len(main_files)} files in main repository")
37
 
38
  # Download each required file
39
  required_files = [
 
42
  "tokenizer.json",
43
  "tokenizer_config.json",
44
  "special_tokens_map.json",
45
+ "generation_config.json",
46
+ "chat_template.jinja"
47
  ]
48
 
49
  downloaded_count = 0
50
  for file_name in required_files:
51
+ if file_name in all_files:
 
52
  logger.info(f"Downloading {file_name}...")
53
  hf_hub_download(
54
  repo_id=MAIN_MODEL_ID,
55
+ filename=file_name,
56
  local_dir=LOCAL_MODEL_PATH,
57
  local_dir_use_symlinks=False
58
  )
59
  logger.info(f"Downloaded {file_name}")
60
  downloaded_count += 1
61
  else:
62
+ logger.warning(f"File {file_name} not found in main repository")
63
 
64
  logger.info(f"Downloaded {downloaded_count} out of {len(required_files)} required files")
65
  logger.info(f"Model downloaded successfully to {LOCAL_MODEL_PATH}")
requirements.txt CHANGED
@@ -8,4 +8,5 @@ tokenizers>=0.21.2
8
  pyyaml>=6.0
9
  psutil>=5.9.0
10
  tqdm>=4.64.0
11
- requests>=2.31.0
 
 
8
  pyyaml>=6.0
9
  psutil>=5.9.0
10
  tqdm>=4.64.0
11
+ requests>=2.31.0
12
+ https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.9.post1/flash_attn-2.5.9.post1+cu118torch1.12cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
test_float16_compatibility.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for float16 compatibility with pre-quantized model
4
+ """
5
+
6
+ import torch
7
+ from transformers import AutoModelForCausalLM, AutoTokenizer
8
+ import logging
9
+
10
+ # Set up logging
11
+ logging.basicConfig(level=logging.INFO)
12
+ logger = logging.getLogger(__name__)
13
+
14
+ def test_float16_compatibility():
15
+ """Test float16 compatibility with pre-quantized model"""
16
+
17
+ model_id = "Tonic/petite-elle-L-aime-3-sft"
18
+ device = "cuda" if torch.cuda.is_available() else "cpu"
19
+
20
+ logger.info(f"Testing float16 compatibility on device: {device}")
21
+
22
+ # Test both float32 and float16
23
+ dtypes_to_test = []
24
+
25
+ if device == "cuda":
26
+ dtypes_to_test = [torch.float32, torch.float16]
27
+ else:
28
+ dtypes_to_test = [torch.float32] # Only test float32 on CPU
29
+
30
+ for dtype in dtypes_to_test:
31
+ logger.info(f"\nTesting with dtype: {dtype}")
32
+
33
+ try:
34
+ # Load tokenizer
35
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
36
+ if tokenizer.pad_token_id is None:
37
+ tokenizer.pad_token_id = tokenizer.eos_token_id
38
+
39
+ # Load model with specific dtype
40
+ model_kwargs = {
41
+ "device_map": "auto" if device == "cuda" else "cpu",
42
+ "torch_dtype": dtype,
43
+ "trust_remote_code": True,
44
+ "low_cpu_mem_usage": True,
45
+ }
46
+
47
+ logger.info(f"Loading model with {dtype}...")
48
+ model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
49
+
50
+ # Test generation
51
+ test_prompt = "Bonjour, comment allez-vous?"
52
+ inputs = tokenizer(test_prompt, return_tensors="pt")
53
+
54
+ if device == "cuda":
55
+ inputs = {k: v.cuda() for k, v in inputs.items()}
56
+
57
+ logger.info("Generating response...")
58
+ with torch.no_grad():
59
+ output_ids = model.generate(
60
+ inputs['input_ids'],
61
+ max_new_tokens=50,
62
+ temperature=0.7,
63
+ top_p=0.95,
64
+ do_sample=True,
65
+ attention_mask=inputs['attention_mask'],
66
+ pad_token_id=tokenizer.eos_token_id,
67
+ eos_token_id=tokenizer.eos_token_id,
68
+ cache_implementation="static"
69
+ )
70
+
71
+ response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
72
+ assistant_response = response[len(test_prompt):].strip()
73
+
74
+ logger.info(f"βœ… {dtype} test successful!")
75
+ logger.info(f"Input: {test_prompt}")
76
+ logger.info(f"Output: {assistant_response}")
77
+
78
+ # Check memory usage
79
+ if device == "cuda":
80
+ memory_used = torch.cuda.memory_allocated() / 1024**3
81
+ logger.info(f"GPU Memory used: {memory_used:.2f} GB")
82
+
83
+ # Check model dtype
84
+ logger.info(f"Model dtype: {model.dtype}")
85
+
86
+ # Clean up
87
+ del model
88
+ torch.cuda.empty_cache() if device == "cuda" else None
89
+
90
+ except Exception as e:
91
+ logger.error(f"❌ {dtype} test failed: {e}")
92
+ import traceback
93
+ traceback.print_exc()
94
+
95
+ if __name__ == "__main__":
96
+ test_float16_compatibility()
test_full_model_loading.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for full fine-tuned model loading and inference
4
+ """
5
+
6
+ import torch
7
+ from transformers import AutoModelForCausalLM, AutoTokenizer
8
+ import logging
9
+
10
+ # Set up logging
11
+ logging.basicConfig(level=logging.INFO)
12
+ logger = logging.getLogger(__name__)
13
+
14
+ def test_full_model_loading():
15
+ """Test the full fine-tuned model loading and generation"""
16
+
17
+ model_id = "Tonic/petite-elle-L-aime-3-sft"
18
+ device = "cuda" if torch.cuda.is_available() else "cpu"
19
+
20
+ logger.info(f"Testing full fine-tuned model on device: {device}")
21
+
22
+ try:
23
+ # Load tokenizer
24
+ logger.info("Loading tokenizer...")
25
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
26
+ if tokenizer.pad_token_id is None:
27
+ tokenizer.pad_token_id = tokenizer.eos_token_id
28
+
29
+ # Load full fine-tuned model
30
+ logger.info("Loading full fine-tuned model...")
31
+ model_kwargs = {
32
+ "device_map": "auto" if device == "cuda" else "cpu",
33
+ "torch_dtype": torch.float16 if device == "cuda" else torch.float32,
34
+ "trust_remote_code": True,
35
+ "low_cpu_mem_usage": True,
36
+ }
37
+
38
+ model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
39
+
40
+ # Test generation
41
+ test_prompt = "Bonjour, comment allez-vous?"
42
+ inputs = tokenizer(test_prompt, return_tensors="pt")
43
+
44
+ if device == "cuda":
45
+ inputs = {k: v.cuda() for k, v in inputs.items()}
46
+
47
+ logger.info("Generating response...")
48
+ with torch.no_grad():
49
+ output_ids = model.generate(
50
+ inputs['input_ids'],
51
+ max_new_tokens=50,
52
+ temperature=0.7,
53
+ top_p=0.95,
54
+ do_sample=True,
55
+ attention_mask=inputs['attention_mask'],
56
+ pad_token_id=tokenizer.eos_token_id,
57
+ eos_token_id=tokenizer.eos_token_id,
58
+ )
59
+
60
+ response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
61
+ assistant_response = response[len(test_prompt):].strip()
62
+
63
+ logger.info("βœ… Full fine-tuned model test successful!")
64
+ logger.info(f"Input: {test_prompt}")
65
+ logger.info(f"Output: {assistant_response}")
66
+
67
+ # Check model precision status
68
+ logger.info("Checking model precision status...")
69
+ float16_layers = 0
70
+ float32_layers = 0
71
+ total_layers = 0
72
+ for name, module in model.named_modules():
73
+ if hasattr(module, 'weight'):
74
+ total_layers += 1
75
+ if module.weight.dtype == torch.float16:
76
+ float16_layers += 1
77
+ elif module.weight.dtype == torch.float32:
78
+ float32_layers += 1
79
+
80
+ logger.info(f"Float16 layers: {float16_layers}/{total_layers}")
81
+ logger.info(f"Float32 layers: {float32_layers}/{total_layers}")
82
+
83
+ # Clean up
84
+ del model
85
+ torch.cuda.empty_cache() if device == "cuda" else None
86
+
87
+ return True
88
+
89
+ except Exception as e:
90
+ logger.error(f"❌ Full fine-tuned model test failed: {e}")
91
+ import traceback
92
+ traceback.print_exc()
93
+ return False
94
+
95
+ if __name__ == "__main__":
96
+ success = test_full_model_loading()
97
+ if success:
98
+ print("βœ… Full model loading test passed!")
99
+ else:
100
+ print("❌ Full model loading test failed!")
test_pre_quantized_model.py CHANGED
@@ -1,6 +1,6 @@
1
  #!/usr/bin/env python3
2
  """
3
- Test script for pre-quantized model inference
4
  """
5
 
6
  import torch
@@ -11,31 +11,31 @@ import logging
11
  logging.basicConfig(level=logging.INFO)
12
  logger = logging.getLogger(__name__)
13
 
14
- def test_pre_quantized_model():
15
- """Test the pre-quantized model loading and generation"""
16
 
17
  model_id = "Tonic/petite-elle-L-aime-3-sft"
18
  device = "cuda" if torch.cuda.is_available() else "cpu"
19
 
20
- logger.info(f"Testing pre-quantized model on device: {device}")
21
 
22
  try:
23
  # Load tokenizer
24
  logger.info("Loading tokenizer...")
25
- tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="int4")
26
  if tokenizer.pad_token_id is None:
27
  tokenizer.pad_token_id = tokenizer.eos_token_id
28
 
29
- # Load pre-quantized model
30
- logger.info("Loading pre-quantized model...")
31
  model_kwargs = {
32
  "device_map": "auto" if device == "cuda" else "cpu",
33
- "torch_dtype": torch.float32,
34
  "trust_remote_code": True,
35
  "low_cpu_mem_usage": True,
36
  }
37
 
38
- model = AutoModelForCausalLM.from_pretrained(model_id, subfolder="int4", **model_kwargs)
39
 
40
  # Test generation
41
  test_prompt = "Bonjour, comment allez-vous?"
@@ -55,37 +55,40 @@ def test_pre_quantized_model():
55
  attention_mask=inputs['attention_mask'],
56
  pad_token_id=tokenizer.eos_token_id,
57
  eos_token_id=tokenizer.eos_token_id,
58
- cache_implementation="static" # Important for quantized models
59
  )
60
 
61
  response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
62
  assistant_response = response[len(test_prompt):].strip()
63
 
64
- logger.info("βœ… Pre-quantized model test successful!")
65
  logger.info(f"Input: {test_prompt}")
66
  logger.info(f"Output: {assistant_response}")
67
 
68
- # Check model quantization status
69
- logger.info("Checking model quantization status...")
70
- quantized_layers = 0
 
71
  total_layers = 0
72
  for name, module in model.named_modules():
73
  if hasattr(module, 'weight'):
74
  total_layers += 1
75
- if module.weight.dtype != torch.float32:
76
- quantized_layers += 1
77
- logger.info(f"Quantized layer: {name} - {module.weight.dtype}")
 
 
78
 
79
- logger.info(f"Quantized layers: {quantized_layers}/{total_layers}")
 
80
 
81
  # Clean up
82
  del model
83
  torch.cuda.empty_cache() if device == "cuda" else None
84
 
85
  except Exception as e:
86
- logger.error(f"❌ Pre-quantized model test failed: {e}")
87
  import traceback
88
  traceback.print_exc()
89
 
90
  if __name__ == "__main__":
91
- test_pre_quantized_model()
 
1
  #!/usr/bin/env python3
2
  """
3
+ Test script for full fine-tuned model inference
4
  """
5
 
6
  import torch
 
11
  logging.basicConfig(level=logging.INFO)
12
  logger = logging.getLogger(__name__)
13
 
14
+ def test_full_fine_tuned_model():
15
+ """Test the full fine-tuned model loading and generation"""
16
 
17
  model_id = "Tonic/petite-elle-L-aime-3-sft"
18
  device = "cuda" if torch.cuda.is_available() else "cpu"
19
 
20
+ logger.info(f"Testing full fine-tuned model on device: {device}")
21
 
22
  try:
23
  # Load tokenizer
24
  logger.info("Loading tokenizer...")
25
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
26
  if tokenizer.pad_token_id is None:
27
  tokenizer.pad_token_id = tokenizer.eos_token_id
28
 
29
+ # Load full fine-tuned model
30
+ logger.info("Loading full fine-tuned model...")
31
  model_kwargs = {
32
  "device_map": "auto" if device == "cuda" else "cpu",
33
+ "torch_dtype": torch.float16 if device == "cuda" else torch.float32,
34
  "trust_remote_code": True,
35
  "low_cpu_mem_usage": True,
36
  }
37
 
38
+ model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
39
 
40
  # Test generation
41
  test_prompt = "Bonjour, comment allez-vous?"
 
55
  attention_mask=inputs['attention_mask'],
56
  pad_token_id=tokenizer.eos_token_id,
57
  eos_token_id=tokenizer.eos_token_id,
 
58
  )
59
 
60
  response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
61
  assistant_response = response[len(test_prompt):].strip()
62
 
63
+ logger.info("βœ… Full fine-tuned model test successful!")
64
  logger.info(f"Input: {test_prompt}")
65
  logger.info(f"Output: {assistant_response}")
66
 
67
+ # Check model precision status
68
+ logger.info("Checking model precision status...")
69
+ float16_layers = 0
70
+ float32_layers = 0
71
  total_layers = 0
72
  for name, module in model.named_modules():
73
  if hasattr(module, 'weight'):
74
  total_layers += 1
75
+ if module.weight.dtype == torch.float16:
76
+ float16_layers += 1
77
+ elif module.weight.dtype == torch.float32:
78
+ float32_layers += 1
79
+ logger.info(f"Float32 layer: {name} - {module.weight.dtype}")
80
 
81
+ logger.info(f"Float16 layers: {float16_layers}/{total_layers}")
82
+ logger.info(f"Float32 layers: {float32_layers}/{total_layers}")
83
 
84
  # Clean up
85
  del model
86
  torch.cuda.empty_cache() if device == "cuda" else None
87
 
88
  except Exception as e:
89
+ logger.error(f"❌ Full fine-tuned model test failed: {e}")
90
  import traceback
91
  traceback.print_exc()
92
 
93
  if __name__ == "__main__":
94
+ test_full_fine_tuned_model()