Spaces:

George-API
/

phi4training

Sleeping

App Files Files Community

George-API commited on Mar 10

Commit

ae57ea2

verified ·

1 Parent(s): 4a1fd53

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

DEPLOY_CHECKLIST.md +42 -97
app.py +3 -12
run_transformers_training.py +244 -110
transformers_config.json +3 -0

DEPLOY_CHECKLIST.md CHANGED Viewed

@@ -1,107 +1,52 @@
-# Phi-4 Training Space Deployment Checklist
-## Critical Configuration Review
-Before updating the Hugging Face Space, verify each of these items to prevent deployment issues:
-### 1. Model Configuration ✓
-- [ ] Confirmed model name in transformers_config.json: `unsloth/phi-4-unsloth-bnb-4bit`
-- [ ] BF16 precision enabled, FP16 disabled (`"bf16": true, "fp16": false`)
-- [ ] Chat template correctly set to `"phi"` in config
-- [ ] LoRA parameters properly configured:
-  - [ ] `r`: 32
-  - [ ] `lora_alpha`: 16
-  - [ ] `target_modules`: All required attention modules included
-- [ ] Max sequence length matches dataset needs (default: 2048)
-### 2. GPU & Memory Management ✓
-- [ ] Per-device batch size set to 16 or lower
-- [ ] Gradient accumulation steps set to 3 or higher
-- [ ] Device mapping set to "auto" for multi-GPU
-- [ ] Max memory limit set to 85% of each GPU's capacity
-- [ ] `PYTORCH_CUDA_ALLOC_CONF` includes `"expandable_segments:True"`
-- [ ] Gradient checkpointing enabled (`"gradient_checkpointing": true`)
-- [ ] Dataloader workers reduced to 2 (from 4)
-- [ ] FSDP configuration enabled for multi-GPU setups
-### 3. Dataset Handling ✓
-- [ ] Dataset configuration correctly specified in dataset_config.json
-- [ ] Conversation structure preserved (id + conversations fields)
-- [ ] SimpleDataCollator configured to use apply_chat_template
-- [ ] No re-ordering or sorting of the dataset (preserves original order)
-- [ ] Sequential sampler used in dataloader (no shuffling)
-- [ ] Max sequence length of 2048 applied
-- [ ] Format validation for first few examples enabled
-### 4. Dependency Management ✓
-- [ ] requirements.txt includes all necessary packages:
-  - [ ] unsloth
-  - [ ] peft
-  - [ ] bitsandbytes
-  - [ ] einops
-  - [ ] sentencepiece
-  - [ ] datasets
-  - [ ] transformers
-- [ ] Optional packages marked as such (e.g., flash-attn)
-- [ ] Dependency version constraints avoid known conflicts
-### 5. Error Handling & Logging ✓
-- [ ] Proper error catching for dataset loading
-- [ ] Fallback mechanisms for chat template application
-- [ ] Clear, concise log messages that work with HF Space interface
-- [ ] Memory usage tracking at key points (start, end, periodic)
-- [ ] Third-party loggers set to WARNING to reduce noise
-- [ ] Low-verbosity log format for better HF Space compatibility
-### 6. Training Setup ✓
-- [ ] Number of epochs properly configured (default: 3)
-- [ ] Learning rate appropriate (default: 2e-5)
-- [ ] Warmup ratio set (default: 0.05)
-- [ ] Checkpointing frequency set to reasonable value (default: 100 steps)
-- [ ] Output directory correctly configured
-- [ ] HuggingFace Hub parameters set correctly if pushing models
-### 7. Pre-Flight Verification ✓
-- [ ] No linting errors or indentation issues
-- [ ] Updated config values are consistent across files
-- [ ] Batch size × gradient accumulation × GPUs gives reasonable total batch
-- [ ] Verified that requirements.txt matches actual imports in code
-- [ ] Confirmed tokenizer settings match the model requirements
 ---
-## Last-Minute Configuration Changes
-If you've made any configuration changes, record them here before deployment:
-| Date | Parameter Changed | Old Value | New Value | Reason | Reviewer |
-|------|-------------------|-----------|-----------|--------|----------|
-|      |                   |           |           |        |          |
-|      |                   |           |           |        |          |
 ---
-## Deployment Notes
-**Current Space Hardware**: 4× NVIDIA L4 GPUs (24GB VRAM each)
-**Expected Training Speed**: ~XXX examples/second with current configuration
-**Memory Requirements**: Peak usage expected to be ~20GB per GPU
-**Common Issues to Watch For**:
-- OOM errors on GPU 0: If seen, reduce batch size by 2 and increase grad accumulation by 1
-- Imbalanced GPU usage: Check device mapping and FSDP configuration
-- Slow training: Verify that all GPUs are being utilized efficiently
-- Log flooding: Reduce verbosity of component logs (transformers, datasets, etc.)
----
 *Last Updated: 2025-03-09*

+# Phi-4 Training Critical Deployment Checklist
+## Essential Configuration Requirements
+### 1. Model Configuration
+- [ ] Model name: `unsloth/phi-4-unsloth-bnb-4bit`
+- [ ] BF16 precision enabled, FP16 disabled
+- [ ] Appropriate sequence length (2048)
+- [ ] LoRA parameters correctly configured (r: 32, alpha: 16)
+### 2. Hardware & Resource Management
+- [ ] Per-device batch size ≤ 16
+- [ ] Gradient accumulation steps ≥ 3
+- [ ] Gradient checkpointing enabled
+- [ ] Memory usage limits properly set (85% of GPU capacity)
+### 3. Critical Dataset Handling Rules
+- [ ] **NO REORDERING of dataset entries** - original order must be preserved
+- [ ] **NO COMBINING of separate entries** - each entry must remain distinct
+- [ ] **SEQUENTIAL PROCESSING required** - entries must be processed one after another
+- [ ] `sort_by_id` and `maintain_paper_order` flags properly set to preserve data sequence
+- [ ] Sequential sampler used with no shuffling (`"shuffle": false`)
+- [ ] Dataset sequential integrity verified with validation samples
+- [ ] Conversation structure preserved (original format maintained)
+### 4. Essential Error Handling
+- [ ] Clear error catching for dataset loading issues
+- [ ] Memory tracking at key training points
+- [ ] Low-verbosity logging for HF Space compatibility
+### 5. Training Core Requirements
+- [ ] Appropriate learning rate (2e-5)
+- [ ] Proper checkpointing frequency
+- [ ] Hub settings correctly configured for model saving
 ---
+## Pre-Deployment Verification
+| Requirement | Status | Notes |
+|-------------|--------|-------|
+| Data sequential integrity | | Confirm entries processed in order |
+| GPU memory within limits | | Check peak memory doesn't exceed 20GB per GPU |
+| Training batch verification | | Verify first few batches maintain proper order |
 ---
+**Current Hardware**: 4× NVIDIA L4 GPUs (24GB VRAM each)
+**CRITICAL REMINDER**: Data sequence preservation is the highest priority - any shuffling, reordering, or combining of entries will compromise model quality.
 *Last Updated: 2025-03-09*

app.py CHANGED Viewed

@@ -109,18 +109,9 @@ def display_config():
 def start_training():
     """Start the training process."""
     try:
-        # Run verification script first
-        log_info("Running pre-training verification...")
-        verify_cmd = "python verify_deployment.py"
-        try:
-            result = subprocess.run(verify_cmd, shell=True, check=True, capture_output=True, text=True)
-            if "All critical checks passed!" not in result.stdout:
-                log_info("Verification found issues. Please review:")
-                log_info(result.stdout)
-                return "Verification detected potential issues. Please review the logs before proceeding."
-        except subprocess.CalledProcessError as e:
-            log_info(f"Verification failed: {e.stderr}")
-            return "Verification failed. Please check the logs for details."
         # Start training
         log_info("Starting training process...")

 def start_training():
     """Start the training process."""
     try:
+        # Log configuration check
+        log_info("Preparing to start training process...")
+        log_info("Using consolidated configuration from transformers_config.json")
         # Start training
         log_info("Starting training process...")

run_transformers_training.py CHANGED Viewed

@@ -8,6 +8,14 @@ import argparse
 import logging
 from datetime import datetime
 import time
 # Import Unsloth first, before other ML imports
 try:
@@ -19,7 +27,6 @@ except ImportError:
     logger = logging.getLogger(__name__)
     logger.warning("Unsloth not available. Please install with: pip install unsloth")
-import torch
 from datasets import load_dataset
 from transformers import (
     AutoModelForCausalLM,
@@ -46,6 +53,9 @@ logging.getLogger("accelerate").setLevel(logging.WARNING)
 logging.getLogger("torch").setLevel(logging.WARNING)
 logging.getLogger("bitsandbytes").setLevel(logging.WARNING)
 # Define a clean logging function for HF Space compatibility
 def log_info(message):
     """Log information in a format compatible with Hugging Face Spaces"""
@@ -336,6 +346,45 @@ def load_dataset_with_mapping(dataset_config):
         # Note: Explicitly NOT sorting the dataset to preserve original order
         logger.info("Preserving original dataset order (no sorting)")
         # Log examples without printing full content
         if "conversations" in dataset.column_names:
             sample_ids = [example['id'] for example in dataset.select(range(min(5, len(dataset))))]
@@ -532,37 +581,107 @@ class SimpleDataCollator:
 class LoggingCallback(TrainerCallback):
     def __init__(self):
         self.last_log_time = time.time()
-        self.last_memory_log_time = time.time()
     def on_step_end(self, args, state, control, **kwargs):
         # Log every 50 steps or every 5 minutes, whichever comes first
         current_time = time.time()
-        # Log loss every 50 steps or 5 minutes
-        if (state.global_step % 50 == 0) or (current_time - self.last_log_time > 300):
-            if state.log_history:
-                loss = state.log_history[-1].get('loss', 'N/A')
-                # Use simple formatting for better HF Space log compatibility
-                log_info(f"Step {state.global_step}: Loss {loss}")
-            else:
-                log_info(f"Step {state.global_step}: No loss data available")
-            self.last_log_time = current_time
-        # Log memory usage every 15 minutes
-        if current_time - self.last_memory_log_time > 900:  # 15 minutes
-            if torch.cuda.is_available():
-                memory_info = []
-                for i in range(torch.cuda.device_count()):
-                    allocated = torch.cuda.memory_allocated(i) / 1024**2
-                    reserved = torch.cuda.memory_reserved(i) / 1024**2
-                    memory_info.append(f"GPU {i}: {allocated:.1f}MB/{reserved:.1f}MB")
-                # Log in compact format for better visibility
-                log_info(f"Memory usage - {', '.join(memory_info)}")
-            self.last_memory_log_time = current_time
     def on_train_begin(self, args, state, control, **kwargs):
         log_info("=== Training is starting ===")
         # Log important training parameters for visibility
@@ -571,9 +690,9 @@ class LoggingCallback(TrainerCallback):
         log_info(f"Epochs: {args.num_train_epochs}")
         # Log memory information in compact format
-        if torch.cuda.is_available():
             memory_info = []
-            for i in range(torch.cuda.device_count()):
                 allocated = torch.cuda.memory_allocated(i) / 1024**2
                 max_mem = torch.cuda.max_memory_allocated(i) / 1024**2
                 memory_info.append(f"GPU {i}: {allocated:.1f}MB (max: {max_mem:.1f}MB)")
@@ -581,15 +700,18 @@ class LoggingCallback(TrainerCallback):
             log_info(f"Initial memory usage - {', '.join(memory_info)}")
     def on_train_end(self, args, state, control, **kwargs):
-        log_info("=== Training completed ===")
-        if torch.cuda.is_available():
-            memory_info = []
-            for i in range(torch.cuda.device_count()):
-                allocated = torch.cuda.memory_allocated(i) / 1024**2
-                max_mem = torch.cuda.max_memory_allocated(i) / 1024**2
-                memory_info.append(f"GPU {i}: {allocated:.1f}MB (max: {max_mem:.1f}MB)")
-            log_info(f"Final memory usage - {', '.join(memory_info)}")
         log_info(f"Total steps: {state.global_step}")
         log_info(f"Final loss: {state.log_history[-1].get('loss', 'N/A') if state.log_history else 'N/A'}")
@@ -627,6 +749,15 @@ def main():
     # Set up logging
     log_info("Starting Phi-4 fine-tuning process")
     # Parse arguments
     args = parse_args()
@@ -645,64 +776,66 @@ def main():
     else:
         log_info("Running in non-distributed mode (single process)")
-    # Load all configurations
     try:
         configs = load_configs(args.config_dir)
-        # Extract specific configs
         if not configs:
             logger.error("Failed to load configuration")
             return 1
         # Verify configuration sections exist
-        if "transformers" not in configs:
             logger.error("transformers_config.json not found or invalid")
             return 1
-        if "hardware" not in configs or not configs["hardware"]:
             logger.warning("Hardware configuration section not found in transformers_config.json. Using default hardware configuration.")
-        if "dataset" not in configs or not configs["dataset"]:
             logger.error("Dataset configuration section not found in transformers_config.json")
             return 1
         # Validate model configuration
-        model_config = configs["transformers"]
-        if not model_config.get("model", {}).get("name") and not model_config.get("model_name_or_path") and not model_config.get("model_name"):
             logger.error("Model name not specified in configuration")
             logger.error("Please ensure 'name' is specified under 'model' in transformers_config.json")
             return 1
-        model_name = model_config.get("model", {}).get("name") or model_config.get("model_name_or_path") or model_config.get("model_name")
         log_info(f"Using model: {model_name}")
         log_info("All configurations loaded successfully")
-        # Extract specific configs
-        model_config = configs["transformers"]
-        hardware_config = configs.get("hardware", {})
-        dataset_config = configs["dataset"]
         # Apply hardware-specific settings if available
         if hardware_config:
             # Get training optimizations from hardware config
             training_opts = hardware_config.get("training_optimizations", {})
             # Apply batch size and gradient accumulation settings
-            if training_opts.get("per_device_batch_size") and model_config.get("training"):
                 batch_size = training_opts.get("per_device_batch_size")
-                model_config["training"]["per_device_train_batch_size"] = batch_size
                 log_info(f"Applied hardware-optimized batch size: {batch_size}")
-            if training_opts.get("gradient_accumulation_steps") and model_config.get("training"):
                 grad_steps = training_opts.get("gradient_accumulation_steps")
-                model_config["training"]["gradient_accumulation_steps"] = grad_steps
                 log_info(f"Applied hardware-optimized gradient accumulation: {grad_steps}")
             # Apply memory optimizations
             memory_opts = training_opts.get("memory_optimizations", {})
-            if memory_opts.get("use_gradient_checkpointing") is not None and model_config.get("training"):
                 grad_ckpt = memory_opts.get("use_gradient_checkpointing")
-                model_config["training"]["gradient_checkpointing"] = grad_ckpt
                 log_info(f"Applied hardware-optimized gradient checkpointing: {grad_ckpt}")
             # Apply system settings
@@ -720,38 +853,27 @@ def main():
         return 1
     # Set random seed for reproducibility
-    seed = model_config.get("seed", 42)
     set_seed(seed)
     log_info(f"Set random seed to {seed} for reproducibility")
-    # Check CUDA and set environment variables for better memory management
-    if torch.cuda.is_available():
-        # Empty CUDA cache
         torch.cuda.empty_cache()
-        # Set memory management env vars for better fragmentation handling
-        os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128,expandable_segments:True"
-        # Get memory fraction from hardware config
-        cuda_memory_fraction = hardware_config.get("system_settings", {}).get("cuda_memory_fraction", 0.85)
-        # Log initial memory information in a compact form
-        gpu_info = []
-        for i in range(torch.cuda.device_count()):
-            name = torch.cuda.get_device_name(i)
-            allocated = torch.cuda.memory_allocated(i) / 1024**3
-            total = torch.cuda.get_device_properties(i).total_memory / 1024**3
-            reserved_memory = total * cuda_memory_fraction
-            gpu_info.append(f"GPU {i}: {name} ({allocated:.1f}GB/{reserved_memory:.1f}GB)")
-        log_info(f"Hardware: {torch.cuda.device_count()} GPUs detected")
-        log_info(f"GPU details: {', '.join(gpu_info)}")
-    else:
-        log_info("No GPU detected, using CPU (training will be very slow)")
     try:
         log_info("Loading model and tokenizer...")
-        model, tokenizer = load_model_and_tokenizer(model_config)
         log_info("Model and tokenizer loaded successfully")
         # Load dataset with proper mapping
@@ -781,25 +903,21 @@ def main():
             log_info("Using FP16 precision from hardware config")
         else:
             # Fall back to transformers config
-            use_bf16 = model_config.get("bf16", False) or model_config.get("torch_dtype", "") == "bfloat16"
-            use_fp16 = model_config.get("fp16", False) and not use_bf16  # Only use fp16 if bf16 is not set
             log_info(f"Using precision: {'bf16' if use_bf16 else 'fp16' if use_fp16 else 'full precision'}")
         # Get per device batch size - from transformers config, but possibly overridden by hardware config
-        per_device_batch_size = model_config.get("training", {}).get("per_device_train_batch_size", 16)
-        gradient_accumulation_steps = model_config.get("training", {}).get("gradient_accumulation_steps", 3)
         # For multi-GPU setup, adjust for better balance
-        if torch.cuda.device_count() > 1:
-            log_info(f"Multi-GPU setup with {torch.cuda.device_count()} GPUs")
-            log_info(f"Training config: {per_device_batch_size} samples/GPU × {gradient_accumulation_steps} accumulation steps")
-        # Determine multi-GPU strategy from hardware config
-        multi_gpu_strategy = hardware_config.get("training_optimizations", {}).get("multi_gpu_strategy", "data_parallel")
         # Set up FSDP for multi-GPU training if specified and in distributed mode
         fsdp_config = None
-        if multi_gpu_strategy == "fsdp" and is_distributed and torch.cuda.device_count() > 1:
             try:
                 from torch.distributed.fsdp import (
                     FullyShardedDataParallel as FSDP,
@@ -845,33 +963,33 @@ def main():
         # Set up training arguments
         log_info("Setting up training arguments")
         training_args = TrainingArguments(
-            output_dir=model_config.get("output_dir", "./results") or model_config.get("checkpointing", {}).get("output_dir", "./results"),
-            num_train_epochs=model_config.get("training", {}).get("num_train_epochs", 3),
             per_device_train_batch_size=per_device_batch_size,
             gradient_accumulation_steps=gradient_accumulation_steps,
-            learning_rate=model_config.get("training", {}).get("learning_rate", 2e-5),
-            weight_decay=model_config.get("training", {}).get("weight_decay", 0.01),
-            warmup_ratio=model_config.get("training", {}).get("warmup_ratio", 0.05),
-            lr_scheduler_type=model_config.get("training", {}).get("lr_scheduler_type", "cosine"),
-            logging_steps=model_config.get("training", {}).get("logging_steps", 10),
-            save_strategy=model_config.get("checkpointing", {}).get("save_strategy", "steps"),
-            save_steps=model_config.get("checkpointing", {}).get("save_steps", 100),
-            save_total_limit=model_config.get("checkpointing", {}).get("save_total_limit", 3),
             fp16=use_fp16,
             bf16=use_bf16,
-            max_grad_norm=model_config.get("training", {}).get("max_grad_norm", 1.0),
-            push_to_hub=model_config.get("huggingface_hub", {}).get("push_to_hub", False),
-            hub_model_id=model_config.get("huggingface_hub", {}).get("hub_model_id", None),
             hub_token=os.environ.get("HF_TOKEN", None),
             report_to="tensorboard",
             remove_unused_columns=False,  # Keep all columns
-            gradient_checkpointing=model_config.get("training", {}).get("gradient_checkpointing", True),
             dataloader_pin_memory=pin_memory,
-            optim=model_config.get("training", {}).get("optim", "adamw_torch"),
             ddp_find_unused_parameters=False,  # Improve distributed training efficiency
             dataloader_drop_last=False,  # Process all examples
             dataloader_num_workers=dataloader_workers,
-            no_cuda=False if torch.cuda.is_available() else True,  # Use CUDA if available
             # Only add FSDP if we're in distributed mode with FSDP strategy
             fsdp=fsdp_config if is_distributed and multi_gpu_strategy == "fsdp" else None,
         )
@@ -894,11 +1012,27 @@ def main():
             """Custom dataloader that preserves original dataset order"""
             log_info("Creating sequential dataloader to maintain original dataset order")
             # Calculate batch size based on device availability
             if getattr(training_args, "no_cuda", False):
                 batch_size = training_args.per_device_train_batch_size
             else:
-                batch_size = max(training_args.per_device_train_batch_size * max(1, torch.cuda.device_count()), 1)
             log_info(f"Using sequential sampler with batch size {batch_size}")
@@ -920,12 +1054,12 @@ def main():
         log_info("=== Starting Training ===")
         try:
             # Empty cache again right before training
-            if torch.cuda.is_available():
                 torch.cuda.empty_cache()
                 log_info("Cleared CUDA cache before training")
             # Display compact training info
-            total_steps = int(len(dataset) / (per_device_batch_size * torch.cuda.device_count() * gradient_accumulation_steps) * training_args.num_train_epochs)
             log_info(f"Training plan: {len(dataset)} examples over {training_args.num_train_epochs} epochs ≈ {total_steps} steps")
             trainer.train()
@@ -937,8 +1071,8 @@ def main():
             log_info(f"Model saved to {training_args.output_dir}")
             # Push to hub if enabled
-            if model_config.get("huggingface_hub", {}).get("push_to_hub", False):
-                hub_id = model_config.get("huggingface_hub", {}).get("hub_model_id", "model")
                 log_info(f"Pushing model to Hugging Face Hub as {hub_id}...")
                 trainer.push_to_hub()
                 log_info("Model successfully pushed to Hub")
@@ -947,9 +1081,9 @@ def main():
         except Exception as e:
             logger.error(f"Training failed with error: {str(e)}")
             # Log CUDA memory info if available in compact format
-            if torch.cuda.is_available():
                 memory_info = []
-                for i in range(torch.cuda.device_count()):
                     allocated = torch.cuda.memory_allocated(i) / 1024**2
                     reserved = torch.cuda.memory_reserved(i) / 1024**2
                     max_mem = torch.cuda.max_memory_allocated(i) / 1024**2

 import logging
 from datetime import datetime
 import time
+import warnings
+import torch
+from importlib.util import find_spec
+# Global variables for hardware detection
+CUDA_AVAILABLE = torch.cuda.is_available()
+NUM_GPUS = torch.cuda.device_count() if CUDA_AVAILABLE else 0
+DEVICE_TYPE = "cuda" if CUDA_AVAILABLE else "cpu"
 # Import Unsloth first, before other ML imports
 try:
     logger = logging.getLogger(__name__)
     logger.warning("Unsloth not available. Please install with: pip install unsloth")
 from datasets import load_dataset
 from transformers import (
     AutoModelForCausalLM,
 logging.getLogger("torch").setLevel(logging.WARNING)
 logging.getLogger("bitsandbytes").setLevel(logging.WARNING)
+# Check availability of libraries
+peft_available = find_spec("peft") is not None
 # Define a clean logging function for HF Space compatibility
 def log_info(message):
     """Log information in a format compatible with Hugging Face Spaces"""
         # Note: Explicitly NOT sorting the dataset to preserve original order
         logger.info("Preserving original dataset order (no sorting)")
+        # Check data ordering requirements
+        processing_config = dataset_config.get("dataset", {}).get("processing", {})
+        data_loading_config = dataset_config.get("data_loading", {})
+        # Flag consolidation - we only need one flag to control sequence preservation
+        # Default to True to ensure safety
+        preserve_sequence = processing_config.get("preserve_entry_sequence", True)
+        shuffle_disabled = not data_loading_config.get("shuffle", False)
+        if not preserve_sequence:
+            logger.warning("CRITICAL: preserve_entry_sequence is set to False. This is NOT RECOMMENDED!")
+            logger.warning("Data sequence integrity is essential for proper model training.")
+        if not shuffle_disabled:
+            logger.error("CRITICAL: shuffle is enabled in the dataset config!")
+            logger.error("This will RANDOMIZE your dataset and break sequential order.")
+            logger.error("Please set shuffle: false in your data_loading configuration.")
+            # Actually enforce sequence preservation by raising an error
+            raise ValueError("Dataset shuffling is enabled but preserve_entry_sequence is required. " +
+                             "Please disable shuffling in your configuration.")
+        # Verify the IDs are in sequential order if they're numeric
+        try:
+            if len(dataset) > 1 and all(isinstance(example.get('id', ''), (int, str)) for example in dataset.select(range(min(10, len(dataset))))):
+                sample_ids = [example['id'] for example in dataset.select(range(min(10, len(dataset))))]
+                logger.info(f"Verifying sequential integrity with first few IDs: {sample_ids}")
+                # Check if IDs are numeric and ordered
+                if all(isinstance(id, int) or id.isdigit() for id in sample_ids):
+                    numeric_ids = [int(id) if isinstance(id, str) else id for id in sample_ids]
+                    is_ordered = all(numeric_ids[i] <= numeric_ids[i+1] for i in range(len(numeric_ids)-1))
+                    if not is_ordered:
+                        logger.warning("WARNING: Sample IDs are not in sequential order.")
+                        logger.warning("This may indicate that data sequence is not preserved.")
+                    else:
+                        logger.info("Sample IDs appear to be in sequential order.")
+        except Exception as e:
+            logger.warning(f"Could not verify sequential integrity: {e}")
         # Log examples without printing full content
         if "conversations" in dataset.column_names:
             sample_ids = [example['id'] for example in dataset.select(range(min(5, len(dataset))))]
 class LoggingCallback(TrainerCallback):
     def __init__(self):
+        super().__init__()
+        self.training_started = time.time()
         self.last_log_time = time.time()
+        self.last_step = 0
+        self.verify_sequence = None
+        self.sequence_samples = None
+        self.sample_indices = None
     def on_step_end(self, args, state, control, **kwargs):
         # Log every 50 steps or every 5 minutes, whichever comes first
         current_time = time.time()
+        # Perform actual sequence integrity verification if enabled
+        if self.verify_sequence is True and state.global_step % 100 == 0 and self.sequence_samples:
+            try:
+                # Get a batch of data without disturbing the training
+                batch = next(iter(trainer.get_train_dataloader()))
+                if 'input_ids' in batch and 'labels' in batch:
+                    log_info("Verifying data sequence integrity...")
+                    # Check if we can access some of our reference samples
+                    current_indices = list(range(min(3, len(trainer.train_dataset))))
+                    current_samples = [trainer.train_dataset[i] for i in current_indices]
+                    # Compare current samples with our reference samples from training start
+                    is_sequence_maintained = True
+                    for i, (orig_idx, orig_sample) in enumerate(zip(self.sample_indices, self.sequence_samples)):
+                        # Check if sample IDs still match our reference
+                        if orig_idx < len(current_samples):
+                            current_sample = current_samples[i]
+                            # Compare IDs if available
+                            if 'id' in orig_sample and 'id' in current_sample:
+                                if orig_sample['id'] != current_sample['id']:
+                                    log_info(f"WARNING: Sequence integrity compromised! Sample {i} ID changed from {orig_sample['id']} to {current_sample['id']}")
+                                    is_sequence_maintained = False
+                            # Compare input fingerprints
+                            if 'conversations' in orig_sample and 'conversations' in current_sample:
+                                orig_len = len(orig_sample['conversations'])
+                                curr_len = len(current_sample['conversations'])
+                                if orig_len != curr_len:
+                                    log_info(f"WARNING: Sequence integrity compromised! Sample {i} conversation length changed from {orig_len} to {curr_len}")
+                                    is_sequence_maintained = False
+                    if is_sequence_maintained:
+                        log_info("Data sequence integrity check: OK")
+                    else:
+                        log_info("CRITICAL WARNING: Data sequence integrity check FAILED!")
+            except Exception as e:
+                log_info(f"Warning: Couldn't verify sequence integrity: {e}")
+        time_interval = current_time - self.last_log_time
+        step_interval = state.global_step - self.last_step
+        if step_interval >= 50 or time_interval >= 300:  # 5 minutes = 300 seconds
+            # Calculate throughput
+            examples_per_second = step_interval * args.per_device_train_batch_size * args.gradient_accumulation_steps / max(time_interval, 1e-6)
+            elapsed_total = time.strftime("%H:%M:%S", time.gmtime(current_time - self.training_started))
+            # Log progress
+            log_info(f"Step: {state.global_step}, Loss: {state.log_history[-1]['loss']:.4f}, "
+                    f"Rate: {examples_per_second:.2f} examples/sec, Elapsed: {elapsed_total}")
+            # Report memory usage if CUDA is available
+            if CUDA_AVAILABLE:
+                log_info(f"GPU Memory: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB allocated, "
+                        f"{torch.cuda.max_memory_reserved() / 1024**3:.2f} GB reserved")
+            # Reset for next interval
+            self.last_log_time = current_time
+            self.last_step = state.global_step
     def on_train_begin(self, args, state, control, **kwargs):
+        log_info(f"=== Training started at {time.strftime('%Y-%m-%d %H:%M:%S')} ===")
+        log_info(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M")
+        # Set up sequence verification with actual sample capturing
+        try:
+            self.verify_sequence = dataset_config.get("validation", {}).get("verify_sequence_integrity", False)
+            if self.verify_sequence:
+                log_info("Sequence integrity verification enabled during training")
+                # Save actual samples for later verification
+                if trainer and trainer.train_dataset:
+                    # Get some reference samples from the beginning of the dataset
+                    self.sample_indices = list(range(min(5, len(trainer.train_dataset))))
+                    self.sequence_samples = [trainer.train_dataset[i] for i in self.sample_indices]
+                    log_info(f"Captured {len(self.sequence_samples)} reference samples for sequence integrity verification")
+                    # Log sample IDs for debugging
+                    if len(self.sequence_samples) > 0 and 'id' in self.sequence_samples[0]:
+                        sample_ids = [s.get('id') for s in self.sequence_samples if 'id' in s]
+                        log_info(f"Reference sample IDs: {sample_ids}")
+                else:
+                    log_info("Warning: Could not capture reference samples - verification will be limited")
+        except Exception as e:
+            log_info(f"Warning: Could not set up sequence integrity verification: {e}")
+            self.verify_sequence = False
         log_info("=== Training is starting ===")
         # Log important training parameters for visibility
         log_info(f"Epochs: {args.num_train_epochs}")
         # Log memory information in compact format
+        if CUDA_AVAILABLE:
             memory_info = []
+            for i in range(NUM_GPUS):
                 allocated = torch.cuda.memory_allocated(i) / 1024**2
                 max_mem = torch.cuda.max_memory_allocated(i) / 1024**2
                 memory_info.append(f"GPU {i}: {allocated:.1f}MB (max: {max_mem:.1f}MB)")
             log_info(f"Initial memory usage - {', '.join(memory_info)}")
     def on_train_end(self, args, state, control, **kwargs):
+        training_time = time.strftime("%H:%M:%S", time.gmtime(time.time() - self.training_started))
+        log_info(f"=== Training completed in {training_time} ===")
+        # Log final memory usage
+        if CUDA_AVAILABLE:
+            for i in range(NUM_GPUS):
+                max_mem = torch.cuda.max_memory_allocated(i) / 1024**3  # GB
+                log_info(f"GPU {i} max memory: {max_mem:.2f} GB")
+            # Clear GPU memory
+            torch.cuda.empty_cache()
+            log_info("GPU memory cleared")
         log_info(f"Total steps: {state.global_step}")
         log_info(f"Final loss: {state.log_history[-1].get('loss', 'N/A') if state.log_history else 'N/A'}")
     # Set up logging
     log_info("Starting Phi-4 fine-tuning process")
+    # Log hardware information
+    log_info(f"Hardware detection: CUDA {'available' if CUDA_AVAILABLE else 'not available'}")
+    if CUDA_AVAILABLE:
+        log_info(f"Found {NUM_GPUS} GPUs")
+        for i in range(NUM_GPUS):
+            log_info(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
+    else:
+        log_info("Running on CPU (training will be very slow)")
     # Parse arguments
     args = parse_args()
     else:
         log_info("Running in non-distributed mode (single process)")
+    # Load all configurations - do this once
     try:
         configs = load_configs(args.config_dir)
+        # Extract specific configs immediately after loading
         if not configs:
             logger.error("Failed to load configuration")
             return 1
+        # Store configurations in clear variables
+        transformers_config = configs.get("transformers", {})
+        hardware_config = configs.get("hardware", {})
+        dataset_config = configs.get("dataset", {})
         # Verify configuration sections exist
+        if not transformers_config:
             logger.error("transformers_config.json not found or invalid")
             return 1
+        if not hardware_config:
             logger.warning("Hardware configuration section not found in transformers_config.json. Using default hardware configuration.")
+        if not dataset_config:
             logger.error("Dataset configuration section not found in transformers_config.json")
             return 1
         # Validate model configuration
+        model_name = (transformers_config.get("model", {}).get("name") or
+                     transformers_config.get("model_name_or_path") or
+                     transformers_config.get("model_name"))
+        if not model_name:
             logger.error("Model name not specified in configuration")
             logger.error("Please ensure 'name' is specified under 'model' in transformers_config.json")
             return 1
         log_info(f"Using model: {model_name}")
         log_info("All configurations loaded successfully")
         # Apply hardware-specific settings if available
         if hardware_config:
             # Get training optimizations from hardware config
             training_opts = hardware_config.get("training_optimizations", {})
             # Apply batch size and gradient accumulation settings
+            if training_opts.get("per_device_batch_size") and transformers_config.get("training"):
                 batch_size = training_opts.get("per_device_batch_size")
+                transformers_config["training"]["per_device_train_batch_size"] = batch_size
                 log_info(f"Applied hardware-optimized batch size: {batch_size}")
+            if training_opts.get("gradient_accumulation_steps") and transformers_config.get("training"):
                 grad_steps = training_opts.get("gradient_accumulation_steps")
+                transformers_config["training"]["gradient_accumulation_steps"] = grad_steps
                 log_info(f"Applied hardware-optimized gradient accumulation: {grad_steps}")
             # Apply memory optimizations
             memory_opts = training_opts.get("memory_optimizations", {})
+            if memory_opts.get("use_gradient_checkpointing") is not None and transformers_config.get("training"):
                 grad_ckpt = memory_opts.get("use_gradient_checkpointing")
+                transformers_config["training"]["gradient_checkpointing"] = grad_ckpt
                 log_info(f"Applied hardware-optimized gradient checkpointing: {grad_ckpt}")
             # Apply system settings
         return 1
     # Set random seed for reproducibility
+    seed = transformers_config.get("seed", 42)
     set_seed(seed)
     log_info(f"Set random seed to {seed} for reproducibility")
+    # Empty CUDA cache to ensure clean state
+    if CUDA_AVAILABLE:
         torch.cuda.empty_cache()
+        log_info("Cleared CUDA cache")
+    # Setup environment variable for CUDA memory allocation
+    if CUDA_AVAILABLE:
+        system_settings = hardware_config.get("system_settings", {})
+        cuda_memory_fraction = system_settings.get("cuda_memory_fraction", 0.85)
+        if cuda_memory_fraction < 1.0:
+            os.environ["PYTORCH_CUDA_ALLOC_CONF"] = f"max_split_size_mb:128,expandable_segments:True"
+            log_info(f"Set CUDA memory allocation limit to expandable with max_split_size_mb:128")
     try:
         log_info("Loading model and tokenizer...")
+        model, tokenizer = load_model_and_tokenizer(transformers_config)
         log_info("Model and tokenizer loaded successfully")
         # Load dataset with proper mapping
             log_info("Using FP16 precision from hardware config")
         else:
             # Fall back to transformers config
+            use_bf16 = transformers_config.get("bf16", False) or transformers_config.get("torch_dtype", "") == "bfloat16"
+            use_fp16 = transformers_config.get("fp16", False) and not use_bf16  # Only use fp16 if bf16 is not set
             log_info(f"Using precision: {'bf16' if use_bf16 else 'fp16' if use_fp16 else 'full precision'}")
         # Get per device batch size - from transformers config, but possibly overridden by hardware config
+        per_device_batch_size = transformers_config.get("training", {}).get("per_device_train_batch_size", 16)
+        gradient_accumulation_steps = transformers_config.get("training", {}).get("gradient_accumulation_steps", 3)
         # For multi-GPU setup, adjust for better balance
+        if CUDA_AVAILABLE and NUM_GPUS > 1:
+            log_info(f"Multi-GPU setup: Adjusting for {NUM_GPUS} GPUs")
         # Set up FSDP for multi-GPU training if specified and in distributed mode
         fsdp_config = None
+        if multi_gpu_strategy == "fsdp" and is_distributed and NUM_GPUS > 1:
             try:
                 from torch.distributed.fsdp import (
                     FullyShardedDataParallel as FSDP,
         # Set up training arguments
         log_info("Setting up training arguments")
         training_args = TrainingArguments(
+            output_dir=transformers_config.get("output_dir", "./results") or transformers_config.get("checkpointing", {}).get("output_dir", "./results"),
+            num_train_epochs=transformers_config.get("training", {}).get("num_train_epochs", 3),
             per_device_train_batch_size=per_device_batch_size,
             gradient_accumulation_steps=gradient_accumulation_steps,
+            learning_rate=transformers_config.get("training", {}).get("learning_rate", 2e-5),
+            weight_decay=transformers_config.get("training", {}).get("weight_decay", 0.01),
+            warmup_ratio=transformers_config.get("training", {}).get("warmup_ratio", 0.05),
+            lr_scheduler_type=transformers_config.get("training", {}).get("lr_scheduler_type", "cosine"),
+            logging_steps=transformers_config.get("training", {}).get("logging_steps", 10),
+            save_strategy=transformers_config.get("checkpointing", {}).get("save_strategy", "steps"),
+            save_steps=transformers_config.get("checkpointing", {}).get("save_steps", 100),
+            save_total_limit=transformers_config.get("checkpointing", {}).get("save_total_limit", 3),
             fp16=use_fp16,
             bf16=use_bf16,
+            max_grad_norm=transformers_config.get("training", {}).get("max_grad_norm", 1.0),
+            push_to_hub=transformers_config.get("huggingface_hub", {}).get("push_to_hub", False),
+            hub_model_id=transformers_config.get("huggingface_hub", {}).get("hub_model_id", None),
             hub_token=os.environ.get("HF_TOKEN", None),
             report_to="tensorboard",
             remove_unused_columns=False,  # Keep all columns
+            gradient_checkpointing=transformers_config.get("training", {}).get("gradient_checkpointing", True),
             dataloader_pin_memory=pin_memory,
+            optim=transformers_config.get("training", {}).get("optim", "adamw_torch"),
             ddp_find_unused_parameters=False,  # Improve distributed training efficiency
             dataloader_drop_last=False,  # Process all examples
             dataloader_num_workers=dataloader_workers,
+            no_cuda=False if CUDA_AVAILABLE else True,  # Use CUDA if available
             # Only add FSDP if we're in distributed mode with FSDP strategy
             fsdp=fsdp_config if is_distributed and multi_gpu_strategy == "fsdp" else None,
         )
             """Custom dataloader that preserves original dataset order"""
             log_info("Creating sequential dataloader to maintain original dataset order")
+            # Verification of sequence preservation flags - consolidated
+            data_loading_config = dataset_config.get("data_loading", {})
+            sequential_processing = data_loading_config.get("sequential_processing", True)
+            shuffle_disabled = not data_loading_config.get("shuffle", False)
+            if not sequential_processing:
+                log_info("CRITICAL WARNING: sequential_processing flag is disabled! This may affect data order.")
+                log_info("Data sequence integrity is essential - using sequential sampler regardless of flag.")
+                # Force sequential processing regardless of flag
+            if not shuffle_disabled:
+                log_info("CRITICAL ERROR: Shuffle is not disabled! This will randomize data entry order!")
+                # Actually handle the error rather than just logging it
+                raise ValueError("Dataset shuffling is enabled but sequential processing is required. " +
+                                 "Please disable shuffling in your configuration.")
             # Calculate batch size based on device availability
             if getattr(training_args, "no_cuda", False):
                 batch_size = training_args.per_device_train_batch_size
             else:
+                batch_size = max(training_args.per_device_train_batch_size * max(1, NUM_GPUS), 1)
             log_info(f"Using sequential sampler with batch size {batch_size}")
         log_info("=== Starting Training ===")
         try:
             # Empty cache again right before training
+            if CUDA_AVAILABLE:
                 torch.cuda.empty_cache()
                 log_info("Cleared CUDA cache before training")
             # Display compact training info
+            total_steps = int(len(dataset) / (per_device_batch_size * NUM_GPUS * gradient_accumulation_steps) * training_args.num_train_epochs)
             log_info(f"Training plan: {len(dataset)} examples over {training_args.num_train_epochs} epochs ≈ {total_steps} steps")
             trainer.train()
             log_info(f"Model saved to {training_args.output_dir}")
             # Push to hub if enabled
+            if transformers_config.get("huggingface_hub", {}).get("push_to_hub", False):
+                hub_id = transformers_config.get("huggingface_hub", {}).get("hub_model_id", "model")
                 log_info(f"Pushing model to Hugging Face Hub as {hub_id}...")
                 trainer.push_to_hub()
                 log_info("Model successfully pushed to Hub")
         except Exception as e:
             logger.error(f"Training failed with error: {str(e)}")
             # Log CUDA memory info if available in compact format
+            if CUDA_AVAILABLE:
                 memory_info = []
+                for i in range(NUM_GPUS):
                     allocated = torch.cuda.memory_allocated(i) / 1024**2
                     reserved = torch.cuda.memory_reserved(i) / 1024**2
                     max_mem = torch.cuda.max_memory_allocated(i) / 1024**2

transformers_config.json CHANGED Viewed

@@ -139,6 +139,7 @@
       "processing": {
         "sort_by_id": true,
         "maintain_paper_order": true,
         "max_seq_length": 2048
       }
     },
@@ -159,6 +160,7 @@
     "data_loading": {
       "batch_size": 24,
       "shuffle": false,
       "drop_last": false,
       "num_workers": 4,
       "pin_memory": true,
@@ -167,6 +169,7 @@
     "validation": {
       "log_samples": 3,
       "log_interval": 50,
       "metrics": ["processed", "skipped", "avg_tokens", "unique_papers"]
     }
   }

       "processing": {
         "sort_by_id": true,
         "maintain_paper_order": true,
+        "preserve_entry_sequence": true,
         "max_seq_length": 2048
       }
     },
     "data_loading": {
       "batch_size": 24,
       "shuffle": false,
+      "sequential_processing": true,
       "drop_last": false,
       "num_workers": 4,
       "pin_memory": true,
     "validation": {
       "log_samples": 3,
       "log_interval": 50,
+      "verify_sequence_integrity": true,
       "metrics": ["processed", "skipped", "avg_tokens", "unique_papers"]
     }
   }