--- license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-0.5B-Instruct pipeline_tag: text-generation library_name: transformers tags: - text-generation-inference - CoT - Reasoner - Qwen --- # **FastThink-1.5B-Tiny** # **Dataset Preparation** This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format. ## Example ```python # Load the initial three datasets dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train") dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train") dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train") # Map conversation columns for all datasets dataset1 = dataset1.map(add_conversations_column, batched=False) dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False) dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False) # Combine all datasets combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3]) # Standardize using the ShareGPT format combined_dataset = standardize_sharegpt(combined_dataset) # Initialize the tokenizer with a specific chat template tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5") # Apply formatting function to the combined dataset combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True) # Print the first few examples to verify the output print(combined_dataset[:50000]) ```