metadata
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
- CoT
- Reasoner
- Qwen
FastThink-1.5B-Tiny
Dataset Preparation
This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the datasets
library to load and manipulate the datasets, and the chat_templates
library to standardize the conversation format.
Example
# Load the initial three datasets
dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")
# Map conversation columns for all datasets
dataset1 = dataset1.map(add_conversations_column, batched=False)
dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)
# Combine all datasets
combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
# Standardize using the ShareGPT format
combined_dataset = standardize_sharegpt(combined_dataset)
# Initialize the tokenizer with a specific chat template
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
# Apply formatting function to the combined dataset
combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)
# Print the first few examples to verify the output
print(combined_dataset[:50000])