|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen2.5-0.5B-Instruct |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- text-generation-inference |
|
- CoT |
|
- Reasoner |
|
- Qwen |
|
--- |
|
# **FastThink-1.5B-Tiny** |
|
|
|
|
|
# **Dataset Preparation** |
|
|
|
This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format. |
|
|
|
## Example |
|
|
|
```python |
|
# Load the initial three datasets |
|
dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train") |
|
dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train") |
|
dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train") |
|
|
|
# Map conversation columns for all datasets |
|
dataset1 = dataset1.map(add_conversations_column, batched=False) |
|
dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False) |
|
dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False) |
|
|
|
# Combine all datasets |
|
combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3]) |
|
|
|
# Standardize using the ShareGPT format |
|
combined_dataset = standardize_sharegpt(combined_dataset) |
|
|
|
# Initialize the tokenizer with a specific chat template |
|
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5") |
|
|
|
# Apply formatting function to the combined dataset |
|
combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True) |
|
|
|
# Print the first few examples to verify the output |
|
print(combined_dataset[:50000]) |
|
``` |