File size: 1,626 Bytes
a8f3e59 e4652d5 a8f3e59 c6d4741 66b9fbb c6d4741 33bc1de |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
- CoT
- Reasoner
- Qwen
---
# **FastThink-1.5B-Tiny**
# **Dataset Preparation**
This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format.
## Example
```python
# Load the initial three datasets
dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")
# Map conversation columns for all datasets
dataset1 = dataset1.map(add_conversations_column, batched=False)
dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)
# Combine all datasets
combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
# Standardize using the ShareGPT format
combined_dataset = standardize_sharegpt(combined_dataset)
# Initialize the tokenizer with a specific chat template
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
# Apply formatting function to the combined dataset
combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)
# Print the first few examples to verify the output
print(combined_dataset[:50000])
``` |