File size: 1,626 Bytes
a8f3e59
 
 
 
 
e4652d5
a8f3e59
 
 
 
 
 
 
c6d4741
66b9fbb
c6d4741
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33bc1de
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
- CoT
- Reasoner
- Qwen
---
# **FastThink-1.5B-Tiny** 


# **Dataset Preparation**

This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format.

## Example

```python
# Load the initial three datasets
dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")

# Map conversation columns for all datasets
dataset1 = dataset1.map(add_conversations_column, batched=False)
dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)

# Combine all datasets
combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])

# Standardize using the ShareGPT format
combined_dataset = standardize_sharegpt(combined_dataset)

# Initialize the tokenizer with a specific chat template
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

# Apply formatting function to the combined dataset
combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)

# Print the first few examples to verify the output
print(combined_dataset[:50000])
```