FastThink-0.5B-Tiny / README.md
prithivMLmods's picture
Update README.md
01c5070 verified
|
raw
history blame
1.63 kB
metadata
license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
  - text-generation-inference
  - CoT
  - Reasoner
  - Qwen

FastThink-1.5B-Tiny

Dataset Preparation

This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the datasets library to load and manipulate the datasets, and the chat_templates library to standardize the conversation format.

Example

# Load the initial three datasets
dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")

# Map conversation columns for all datasets
dataset1 = dataset1.map(add_conversations_column, batched=False)
dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)

# Combine all datasets
combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])

# Standardize using the ShareGPT format
combined_dataset = standardize_sharegpt(combined_dataset)

# Initialize the tokenizer with a specific chat template
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

# Apply formatting function to the combined dataset
combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)

# Print the first few examples to verify the output
print(combined_dataset[:50000])