--- license: other library_name: transformers tags: - finetune - synthetic data - custom_code - qwen2 - COT datasets: - kaist-ai/CoT-Collection license_name: tongyi-qianwen-research license_link: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat/raw/main/LICENSE model-index: - name: Reyna-CoT-4B-v0.1 results: - task: type: text-generation name: Text Generation dataset: name: AI2 Reasoning Challenge (25-Shot) type: ai2_arc config: ARC-Challenge split: test args: num_few_shot: 25 metrics: - type: acc_norm value: 44.71 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=aloobun/Reyna-CoT-4B-v0.1 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: HellaSwag (10-Shot) type: hellaswag split: validation args: num_few_shot: 10 metrics: - type: acc_norm value: 71.12 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=aloobun/Reyna-CoT-4B-v0.1 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU (5-Shot) type: cais/mmlu config: all split: test args: num_few_shot: 5 metrics: - type: acc value: 55.9 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=aloobun/Reyna-CoT-4B-v0.1 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: TruthfulQA (0-shot) type: truthful_qa config: multiple_choice split: validation args: num_few_shot: 0 metrics: - type: mc2 value: 43.09 source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=aloobun/Reyna-CoT-4B-v0.1 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: Winogrande (5-shot) type: winogrande config: winogrande_xl split: validation args: num_few_shot: 5 metrics: - type: acc value: 67.72 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=aloobun/Reyna-CoT-4B-v0.1 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GSM8k (5-shot) type: gsm8k config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 16.91 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=aloobun/Reyna-CoT-4B-v0.1 name: Open LLM Leaderboard --- ![Reyna aloobun qwen4B](https://i.imgur.com/QfbOY6c.jpeg) - Finetuned [Qwen/Qwen1.5-4B](https://huggingface.co/Qwen/Qwen1.5-4B), on variety of CoT tasks including Reasoning, Closed Book Question Answering, Ethics, and more. - Datasets : Curated from - [kaist-ai/CoT-Collection](https://huggingface.co/datasets/kaist-ai/CoT-Collection), [euclaise/TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT) and a very small subset from [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5). - This marks the fourth model in this series. This experiment aims to improve Chain of Thought (CoT) capabilities on smaller language models. - I may rerun the finetuning experiment(with a more balanced dataset), using an iterative rationale-bootstrapping procedure inspired by euclaise/Memphis-CoT-3B. - Hyperparameter: adamw with eps of 1e-8, cosine decay with 20% warmup, lr=2e-5 ## Benchamrks: WIP ## Example: ``` from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, StoppingCriteria import torch class MyStoppingCriteria(StoppingCriteria): def __init__(self, target_sequence, prompt): self.target_sequence = target_sequence self.prompt=prompt def __call__(self, input_ids, scores, **kwargs): generated_text = tokenizer.decode(input_ids[0]) generated_text = generated_text.replace(self.prompt,'') if self.target_sequence in generated_text: return True return False def __len__(self): return 1 def __iter__(self): yield self modelpath="aloobun/Reyna-CoT-4B-v0.1" model = AutoModelForCausalLM.from_pretrained( modelpath, torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( modelpath, trust_remote_code=True, use_fast=False, ) prompt = "Avery opens a flower shop. She ties 8 bunches of flowers with 9 flowers in each bunch. How many bunches would she have if she put 12 flowers in each bunch instead?\n" encoded_input = tokenizer(prompt, return_tensors='pt') input_ids=encoded_input['input_ids'].cuda() streamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True) op = model.generate( input_ids, streamer=streamer, pad_token_id=tokenizer.eos_token_id, do_sample=True, temperature=0.6, top_p=0.8, max_new_tokens=512, stopping_criteria=MyStoppingCriteria("<|endoftext|>", prompt) ) ``` ## Output: >She would have 8 x 9 = 72 flowers in total. >She would have 72 / 12 = 6 bunches of flowers with 12 flowers in each bunch. >Therefore, the answer is 6.<|endoftext|> # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_aloobun__Reyna-CoT-4B-v0.1) | Metric |Value| |---------------------------------|----:| |Avg. |49.91| |AI2 Reasoning Challenge (25-Shot)|44.71| |HellaSwag (10-Shot) |71.12| |MMLU (5-Shot) |55.90| |TruthfulQA (0-shot) |43.09| |Winogrande (5-shot) |67.72| |GSM8k (5-shot) |16.91|