kenhktsui
/

nano-phi-115M-v0.1

@@ -8,7 +8,7 @@ inference:
     max_new_tokens: 64
     do_sample: true
     temperature: 0.1
-    repetition_penalty: 10.0
     no_repeat_ngram_size: 4
     eta_cutoff: 0.0006
     renormalize_logits: true
@@ -17,29 +17,35 @@ widget:
   example_title: El Microondas
 - text: Kennesaw State University is a public
   example_title: Kennesaw State University
-- text: Bungie Studios is an American video game developer. They are most famous for
-    developing the award winning Halo series of video games. They also made Destiny.
-    The studio was founded
   example_title: Bungie
 - text: The Mona Lisa is a world-renowned painting created by
   example_title: Mona Lisa
-- text: The Harry Potter series, written by J.K. Rowling, begins with the book titled
   example_title: Harry Potter Series
-- text: 'Question: I have cities, but no houses. I have mountains, but no trees. I
     have water, but no fish. What am I?
-    Answer:'
   example_title: Riddle
 - text: The process of photosynthesis involves the conversion of
   example_title: Photosynthesis
-- text: Jane went to the store to buy some groceries. She picked up apples, oranges,
     and a loaf of bread. When she got home, she realized she forgot
   example_title: Story Continuation
-- text: 'Problem 2: If a train leaves Station A at 9:00 AM and travels at 60 mph,
-    and another train leaves Station B at 10:00 AM and travels at 80 mph, when will
     they meet if the distance between the stations is 300 miles?
-    To determine'
   example_title: Math Problem
 - text: In the context of computer programming, an algorithm is
   example_title: Algorithm Definition
@@ -62,7 +68,8 @@ model-index:
       value: 21.93
       name: normalized accuracy
     source:
-      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
@@ -78,7 +85,8 @@ model-index:
       value: 27.86
       name: normalized accuracy
     source:
-      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
@@ -95,7 +103,8 @@ model-index:
       value: 25.34
       name: accuracy
     source:
-      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
@@ -109,9 +118,10 @@ model-index:
         num_few_shot: 0
     metrics:
     - type: mc2
-      value: 46.0
     source:
-      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
@@ -128,7 +138,8 @@ model-index:
       value: 50.83
       name: accuracy
     source:
-      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
@@ -142,11 +153,19 @@ model-index:
         num_few_shot: 5
     metrics:
     - type: acc
-      value: 0.0
       name: accuracy
     source:
-      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
 ---
@@ -155,7 +174,7 @@ model-index:
 Inspired by [Phi2](https://huggingface.co/microsoft/phi-2), and open source small language model attempts like [smol_llama-101M-GQA](https://huggingface.co/BEE-spoke-data/smol_llama-101M-GQA).
 Pre-trained with training 7B token **from scratch**, with application of quality filter to datasets resulting in 0.26B token.
 The control is [kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1), where full dataset (0.6B) is used.
-Not much degradation in performance despite only using **42%** of the data due to the effective quality filter.
 In fact, upon inspection, the 6000 steps chkpt achieves similar performance as this model, signaling underlying **effective training due to high quality data**.
 It just took 1d to train in Colab with a A100 40GB (**<USD$ 50**).
 It achieves quite competitive results in evaluation given its training token, and training data size.
@@ -569,5 +588,4 @@ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-le
 |MMLU (5-Shot)                    |25.34|
 |TruthfulQA (0-shot)              |46.00|
 |Winogrande (5-shot)              |50.83|
-|GSM8k (5-shot)                   | 0.00|

     max_new_tokens: 64
     do_sample: true
     temperature: 0.1
+    repetition_penalty: 10
     no_repeat_ngram_size: 4
     eta_cutoff: 0.0006
     renormalize_logits: true
   example_title: El Microondas
 - text: Kennesaw State University is a public
   example_title: Kennesaw State University
+- text: >-
+    Bungie Studios is an American video game developer. They are most famous for
+    developing the award winning Halo series of video games. They also made
+    Destiny. The studio was founded
   example_title: Bungie
 - text: The Mona Lisa is a world-renowned painting created by
   example_title: Mona Lisa
+- text: >-
+    The Harry Potter series, written by J.K. Rowling, begins with the book
+    titled
   example_title: Harry Potter Series
+- text: >-
+    Question: I have cities, but no houses. I have mountains, but no trees. I
     have water, but no fish. What am I?
+    Answer:
   example_title: Riddle
 - text: The process of photosynthesis involves the conversion of
   example_title: Photosynthesis
+- text: >-
+    Jane went to the store to buy some groceries. She picked up apples, oranges,
     and a loaf of bread. When she got home, she realized she forgot
   example_title: Story Continuation
+- text: >-
+    Problem 2: If a train leaves Station A at 9:00 AM and travels at 60 mph, and
+    another train leaves Station B at 10:00 AM and travels at 80 mph, when will
     they meet if the distance between the stations is 300 miles?
+    To determine
   example_title: Math Problem
 - text: In the context of computer programming, an algorithm is
   example_title: Algorithm Definition
       value: 21.93
       name: normalized accuracy
     source:
+      url: >-
+        https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
       value: 27.86
       name: normalized accuracy
     source:
+      url: >-
+        https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
       value: 25.34
       name: accuracy
     source:
+      url: >-
+        https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
         num_few_shot: 0
     metrics:
     - type: mc2
+      value: 46
     source:
+      url: >-
+        https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
       value: 50.83
       name: accuracy
     source:
+      url: >-
+        https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
   - task:
       type: text-generation
         num_few_shot: 5
     metrics:
     - type: acc
+      value: 0
       name: accuracy
     source:
+      url: >-
+        https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kenhktsui/nano-phi-115M-v0.1
       name: Open LLM Leaderboard
+datasets:
+- kenhktsui/minipile_quality_score_v1
+- kenhktsui/simple_wikipedia_LM_quality_score_v1
+- kenhktsui/refinedweb-3m_quality_score_v1
+- kenhktsui/TM-DATA_quality_score_v1
+- kenhktsui/openwebtext_quality_score_v1
 ---
 Inspired by [Phi2](https://huggingface.co/microsoft/phi-2), and open source small language model attempts like [smol_llama-101M-GQA](https://huggingface.co/BEE-spoke-data/smol_llama-101M-GQA).
 Pre-trained with training 7B token **from scratch**, with application of quality filter to datasets resulting in 0.26B token.
 The control is [kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1), where full dataset (0.6B) is used.
+Not much degradation in performance despite only using **42%** of the data due to the effective quality filter ("quality_score_v1" > 0.5).
 In fact, upon inspection, the 6000 steps chkpt achieves similar performance as this model, signaling underlying **effective training due to high quality data**.
 It just took 1d to train in Colab with a A100 40GB (**<USD$ 50**).
 It achieves quite competitive results in evaluation given its training token, and training data size.
 |MMLU (5-Shot)                    |25.34|
 |TruthfulQA (0-shot)              |46.00|
 |Winogrande (5-shot)              |50.83|
+|GSM8k (5-shot)                   | 0.00|