File size: 2,727 Bytes
50f2584
 
 
9cdcb6c
 
 
 
7add242
 
 
 
 
 
 
 
 
 
 
8c57c0f
 
 
 
 
 
c60298a
 
 
 
 
 
9cdcb6c
c60298a
9cdcb6c
 
 
 
 
 
8c57c0f
 
 
 
 
 
 
 
 
 
9cdcb6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: llama2
---

# Toy LLaMA-39M

- This is a tiny LLaMA model pretrained on [Recag/Rp_C4_55](https://huggingface.co/datasets/Recag/Rp_C4_55), a small subset of C4 with `seq_len=512`.
  - Model architecture
      ```json
      {
        "hidden_size": 512,
        "intermediate_size": 2048,
        "max_position_embeddings": 2048,
        "num_attention_heads": 8,
        "num_hidden_layers": 2,
        "num_key_value_heads": 8
      }
      ```
  - Load model and tokenizer:
      ```python
      from transformers import AutoTokenizer, AutoModelForCausalLM
      model = AutoModelForCausalLM.from_pretrained("Cheng98/llama-39m")
      tokenizer = AutoTokenizer.from_pretrained("Cheng98/llama-39m")
      ```
  - Training script: [huggingface/transformers/examples/pytorch/language-modeling/run_clm.py](https://github.com/huggingface/transformers/blob/e9476832942a19cf99354776ef112babc83c139a/examples/pytorch/language-modeling/run_clm.py)
      ```python
      # "train" split is created from the last 95% samples of original "train" subset
      raw_datasets["validation"] = load_dataset("Recag/Rp_C4_55", split="train[5%:]")
      ```


- Evaluation (`seq_len=512`):
   
  | Dataset        | Eval loss | Perplexity | Accuracy | block_size |
  |----------------|-----------|------------|----------|------------|
  | Recag/Rp_C4_55 | 3.63      | 37.78      | 0.3561   | 512        |
  | Wikitext2      | 4.58      | 97.48      | 0.2719   | 512        |

  - Evaluation command (Wikitext2):
      ```bash
      # Evaluation command
      python run_clm.py --model_name_or_path Cheng98/llama-39m \
        --dataset_name wikitext \
        --dataset_config_name wikitext-2-raw-v1 \
        --block_size 512 \
        --do_eval \
        --output_dir ./results
      ```
  
  - Evaluation on Recag/Rp_C4_55 (`seq_len=512`):
      ```python
      # "validation" split is created from the first 5% samples of original "train" subset
      raw_datasets["validation"] = load_dataset("Recag/Rp_C4_55", split="train[:5%]")
      ```
      Results
      ```json
      {
        "eval_accuracy": 0.3561766818954313,
        "eval_loss": 3.6318140029907227,
        "eval_runtime": 190.8411,
        "eval_samples": 19413,
        "eval_samples_per_second": 101.723,
        "eval_steps_per_second": 1.593,
        "perplexity": 37.7812898658763
      }
      ```
      
  - Evaluation on Wikitext2 (`seq_len=512`):
      ```json
      {
        "eval_accuracy": 0.2718795201225219,
        "eval_loss": 4.579628944396973,
        "eval_runtime": 3.939,
        "eval_samples": 575,
        "eval_samples_per_second": 145.976,
        "eval_steps_per_second": 0.762,
        "perplexity": 97.47821765687856
      }
      ```