jmichaelov commited on
Commit
04a9dc7
·
verified ·
1 Parent(s): 27630c6

model card

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Skylion007/openwebtext
5
+ language:
6
+ - en
7
+ base_model:
8
+ - EleutherAI/pythia-160m
9
+ library_name: transformers
10
+ ---
11
+
12
+ # Parc-Pythia (Seed 0)
13
+
14
+ The Parc (Parallel Architecture) models are a set of autoregressive language models (of the Pythia, Mamba, and RWKV architectures) of roughly the same size trained in parallel on the same data (2B tokens of OpenWebText) for the same number of steps, with 6 runs of training each (based on a different random seed). The Parc models were designed to allow for more direct and fine-grained analysis of training dynamics across and within architectures.
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+
21
+ - **Developed by:** James Michaelov
22
+ - **Model type:** Pythia (autoregressive transformer)
23
+ - **Language(s) (NLP):** English
24
+ - **License:** Apache 2.0
25
+
26
+ ### Model Sources
27
+
28
+ #### Base Model
29
+
30
+ - **Repository:** [EleutherAI/pythia-160m](https://huggingface.co/EleutherAI/pythia-160m)
31
+ - **Paper:** [Biderman et al. (2023)](https://proceedings.mlr.press/v202/biderman23a.html)
32
+
33
+
34
+ ## Model Use
35
+
36
+ The Parc models are intended for research use and generally not suitable for deployment. They are pretrained on a subset of OpenWebText, which is not well-documented, and thus it possible that they are trained on (and may generate) harmful, offensive, or otherwise inappropriate text, especially as they are not fine-tuned in any way. For the same reason, there is no guarantee that they will generate accurate or truthful text. Rather than fine-tuning our models, we instead recommend fine-tuning the original Pythia, Mamba, RWKV models, as they are trained on many times more data, and thus are likely to have substantially better performance.
37
+
38
+ ## How to Get Started with the Model
39
+
40
+ Example code for generation:
41
+
42
+ ```
43
+ from transformers import AutoModelForCausalLM, AutoTokenizer
44
+
45
+ model = AutoModelForCausalLM.from_pretrained(
46
+ "jmichaelov/parc-pythia-seed0",
47
+ revision="checkpoint-4000"
48
+ )
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained(
51
+ "jmichaelov/parc-pythia-seed0",
52
+ revision="checkpoint-4000"
53
+ )
54
+
55
+ inputs = tokenizer("The Parc language models", return_tensors="pt")
56
+ tokens = model.generate(**inputs)
57
+ tokenizer.decode(tokens[0])
58
+ ```
59
+
60
+ ## Training Details
61
+
62
+ ### Training Data
63
+
64
+ * [OpenWebText](https://huggingface.co/datasets/Skylion007/openwebtext): An open replication of the WebText corpus (on which GPT-2 was trained).
65
+
66
+ ### Training Procedure
67
+
68
+ * Context Length: 1024 tokens
69
+ * Effective batch size: 512 (batch size * gradient accumulation)
70
+ * Total training steps: 4000
71
+ * Total Tokens = 4000 * 512 * 1024 = 2,097,152,000
72
+
73
+
74
+ #### Training Hyperparameters
75
+ * Warmup Steps: 100
76
+ * Weight Decay: 0.1
77
+ * Learning Rate: 6e-4
78
+ * Learning Rate Scheduler: Cosine
79
+ * Precision: `float32`
80
+
81
+
82
+ ## Evaluation
83
+
84
+ #### Testing Data
85
+
86
+ * [ARC (Easy)](https://huggingface.co/datasets/allenai/ai2_arc)
87
+ * [BLiMP](https://huggingface.co/datasets/nyu-mll/blimp)
88
+ * [LAMBADA (OpenAI version)](https://huggingface.co/datasets/EleutherAI/lambada_openai)
89
+ * [SciQ](https://huggingface.co/datasets/allenai/sciq)
90
+ * [SWAG](https://huggingface.co/datasets/allenai/swag)
91
+
92
+ All evaluations were carried out using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness).
93
+
94
+ #### Metrics
95
+
96
+ * Accuracy: The standard metric for the benchmarks used.
97
+
98
+ ### Results
99
+
100
+ ![Results](https://huggingface.co/jmichaelov/parc-pythia-seed0/resolve/main/parc_eval_plot.png)
101
+
102
+
103
+ ## Environmental Impact
104
+
105
+ - **Hardware Type:** GPUs: NVIDIA A100 80GB; CPUs: AMD EPYC (7713, 7643, or 7513)
106
+ - **Hours used:** ~42hrs (on average)
107
+ - **Infrastructure Details:** Massachusetts Green High-Performance Computing Center
108
+ - **Carbon Emitted:** ~0.8kg (upper bound based on the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) from [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) and the carbon efficiency of 0.0231kg/KWh reported for the data center by [Sharma et al., (2017)](https://ieeexplore.ieee.org/document/7994556/)).
109
+
110
+ ## Citation
111
+
112
+ If you use any of the Parc models, please cite our forthcoming NeurIPS paper where we introduce them:
113
+
114
+ **BibTeX:**
115
+
116
+ ```
117
+ @inproceedings{michaelov_language_2025,
118
+ title = {Language {Model} {Behavioral} {Phases} are {Consistent} {Across} {Scale} and {Architecture}},
119
+ author = {Michaelov, James A. and Levy, Roger P. and Bergen, Benjamin K.},
120
+ booktitle = {Advances in {Neural} {Information} {Processing} {Systems}},
121
+ volume = {38},
122
+ year = {2025}
123
+ }
124
+ ```
125
+
126
+ **APA:**
127
+
128
+ Michaelov, J. A., Levy, R. P., & Bergen, B. K. (2025). Language Model Behavioral Phases are Consistent Across Scale and Architecture. *Advances in Neural Information Processing Systems, 38*.