Muennighoff commited on
Commit
f1a1828
1 Parent(s): b07b3ae

Move to subsection

Browse files
Files changed (1) hide show
  1. README.md +25 -25
README.md CHANGED
@@ -2140,6 +2140,31 @@ Model may:
2140
  <details>
2141
  <summary>Click to expand</summary>
2142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2143
  See this repository for JSON files: https://github.com/bigscience-workshop/evaluation-results
2144
 
2145
  | Task | Language | Metric | BLOOM-176B | OPT-175B* |
@@ -2291,30 +2316,6 @@ See this repository for JSON files: https://github.com/bigscience-workshop/evalu
2291
  | humaneval | python | pass@10 | 0.322 | 0.0 |
2292
  | humaneval | python | pass@100 | 0.555 | 0.003 |
2293
 
2294
- ## Metrics
2295
- *This section describes the different ways performance is calculated and why.*
2296
-
2297
-
2298
- Includes:
2299
-
2300
- | Metric | Why chosen |
2301
- |--------------------|--------------------------------------------------------------------|
2302
- | [Perplexity](#perplexity) | Standard metric for quantifying model improvements during training |
2303
- | Cross Entropy [Loss](#loss) | Standard objective for language models. |
2304
-
2305
- And multiple different metrics for specific tasks. _(More evaluation metrics forthcoming upon completion of evaluation protocol.)_
2306
-
2307
- ## Factors
2308
- *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
2309
-
2310
- - Language, such as English or Yoruba
2311
-
2312
- - Domain, such as newswire or stories
2313
-
2314
- - Demographic characteristics, such as gender or nationality
2315
-
2316
- ## Results
2317
- *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
2318
 
2319
  **Train-time Evaluation:**
2320
 
@@ -2326,7 +2327,6 @@ As of 25.May.2022, 15:00 PST:
2326
 
2327
  - Perplexity: 8.9
2328
 
2329
- (More evaluation scores forthcoming.)
2330
 
2331
  </details>
2332
 
2140
  <details>
2141
  <summary>Click to expand</summary>
2142
 
2143
+ ## Metrics
2144
+ *This section describes the different ways performance is calculated and why.*
2145
+
2146
+
2147
+ Includes:
2148
+
2149
+ | Metric | Why chosen |
2150
+ |--------------------|--------------------------------------------------------------------|
2151
+ | [Perplexity](#perplexity) | Standard metric for quantifying model improvements during training |
2152
+ | Cross Entropy [Loss](#loss) | Standard objective for language models. |
2153
+
2154
+ And multiple different metrics for specific tasks. _(More evaluation metrics forthcoming upon completion of evaluation protocol.)_
2155
+
2156
+ ## Factors
2157
+ *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
2158
+
2159
+ - Language, such as English or Yoruba
2160
+
2161
+ - Domain, such as newswire or stories
2162
+
2163
+ - Demographic characteristics, such as gender or nationality
2164
+
2165
+ ## Results
2166
+ *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
2167
+
2168
  See this repository for JSON files: https://github.com/bigscience-workshop/evaluation-results
2169
 
2170
  | Task | Language | Metric | BLOOM-176B | OPT-175B* |
2316
  | humaneval | python | pass@10 | 0.322 | 0.0 |
2317
  | humaneval | python | pass@100 | 0.555 | 0.003 |
2318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2319
 
2320
  **Train-time Evaluation:**
2321
 
2327
 
2328
  - Perplexity: 8.9
2329
 
 
2330
 
2331
  </details>
2332