Safetensors
English
llama
comma-v0.1-1t / README.md
blester125's picture
Create README.md
9105959 verified
|
raw
history blame
4.78 kB

---

license: apache-2.0 datasets: - common-pile/comma-dataset language: - en tags: - openly-licensed - llm - pretraining

Model Card for Model ID

Comma v0.1 is a 7 billion parameter model trained on 1 trillion tokens of openly licensed text collected as part of the Common Pile.

Model Details

Model Description

Comma v0.1 is a 7 billion parameter decoder-only transformer. It uses the same architecture as Llama 3. It was trained on 1 trillion tokens from the Common Pile, an 8TB collection of openly licensed text.

  • Developed by: r-three, Eulther AI, Vector, University of Toronto
  • Model type: Decoder-Only Transformer
  • Language(s) (NLP): English
  • License: Apache 2.0

Model Sources

Uses

Comma v0.1 can be used a the starting point for finetuning and post-training. As it was trained on openly licensed text, it is less likely to create IP issues, but this is not a guarantee.

Direct Use

Evaluations in our paper show performance when using our final model directly. Additional post-training will mostly likely increase performance.

Out-of-Scope Use

Comma v0.1 is only trained on openly licensed text. Therefore it will probably have reduced performance when asked about topics that only appear in copyrighted text.

Bias, Risks, and Limitations

As it was trained on openly licensed text, Comma v0.1 is less likely to output IP infringing text, however, due to issues like license laundering this is not a guarantee. See our paper for a deeper discussion of these details.

Comma v0.1 is trained on many old books (pre 1929) and may therefore repeat societal biases common at the time.

Comma v0.1 include no post-hoc guardrails that limit what it may generate.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

Comma v0.1 was train on the common-pile/comma-data, a filtered and deduplicted dataset of 1 trillion tokens drawn from the Common Pile dataset (8TB of openly licensed text).

Training Procedure

Comma v0.1 was trained in two stages, first it was trained for 965 billion tokens with a cosine learning rate. Then a second "cool-down" training phase on 35 billion tokens from high quality sources was done. The final model is the average of 10 checkpoints during this cool-down phase.

Training Hyperparameters

Hyperparameters can be found in our lingua config file.

Evaluation

Comma v1.0 7B outperforms models with similar computational budgets (7 billion parameters, 1 trillion tokens) that were trained on non-openly licensed text (LLaMA 1, MPT, RPJ-INCITE) on several common benchmarks (ARC-C, MMLU, BoolQ, SIQA etc.) and does especially well on Code based tasks (HumEval, MBPP). It tends to under performs on datasets like HellaSwag. Evaluations where done using OLMES. Note that there is still a large gap between Comma v0.1 and current state-of-the-art models line Qwen3 which was trained on 36 times as many tokens.

More evaluation results can be found in our paper

Summary

Comma v0.1 is a 7B parameter model train on openly licensed text. It is one of the first performant model trained on only open licensed text.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: H100 Nvidia GPU
  • Hours used: [More Information Needed]
  • Cloud Provider: AWS
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

Comma v0.1 uses the same architecture as Llama 3 and is trained using standard autoregressive next-token prediction.

Compute Infrastructure

Comma v0.1 was trained on the Huggingface Cluster.

Hardware

Comma v0.1 was trained using 64 H100 Nvidia GPUs

Software

Comma v0.1 was trained using lingua

Citation