metadata

datasets:
  - tiiuae/falcon-refinedweb
  - togethercomputer/RedPajama-Data-1T
  - CarperAI/pilev2-dev
  - bigcode/starcoderdata
  - JeanKaddour/minipile
language:
  - en
tags:
  - causal-lm
license: cc-by-sa-4.0

`StableLM-Base-Alpha-3B-v2`

Model Description

StableLM-Base-Alpha-3B-v2 is a 3 billion parameter decoder-only language model pre-trained on diverse English datasets. This model is the successor to the first StableLM-Base-Alpha-3B model, addressing previous shortcomings through the use of improved data sources and mixture ratios.

Usage

Get started generating text with StableLM-Base-Alpha-3B-v2 by using the following code snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-base-alpha-3b-v2")
model = AutoModelForCausalLM.from_pretrained(
  "stabilityai/stablelm-base-alpha-3b-v2",
  trust_remote_code=True,
  torch_dtype="auto",
)
model.cuda()
inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to("cuda")
tokens = model.generate(
  **inputs,
  max_new_tokens=64,
  temperature=0.75,
  top_p=0.95,
  do_sample=True,
)
print(tokenizer.decode(tokens[0], skip_special_tokens=True))

Model Details

Developed by: Stability AI
Model type: StableLM-Base-Alpha-v2 models are auto-regressive language models based on the transformer decoder architecture.
Language(s): English
Library: GPT-NeoX
License: Model checkpoints are licensed under the Creative Commons license (CC BY-SA-4.0). Under this license, you must give credit to Stability AI, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the Stability AI endorses you or your use.
Contact: For questions and comments about the model, please email lm@stability.ai

Model Architecture

Parameters	Hidden Size	Layers	Heads	Sequence Length
2,796,431,360	2560	32	32	4096

The model is a decoder-only transformer similar to the StableLM-Base-Alpha (v1) with the following configurations:

Activation: SwiGLU (Shazeer, 2020)
Decoder Layer: Parallel Attention and MLP residuals with a single input LayerNorm (Wang & Komatsuzaki, 2021)
Position Embeddings: Rotary Position Embeddings (Su et al., 2021)
Bias: LayerNorm bias terms only

Training

StableLM-Base-Alpha-3B-v2 is pre-trained using a multi-stage context length extension schedule following similar work (Nijkamp et al. 2023); first pre-training at a context length of 2048 for 1 trillion tokens, then fine-tuning at a context length of 4096 for another 100B tokens.

Training Dataset

The first pre-training stage relies on 1 trillion tokens sourced from a mix of the public Falcon RefinedWeb extract (Penedo et al., 2023), RedPajama-Data (Together Computer 2023, The Pile (Gao et al., 2020), and internal datasets with web text sampled at a rate of 71%.

In the second stage, we include the StarCoder (Li et al., 2023) dataset and down sample web text to 55% while increasing sampling proportions of naturally long text examples in the aforementioned sources.

Training Procedure

The model is pre-trained on the dataset mixes mentioned above in mixed-precision (FP16), optimized with AdamW, and trained using the NeoX tokenizer with a vocabulary size of 50,257. We outline the complete hyperparameters choices in the project's GitHub repository - config.

Training Infrastructure

Hardware: StableLM-Base-Alpha-3B-v2 was trained on the Stability AI cluster - occupying 256 NVIDIA A100 40GB GPUs across AWS P4d instances. Training took approximately 8.45 days to complete across both stages.
Software: We use a fork of gpt-neox (EleutherAI, 2021) and train under 2D parallelism (Data and Tensor Parallel) with ZeRO-1 (Rajbhandari et al., 2019) and rely on flash-attention as well as rotary embedding kernels from FlashAttention-2 (Dao et al., 2023)

Use and Limitations

Intended Use

These models are intended to be used by all individuals as foundational models for application-specific fine-tuning without strict limitations on commercial use.

Limitations and bias

The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters which can be reflected in the model-generated text. We recommend that users exercise caution when using these models in production systems. Do not use the models for any applications that may cause harm or distress to individuals or groups.

How to cite

@misc{StableLMAlphaV2Models, 
      url={[https://huggingface.co/stabilityai/stablelm-base-alpha-3b-v2](https://huggingface.co/stabilityai/stablelm-base-alpha-3b-v2)},
      title={StableLM Alpha v2 Models},
      author={Tow, Jonathan}
}