English | 简体中文 |

FLM-101B

FLM-101B是一个开源的decoder-only架构的语言模型，参数规模101B。训练过程采用模型生长技术，通过训练前期在小规模模型上快速学习知识，后期将模型逐步生长成大模型的方式，实现了千亿规模模型的低成本(~$100K)训练。 FLM-101B支持中英双语，训练上下文窗口长度为2048，得益于使用了xPos旋转位置编码，推理时窗口大小可进行良好的拓展。为推动千亿规模LLM技术发展，FLM-101B现已全面开源。

为什么使用FLM-101B

开源的千亿级中英双语模型
已知的最大规模的使用了xPos训练的语言模型
已知的最大规模的成功实现了μp transfer以及loss prediction的语言模型
已知的最大规模的成功实现了progressive learning with model growth的语言模型

快速上手FLM-101B

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CofeAI/FLM-101B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("CofeAI/FLM-101B", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto", trust_remote_code=True)
inputs = tokenizer('一幢没有书的房子，犹如一个没有灵魂的身体;', return_tensors='pt').to(model.device)
generated = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
print(tokenizer.decode(generated.cpu()[0], skip_special_tokens=True))

模型细节

模型简介

模型类型: 解码器语言模型
支持语言: 中文/英文
开源协议: apache-2.0

模型大小

Hyperparameter	Value
n_parameters	101B
n_layers	80
n_heads	80
d_model	10240
vocab size	100256
sequence length	2048

模型架构

Extrapolatable Position Embedding (xPos)
Flash Attention (In Training)
Model Growth
Loss Prediction

训练情况

训练超参数

Hyperparameter	16b	51b	101b
Optimizer		AdamW
Precision		bfloat16
Weight decay		0.1
Gradient clipping		1.0
Learning rate	4e-4	3.4e-4	2e-4
Batch size(M tokens)	4.72	4.72	4.31
Warmup(M samples)	4.61	0.23	0.23
Time(day)	9.63	5.37	6.54
Tokens(B)	245.37	39.64	26.54

并行策略

Params(billion)	TP Size	PP Size	DP Size	Number of GPUs	Batch Size	TFLOP/s per GPU	GPU Utilization
16	2	1	96	192	2304	162	51.90%
51	4	2	24	192	2304	160	51.30%
101	4	4	12	192	2160	165	52.88%

硬件

FLM-101B在24节点DGX-A800 GPU(8×80G)集群上完成的训练，总耗时近26天。基于模型生长策略，我们依次在该集群上进行了16B， 51B和101B的模型的训练和生长。

软件

FLM-101B的训练代码基于Megatron-LM框架修改，将在近期开源。框架支持3D并行策略以及分布式优化器。

偏见、风险与限制

尽管我们已经尽最大努力对模型训练语料进行了清洗过滤，但由于训练语料的开放性，模型仍有可能在一些不安全的语料上进行过学习。因此模型仍有可能生成不符合预期的文本，包括但不限于歧视、偏见、谩骂等。我们在此提醒模型使用者，请勿传播模型可能生成的不安全内容。由于传播不良信息导致的任何后果，本项目开发者不承担责任。

FLM-101B 现阶段训练的 token 数比较少，在知识（特别是专业知识）方面有较大进步空间。另一方面，模型的推理目前没有进行优化因此推理资源占用较高，速度受限。为此，我们将很快在推理侧支持Flash Attention。如果您对这两方面，或其他方面有改进需求，欢迎在 github 提issue，我们会尽快响应。谢谢！

引用

@article{flm-101b,
  author       = {Xiang Li and Yiqun Yao and Xin Jiang and Xuezhi Fang and Xuying Meng and
                  Siqi Fan and Peng Han and Jing Li and Li Du and Bowen Qin and Zheng Zhang and
                  Aixin Sun and Yequan Wang},
  title        = {FLM-101B: An Open LLM and How to Train It with \$100K Budget},
  year         = {2023}
}

联系我们

tshwangyequan at gmail.com