weitianwen commited on
Commit
a0fbdea
1 Parent(s): d65489b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -99
README.md CHANGED
@@ -37,110 +37,20 @@ license_link: >-
37
  </div>
38
 
39
 
40
- # 模型介绍(Introduction)
41
- **Skywork-13B-Base-XT**是Skywork-13B-Base模型的中间checkpoints,是第一阶段预训练完成后的checkpoints,此时模型训练完X TB个token。 X 有这样对应几个值,分别是0.5,1,1.5,2,2.5,3.
 
 
 
 
 
 
42
 
43
- **Skywork-13B-Base-XT**: This is an intermediate checkpoints of the Skywork-13B-Base model, which are the checkpoints after the completion of the first stage of pre-training. At this point, the model has trained X TB tokens. X can be 0.5,1,1.5,2,2.5,3.
44
 
45
- 如果您希望了解更多的信息,如训练方案,评估方法,请参考我们的[技术报告](http://arxiv.org/abs/2310.19341),[Skymath](https://arxiv.org/abs/2310.16713)论文,[SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf)论文。
46
 
 
47
  If you are interested in more training and evaluation details, please refer to our [technical report](http://arxiv.org/abs/2310.19341), [Skymath]((https://arxiv.org/skywork-tech-report)) paper and [SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf) paper.
48
 
49
- ## 训练数据(Training Data)
50
- 我们精心搭建了数据清洗流程对文本中的低质量数据、有害信息、敏感信息进行清洗过滤。我们的Skywork-13B-Base模型是在清洗后的3.2TB高质量中、英、代码数据上进行训练,其中英文占比52.2%,中文占比39.6%,代码占比8%,在兼顾中文和英文上的表现的同时,代码能力也能有保证。
51
-
52
- We have developed a data cleaning pipeline with great care to effectively clean and filter low-quality data and eliminate harmful information from text data. Our Skywork-13B-Base model is trained on a dataset with 3.2TB tokens that consists of high-quality Chinese, English, and code data, all of which have been thoroughly cleaned. The English data comprises 52.2% of the dataset, the Chinese data accounts for 39.6%, and the code data makes up 8%. This comprehensive approach ensures optimal performance for both Chinese and English while also maintaining the ability to handle code.
53
-
54
- | | Category | Percentage |
55
- |-------------|------------------|------------|
56
- | **English** | Webpages | 39.8% |
57
- | | Books | 3.6% |
58
- | | Academic Papers | 3.0% |
59
- | | Encyclopedia | 0.5% |
60
- | | Miscellany | 2.9% |
61
- | **Chinese** | Webpages | 30.4% |
62
- | | Social Media | 5.5% |
63
- | | Encyclopedia | 0.8% |
64
- | | Miscellany | 3.1% |
65
- | **Other Lang.** | Encyclopedia | 2.4% |
66
- | **Code** | Github | 8.0% |
67
-
68
-
69
-
70
- ## 模型结构(Model Structure)
71
- 与Llama-2-13B模型对比,天工Skywork-13B模型采用相对更加瘦长的网络结构,层数为52层,同时将FFN Dim和Hidden Dim缩小到12288和4608,从而保证模型参数量和原始Llama-2-13B模型相当。根据我们前期实验对比,相对瘦长的网络结构在大Batch Size训练下可以取得更好的泛化效果。Skywork-13B和Llama-2-13B模型的对比如下:
72
-
73
- Compared to the Llama2-13B model, the Skywork-13B model adopts a relatively thinner and deeper network structure with 52 layers. At the same time, the FFN Dim and Hidden Dim are reduced to 12288 and 4608, respectively, to ensure that the model has a similar number of parameters as the original Llama-13B model. Based on our preliminary experimental results, a relatively thinner and deeper network structure can achieve better generalization performance under large batch size training. The detailed comparison between the Skywork-13B and Llama-2-13B models is as follows:
74
-
75
- | Model Structure | Llama2-13B | Skywork-13B |
76
- |----------------------|:----:|:-----------:|
77
- | Vocab. Size | 32,000 | 65,536 |
78
- | Hidden Dim. | 5,120 | 4,608 |
79
- | FFN Dim. | 13,696 | 12,288 |
80
- | Head Dim. | 128 | 128 |
81
- | Num. Heads | 40 | 36 |
82
- | Num. Layers | 40 | 52 |
83
- | Seq. Len. | 4,096 | 4,096 |
84
- | Positional Embedding | RoPE | RoPE |
85
-
86
-
87
- ## 分词器(Tokenizer)
88
- 我们使用Byte-Pair Encoding(BPE)对数据进行分词,词表大小为65536,其中拉丁字符和子词为32000个,汉字和Unicode符号8000个,汉语词语25519个,剩下的17个为保留字。
89
-
90
- We use Byte-Pair Encoding (BPE) to tokenize the data, with a vocabulary size of 65536. Among them, there are 32000 Latin characters and subwords, 8000 Chinese characters and Unicode symbols, 25519 Chinese words, and the remaining 17 are reserved words.
91
-
92
-
93
- | Category | Size |
94
- |---------------------------------|--------|
95
- | Latin based words & subwords | 32000 |
96
- | Chinese characters & Unicode symbols | 8000 |
97
- | Chinese words | 25519 |
98
- | Reserved symbols | 17 |
99
- | **Total** | **65536** |
100
-
101
-
102
- # 快速开始(Quickstart)
103
- 我们将模型参数、配置文件、tokenizer等在huggingface和modelscope上进行了开源。
104
-
105
- We have open-sourced the model parameters, configuration files, tokenizer, and more on Huggingface and Modelscope.
106
-
107
- ## 依赖安装(Requirements)
108
- - Python 3.8及以上版本
109
- - Pytorch 2.0及以上版本
110
- - CUDA建议使用11.4以上版本。
111
-
112
- Skywork-13B-Base模型,Skywork-13B-Chat模型和Skywork-13B-Math模型运行下面的脚本进行Python依赖安装。
113
-
114
- - Python 3.8 and above
115
- - Pytorch 2.0 and above
116
- - CUDA 11.4 and above are recommended.
117
-
118
- ```shell
119
- pip install -r requirements.txt
120
- ```
121
-
122
- ### Base 模型推理(Base Model Inference)
123
- 以Skywork-13B-Base-0.5T推理为例
124
-
125
- Demonstrating reasoning through Skywork-13B-Base-0.5T
126
-
127
-
128
-
129
- ```python
130
-
131
- >>> from transformers import AutoModelForCausalLM, AutoTokenizer
132
- >>> from transformers.generation import GenerationConfig
133
- >>> import torch
134
-
135
- >>> tokenizer = AutoTokenizer.from_pretrained("Skywork-13B-Base-Intermediate/model_hubs/Skywork-13B-Base-0.5T/", trust_remote_code=True)
136
- >>> model = AutoModelForCausalLM.from_pretrained("Skywork-13B-Base-Intermediate/model_hubs/Skywork-13B-Base-0.5T/", device_map="auto", trust_remote_code=True).eval()
137
-
138
- >>> inputs = tokenizer('陕西的省会是西安', return_tensors='pt').to(model.device)
139
- >>> response = model.generate(inputs.input_ids, max_length=128)
140
- >>> print(tokenizer.decode(response.cpu()[0], skip_special_tokens=True))
141
-
142
- ```
143
-
144
 
145
  # 声明和协议(Declaration and License Agreement)
146
 
 
37
  </div>
38
 
39
 
40
+ # 介绍(Introduction)
41
+ 本项目提供Skywork-13B-Base第一阶段预训练(约0~3T)过程中产生的中间存档。In this repository we provide intermediate checkpoints of our Skywork-13B-Base during Stage-1 pretraining (0~3T), including:
42
+ - Skywork-13B-Base-0.5T
43
+ - Skywork-13B-Base-1T
44
+ - Skywork-13B-Base-1.5T
45
+ - Skywork-13B-Base-2T
46
+ - Skywork-13B-Base-2.5T
47
+ - Skywork-13B-Base-3T
48
 
 
49
 
 
50
 
51
+ 如果您希望了解更多的信息,如训练方案,评估方法,请参考我们的[技术报告](http://arxiv.org/abs/2310.19341),[Skymath](https://arxiv.org/abs/2310.16713)论文,[SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf)论文。
52
  If you are interested in more training and evaluation details, please refer to our [technical report](http://arxiv.org/abs/2310.19341), [Skymath]((https://arxiv.org/skywork-tech-report)) paper and [SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf) paper.
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  # 声明和协议(Declaration and License Agreement)
56