# Miniature Chinese Llama2 Basic Model [English](./readme_en.md) [简体中文](./readme.md) This is an ultra-mini model with approximately 58M parameters, utilizing the Llama2 architecture. The uploaded version is pre-trained and has not undergone SFT yet. A chat version post-SFT will be launched soon. The goals of developing this ultra-mini model are: 1. To practice the full process of pre-training a basic large language model from scratch. 2. To provide a fast-deployable environment for the development of large parameter models, as loading large models can be very time-consuming and not conducive to rapid iterative development and debugging. 3. To enable quick parameter tuning and the reproduction of various optimization algorithms on consumer-level graphics cards. ## Training Data We have collected 429 Chinese online fantasy novels and organized them into plain text format. Lines with fewer than 10 characters and those exceeding 4096 characters were removed, serving as the base data for pre-training. The organized txt file is 3.3G in size and contains 868M Chinese characters across 18M lines. ## Chinese Tokenizer The tokenizer for the model was also retrained, without relying on any existing tokenizers. Training Parameters: 1. Maximum Sentence Length: 2657 2. Vocabulary Size: 32000 3. Normalization Rule: identity 4. Character Coverage: 0.9995 | | Llama2 | Baby Llama2 | | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | tokens | 32000 | 65534 | | model_max_length | 4096 | 4096 | | 白日依山尽,黄河入海流。欲穷千里目,更上一层楼。 | :['▁', '白', '日', '<0xE4>', '<0xBE>', '<0x9D>', '山', '<0xE5>', '<0xB0>', '<0xBD>', ',', '黄', '河', '入', '海', '流', '。', '<0xE6>', '<0xAC>', '<0xB2>', '<0xE7>', '<0xA9>', '<0xB7>', '千', '里', '目', ',', '更', '上', '一', '<0xE5>', '<0xB1>', '<0x82>', '<0xE6>', '<0xA5>', '<0xBC>', '。'] | ['▁白', '日', '依山', '尽', ',', '黄河', '入海', '流', '。', '欲', '穷', '千里', '目', ',', '更', '上一层', '楼', '。'] | | | [1, 29871, 30868, 30325, 231, 193, 160, 30329, 232, 179, 192, 30214, 31491, 30828, 30752, 30581, 31151, 30267, 233, 175, 181, 234, 172, 186, 31159, 30755, 30895, 30214, 31100, 30429, 30287, 232, 180, 133, 233, 168, 191, 30267] | [65534, 1764, 63106, 62484, 63203, 62793, 14729, 29082, 63130, 62795, 63920, 64266, 3271, 63038, 62793, 63007, 17116, 63636, 62795] | | The primary use of LLaMA is research on large language models, including BERT, XLNet, and RoBERTa. | :['▁The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including', '▁B', 'ERT', ',', '▁X', 'L', 'Net', ',', '▁and', '▁Ro', 'BER', 'T', 'a', '.'] | :['▁T', 'h', 'e', '▁p', 'ri', 'm', 'ar', 'y', '▁', 'u', 'se', '▁o', 'f', '▁', '<0x4C>', '<0x4C>', 'a', 'M', 'A', '▁i', 's', '▁', 're', 'se', 'ar', 'ch', '▁o', 'n', '▁', 'l', 'ar', 'g', 'e', '▁', 'l', 'ang', 'ua', 'g', 'e', '▁m', 'od', 'e', 'ls', ',', '▁', 'in', 'c', 'lu', 'd', 'i', 'ng', '▁', '<0x42>', '<0x45>', '<0x52>', 'T', ',', '▁', 'X', '<0x4C>', '<0x4E>', 'e', 't', ',', '▁', 'an', 'd', '▁', '<0x52>', 'o', '<0x42>', '<0x45>', '<0x52>', 'T', 'a', '.'] | | | [1, 450, 7601, 671, 310, 365, 5661, 1529, 338, 5925, 373, 2919, 4086, 4733, 29892, 3704, 350, 20161, 29892, 1060, 29931, 6779, 29892, 322, 1528, 13635, 29911, 29874, 29889] | [65534, 14962, 63590, 64211, 27052, 16426, 63475, 13594, 64158, 62797, 63569, 11279, 13719, 65368, 62797, 81, 81, 63518, 64918, 64752, 24145, 63338, 62797, 44186, 11279, 13594, 9251, 13719, 63541, 62797, 64399, 13594, 64101, 64211, 62797, 64399, 37035, 36500, 64101, 64211, 2939, 11320, 64211, 53670, 62793, 62797, 18944, 63603, 14575, 64096, 63484, 1171, 62797, 71, 74, 87, 64760, 62793, 62797, 65257, 81, 83, 64211, 63073, 62793, 62797, 6604, 64096, 62797, 87, 63143, 71, 74, 87, 64760, 63518, 62801] | The Llama2 tokenizer has 32,000 tokens and is optimized for English characters, while Baby Llama2 has 65,534 tokens and only includes Chinese. It can be seen that in terms of vectorization for Chinese and English text, Baby Llama2's Chinese vectorization is better than standard Llama2, while its English vectorization is weaker than Llama2. ## Full Training Corpus Processing Before full training, the corpus is processed for vectorization. Using the recently trained tokenizer (tokenizer), the txt files of online novels are read line by line. Each line is vectorized and an eos_token_id is added at the end of the line for differentiation. All processed binary data is then stored on disk in the form of a two-dimensional np.uint16 array, with dimensions of [-1: max_sentence_length]. ## Pre-training Pre-training is done on a single 3090 machine. The model uses the architecture of llama2, and the training parameters are as follows: 1. max_seq_len = 1024 2. dim = 768 3. n_headers = 12 4. n_layers = 12 5. n_kv_headers = 12 ## Demonstration [Huggingface Space For Baby Llama2](https://huggingface.co/spaces/wangqi777/wangqi777-chinese-baby-llama2) ## Citation [llama2.c](https://github.com/karpathy/llama2.c) [baby-llama2-chinese](baby-llama2-chinese](https://github.com/DLLXW/baby-llama2-chinese)