metadata

license: gpl-3.0
language:
  - en
datasets:
  - HuggingFaceTB/cosmopedia-100k
  - pleisto/wikipedia-cn-20230720-filtered
pipeline_tag: text-generation
tags:
  - text-generation-inference

NanoLM-365M-base

English | 简体中文

Introduction

Based on Qwen2-0.5B, the tokenizer has been replaced with BilingualTokenizer-8K to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.

Details

To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on wikipedia-zh and cosmopedia-100k.

	Value
Total Params	365 M
Trainable Params	< 10 M
Trainable Parts	`model.embed_tokens`
Training Steps	40,000
Training Dataset	wikipedia-zh, cosmopedia-100k
Optimizer	adamw_torch
Learning Rate	2e-4
LR Scheduler	cosine
Weight Decay	0.1
Warm-up Ratio	0.03
Batch Size	16
Gradient Accumulation Steps	1
Seq Len	4096
Dtype	bf16
Peak GPU Memory	< 48 GB
Device	NVIDIA A100-SXM4-80GB

The specific training records are as follows: