README.md · keyfan/falcon-qlora-chinese at 36cb93e938b45c8e7a17182b421a5531d6a3a7e1

metadata

license: apache-2.0
datasets:
  - QingyiSi/Alpaca-CoT
language:
  - zh
  - en

This is a beta release of a QLoRa adapter model to Falcon-40b. Please read the instruction carefully before downloading the model.

Though Falcon is not specifically trained on Chinese corpus, it exhibits strong performance in Chinese Language Understanding in our experiment. We would like to explore out of curiosity whether a small amount of Chinese instruction data can push it further and make it better at speaking.

The LoRa model is trained with the QLoRa code on a subset of bilingual instruction data from Alpaca-CoT dataset for a mere 5k steps. The finetune model is not as good as the carefully continue-trained-and-finetuned LLaMA-models such as OpenBuddy and Ziya in Chinese generation, still it quickly adapts to the new langauge and generate superisingly good result. We call for more research on applying Falcon-40b to the Chinese domain.

Evalutions

We evaluate on two Chinese language understanding benchmarks, C-Eval and Gaokao subset of AGIEval.

C-Eval made breaking change in 2023/06/08 from few-shot to zero-shot,

Result on C-Eval test set with 5-shot and no CoT

Average	Avg(Hard)	STEM	Social Science	Humanities	Others
40.4	30.1	35.8	47.6	42.0	40.6

Result on GaoKao subset of C-Eval with 0-shot

Average	GK-chinese	GK-English	GK-geography	GK-history	GK-biology	GK-chemistry	GK-physics	GK-mathqa	GK-mathcloze
33.6	26.4	69.0	46.7	47.8	27.1	32.4	24.5	26.8	1.7