Edit model card

Llama-2-ko-DPO-13B

Based on the changed criteria from Open-AI-LLM leaderboard, the evaluation metric exceeded 50 percent for the first time. I am pretty proud of myself, even though this score will soon fade into the background as I'm simply testing a hypothesis rather than competing, and there are a lot of great models coming out of 7B. Since my day job is technical support, not R&D, I could not spend a lot of time on it, so I only processed about 1000 samples and tuned them with DPO (Direct Preference Optimization) to reduce hallucination. The infrastructure was the same as before, using AWS g5.12xlarge, and no additional prompts were given.

I think the potential of the base LLM model is enormous, seeing how much hallucination are reduced with very little data and without much effort. When I meet with customers, many of them have difficulty implementing GenAI features. But it does not take much effort to implement them since many template codes/APIs are well done. It is a world where anyone who is willing to process data can easily and quickly create their own quality model.

Model Details

Datasets

  • 1,000 samples generated by myself
  • Sentences generated by Amazon Bedrock Claude-2 were adopted as chosen, and sentences generated by the Llama-2-13B model fine-tuned with SFT were adopted as rejected.

Benchmark

Model Average Ko-ARC Ko-HellaSwag Ko-MMLU Ko-TruthfulQA Ko-CommonGen V2
daekeun-ml/Llama-2-ko-DPO-13B (Ours) 51.03 47.53 58.28 43.59 51.91 53.84
daekeun-ml/Llama-2-ko-instruct-13B 49.52 46.5 56.9 43.76 42 58.44
kyujinpy/Korean-OpenOrca-13B 48.79 43.09 54.13 40.24 45.22 61.28

image/png

License

  • Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License, under LLAMA 2 COMMUNITY LICENSE AGREEMENT

This model was created as a personal experiment, unrelated to the organization I work for.

Downloads last month
1,206
Safetensors
Model size
13.2B params
Tensor type
FP16
Β·

Spaces using daekeun-ml/Llama-2-ko-DPO-13B 2