qnguyen3 commited on
Commit
4fd9d79
1 Parent(s): 0b599ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -1
README.md CHANGED
@@ -8,4 +8,98 @@ tags:
8
  - merge
9
  language:
10
  - vi
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - merge
9
  language:
10
  - vi
11
+ ---
12
+
13
+ # Arcee-VyLinh
14
+
15
+ Arcee-VyLinh is a 3B parameter instruction-following model specifically optimized for Vietnamese language understanding and generation. Built through an innovative training process combining evolved hard questions and iterative Direct Preference Optimization (DPO), it achieves remarkable performance despite its compact size.
16
+
17
+ ## Model Details
18
+
19
+ - **Architecture:** Based on Qwen2.5-3B
20
+ - **Parameters:** 3 billion
21
+ - **Context Length:** 4096 tokens
22
+ - **Training Data:** Custom evolved dataset + ORPO-Mix-40K (Vietnamese)
23
+ - **Training Method:** Multi-stage process including EvolKit, proprietary merging, and iterative DPO
24
+ - **Input Format:** Supports both English and Vietnamese, optimized for Vietnamese
25
+
26
+ ## Intended Use
27
+
28
+ - Vietnamese language chat and instruction following
29
+ - Text generation and completion
30
+ - Question answering
31
+ - General language understanding tasks
32
+ - Content creation and summarization
33
+
34
+ ## Performance and Limitations
35
+
36
+ ### Strengths
37
+
38
+ - Exceptional performance on complex Vietnamese language tasks
39
+ - Efficient 3B parameter architecture
40
+ - Strong instruction-following capabilities
41
+ - Competitive with larger models (4B-8B parameters)
42
+
43
+ ### Benchmarks
44
+
45
+ Tested on Vietnamese subset of m-ArenaHard (CohereForAI), with Claude 3.5 Sonnet as judge:
46
+
47
+ - vs PhoGPT-4B-Chat: 95.4% win rate
48
+ - vs Vistral-7B-chat: 80.0% win rate
49
+ - vs Qwen2.5-7B-Instruct: 57.1% win rate
50
+ - vs Llama3.1-8B-Instruct: 61.8% win rate
51
+ - vs VinaLlama3.1-8B-Instruct: 78.4% win rate
52
+
53
+ ### Limitations
54
+
55
+ - Limited to 4096 token context window
56
+ - Primary focus on Vietnamese language understanding
57
+ - May not perform optimally for specialized technical domains
58
+
59
+ ## Training Process
60
+
61
+ Our training pipeline consisted of several innovative stages:
62
+
63
+ 1. **Base Model Selection:** Started with Qwen2.5-3B
64
+ 2. **Hard Question Evolution:** Generated 20K challenging questions using EvolKit
65
+ 3. **Initial Training:** Created VyLinh-SFT through supervised fine-tuning
66
+ 4. **Model Merging:** Proprietary merging technique with Qwen2.5-3B-Instruct
67
+ 5. **DPO Training:** 6 epochs of iterative DPO using ORPO-Mix-40K
68
+ 6. **Final Merge:** Combined with Qwen2.5-3B-Instruct for optimal performance
69
+
70
+ ## Usage Examples
71
+
72
+ ```python
73
+ from transformers import AutoModelForCausalLM, AutoTokenizer
74
+
75
+ # Load the model and tokenizer
76
+ model = AutoModelForCausalLM.from_pretrained("arcee-ai/Arcee-VyLinh")
77
+ tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Arcee-VyLinh")
78
+
79
+ prompt = ""
80
+ messages = [
81
+ {"role": "system", "content": "Bạn là trợ lí hữu ích."},
82
+ {"role": "user", "content": prompt}
83
+ ]
84
+ text = tokenizer.apply_chat_template(
85
+ messages,
86
+ tokenize=False,
87
+ add_generation_prompt=True
88
+ )
89
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
90
+
91
+ generated_ids = model.generate(
92
+ model_inputs.input_ids,
93
+ max_new_tokens=1024,
94
+ eos_token_id=tokenizer.eos_token_id,
95
+ temperature=0.25,
96
+ )
97
+ generated_ids = [
98
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
99
+ ]
100
+
101
+ response = tokenizer.batch_decode(generated_ids)[0]
102
+ print(response)
103
+ ```
104
+
105
+ Quantized Version: [arcee-ai/Arcee-VyLinh-GGUF](https://huggingface.co/arcee-ai/Arcee-VyLinh-GGUF)