mav23 commited on
Commit
c06a30f
·
verified ·
1 Parent(s): 59b22e3

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +100 -0
  3. arcee-vylinh.Q4_0.gguf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ arcee-vylinh.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - qnguyen3/VyLinh-3B
4
+ - Qwen/Qwen2.5-3B-Instruct
5
+ library_name: transformers
6
+ tags:
7
+ - mergekit
8
+ - merge
9
+ language:
10
+ - vi
11
+ ---
12
+ **Quantized Version**: [arcee-ai/Arcee-VyLinh-GGUF](https://huggingface.co/arcee-ai/Arcee-VyLinh-GGUF)
13
+
14
+ # Arcee-VyLinh
15
+
16
+ Arcee-VyLinh is a 3B parameter instruction-following model specifically optimized for Vietnamese language understanding and generation. Built through an innovative training process combining evolved hard questions and iterative Direct Preference Optimization (DPO), it achieves remarkable performance despite its compact size.
17
+
18
+ ## Model Details
19
+
20
+ - **Architecture:** Based on Qwen2.5-3B
21
+ - **Parameters:** 3 billion
22
+ - **Context Length:** 32K tokens
23
+ - **Training Data:** Custom evolved dataset + ORPO-Mix-40K (Vietnamese)
24
+ - **Training Method:** Multi-stage process including EvolKit, proprietary merging, and iterative DPO
25
+ - **Input Format:** Supports both English and Vietnamese, optimized for Vietnamese
26
+
27
+ ## Intended Use
28
+
29
+ - Vietnamese language chat and instruction following
30
+ - Text generation and completion
31
+ - Question answering
32
+ - General language understanding tasks
33
+ - Content creation and summarization
34
+
35
+ ## Performance and Limitations
36
+
37
+ ### Strengths
38
+
39
+ - Exceptional performance on complex Vietnamese language tasks
40
+ - Efficient 3B parameter architecture
41
+ - Strong instruction-following capabilities
42
+ - Competitive with larger models (4B-8B parameters)
43
+
44
+ ### Benchmarks
45
+
46
+ Tested on Vietnamese subset of m-ArenaHard (CohereForAI), with Claude 3.5 Sonnet as judge:
47
+
48
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/630430583926de1f7ec62c6b/m1bTn0vkiPKZ3uECC4b0L.png)
49
+
50
+ ### Limitations
51
+
52
+ - Might still hallucinate on cultural-specific content.
53
+ - Primary focus on Vietnamese language understanding
54
+ - May not perform optimally for specialized technical domains
55
+
56
+ ## Training Process
57
+
58
+ Our training pipeline consisted of several innovative stages:
59
+
60
+ 1. **Base Model Selection:** Started with Qwen2.5-3B
61
+ 2. **Hard Question Evolution:** Generated 20K challenging questions using EvolKit
62
+ 3. **Initial Training:** Created VyLinh-SFT through supervised fine-tuning
63
+ 4. **Model Merging:** Proprietary merging technique with Qwen2.5-3B-Instruct
64
+ 5. **DPO Training:** 6 epochs of iterative DPO using ORPO-Mix-40K
65
+ 6. **Final Merge:** Combined with Qwen2.5-3B-Instruct for optimal performance
66
+
67
+ ## Usage Examples
68
+
69
+ ```python
70
+ from transformers import AutoModelForCausalLM, AutoTokenizer
71
+
72
+ # Load the model and tokenizer
73
+ model = AutoModelForCausalLM.from_pretrained("arcee-ai/Arcee-VyLinh")
74
+ tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Arcee-VyLinh")
75
+
76
+ prompt = "Một cộng một bằng mấy?"
77
+ messages = [
78
+ {"role": "system", "content": "Bạn là trợ lí hữu ích."},
79
+ {"role": "user", "content": prompt}
80
+ ]
81
+ text = tokenizer.apply_chat_template(
82
+ messages,
83
+ tokenize=False,
84
+ add_generation_prompt=True
85
+ )
86
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
87
+
88
+ generated_ids = model.generate(
89
+ model_inputs.input_ids,
90
+ max_new_tokens=1024,
91
+ eos_token_id=tokenizer.eos_token_id,
92
+ temperature=0.25,
93
+ )
94
+ generated_ids = [
95
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
96
+ ]
97
+
98
+ response = tokenizer.batch_decode(generated_ids)[0]
99
+ print(response)
100
+ ```
arcee-vylinh.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c83240beb25c80b1be4c74311d52e28da58f65390174778ec3eab08d056bb90e
3
+ size 1997880320