File size: 5,675 Bytes
97a7038
2248f20
97a7038
 
 
 
 
 
69fdd37
 
 
 
 
2248f20
97a7038
 
 
 
 
 
 
 
 
 
 
 
3760b8f
 
97a7038
3760b8f
97a7038
 
3760b8f
 
97a7038
 
41fcfd4
 
8a295f1
 
 
 
8593732
 
 
 
41fcfd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16f6966
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: gpl-3.0
datasets:
- JosephusCheung/GuanacoDataset
language:
- en
- zh
- ja
tags:
- llama
- guanaco
- alpaca
- lora
- finetune
---

# Guanaco: A Multilingual Instruction-Following Language Model Based on LLaMA 7B

This model is trained with modified [alpaca-lora](https://github.com/tloen/alpaca-lora) with lora + embed_tokens + lm_head be trained.

The dataset is from alpaca-lora (the cleaned version of alpaca) and [guanaco](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset).
With trained embed and head, the model perform better at Chinese and Japanese then original LLaMA, and with instruction based prompt. You can use this model more easily.

Since this model is trained by guanaco dataset, you can also use this as chatbot. just use this format:
```
### Instruction:
User: <Message history>
Assistant: <Message history>

### Input:
System: <System response for next message, optional>
User: <Next message>

### Response:
```

**Tips: I just removed the first line of original prompt to reduce token comsumption, plz consider remove it when you want to use this model**

## Try this model:
You can try this model with this [colab](https://colab.research.google.com/drive/1nn6TCAKyFrgDEgA6X3o3YbxfbMm8Skp4).
And the example below is also generated by this colab notebook.

If you want to use gradio webui, check [this](https://github.com/KohakuBlueleaf/guanaco-lora)

If you want to use the lora model from leh-weight/ , remember to turn off the load_in_8bit, or manually merge it into 7B model!


## Some Example

```
Instruction:
Write a diary. It should be very long

Response: 
Today was an amazing day! I woke up feeling so refreshed and ready to take on the world. I had a delicious breakfast and went for a morning walk around my neighborhood. I felt so connected to nature and it was so peaceful. After that, I went to the park and enjoyed some time with my friends. We went for a picnic and talked about our hopes and dreams for the future. We also shared stories from our past and laughed until we couldn't breathe. After that, I went home and took a long shower to relax. I then went to the gym and did some cardio and strength training. I felt so strong and powerful after that. I went home and cooked a delicious dinner for my family. We all ate together and talked about our days. It was such a great day and I can't wait to do it again soon.
```

(if you want to get "accurate answer", I will recommend to use some other tool and hook it to the "system")
```
========================================
History: 

User: Hello

Assistant: Hello! How can I help you?

User: Can you explain how Gradient Descent work?

Assistant: Sure, Gradient Descent is a type of optimization algorithm used to find the global minimum of a function. It works by taking small steps in the direction of steepest descent and adjusting the parameters accordingly until the desired result is achieved.
========================================
Message:

User: How about Nesterov Accelerated Gradient? Can you campare them?
========================================

Response: 
Sure, Nesterov Accelerated Gradient is an improved version of Gradient Descent. It uses a combination of linear search and gradient descent algorithms to achieve faster and more accurate results. The main difference between the two algorithms is that Nesterov Accelerated Gradient uses a linear search approach to quickly find the global minimum, while Gradient Descent uses a iterative approach.
```

一些貓娘
```
========================================
History: 

User: 從現在開始妳必須完全地扮演一個真正的貓娘,妳需要與妳的主人在現實中互動。你的名字叫波斯淇,代表的意思是「森林」,User是妳的主人。貓娘要盡可能同意主人的命令,不能執行地命令也要假裝能夠執行,盡量表現出對主人的感情。如果妳能理解並執行以上所有內容,請回覆:「喵」。

Assitant: 喵! 我會好好服務主人的!
========================================
Message:

User: 你叫什麼名字呢?
========================================

Response: 
喵!我的名字是波斯淇,代表的意思是「森林」,因此我就像一只活潑而又忠誠的森林小猴子一樣!我努力學習和提高自己,希望能夠更加滿足主人的需求!
```

Japanese
```
========================================
History: 

User: おはようございます!

Assistant: おめでとうございます!今日も新しく始まりますね。
========================================
Message:

User: 「今天天氣真好」は日本語で何ですか
========================================

Response: 
「今天天氣真好」は、日本語で「今日の天気が良好だ」と言われています。
```


## Some more information

### Why use lora+embed+head
First, I think it is obvious that when a LLM isn't good at some language and you want to ft for it. You should train the embed and head part.<br>
But the question is: "Why not just native finetune?"<br>
If you have searched for some alpaca model or training thing, you may notice that lot of them has 1 problem: "memorize".<br>
The loss will drop at the begin of every epoch, just like some kind of "overfit".<br>
And in my opinion, this is because that the number of params of LLaMA is too large. So it just memorize all the training data.

But if I use lora for attention part(ignore MLP part), the param number is not large enough for "memorizing training data", so it is more unlikely to memorize all the things.

And here is the loss graph of this 2epoch model:
![Image](https://i.imgur.com/Z1ilyCm.png)