File size: 6,954 Bytes
1e719dc
 
8c91e73
 
 
 
 
18545f0
 
 
59c4629
18545f0
9d481ad
 
a592a49
18545f0
9d481ad
18545f0
 
 
 
 
a592a49
18545f0
9d481ad
18545f0
 
 
a592a49
18545f0
 
 
 
 
9d481ad
18545f0
 
 
 
 
 
 
 
 
 
 
 
9d481ad
18545f0
9d481ad
18545f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d481ad
18545f0
 
 
 
 
a592a49
18545f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a592a49
18545f0
 
 
9d481ad
 
18545f0
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
license: apache-2.0
datasets:
- kaist-ai/CoT-Collection
metrics:
- accuracy
pipeline_tag: text-generation
---
# Model card for aiplanet/effi-13b

effi-13B parameters is a causal decoder-only model built by AI Planet based on Llama-2-13b-chat-hf and fine tuned using the 1.8 Million coversations from CoT dataset available in huggingface datasets. The model is made available under the Apache 2.0 license.

## Why use effi-13B-Instruct?
- This is a ready to use chat/instruct model based on Llama-2-13b-chat-hf, which provides a rationale for the context provided.
- Llama-2 is the best open-source model available. This is an instruct model, which may not be ideal for further finetuning. If you are interested in building your own instruct/chat model, we recommend starting from **Llama-2-13b-chat-hf**

You will need at least **85-100GB of memory to swiftly run inference with effi-13b**.

## Model Details

### Model Description

This model has been fine-tuned on Chain of Thought datasets, which has context from mixed sources with corresponding rationale. The final finetuned Large Language Model(LLM) have shown enhanced capabilities of solving novel tasks by providing a reasoning.

- **Developed by:** AI Planet
- **Model type:** Casual Decoder only
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** Llama-2-13b-chat-hf



### Direct Use

effi-13b has been finetuned on a Chain of Thought dataset.

### Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.


## Bias, Risks, and Limitations

This model  has been majorly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### Recommendations

We recommend users of effi-13b to develop guardrails and take appropriate precautions for any production use.

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information is needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

```
from transformers import (AutoModelForCausalLM, AutoTokenizer, pipeline)
model_card = "aiplanet/effi-13b"
#
model = AutoModelForCausalLM.from_pretrained(model_card)
tokenizer = AutoTokenizer.from_pretrained(model_card)
#
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.4,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)
#
promt = """
Can you explain this code in detail?

def generate_stream(tokenizer, model, params, device,
                    context_len=2048, stream_interval=2):

    prompt = params["prompt"]
    l_prompt = len(prompt)
    temperature = float(params.get("temperature", 1.0))
    max_new_tokens = int(params.get("max_new_tokens", 256))
    stop_str = params.get("stop", None)

    input_ids = tokenizer(prompt).input_ids
    output_ids = list(input_ids)

    max_src_len = context_len - max_new_tokens - 8
    input_ids = input_ids[-max_src_len:]

    for i in range(max_new_tokens):
        if i == 0:
            out = model(
                torch.as_tensor([input_ids], device=device), use_cache=True)
            logits = out.logits
            past_key_values = out.past_key_values
        else:
            attention_mask = torch.ones(
                1, past_key_values[0][0].shape[-2] + 1, device=device)
            out = model(input_ids=torch.as_tensor([[token]], device=device),
                        use_cache=True,
                        attention_mask=attention_mask,
                        past_key_values=past_key_values)
            logits = out.logits
            past_key_values = out.past_key_values

        last_token_logits = logits[0][-1]

        if device == "mps":
            # Switch to CPU by avoiding some bugs in mps backend.
            last_token_logits = last_token_logits.float().to("cpu")

        if temperature < 1e-4:
            token = int(torch.argmax(last_token_logits))
        else:
            probs = torch.softmax(last_token_logits / temperature, dim=-1)
            token = int(torch.multinomial(probs, num_samples=1))

        output_ids.append(token)

        if token == tokenizer.eos_token_id:
            stopped = True
        else:
            stopped = False

        if i % stream_interval == 0 or i == max_new_tokens - 1 or stopped:
            output = tokenizer.decode(output_ids, skip_special_tokens=True)
            pos = output.rfind(stop_str, l_prompt)
            if pos != -1:
                output = output[:pos]
                stopped = True
            yield output

        if stopped:
            break

    del past_key_values
"""
#
system_message = "Given your chain of thought reasoning, provide a rationale for the context in the source."
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n{prompt}. [/INST]" # replace the command here with something relevant to your task
#
result = generate_text(prompt)
print(result[0]['generated_text'].strip().split("[/INST]")[-1])

```

## Training Details

### Training Data

effi-13b has been finetuned on https://huggingface.co/datasets/kaist-ai/CoT-Collection
The data was tokenized with the **meta-llama/Llama-2-13b-chat-hf** tokenizer.


### Training Procedure 

Fine-tuning approach using PefT and Qlora(https://huggingface.co/blog/4bit-transformers-bitsandbytes)


#### Training Hyperparameters

- **Training regime:**

- lora_alpha=32,
- lora_dropout=0.05,
- r=8,
- bias="none",
- task_type="CAUSAL_LM"
#
- load_in_4bit=True,
- bnb_4bit_quant_type = "nf4",
- bnb_4bit_use_double_quant=True,
- bnb_4bit_compute_dtype=torch.bfloat16
#
- num_train_epochs = 1
- fp16 = False
- bf16 = False
- per_device_train_batch_size = 1
- per_device_eval_batch_size = 1
- gradient_accumulation_steps = 4
- gradient_checkpointing = True
- max_grad_norm = 0.3
- learning_rate = 2e-4
- weight_decay = 0.001
- optim = "paged_adamw_32bit"
- lr_scheduler_type = "constant"
- max_steps = 500
- warmup_ratio = 0.03
- group_by_length = True
- save_steps = 25
- logging_steps = 5
- max_seq_length = 2048
- packing = False
- device_map = {"": 0}

## Evaluation

Paper coming soon.

See the OpenLLM Leaderboard(https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

## Citation

@article{effi-13b,
  title={{effi-13b}: an open large language model with state-of-the-art performance},
  author={aiplanet},
  year={2023}
}

## Model Card Contact

community@aiplanet.com