File size: 5,879 Bytes
65bfe7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c7f9a2
 
 
 
 
 
 
65bfe7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
license: apache-2.0
datasets:
- wikimedia/wikipedia
- oscar
language:
- it
pipeline_tag: text-generation
---
<img src="https://hoodie-creator.s3.eu-west-1.amazonaws.com/15be78c6-original.png" alt="zefiro" border="0" width="400px">



# Model Card for zefiro-base-7b-ITA
*Last Update: 20/02/2024*<br>


<!-- Provide a quick summary of what the model is/does. -->

Zefiro base is a continual pretrained model for the Italian language based on [Mistral-7b](https://huggingface.co/mistralai/Mistral-7B-v0.1) trained on
a subset of the Italian subdataset of Oscar and wikipedia dataset. 


## Model Details

Zefiro base is a continual pre-trained language model started from [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) model to the italian language. 


## Model description

- **Model type:** A 7B parameter GPT-like continual pre-trained model from [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
- **Language(s) (NLP):** Primarily Italian
- **License:** Apache 2
- **Finetuned from model:** [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- **Developed by:** [giux78](https://alessandroercolani.webflow.io/)
- **Funded by:** [Business Operating System](https://www.businessos.xyz)

## Code
Lost 

## Computation
It has been trained on a GPUs cluster of 4 H100s from [runpod](https://runpod.io/) 


## Evaluations:

| Model | Arc-c  | HellaS | MMUL | AVG |
| --- | --- | --- | --- | --- |
| Mixtral 7x8 | 52.8 | 75.1 | 70.9 | 66.26666667 |
| LLama2 70b | 49.4 | 70.9 | 65.1 | 61.8 |
| zefiro-dpo-7b | 52.69 | 67.09 | 50.8 | 56.86 |
| **zefiro-base-7b** | **51.07** | **63.47** | **52.97** | **55.83666667** |
| zefiro-sft-7b | 50.98 | 62.71 | 51.96 | 55.21666667 |
| LLama1 34B | 42.9 | 65.4 | 49.0 | 52.43333333 |


## Intended uses & limitations

Here's how you can run the model using Transformers from 🤗 :

```python
# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mii-community/zefiro-7b-base-ITA"
model = AutoModelForCausalLM.from_pretrained(model_id)
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")


sys_prompt = "Sei un assistente disponibile, rispettoso e onesto. " \
         "Rispondi sempre nel modo piu' utile possibile, pur essendo sicuro. " \
         "Le risposte non devono includere contenuti dannosi, non etici, razzisti, sessisti, tossici, pericolosi o illegali. " \
         "Assicurati che le tue risposte siano socialmente imparziali e positive. " \
         "Se una domanda non ha senso o non e' coerente con i fatti, spiegane il motivo invece di rispondere in modo non corretto. " \
         "Se non conosci la risposta a una domanda, non condividere informazioni false."

messages = [{ 'content' : sys_prompt, 'role' : 'assistant'}, 
            {'content' : 'Crea una lista su cosa mangiare a pranzo ogni giorno della settimana a pranzo e cena', 'role' : 'user'}]


def generate_text(sys_prompt, user_prompt):
    messages = [{ 'content' : sys_prompt, 'role' : 'assistant'}, 
            {'content' : user_prompt, 'role' : 'user'}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]


generate_text(sys_prompt, 'cosa ne pensi della politica italiana?')
```

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Zefiro-7b-base-ITA has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). 
It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the [Falcon 180B model card](https://huggingface.co/tiiuae/falcon-180B#training-data) for an example of this.



### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
We used a subset of the italian version of [Oscar](https://huggingface.co/datasets/oscar) and [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) a as training data.

#### Summary
Zefiro-7b-beta-ITA-v0.1 is a continula pre-trained version of mistral-7b for the italian language. 

## Citation

```
@misc{tunstall2023zephyr,
      title={Zephyr: Direct Distillation of LM Alignment}, 
      author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
      year={2023},
      eprint={2310.16944},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@misc{basile2023llamantino,
      title={LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language}, 
      author={Pierpaolo Basile and Elio Musacchio and Marco Polignano and Lucia Siciliani and Giuseppe Fiameni and Giovanni Semeraro},
      year={2023},
      eprint={2312.09993},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

```


## Model Card Authors 

[giux78](https://huggingface.co/giux78)

## Model Card Contact

**ale.ercolani@gmail.com