javi8979 commited on
Commit
7659612
verified
1 Parent(s): f487813

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md CHANGED
@@ -45,6 +45,80 @@ language:
45
 
46
  # Salamandra Model Card
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## Data
50
 
 
45
 
46
  # Salamandra Model Card
47
 
48
+ ## How to use
49
+
50
+ > [!IMPORTANT]
51
+ > This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
52
+
53
+
54
+ The instruction-following models use the commonly adopted ChatML template:
55
+
56
+ ```
57
+ <|im_start|>system
58
+ {SYSTEM PROMPT}<|im_end|>
59
+ <|im_start|>user
60
+ {USER PROMPT}<|im_end|>
61
+ <|im_start|>assistant
62
+ {MODEL RESPONSE}<|im_end|>
63
+ <|im_start|>user
64
+ [...]
65
+ ```
66
+
67
+ The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.
68
+
69
+ ```python
70
+ from datetime import datetime
71
+ from transformers import AutoTokenizer, AutoModelForCausalLM
72
+ import transformers
73
+ import torch
74
+
75
+ model_id = "/gpfs/projects/bsc88/mt_translation/instructed_models/salamandraTA7b_instruct_mixture1/checkpoint-510"
76
+
77
+ source = 'Spanish'
78
+ target = 'Catalan'
79
+ sentence = "Pensando en ti y en este amor que parte mi universo en dos y que llega del olvido hasta mi propia voz y ara帽a mi pasado sin pedir perd贸n"
80
+
81
+ text = f"Translate the following text from {source} into {target}.\n{source}: {sentence} \n{target}:"
82
+
83
+
84
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
85
+
86
+ stop_sequence = '<|im_end|>'
87
+ eos_tokens = [tokenizer.eos_token_id,tokenizer.convert_tokens_to_ids(stop_sequence)]
88
+
89
+ model = AutoModelForCausalLM.from_pretrained(
90
+ model_id,
91
+ device_map="auto",
92
+ torch_dtype=torch.bfloat16
93
+ )
94
+
95
+ message = [ { "role": "user", "content": text } ]
96
+ date_string = datetime.today().strftime('%Y-%m-%d')
97
+
98
+ prompt = tokenizer.apply_chat_template(
99
+ message,
100
+ tokenize=False,
101
+ add_generation_prompt=True,
102
+ date_string=date_string
103
+ )
104
+
105
+ inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
106
+ input_length = inputs.shape[1]
107
+ outputs = model.generate(input_ids=inputs.to(model.device),
108
+ max_new_tokens=400,
109
+ early_stopping=True,
110
+ eos_token_id=eos_tokens,
111
+ pad_token_id=tokenizer.eos_token_id,
112
+ num_beams=5)
113
+
114
+ print(tokenizer.decode(outputs[0, input_length:], skip_special_tokens=True))
115
+ # Pensant en tu i en aquest amor que parteix el meu univers en dos i que arriba des de l'oblit fins a la meva pr貌pia veu i esgarrapa el meu passat sense demanar perd贸
116
+ ```
117
+
118
+
119
+ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity
120
+ (either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.
121
+
122
 
123
  ## Data
124