huseinzol05 commited on
Commit
98adb7e
1 Parent(s): 7a3316a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -0
README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ms
4
+ ---
5
+
6
+ # Pretrain BB 4096 context length Mistral on Malaysian text
7
+
8
+ README at https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral
9
+
10
+ - Dataset gathered at https://github.com/malaysia-ai/dedup-text-dataset/tree/main/pretrain-llm
11
+ - We use Ray cluster to train on 5 nodes of 8x A100 80GB, https://github.com/malaysia-ai/jupyter-gpu/tree/main/ray
12
+
13
+ WandB, https://wandb.ai/mesolitica/pretrain-mistral-5b?workspace=user-husein-mesolitica
14
+
15
+ WandB report, https://wandb.ai/mesolitica/pretrain-mistral-3b/reports/Pretrain-Larger-Malaysian-Mistral--Vmlldzo2MDkyOTgz
16
+
17
+ ## how-to
18
+
19
+ ```python
20
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
21
+ import torch
22
+
23
+ TORCH_DTYPE = 'bfloat16'
24
+ nf4_config = BitsAndBytesConfig(
25
+ load_in_4bit=True,
26
+ bnb_4bit_quant_type='nf4',
27
+ bnb_4bit_use_double_quant=True,
28
+ bnb_4bit_compute_dtype=getattr(torch, TORCH_DTYPE)
29
+ )
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained('mesolitica/malaysian-mistral-5B-4096')
32
+ model = AutoModelForCausalLM.from_pretrained(
33
+ 'mesolitica/malaysian-mistral-5B-4096',
34
+ use_flash_attention_2 = True,
35
+ quantization_config = nf4_config
36
+ )
37
+ prompt = '<s>nama saya'
38
+ inputs = tokenizer([prompt], return_tensors='pt', add_special_tokens=False).to('cuda')
39
+
40
+ generate_kwargs = dict(
41
+ inputs,
42
+ max_new_tokens=512,
43
+ top_p=0.95,
44
+ top_k=50,
45
+ temperature=0.9,
46
+ do_sample=True,
47
+ num_beams=1,
48
+ repetition_penalty=1.05,
49
+ )
50
+ r = model.generate(**generate_kwargs)
51
+ ```