aashay96 commited on
Commit
55d0815
1 Parent(s): fc89b90

Added readme

Browse files
Files changed (1) hide show
  1. README.md +49 -22
README.md CHANGED
@@ -1,22 +1,49 @@
1
- ---
2
- license: bigscience-openrail-m
3
- datasets:
4
- - aashay96/indic_language_corpus
5
- language:
6
- - hi
7
- - ta
8
- - te
9
- - gu
10
- - pa
11
- - or
12
- - as
13
- - kn
14
- - mr
15
- library_name: transformers
16
- pipeline_tag: text-generation
17
- tags:
18
- - indic
19
- - text-generation-inference
20
- - peft
21
- - Bloom
22
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Indic Language Bloom Model Training
2
+
3
+ This repository contains the code and resources for fine-tuning the Huggingface Bloom model on the Indic language dataset using Low-Rank Adaptation (LoRA). The goal is to create a high-performance language model specifically tailored to Indic languages.
4
+
5
+ ## Dataset
6
+
7
+ The dataset used for training is provided by AI4Bharat. I have uploaded it to huggingface hub at:
8
+
9
+ - [Processed Indic Language Corpus](https://huggingface.co/datasets/aashay96/indic_language_corpus/tree/main)
10
+
11
+ ## Progress
12
+
13
+ ### Completed
14
+
15
+ - [x] Low-Rank Adaptation fine-tuning of the Bloom model on streaming data
16
+ - [x] Single checkpoint available (training logs at [Weights & Biases](https://wandb.ai/indic-lm/huggingface/runs/7kq2m62v/))
17
+
18
+ ### To Do
19
+
20
+ - [ ] Benchmark current multilingual LLMs on IndicGLUE using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
21
+ - [ ] Integrate DeepSpeed for better resource utilization
22
+ - [ ] Convert current instruction dataset to Indic languages and train (dolly v2 dataset, distilled from GPT, etc.)
23
+ - [ ] Model doesn't stop producing text - how to fix?
24
+ - [ ] Deploy RLHF community app using [Cheese](https://github.com/CarperAI/cheese)
25
+
26
+ ## Using the Model
27
+
28
+
29
+ ```bash
30
+ import torch
31
+ from peft import PeftModel, PeftConfig
32
+ from transformers import AutoModelForCausalLM, AutoTokenizer
33
+
34
+ peft_model_id = "aashay96/indic-BloomLM"
35
+ config = PeftConfig.from_pretrained(peft_model_id)
36
+ model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
37
+ tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
38
+
39
+ # Load the Lora model
40
+ model = PeftModel.from_pretrained(model, peft_model_id)
41
+
42
+
43
+
44
+ batch = tokenizer("आप कैसे हैं", return_tensors='pt')
45
+
46
+ with torch.cuda.amp.autocast():
47
+ output_tokens = model.generate(**batch, max_new_tokens=10)
48
+
49
+ print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))