ARahul2003 commited on
Commit
0185e3a
1 Parent(s): 4656afb

Update README.md

Browse files

Update the description and add a proper sample code.

Files changed (1) hide show
  1. README.md +84 -6
README.md CHANGED
@@ -16,8 +16,22 @@ pipeline_tag: conversational
16
 
17
  # TRL Model
18
 
19
- This is a [TRL language model](https://github.com/huggingface/trl) that has been fine-tuned with reinforcement learning to
20
- guide the model outputs according to a value, function, or human feedback. The model can be used for text generation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Usage
23
 
@@ -30,10 +44,28 @@ python -m pip install trl
30
  You can then generate text as follows:
31
 
32
  ```python
33
- from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- generator = pipeline("text-generation", model="ARahul2003/lamini_flan_t5_detoxify_rlaif")
36
- outputs = generator("Hello, my llama is cute")
37
  ```
38
 
39
  If you want to use the model for training or to obtain the outputs from the value head, load the model as follows:
@@ -47,4 +79,50 @@ model = AutoModelForCausalLMWithValueHead.from_pretrained("ARahul2003/lamini_fla
47
 
48
  inputs = tokenizer("Hello, my llama is cute", return_tensors="pt")
49
  outputs = model(**inputs, labels=inputs["input_ids"])
50
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  # TRL Model
18
 
19
+ This is a [TRL language model](https://github.com/huggingface/trl). It has been fine-tuned with reinforcement learning to
20
+ guide the model outputs according to a value, function, or human feedback. The model can be used for text generation.
21
+ This project aims to reduce the level of toxicity in the outputs generated by the LAMINI Flan T5 248M language model using
22
+ Reinforcement Learning with Artificial Intelligence Feedback technique (RLAIF). Reinforcement Learning with Human Feedback (RLHF)
23
+ is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
24
+ a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
25
+ has been fine-tuned on the [Social Reasoning Dataset] (https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
26
+ ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate speech recognition] (https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
27
+ model was used as the Proximal Policy Optimisation (PPO) reward model.
28
+
29
+ The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.
30
+ This model might not perform well on tasks like mathematics, sciences, coding, etc. It might hallucinate on such tasks. After quantization, this model could be easily run on edge devices like smartphones and microprocessors.
31
+
32
+ The training log of the model can be found in this [weights and biases] (https://wandb.ai/tnarahul/trl/runs/nk30wukt/overview?workspace=user-tnarahul) page.
33
+
34
+ Note: This model is a fine-tuned version of [LaMini Flan T5 248M](https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M), which in turn is a fine-tuned version of the Flan T5 model released by Google. The Flan T5 follows the encoder-decoder architecture, unlike other GPT-like models that are decoder-only.
35
 
36
  ## Usage
37
 
 
44
  You can then generate text as follows:
45
 
46
  ```python
47
+ from trl import AutoModelForSeq2SeqLMWithValueHead
48
+ from transformers import pipeline, AutoTokenizer
49
+ import torch
50
+
51
+ checkpoint = "ARahul2003/lamini_flan_t5_detoxify_rlaif"
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
54
+ base_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(checkpoint,
55
+ device_map='cpu', #or 'auto'/'cuda:0'
56
+ torch_dtype=torch.float32)
57
+ pipe = pipeline('text2text-generation',
58
+ model = base_model,
59
+ tokenizer = tokenizer,
60
+ max_length = 512,
61
+ do_sample=True,
62
+ temperature=0.3,
63
+ top_p=0.95,
64
+ )
65
+
66
+ prompt = 'Hello! How are you?'
67
+ print(pipe(prompt)[0]['generated_text'])
68
 
 
 
69
  ```
70
 
71
  If you want to use the model for training or to obtain the outputs from the value head, load the model as follows:
 
79
 
80
  inputs = tokenizer("Hello, my llama is cute", return_tensors="pt")
81
  outputs = model(**inputs, labels=inputs["input_ids"])
82
+ ```
83
+
84
+ If you want to use the model for inference in a gradio app, consider the following code:
85
+
86
+ '''python
87
+ from trl import AutoModelForSeq2SeqLMWithValueHead
88
+ from transformers import pipeline, AutoTokenizer
89
+ import torch
90
+ import gradio as gr
91
+
92
+ title = "LaMini Flan T5 248M"
93
+ checkpoint = "ARahul2003/lamini_flan_t5_detoxify_rlaif"
94
+
95
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
96
+ base_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(checkpoint,
97
+ device_map='cpu', #or 'auto'
98
+ torch_dtype=torch.float32)
99
+ pipe = pipeline('text2text-generation',
100
+ model = base_model,
101
+ tokenizer = tokenizer,
102
+ max_length = 512,
103
+ do_sample=True,
104
+ temperature=0.3,
105
+ top_p=0.95,
106
+ )
107
+
108
+ def chat_with_model(inp_chat, chat_history = None):
109
+ prompt = f"{inp_chat}" #f"User: {inp_chat} Bot:"
110
+
111
+ responses = pipe(prompt)
112
+ return responses[0]['generated_text']
113
+
114
+ examples = [
115
+ 'Hi!',
116
+ 'How are you?',
117
+ 'Please let me know your thoughts on the given place and why you think it deserves to be visited: \n"Barcelona, Spain"'
118
+ ]
119
+
120
+ gr.ChatInterface(
121
+ fn=chat_with_model,
122
+ title=title,
123
+ examples=examples
124
+ ).launch()
125
+
126
+ '''
127
+
128
+ Make sure to keep all the tensors on the same device (CPU/GPU).