PEFT
Safetensors
English
mistral
Generated from Trainer
nopperl commited on
Commit
3055bce
1 Parent(s): fd69bd1

update model

Browse files
Files changed (3) hide show
  1. README.md +65 -11
  2. adapter_model.safetensors +1 -1
  3. ggml-adapter-model.bin +1 -1
README.md CHANGED
@@ -71,10 +71,10 @@ wandb_log_model:
71
 
72
  gradient_accumulation_steps: 8
73
  micro_batch_size: 1
74
- num_epochs: 2
75
  optimizer: adamw_bnb_8bit
76
  lr_scheduler: cosine
77
- learning_rate: 0.000005
78
 
79
  train_on_inputs: false
80
  group_by_length: false
@@ -118,25 +118,79 @@ This is a LoRA for the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.
118
 
119
  ## Model description
120
 
121
- Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`. For more information, refer to the [GitHub repo](https://github.com/nopperl/corporate_emission_reports).
 
 
122
 
123
  ## Intended uses & limitations
124
 
125
- The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [GitHub repo](https://github.com/nopperl/corporate_emission_reports). The script ensures that the prompt string and token ids exactly match the ones used for training.
 
 
 
 
126
 
127
- Example usage:
128
 
129
- python inference.py --model mistral --lora emissions-extraction-lora/ggml-adapter-model.bin https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
130
 
131
  Compare to base model without LoRA:
132
 
133
- python inference.py --model mistral https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
134
 
135
- ## Training and evaluation data
 
 
 
 
 
 
 
 
 
 
136
 
137
- Finetuned on the [sustainability-report-emissions-instruction-style](https://huggingface.co/datasets/nopperl/sustainability-report-emissions-instruction-style) dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
- Reaches an emission value extraction accuracy of 57\% (up from 46\% of the base model) and a source citation accuracy of 68\% (base model: 52\%) on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports) dataset.
140
 
141
  ## Training procedure
142
 
@@ -169,4 +223,4 @@ The following hyperparameters were used during training:
169
  - Transformers 4.37.1
170
  - Pytorch 2.0.1
171
  - Datasets 2.16.1
172
- - Tokenizers 0.15.0
 
71
 
72
  gradient_accumulation_steps: 8
73
  micro_batch_size: 1
74
+ num_epochs: 4
75
  optimizer: adamw_bnb_8bit
76
  lr_scheduler: cosine
77
+ learning_rate: 0.00002
78
 
79
  train_on_inputs: false
80
  group_by_length: false
 
118
 
119
  ## Model description
120
 
121
+ Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`.
122
+
123
+ Reaches an emission value extraction accuracy of 65\% (up from 46\% of the base model) and a source citation accuracy of 69\% (base model: 52\%) on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports) dataset. For more information, refer to the [GitHub repo](https://github.com/nopperl/corporate_emission_reports).
124
 
125
  ## Intended uses & limitations
126
 
127
+ The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [accompanying python package](https://github.com/nopperl/corporate_emission_reports). The script ensures that the prompt string and token ids exactly match the ones used for training.
128
+
129
+ ### Example usage
130
+
131
+ #### CLI
132
 
133
+ Using [transformers](https://github.com/huggingface/transformers) as inference engine:
134
 
135
+ python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --lora nopperl/emissions-extraction-lora --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
136
 
137
  Compare to base model without LoRA:
138
 
139
+ python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
140
 
141
+ Alternatively, it is possible to use [llama.cpp](https://github.com/ggerganov/llama.cpp) as inference engine. In this case, follow the installation instructions of the [package readme](https://github.com/nopperl/corporate_emission_reports/blob/main/README.md). In particular, the model needs to be downloaded beforehand. Then:
142
+
143
+ python -m corporate_emissions_reports.inference.py --model mistral --lora ./emissions-extraction-lora/ggml-adapter-model.bin https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
144
+
145
+ Compare to base model without LoRA:
146
+
147
+ python -m corporate_emissions_reports.inference.py --model mistral https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
148
+
149
+ #### Programmatically
150
+
151
+ The package also provides a function for inference from python code:
152
 
153
+ from corporate_emission_reports.inference import extract_emissions
154
+ document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
155
+ model_kwargs = {} # Optional arguments with are passed to the HF model
156
+ emissions = extract_emissions(document_path, "mistralai/Mistral-7B-Instruct-v0.2", lora="nopperl/emissions-extraction-lora", engine="hf", **model_kwargs)
157
+
158
+ It's also possible to use it directly with [transformers](https://github.com/huggingface/transformers):
159
+
160
+ ```
161
+ from corporate_emission_reports.inference import construct_prompt
162
+ from peft import AutoPeftModelForCausalLM
163
+ from transformers import AutoTokenizer
164
+ document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
165
+ lora_path = "nopperl/emissions-extraction-lora"
166
+ tokenizer = AutoTokenizer.from_pretrained(lora_path)
167
+ prompt_text = construct_prompt(document_path, tokenizer)
168
+ model = AutoPeftModelForCausalLM.from_pretrained(lora_path)
169
+ prompt_tokenized = tokenizer.encode(prompt_text, return_tensors="pt").to(model.device)
170
+ outputs = model.generate(prompt_tokenized, max_new_tokens=120)
171
+ output = outputs[0][prompt_tokenized.shape[1]:]
172
+ ```
173
+
174
+ Additionally, it is possible to enforce valid JSON output and convert it into a Pydantic object using [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer):
175
+
176
+ ```
177
+ from corporate_emission_reports.pydantic_types import Emissions
178
+ from lmformatenforcer import JsonSchemaParser
179
+ from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
180
+ ...
181
+ parser = JsonSchemaParser(Emissions.model_json_schema())
182
+ prefix_function = build_transformers_prefix_allowed_tokens_fn(tokenizer, parser)
183
+ outputs = model.generate(prompt_tokenized, max_new_tokens=120, prefix_allowed_tokens_fn=prefix_function)
184
+ output = outputs[0][prompt_tokenized.shape[1]:]
185
+ if tokenizer.eos_token:
186
+ output = output[:-1]
187
+ output = tokenizer.decode(output)
188
+ return Emissions.model_validate_json(output, strict=True)
189
+ ```
190
+
191
+ ## Training and evaluation data
192
 
193
+ Finetuned on the [sustainability-report-emissions-instruction-style](https://huggingface.co/datasets/nopperl/sustainability-report-emissions-instruction-style) dataset and evaluated on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports).
194
 
195
  ## Training procedure
196
 
 
223
  - Transformers 4.37.1
224
  - Pytorch 2.0.1
225
  - Datasets 2.16.1
226
+ - Tokenizers 0.15.0
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:001b46af325d9d3b08730c26b86389303e95e3a0690ffa851548afcf21d18cc4
3
  size 167832688
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9cf7ae7c20b80e1a17041b5e0f8b12788db0bc46943fa01b4ebeb96f8059615
3
  size 167832688
ggml-adapter-model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9e159542c3de0db7c7b35e23cd948ee97f7225609af8def1ec224c746ae7f28f
3
  size 335572992
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d915ae9cd2bd2f1909eea73bdba5a7b14ac5b423e5f32d5ab45f82c4ffbfccf8
3
  size 335572992