Alexandre-Numind commited on
Commit
831257c
1 Parent(s): 30f4cd9

Upload 2 files

Browse files
Files changed (2) hide show
  1. app.py +351 -0
  2. ml.py +34 -0
app.py ADDED
@@ -0,0 +1,351 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
2
+ import torch
3
+ from itertools import cycle
4
+ import json
5
+ import gradio as gr
6
+ from urllib.parse import unquote
7
+ from ml import create_prompt, generate_answer_short
8
+
9
+
10
+ example1 = ("""We introduce Mistral 7B, a 7–billion-parameter language model engineered for
11
+ superior performance and efficiency. Mistral 7B outperforms the best open 13B
12
+ model (Llama 2) across all evaluated benchmarks, and the best released 34B
13
+ model (Llama 1) in reasoning, mathematics, and code generation. Our model
14
+ leverages grouped-query attention (GQA) for faster inference, coupled with sliding
15
+ window attention (SWA) to effectively handle sequences of arbitrary length with a
16
+ reduced inference cost. We also provide a model fine-tuned to follow instructions,
17
+ Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
18
+ automated benchmarks. Our models are released under the Apache 2.0 license.
19
+ Code: https://github.com/mistralai/mistral-src
20
+ Webpage: https://mistral.ai/news/announcing-mistral-7b/""","""{
21
+ "Model": {
22
+ "Name": "",
23
+ "Number of parameters": "",
24
+ "Number of token": "",
25
+ "Architecture": []
26
+ },
27
+ "Usage": {
28
+ "Use case": [],
29
+ "Licence": ""
30
+ }
31
+ }""")
32
+
33
+
34
+ example2 = ("""Identity security company IDfy said on Wednesday it has raised $27 million in a mix of primary and secondary fundraising from Elev8, KB Investment and Tenacity Ventures.
35
+ Mumbai-based IDfy makes products and solutions for authenticating entities, helping companies prevent fraud and verify other businesses. "Investment from Elev8 and Tenacity is a strong validation of our vision and capabilities. The fund will fuel our expansion plans and product development, enabling us to serve even more businesses and unlock opportunities for trustworthy people and businesses," said Ashok Hariharan, co-founder and chief executive officer of IDfy, in a statement.
36
+ Click here to follow our WhatsApp channel
37
+ IDfy (Baldor Technologies) was founded in 2011 by Hariharan and Vineet Jawa. It has products for diligence processes called know your customer and know your business, employee background verification, risk and fraud mitigation, and digital privacy.
38
+ The company said its artificial intelligence-driven solutions serve more than 1,500 clients in banking, financial services and insurance, e-commerce, gaming and other sectors. It works with companies in India, Southeast Asia and West Asia, having HDFC Bank, Zomato, Paytm, HUL and American Express as clients.
39
+ Navin Honagudi, managing partner at Elev8 Venture Partners, said: "We are thrilled to partner with IDfy as our first investment. The company's innovative technology, experienced leadership team, and strong market fit position it for remarkable growth. We are confident that IDfy will play a crucial role in shaping the future of risk management in India and beyond."
40
+ Elev8 Venture Partners is a $200 million growth-stage fund anchored by South Korea's KB Investment. Tenacity Ventures is a growth-stage investment fund with a focus on technology product businesses.
41
+ """, """{
42
+ "Funding": {
43
+ "New funding": "",
44
+ "Investor": []
45
+ },
46
+ "Company": {
47
+ "Name": "",
48
+ "Activity": "",
49
+ "Total valuation": ""
50
+ }
51
+ }""")
52
+
53
+ example3 = (""""Office of Management and Budget (OMB) memorandum M-12-12, as amended by memorandum
54
+ M-17-08, requires federal agencies to issue an annual report related to its conference-related expenditures
55
+ for the previous fiscal year. This document constitutes the SEC’s report for Fiscal Year (FY) 2018.
56
+ The SEC has put in place policies and procedures governing the approval and use of agency funds for
57
+ conference expenses, to ensure that such spending is legal, reasonable, and in furtherance of the agency’s
58
+ mission to protect investors, maintain fair, orderly, and efficient markets, and facilitate capital formation.
59
+ At a high level, the major steps in this process are as follows:
60
+ 1. All SEC division/office requests to spend money on hosting a conference must be approved by
61
+ the division/office head or his/her designee. Divisions and offices are required to use SEC
62
+ facilities for such events whenever possible, to minimize space rental and equipment costs. In
63
+ order to limit expenses for meals or refreshments, the SEC uses per diem rates established for
64
+ the federal government as the ceiling for any such costs, except when higher rates are
65
+ unavoidable or otherwise justified. The acquisition of any goods, services, or meeting space is
66
+ subject to the applicable policies and regulations which govern these areas.
67
+ 2. When a request for funds is necessary and has received approval from the division/office head,
68
+ it is reviewed by staff in the Office of Financial Management (OFM) to ensure the expenses
69
+ are permissible under the applicable polices and regulations. OFM has implemented an
70
+ automated system for the submission, review, and approval of all SEC conference requests
71
+ that enables OFM to monitor and control conference spending, as well as record actual
72
+ conference spending after a conference has been held.
73
+ 3. Each request must receive final approval from designated officials according to the total
74
+ projected cost. These designations comply with OMB Memorandum 12-12.
75
+ 4. The SEC is reporting conferences which meet thresholds defined in P.L.115-141 Division E,
76
+ Title VII, Sections 739 (a), (b), and (c), to the SEC’s Office of Inspector General via separate
77
+ correspondence.
78
+ For FY 2018, the SEC authorized 97 conferences (including training conferences) with costs totaling
79
+ $884,759.
80
+ 2
81
+ Conferences over $100,000:
82
+ In FY 2018, the SEC authorized two conferences costing greater than $100,000, which are described
83
+ below:
84
+ A. 2018 Chief Enforcement Conference (CEC), SEC Headquarters, Washington DC, September 25-26,
85
+ 2018
86
+ • Cost incurred1
87
+ : $165,194
88
+ • Number of attendees: 209 (207 SEC attendees and 2 non-SEC attendees)
89
+ The Enforcement Division (Enforcement) conducts investigations into potential violations of the
90
+ federal securities laws, litigates actions, negotiates settlements, and coordinates with the
91
+ Commission and other SEC divisions and offices regarding the national enforcement program.
92
+ Because Enforcement staff are located in Washington, DC and 11 regional offices, periodic
93
+ gatherings of Enforcement leaders help to ensure an efficient, well-coordinated national program.
94
+ The 2018 Chief Enforcement Conference (CEC) was held at SEC Headquarters in Washington,
95
+ D.C. on September 25 and 26, 2018. CEC served as a strategic planning and training session for
96
+ Enforcement’s senior managers and provided an important opportunity for attendees to discuss
97
+ relevant enforcement topics with the Chairman and participating Commissioners.
98
+ B. 2018 Leadership Conference, SEC Headquarters, Washington, DC, July 26-27, 2018
99
+ • Cost incurred1
100
+ : $219,658
101
+ • Number of attendees: 261 attendees (261 SEC employees)
102
+ The Office of Compliance Inspections and Examinations (OCIE) conducts the National
103
+ Examination Program and focuses on improving compliance with the federal securities laws,
104
+ preventing fraud, informing policy, and monitoring risk. Because examination program staff are
105
+ located in Washington, DC and 11 regional offices, periodic gatherings of examination program
106
+ leaders help to ensure an efficient, well-coordinated national program. On July 26 and 27, 2018,
107
+ OCIE held its leadership conference at SEC Headquarters in Washington DC, which focused on
108
+ initiatives to increase OCIE’s capabilities. The conference gathered SEC managers from across
109
+ the National Examination Program to collaborate on strategic planning and to provide training. It
110
+ included presentations and discussions on risk assessment tools and procedures, implementation
111
+ of new requirements, and increasing OCIE’s collaboration with other Commission offices and
112
+ divisions""","""{
113
+ "Number of conference": "",
114
+ "Total cost": "",
115
+ "Conferences over 100k": [
116
+ {
117
+ "Name": "",
118
+ "Organizer": "",
119
+ "Cost": "",
120
+ "Start date": "",
121
+ "End date": "",
122
+ "Location": ""
123
+ }
124
+ ]
125
+ }""")
126
+
127
+ example4 = ("""
128
+ Patient: Good evening doctor.
129
+
130
+ Doctor: Good evening. You look pale and your voice is out of tune.
131
+
132
+ Patient: Yes doctor. I’m running a temperature and have a sore throat.
133
+
134
+ Doctor: Lemme see.
135
+
136
+ (He touches the forehead to feel the temperature.)
137
+
138
+ Doctor: You’ve moderate fever.
139
+
140
+ (He then whips out a thermometer.)
141
+
142
+ Patient: This thermometer is very different from the one you used the last time. (Unlike the earlier one which was placed below the tongue, this one snapped around one of the fingers.)
143
+
144
+ Doctor: Yes, this is a new introduction by medical equipment companies. It’s much more convenient, as it doesn’t require cleaning after every use.
145
+
146
+ Patient: That’s awesome.
147
+
148
+ Doctor: Yes it is.
149
+
150
+ (He removes the thermometer and looks at the reading.)
151
+
152
+ Doctor: Not too high – 99.8.
153
+
154
+ (He then proceeds with measuring blood pressure.)
155
+
156
+ Doctor: Your blood pressure is fine.
157
+
158
+ (He then checks the throat.)
159
+
160
+ Doctor: It looks bit scruffy. Not good.
161
+
162
+ Patient: Yes, it has been quite bad.
163
+
164
+ Doctor: Do you get sweating and shivering?
165
+
166
+ Patient: Not sweating, but I feel somewhat cold when I sit under a fan.
167
+
168
+ Doctor: OK. You’ve few symptoms of malaria. I would suggest you undergo blood test. Nothing to worry about. In most cases, the test come out to be negative. It’s just precautionary, as there have been spurt in malaria cases in the last month or so.
169
+
170
+ (He then proceeds to write the prescription.)
171
+
172
+ Doctor: I’m prescribing three medicines and a syrup. The number of dots in front of each tells you how many times in the day you’ve to take them. For example, the two dots here mean you’ve to take the medicine twice in the day, once in the morning and once after dinner.
173
+
174
+ Doctor: Do you’ve any other questions?
175
+
176
+ Patient: No, doctor. Thank you.
177
+ ""","""{
178
+ "Doctor_Patient_Discussion": {
179
+ "Initial_Observation": {
180
+ "Symptoms": [],
181
+ "Initial_Assessment": ""
182
+ },
183
+ "Medical_Examination": {
184
+ "Temperature":"",
185
+ "Blood_Pressure":"",
186
+ "Doctor_Assessment": "",
187
+ "Diagnosis": ""
188
+ },
189
+ "Treatment_Plan": {
190
+ "Prescription": []
191
+ }
192
+ }
193
+ }""")
194
+
195
+ example5 = ("""HARVARD UNIVERSITY Extension School
196
+ Master of Liberal Arts, Information Management Systems May 2015
197
+  Dean’s List Academic Achievement Award recipient
198
+  Relevant coursework: Trends in Enterprise Information Systems, Principles of Finance, Data mining
199
+ and Forecast Management, Resource Planning and Allocation Management, Simulation for
200
+ Managerial Decision Making
201
+ RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY
202
+ Bachelor of Arts in Computer Science with Mathematics minor May 2008
203
+ Professional Experience
204
+ STATE STREET CORPORATION
205
+ Principal –Simulated Technology
206
+ Boston, MA
207
+ December 2011 – July 2013
208
+  Led 8 cross functional, geographically dispersed teams to support quality for the reporting system
209
+  Improved process efficiency 75% by standardizing end to end project management workflow
210
+  Reduced application testing time 30% by automating shorter testing phases for off cycle projects
211
+  Conducted industry research on third-party testing tools and prepared recommendations for maximum
212
+ return on investment
213
+ FIDELITY INVESTMENTS
214
+ Associate – Interactive Technology
215
+ Boston, MA
216
+ January 2009 – November 2011
217
+  Initiated automated testing efforts that reduced post production defects by 40%
218
+  Implemented initiatives to reduce overall project time frames by involving quality team members
219
+ early in the Software Development Life Cycle iterations
220
+  Developed a systematic approach to organize and document the requirements of the to-be-system
221
+  Provided leadership to off-shore tech teams via training and analyzing business requirements
222
+ L.L. BEAN, INC.
223
+ IT Consultant
224
+ Freeport, ME
225
+ June 2008 – December 2009
226
+  Collaborated closely with the business teams to streamline production release strategy plans
227
+  Managed team of five test engineers to develop data driven framework that increased application
228
+ testing depth and breadth by 150%
229
+  Generated statistical analysis of quality and requirements traceability matrices to determine the linear
230
+ relationship of development time frames to defect identification and subsequent resolution
231
+  Led walkthroughs with project stakeholders to set expectations and milestones for the project team
232
+ Technical Expertise
233
+ MS Excel, PowerPoint, Relational Databases, Project Management, Quantitative Analysis, SQL, Java
234
+ Additional
235
+ Organized computer and English literacy workshops for underprivileged children in South Asia, 2013
236
+ Student Scholarship Recipient, National Conference on Race and Ethnicity, 2007-2008""","""{
237
+ "Name": "",
238
+ "Age": "",
239
+ "Educations": [
240
+ {
241
+ "School": "",
242
+ "Date": ""
243
+ }
244
+ ],
245
+ "Experiences": [
246
+ {
247
+ "Company": "",
248
+ "Date": ""
249
+ }
250
+ ]
251
+ }""")
252
+
253
+ example6 = (""""Libretto by Marius Petipa, based on the 1822 novella ``Trilby, ou Le Lutin d'Argail`` by Charles Nodier, first presented by the Ballet of the Moscow Imperial Bolshoi Theatre on January 25/February 6 (Julian/Gregorian calendar dates), 1870, in Moscow with Polina Karpakova as Trilby and Ludiia Geiten as Miranda and restaged by Petipa for the Imperial Ballet at the Imperial Bolshoi Kamenny Theatre on January 17–29, 1871 in St. Petersburg with Adèle Grantzow as Trilby and Lev Ivanov as Count Leopold.""","""{
254
+ "Name_act": "",
255
+ "Director": "",
256
+ "Location": [
257
+ {
258
+ "City": "",
259
+ "Venue": "",
260
+ "Date": "",
261
+ "Actor": [
262
+ {
263
+ "Name": "",
264
+ "Character_played": ""
265
+ }
266
+ ]
267
+ }
268
+ ]
269
+ }""")
270
+
271
+
272
+ def extract_leaves(item, path=None, leaves=None):
273
+ if leaves is None:
274
+ leaves = []
275
+ if path is None:
276
+ path = []
277
+
278
+ if isinstance(item, dict):
279
+ for key, value in item.items():
280
+ extract_leaves(value, path + [key], leaves)
281
+ elif isinstance(item, list):
282
+ for value in item:
283
+ extract_leaves(value, path, leaves)
284
+ else:
285
+ if item != '':
286
+ leaves.append((path, item))
287
+ return leaves
288
+
289
+ def highlight_words(input_text, json_output):
290
+ colors = cycle(["#90ee90", "#add8e6", "#ffb6c1", "#ffff99", "#ffa07a", "#20b2aa", "#87cefa", "#b0e0e6", "#dda0dd", "#ffdead"])
291
+ color_map = {}
292
+ highlighted_text = input_text
293
+
294
+ leaves = extract_leaves(json_output)
295
+ for path, value in leaves:
296
+ path_key = tuple(path)
297
+ if path_key not in color_map:
298
+ color_map[path_key] = next(colors)
299
+ color = color_map[path_key]
300
+ highlighted_text = highlighted_text.replace(value, f"<span style='background-color: {color};'>{unquote(value)}</span>")
301
+
302
+ return highlighted_text
303
+
304
+ # model = AutoModelForCausalLM.from_pretrained(
305
+ # "numind/NuExtract-tinyv2",
306
+ # )
307
+
308
+ model = AutoModelForCausalLM.from_pretrained(
309
+ "numind/NuExtract",
310
+ trust_remote_code=True,
311
+ attn_implementation="flash_attention_2",
312
+ torch_dtype="auto"
313
+ )
314
+
315
+
316
+ tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract")
317
+ tokenizer.eos = tokenizer("<|end-output|>") #will save it directly on hf
318
+
319
+ model.to("cuda")
320
+ model.eval()
321
+
322
+
323
+ def get_prediction(text,template,example):
324
+ print(template)
325
+ prompt = create_prompt(text,template,[example,"",""])
326
+ result = generate_answer_short(prompt,model,tokenizer)
327
+ print(result)
328
+ result = result.replace("\n"," ")
329
+ r = unquote(result)
330
+ r = json.dumps(json.loads(r),indent = 4)
331
+ print(result)
332
+ dic_out = json.loads(r)
333
+ highlighted_input2 = highlight_words(text, dic_out)
334
+ return r,highlighted_input2
335
+
336
+
337
+ iface = gr.Interface(fn=get_prediction,
338
+ inputs=[
339
+ gr.Textbox(lines=2, placeholder="Enter Text here...", label="Text"),
340
+ gr.Textbox(lines=2, placeholder="Enter Template input here...", label="Template"),
341
+ gr.Textbox(lines=2, placeholder="Enter Example input here...", label="Example")],
342
+ outputs=[gr.Textbox(label="Model Output"),gr.HTML(label="Model Output with Highlighted Words")],
343
+ examples=[[example6[0],example6[1]],
344
+ [example1[0],example1[1]],
345
+ [example4[0],example4[1]],
346
+ [example2[0],example2[1]],
347
+ [example5[0],example5[1]],
348
+ [example3[0],example3[1]]])
349
+
350
+
351
+ iface.launch(debug=True,share=True)
ml.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
2
+ import torch
3
+ import json
4
+ import json
5
+ import re
6
+ import numpy as np
7
+
8
+
9
+ def create_prompt(text, template, examples):
10
+ template = json.dumps(json.loads(template),indent = 4)
11
+
12
+ prompt = "<|input|>\n### Template:\n"+template+"\n"
13
+
14
+ if examples[0]:
15
+ example1 = json.dumps(json.loads(examples[0]),indent = 4)
16
+ prompt+= "### Example:\n"+example1+"\n"
17
+ if examples[1]:
18
+ example2 = json.dumps(json.loads(examples[1]),indent = 4)
19
+ prompt+= "### Example:\n"+example1+"\n"
20
+ if examples[2]:
21
+ example3 = json.dumps(json.loads(examples[1]),indent = 4)
22
+ prompt+= "### Example:\n"+example3+"\n"
23
+
24
+ prompt += "### Text:\n"+text+'''\n<|output|>'''
25
+
26
+ return prompt
27
+
28
+
29
+ def generate_answer_short(prompt,model, tokenizer):
30
+ model_input = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=3000).to("cuda")
31
+ with torch.no_grad():
32
+ gen = tokenizer.decode(model.generate(**model_input, max_new_tokens=1500)[0], skip_special_tokens=True)
33
+ print(gen.split("<|output|>")[1].split("<|end-output|>")[0])
34
+ return gen.split("<|output|>")[1].split("<|end-output|>")[0]