Spaces:

numind
/

NuExtract

Running on L4

App Files Files Community

Alexandre-Numind commited on Jun 20

Commit

831257c

•

1 Parent(s): 30f4cd9

Upload 2 files

Browse files

Files changed (2) hide show

app.py +351 -0
ml.py +34 -0

app.py ADDED Viewed

	@@ -0,0 +1,351 @@

+from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
+import torch
+from itertools import cycle
+import json
+import gradio as gr
+from urllib.parse import unquote
+from ml import create_prompt, generate_answer_short
+example1 = ("""We introduce Mistral 7B, a 7–billion-parameter language model engineered for
+superior performance and efficiency. Mistral 7B outperforms the best open 13B
+model (Llama 2) across all evaluated benchmarks, and the best released 34B
+model (Llama 1) in reasoning, mathematics, and code generation. Our model
+leverages grouped-query attention (GQA) for faster inference, coupled with sliding
+window attention (SWA) to effectively handle sequences of arbitrary length with a
+reduced inference cost. We also provide a model fine-tuned to follow instructions,
+Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
+automated benchmarks. Our models are released under the Apache 2.0 license.
+Code: https://github.com/mistralai/mistral-src
+Webpage: https://mistral.ai/news/announcing-mistral-7b/""","""{
+    "Model": {
+        "Name": "",
+        "Number of parameters": "",
+        "Number of token": "",
+        "Architecture": []
+    },
+    "Usage": {
+        "Use case": [],
+        "Licence": ""
+    }
+}""")
+example2 = ("""Identity security company IDfy said on Wednesday it has raised $27 million in a mix of primary and secondary fundraising from Elev8, KB Investment and Tenacity Ventures.
+Mumbai-based IDfy makes products and solutions for authenticating entities, helping companies prevent fraud and verify other businesses. "Investment from Elev8 and Tenacity is a strong validation of our vision and capabilities. The fund will fuel our expansion plans and product development, enabling us to serve even more businesses and unlock opportunities for trustworthy people and businesses," said Ashok Hariharan, co-founder and chief executive officer of IDfy, in a statement.
+Click here to follow our WhatsApp channel
+IDfy (Baldor Technologies) was founded in 2011 by Hariharan and Vineet Jawa. It has products for diligence processes called know your customer and know your business, employee background verification, risk and fraud mitigation, and digital privacy.
+The company said its artificial intelligence-driven solutions serve more than 1,500 clients in banking, financial services and insurance, e-commerce, gaming and other sectors. It works with companies in India, Southeast Asia and West Asia, having HDFC Bank, Zomato, Paytm, HUL and American Express as clients.
+Navin Honagudi, managing partner at Elev8 Venture Partners, said: "We are thrilled to partner with IDfy as our first investment. The company's innovative technology, experienced leadership team, and strong market fit position it for remarkable growth. We are confident that IDfy will play a crucial role in shaping the future of risk management in India and beyond."
+Elev8 Venture Partners is a $200 million growth-stage fund anchored by South Korea's KB Investment. Tenacity Ventures is a growth-stage investment fund with a focus on technology product businesses.
+""", """{
+    "Funding": {
+        "New funding": "",
+        "Investor": []
+    },
+     "Company": {
+        "Name": "",
+        "Activity": "",
+        "Total valuation": ""
+    }
+}""")
+example3 = (""""Office of Management and Budget (OMB) memorandum M-12-12, as amended by memorandum
+M-17-08, requires federal agencies to issue an annual report related to its conference-related expenditures
+for the previous fiscal year. This document constitutes the SEC’s report for Fiscal Year (FY) 2018.
+The SEC has put in place policies and procedures governing the approval and use of agency funds for
+conference expenses, to ensure that such spending is legal, reasonable, and in furtherance of the agency’s
+mission to protect investors, maintain fair, orderly, and efficient markets, and facilitate capital formation.
+At a high level, the major steps in this process are as follows:
+1. All SEC division/office requests to spend money on hosting a conference must be approved by
+the division/office head or his/her designee. Divisions and offices are required to use SEC
+facilities for such events whenever possible, to minimize space rental and equipment costs. In
+order to limit expenses for meals or refreshments, the SEC uses per diem rates established for
+the federal government as the ceiling for any such costs, except when higher rates are
+unavoidable or otherwise justified. The acquisition of any goods, services, or meeting space is
+subject to the applicable policies and regulations which govern these areas.
+2. When a request for funds is necessary and has received approval from the division/office head,
+it is reviewed by staff in the Office of Financial Management (OFM) to ensure the expenses
+are permissible under the applicable polices and regulations. OFM has implemented an
+automated system for the submission, review, and approval of all SEC conference requests
+that enables OFM to monitor and control conference spending, as well as record actual
+conference spending after a conference has been held.
+3. Each request must receive final approval from designated officials according to the total
+projected cost. These designations comply with OMB Memorandum 12-12.
+4. The SEC is reporting conferences which meet thresholds defined in P.L.115-141 Division E,
+Title VII, Sections 739 (a), (b), and (c), to the SEC’s Office of Inspector General via separate
+correspondence.
+For FY 2018, the SEC authorized 97 conferences (including training conferences) with costs totaling
+$884,759.
+2
+Conferences over $100,000:
+In FY 2018, the SEC authorized two conferences costing greater than $100,000, which are described
+below:
+A. 2018 Chief Enforcement Conference (CEC), SEC Headquarters, Washington DC, September 25-26,
+2018
+• Cost incurred1
+: $165,194
+• Number of attendees: 209 (207 SEC attendees and 2 non-SEC attendees)
+The Enforcement Division (Enforcement) conducts investigations into potential violations of the
+federal securities laws, litigates actions, negotiates settlements, and coordinates with the
+Commission and other SEC divisions and offices regarding the national enforcement program.
+Because Enforcement staff are located in Washington, DC and 11 regional offices, periodic
+gatherings of Enforcement leaders help to ensure an efficient, well-coordinated national program.
+The 2018 Chief Enforcement Conference (CEC) was held at SEC Headquarters in Washington,
+D.C. on September 25 and 26, 2018. CEC served as a strategic planning and training session for
+Enforcement’s senior managers and provided an important opportunity for attendees to discuss
+relevant enforcement topics with the Chairman and participating Commissioners.
+B. 2018 Leadership Conference, SEC Headquarters, Washington, DC, July 26-27, 2018
+• Cost incurred1
+: $219,658
+• Number of attendees: 261 attendees (261 SEC employees)
+The Office of Compliance Inspections and Examinations (OCIE) conducts the National
+Examination Program and focuses on improving compliance with the federal securities laws,
+preventing fraud, informing policy, and monitoring risk. Because examination program staff are
+located in Washington, DC and 11 regional offices, periodic gatherings of examination program
+leaders help to ensure an efficient, well-coordinated national program. On July 26 and 27, 2018,
+OCIE held its leadership conference at SEC Headquarters in Washington DC, which focused on
+initiatives to increase OCIE’s capabilities. The conference gathered SEC managers from across
+the National Examination Program to collaborate on strategic planning and to provide training. It
+included presentations and discussions on risk assessment tools and procedures, implementation
+of new requirements, and increasing OCIE’s collaboration with other Commission offices and
+divisions""","""{
+    "Number of conference": "",
+    "Total cost": "",
+    "Conferences over 100k": [
+        {
+            "Name": "",
+            "Organizer": "",
+            "Cost": "",
+            "Start date": "",
+            "End date": "",
+            "Location": ""
+        }
+    ]
+}""")
+example4 = ("""
+Patient: Good evening doctor.
+Doctor: Good evening. You look pale and your voice is out of tune.
+Patient: Yes doctor. I’m running a temperature and have a sore throat.
+Doctor: Lemme see.
+(He touches the forehead to feel the temperature.)
+Doctor: You’ve moderate fever.
+(He then whips out a thermometer.)
+Patient: This thermometer is very different from the one you used the last time. (Unlike the earlier one which was placed below the tongue, this one snapped around one of the fingers.)
+Doctor: Yes, this is a new introduction by medical equipment companies. It’s much more convenient, as it doesn’t require cleaning after every use.
+Patient: That’s awesome.
+Doctor: Yes it is.
+(He removes the thermometer and looks at the reading.)
+Doctor: Not too high – 99.8.
+(He then proceeds with measuring blood pressure.)
+Doctor: Your blood pressure is fine.
+(He then checks the throat.)
+Doctor: It looks bit scruffy. Not good.
+Patient: Yes, it has been quite bad.
+Doctor: Do you get sweating and shivering?
+Patient: Not sweating, but I feel somewhat cold when I sit under a fan.
+Doctor: OK. You’ve few symptoms of malaria. I would suggest you undergo blood test. Nothing to worry about. In most cases, the test come out to be negative. It’s just precautionary, as there have been spurt in malaria cases in the last month or so.
+(He then proceeds to write the prescription.)
+Doctor: I’m prescribing three medicines and a syrup. The number of dots in front of each tells you how many times in the day you’ve to take them. For example, the two dots here mean you’ve to take the medicine twice in the day, once in the morning and once after dinner.
+Doctor: Do you’ve any other questions?
+Patient: No, doctor. Thank you.
+""","""{
+    "Doctor_Patient_Discussion": {
+        "Initial_Observation": {
+            "Symptoms": [],
+            "Initial_Assessment": ""
+        },
+        "Medical_Examination": {
+            "Temperature":"",
+            "Blood_Pressure":"",
+            "Doctor_Assessment": "",
+            "Diagnosis": ""
+        },
+        "Treatment_Plan": {
+            "Prescription": []
+        }
+    }
+}""")
+example5 = ("""HARVARD UNIVERSITY Extension School
+Master of Liberal Arts, Information Management Systems May 2015
+ Dean’s List Academic Achievement Award recipient
+ Relevant coursework: Trends in Enterprise Information Systems, Principles of Finance, Data mining
+and Forecast Management, Resource Planning and Allocation Management, Simulation for
+Managerial Decision Making
+RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY
+Bachelor of Arts in Computer Science with Mathematics minor May 2008
+Professional Experience
+STATE STREET CORPORATION
+Principal –Simulated Technology
+ Boston, MA
+December 2011 – July 2013
+ Led 8 cross functional, geographically dispersed teams to support quality for the reporting system
+ Improved process efficiency 75% by standardizing end to end project management workflow
+ Reduced application testing time 30% by automating shorter testing phases for off cycle projects
+ Conducted industry research on third-party testing tools and prepared recommendations for maximum
+return on investment
+FIDELITY INVESTMENTS
+Associate – Interactive Technology
+ Boston, MA
+January 2009 – November 2011
+ Initiated automated testing efforts that reduced post production defects by 40%
+ Implemented initiatives to reduce overall project time frames by involving quality team members
+early in the Software Development Life Cycle iterations
+ Developed a systematic approach to organize and document the requirements of the to-be-system
+ Provided leadership to off-shore tech teams via training and analyzing business requirements
+L.L. BEAN, INC.
+IT Consultant
+ Freeport, ME
+June 2008 – December 2009
+ Collaborated closely with the business teams to streamline production release strategy plans
+ Managed team of five test engineers to develop data driven framework that increased application
+testing depth and breadth by 150%
+ Generated statistical analysis of quality and requirements traceability matrices to determine the linear
+relationship of development time frames to defect identification and subsequent resolution
+ Led walkthroughs with project stakeholders to set expectations and milestones for the project team
+Technical Expertise
+MS Excel, PowerPoint, Relational Databases, Project Management, Quantitative Analysis, SQL, Java
+Additional
+Organized computer and English literacy workshops for underprivileged children in South Asia, 2013
+Student Scholarship Recipient, National Conference on Race and Ethnicity, 2007-2008""","""{
+    "Name": "",
+    "Age": "",
+    "Educations": [
+        {
+            "School": "",
+            "Date": ""
+        }
+    ],
+    "Experiences": [
+        {
+            "Company": "",
+            "Date": ""
+        }
+    ]
+}""")
+example6 = (""""Libretto by Marius Petipa, based on the 1822 novella ``Trilby, ou Le Lutin d'Argail`` by Charles Nodier, first presented by the Ballet of the Moscow Imperial Bolshoi Theatre on January 25/February 6 (Julian/Gregorian calendar dates), 1870, in Moscow with Polina Karpakova as Trilby and Ludiia Geiten as Miranda and restaged by Petipa for the Imperial Ballet at the Imperial Bolshoi Kamenny Theatre on January 17–29, 1871 in St. Petersburg with Adèle Grantzow as Trilby and Lev Ivanov as Count Leopold.""","""{
+    "Name_act": "",
+    "Director": "",
+    "Location": [
+        {
+            "City": "",
+            "Venue": "",
+            "Date": "",
+            "Actor": [
+                {
+                    "Name": "",
+                    "Character_played": ""
+                }
+            ]
+        }
+    ]
+}""")
+def extract_leaves(item, path=None, leaves=None):
+    if leaves is None:
+        leaves = []
+    if path is None:
+        path = []
+    if isinstance(item, dict):
+        for key, value in item.items():
+            extract_leaves(value, path + [key], leaves)
+    elif isinstance(item, list):
+        for value in item:
+            extract_leaves(value, path, leaves)
+    else:
+        if item != '':
+          leaves.append((path, item))
+    return leaves
+def highlight_words(input_text, json_output):
+    colors = cycle(["#90ee90", "#add8e6", "#ffb6c1", "#ffff99", "#ffa07a", "#20b2aa", "#87cefa", "#b0e0e6", "#dda0dd", "#ffdead"])
+    color_map = {}
+    highlighted_text = input_text
+    leaves = extract_leaves(json_output)
+    for path, value in leaves:
+        path_key = tuple(path)
+        if path_key not in color_map:
+            color_map[path_key] = next(colors)
+        color = color_map[path_key]
+        highlighted_text = highlighted_text.replace(value, f"<span style='background-color: {color};'>{unquote(value)}</span>")
+    return highlighted_text
+# model = AutoModelForCausalLM.from_pretrained(
+#         "numind/NuExtract-tinyv2",
+#         )
+model = AutoModelForCausalLM.from_pretrained(
+        "numind/NuExtract",
+        trust_remote_code=True,
+        attn_implementation="flash_attention_2",
+        torch_dtype="auto"
+        )
+tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract")
+tokenizer.eos = tokenizer("<|end-output|>") #will save it directly on hf
+model.to("cuda")
+model.eval()
+def get_prediction(text,template,example):
+    print(template)
+    prompt = create_prompt(text,template,[example,"",""])
+    result = generate_answer_short(prompt,model,tokenizer)
+    print(result)
+    result = result.replace("\n"," ")
+    r = unquote(result)
+    r = json.dumps(json.loads(r),indent = 4)
+    print(result)
+    dic_out = json.loads(r)
+    highlighted_input2 = highlight_words(text, dic_out)
+    return r,highlighted_input2
+iface = gr.Interface(fn=get_prediction,
+                     inputs=[
+                             gr.Textbox(lines=2, placeholder="Enter Text here...", label="Text"),
+                             gr.Textbox(lines=2, placeholder="Enter Template input here...", label="Template"),
+                             gr.Textbox(lines=2, placeholder="Enter Example input here...", label="Example")],
+                     outputs=[gr.Textbox(label="Model Output"),gr.HTML(label="Model Output with Highlighted Words")],
+                     examples=[[example6[0],example6[1]],
+                               [example1[0],example1[1]],
+                               [example4[0],example4[1]],
+                               [example2[0],example2[1]],
+                               [example5[0],example5[1]],
+                               [example3[0],example3[1]]])
+iface.launch(debug=True,share=True)

ml.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
+import torch
+import json
+import json
+import re
+import numpy as np
+def create_prompt(text, template, examples):
+  template = json.dumps(json.loads(template),indent = 4)
+  prompt = "<|input|>\n### Template:\n"+template+"\n"
+  if examples[0]:
+    example1 = json.dumps(json.loads(examples[0]),indent = 4)
+    prompt+= "### Example:\n"+example1+"\n"
+  if examples[1]:
+    example2 = json.dumps(json.loads(examples[1]),indent = 4)
+    prompt+= "### Example:\n"+example1+"\n"
+  if examples[2]:
+    example3 = json.dumps(json.loads(examples[1]),indent = 4)
+    prompt+= "### Example:\n"+example3+"\n"
+  prompt += "### Text:\n"+text+'''\n<|output|>'''
+  return prompt
+def generate_answer_short(prompt,model, tokenizer):
+    model_input = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=3000).to("cuda")
+    with torch.no_grad():
+        gen = tokenizer.decode(model.generate(**model_input, max_new_tokens=1500)[0], skip_special_tokens=True)
+    print(gen.split("<|output|>")[1].split("<|end-output|>")[0])
+    return gen.split("<|output|>")[1].split("<|end-output|>")[0]