mdelmas commited on
Commit
83595b7
1 Parent(s): 3776d61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,9 +1,72 @@
1
  ---
2
  library_name: peft
 
 
 
 
3
  ---
4
- ## Training procedure
5
 
6
- ### Framework versions
7
 
 
 
 
8
 
9
- - PEFT 0.5.0.dev0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: peft
3
+ license: mit
4
+ language:
5
+ - en
6
+ pipeline_tag: text-generation
7
  ---
 
8
 
9
+ ## Biogpt-Natural-Products-RE-Extended-synt-v1.0
10
 
11
+ Natural products represent a large pool of bioactive compounds of high interest in drug-discovery. However, these relationships are sparsely distributed across organisms and a growing part
12
+ of the literature remains unannotated. This volume necessitates the development of a machine assistant to boost the completion of existing resources. Framing the task as an
13
+ end-to-end Relation Extraction (RE), we propose a BioGPT model fined-tuned on synthetic data. See details about this procedure in the join [article](url/rXiv).
14
 
15
+ The model is a derived from [microsoft/BioGPT-Large](hhttps://huggingface.co/microsoft/BioGPT-Large) and was trained on a synthetic dataset *Extended-synt* available on zenodo.
16
+
17
+ Dataset: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8422294.svg)](https://doi.org/10.5281/zenodo.8422294)
18
+
19
+ You can use the model directly as:
20
+
21
+ ## Example
22
+
23
+ ```python
24
+ import torch
25
+ from transformers import AutoTokenizer, AutoModelForCausalLM, AutoTokenizer
26
+ from peft import PeftModel
27
+
28
+ model_hf = "microsoft/BioGPT-Large"
29
+ lora_adapters = "mdelmas/BioGPT-Large-Natural-Products-RE-Extended-synt-v1.0"
30
+
31
+ # Load model and plug adapters using peft
32
+ model = AutoModelForCausalLM.from_pretrained(model_hf)
33
+ model = PeftModel.from_pretrained(model, lora_adapters)
34
+ model = model.merge_and_unload()
35
+ tokenizer = AutoTokenizer.from_pretrained(model_hf)
36
+
37
+ # Example from PubMed article 24048364
38
+ title_text = "Producers and important dietary sources of ochratoxin A and citrinin."
39
+ abstract_text = "Ochratoxin A (OTA) is a very important mycotoxin, and its research is focused right now on the new findings of OTA, like being a complete carcinogen, information about OTA producers and new exposure sources of OTA. Citrinin (CIT) is another important mycotoxin, too, and its research turns towards nephrotoxicity. Both additive and synergistic effects have been described in combination with OTA. OTA is produced in foodstuffs by Aspergillus Section Circumdati (Aspergillus ochraceus, A. westerdijkiae, A. steynii) and Aspergillus Section Nigri (Aspergillus carbonarius, A. foetidus, A. lacticoffeatus, A. niger, A. sclerotioniger, A. tubingensis), mostly in subtropical and tropical areas. OTA is produced in foodstuffs by Penicillium verrucosum and P. nordicum, notably in temperate and colder zones. CIT is produced in foodstuffs by Monascus species (Monascus purpureus, M. ruber) and Penicillium species (Penicillium citrinum, P. expansum, P. radicicola, P. verrucosum). OTA was frequently found in foodstuffs of both plant origin (e.g., cereal products, coffee, vegetable, liquorice, raisins, wine) and animal origin (e.g., pork/poultry). CIT was also found in foodstuffs of vegetable origin (e.g., cereals, pomaceous fruits, black olive, roasted nuts, spices), food supplements based on rice fermented with red microfungi Monascus purpureus and in foodstuffs of animal origin (e.g., cheese)."
40
+ text = title_text + " " + abstract_text
41
+
42
+ # Tokenization
43
+ input_text = text + tokenizer.eos_token + tokenizer.bos_token
44
+ input_tokens = tokenizer(input_text, return_tensors='pt')
45
+
46
+ # Decoding parameters
47
+ EVAL_GENERATION_ARGS = {"max_length": 1024,
48
+ "do_sample": False,
49
+ "forced_eos_token_id": tokenizer.eos_token_id,
50
+ "num_beams": 3,
51
+ "early_stopping": "never",
52
+ "length_penalty": 1.5,
53
+ "temperature": 0}
54
+
55
+ # Generation
56
+ with torch.no_grad():
57
+ beam_output = model.generate(**input_tokens, **EVAL_GENERATION_ARGS)
58
+ output = tokenizer.decode(beam_output[0][len(input_tokens["input_ids"][0]):], skip_special_tokens=True)
59
+
60
+ # Parse and print
61
+ rels = output.strip().split("; ")
62
+ for rel in rels:
63
+ print("- " + rel)
64
+ ```
65
+
66
+ ## Citation
67
+
68
+ If you find this model useful, please cite:
69
+
70
+ ```latex
71
+
72
+ ```