GoloMarcos commited on
Commit
64ab909
·
verified ·
1 Parent(s): cced6d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md CHANGED
@@ -20,3 +20,126 @@ language:
20
  This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
23
+
24
+ ## 📄 Model Card: `aksw/Bike-name`
25
+
26
+ ### 🧠 Model Overview
27
+
28
+ `Bike-name` is a Medium fine-tuned language model designed to **extract biochemical names from scientific text articles**. It is ideal for Information Retrieval systems based on Biohemical Knowledge Extraction.
29
+
30
+ ---
31
+
32
+ ### 🔍 Intended Use
33
+
34
+ * **Input**: Text from a Biochemical PDF file
35
+ * **Output**: A **single list** containing the corresponding biochemical names from the text.
36
+
37
+ ---
38
+
39
+ ### 🧩 Applications
40
+
41
+ * Question Answering systems over Biochemical Datasets
42
+ * Biochemical Knowledge graph exploration tools
43
+ * Extraction of biochemical names from scientific text articles
44
+
45
+ ---
46
+
47
+ ### ⚙️ Model Details
48
+
49
+ * **Base model**: Phi 4 14B (via Unsloth)
50
+ * **Training**: Scientific text articles
51
+ * 418 unique names
52
+ * 143 articles
53
+ * **Target Ontology**: NatUke Benchmarking (https://github.com/AKSW/natuke)
54
+ * **Frameworks**: Unsloth, HuggingFace, Transformers
55
+
56
+ ---
57
+
58
+ ### 📦 Installation
59
+
60
+ Make sure to install `unsloth`, `torch` and CUDA dependencies:
61
+
62
+ ```bash
63
+ pip install unsloth torch
64
+ ```
65
+
66
+ ---
67
+
68
+ ### 🧪 Example: Inference Code
69
+
70
+ ```python
71
+ from unsloth import FastLanguageModel
72
+ import torch
73
+
74
+ class SPARQLQueryGenerator:
75
+ def __init__(self, model_name: str, max_seq_length: int = 32768, load_in_4bit: bool = True):
76
+ self.model, self.tokenizer = FastLanguageModel.from_pretrained(
77
+ model_name=model_name,
78
+ max_seq_length=max_seq_length,
79
+ load_in_4bit=load_in_4bit
80
+ )
81
+ _ = FastLanguageModel.for_inference(self.model)
82
+
83
+ def build_prompt(self, article_text: str) -> list:
84
+ return [
85
+ {"role": "system", "content": (
86
+ "You are a scientist trained in chemistry.\n"
87
+ "You must extract information from scientific papers identifying relevant properties associated with each natural product discussed in the academic publication.\n"
88
+ "For each paper, you have to analyze the content (text) to identify the *Compound name*. It can be more than one compound name.\n"
89
+ "Your output should be a list with the names. Return only the list, without any additional information.\n"
90
+ )},
91
+ {"role": "user", "content": article_text}
92
+ ]
93
+
94
+ def generate_query(self, article_text: str, temperature: float = 0.01, max_new_tokens: int = 1024) -> str:
95
+ si = "<|im_start|>assistant<|im_sep|>"
96
+ sf = "<|im_end|>"
97
+ messages = self.build_prompt(article_text)
98
+ inputs = self.tokenizer.apply_chat_template(
99
+ messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
100
+ ).to("cuda")
101
+ outputs = self.model.generate(inputs, max_new_tokens=max_new_tokens, use_cache=True, temperature=temperature, min_p=0.1)
102
+ decoded = self.tokenizer.batch_decode(outputs)[0]
103
+ parsed = decoded[decoded.find(si):].replace(si, "").replace(sf, "")
104
+ try:
105
+ l = eval(parsed)
106
+ except:
107
+ l = parsed
108
+ print('Your output is not a list, you will need one more preprocessing step.')
109
+
110
+ return l
111
+
112
+ # --- Using the model ---
113
+ if __name__ == "__main__":
114
+ generator = SPARQLQueryGenerator(model_name="aksw/Bike-name")
115
+ text = "Title, Abstract, Introduction, Background, Method, Results, Conclusion, References."
116
+ list_names = generator.generate_query(text)
117
+ print(list_names)
118
+ ```
119
+
120
+ ---
121
+
122
+ ### 🧪 Evaluation
123
+
124
+ The model was evaluated using Hits@k on the test sets of the NatUKE Benchmark (do Carmo et al. 2023)
125
+
126
+ ---
127
+
128
+ Do Carmo, Paulo Viviurka, et al. "NatUKE: A Benchmark for Natural Product Knowledge Extraction from Academic Literature." 2023 IEEE 17th International Conference on Semantic Computing (ICSC). IEEE, 2023.
129
+
130
+
131
+ ### 📚 Citation
132
+
133
+ If you use this model in your work, please cite it as:
134
+
135
+ ```
136
+ @inproceedings{ref:doCarmo2025,
137
+ title={Improving Natural Product Knowledge Extraction from Academic Literature with Enhanced PDF Text Extraction and Large Language Models},
138
+ author={Viviurka do Carmo, Paulo and Silva G{\^o}lo, Marcos Paulo and Gwozdz, Jonas and Marx, Edgard and Marcondes Marcacini, Ricardo},
139
+ booktitle={Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing},
140
+ pages={980--987},
141
+ year={2025}
142
+ }
143
+ ```
144
+
145
+