Token Classification
PyTorch
English
bert
medical
alecocc commited on
Commit
9948694
·
verified ·
1 Parent(s): fe2367a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -13
README.md CHANGED
@@ -74,7 +74,7 @@ OpenBioNER outperforms all competing models, achieving the **highest average per
74
  | UniNER | 7B | 25.1 | 60.4 | 48.1 | 46.2 | 47.9 | **68.0** | 50.2 | **53.4** | 49.9 |
75
  | GLiNER_large-v1 | 459M | 33.3 | **61.9** | **57.1** | 47.9 | 43.1 | 66.4 | 51.9 | **53.4** | 51.9 |
76
  | OpenBioNER *(Ours)* | 110M | 35.2 | 58.5 | **57.1** | **49.1** | **48.0** | 60.4 | **63.9** | 50.9 | **52.9** |
77
- | OpenBioNER *(Ours)* - Zshot | 110M | 34.8 | 57.8 | 56.8 | 49.5 | 47.1 | 60.1 | 64.6 | 52.5 | 52.9 |
78
 
79
  > ⚠️ **Disclaimer**: Please note that running evaluations using the `zshot` library may lead to slightly different results on certain benchmarks compared to those reported in the paper (above). This discrepancy is due to differences in token alignment: `zshot` uses spaCy's character-based span matching, while our experiments use token-level alignment as handled by BERT-based NER pipelines. These differences can affect how entity spans are matched and evaluated, particularly in cases with subword tokenization or punctuation.
80
 
@@ -99,7 +99,6 @@ This is the description used as NEG class (e.g. not an entity) for all the datas
99
  | :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
100
  | DISEASE | A disease is a medical condition that disrupts normal bodily functions or structures, affecting various organs or systems, and leading to symptoms like muscle weakness, fatigue, stiffness, or cognitive impairment. Diseases can impact muscles, the nervous system, heart, eyes, and more, and may be chronic or acute, such as diabetes, cardiovascular or neurological disorders, and cancer-related conditions like lymphoblastic leukemia or lymphoma. |
101
 
102
- ---
103
 
104
  ### AnatEM
105
 
@@ -107,7 +106,6 @@ This is the description used as NEG class (e.g. not an entity) for all the datas
107
  | :------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
108
  | ANATOMY | The anatomy refers to biological components at various scales, including cells, tissues, and organs. These entities can be identified by proper nouns referring to cell types (e.g., HeLa cells, neurospheres, NSCLC, SCC), body parts (e.g., serum, blood) or biological substances (e.g., vegetables, meats, cow milk) or tumors. |
109
 
110
- ---
111
 
112
  ### BC4CHEMD
113
 
@@ -115,7 +113,6 @@ This is the description used as NEG class (e.g. not an entity) for all the datas
115
  | :------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
116
  | CHEMICAL | Chemicals are substances that are composed of one or more elements, typically consisting of atoms bonded together by chemical bonds. They can be naturally occurring, such as vitamins or sterols, or synthesized, like alkylcarbazoles or tetrachlorodibenzo-p-dioxins (TCDD). Chemicals can also be modified or combined to form new compounds, such as esters or polymers. |
117
 
118
- ---
119
 
120
  ### BC2GM
121
 
@@ -123,7 +120,6 @@ This is the description used as NEG class (e.g. not an entity) for all the datas
123
  | :--- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
124
  | GENE | A gene is a unit of heredity that carries information from one generation to the next and is composed of DNA sequences that encode the instructions for the development, growth, and function of an organism. It can be a segment of DNA that is passed from one generation to the next and is responsible for the transmission of traits from parents to offspring. A gene is often represented using a three-letter code (e.g., trios, ABL, DNA-PK). |
125
 
126
- ---
127
 
128
  ### BC5CDR
129
 
@@ -132,7 +128,6 @@ This is the description used as NEG class (e.g. not an entity) for all the datas
132
  | CHEMICAL | Chemicals are substances that are composed of atoms, either bonded together in a molecule or as a mixture of different substances. This includes medications (e.g., nitroarginine methyl ester, nifedipine, prednisolone, methyldopa), compounds (e.g., potassium, calcium, ammonium), and other substances that can have various effects on the body. |
133
  | DISEASE | Diseases are any medical condition that affects the normal functioning of the body, resulting in symptoms, discomfort, or potentially life-threatening complications. This includes chronic and acute disorders, conditions affecting specific bodily systems, cancer-related conditions, and complications arising from medical treatments or external factors. |
134
 
135
- ---
136
 
137
  ### JNLPBA
138
 
@@ -144,7 +139,6 @@ This is the description used as NEG class (e.g. not an entity) for all the datas
144
  | CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments. |
145
  | RNA | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins. |
146
 
147
- ---
148
 
149
  ### JNLPBA-Rare
150
 
@@ -153,22 +147,22 @@ This is the description used as NEG class (e.g. not an entity) for all the datas
153
  | CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments. |
154
  | RNA | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins. |
155
 
156
- ---
157
 
158
  ### MedMentions-Rare
159
 
160
  | TYPE | Description |
161
  | :--- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
162
- | NEG | In this study, we fabricated prevascularized synthetic device ports to help mitigate this limitation. Thus, the optimum range of pore size for prevascularization of these membranes was estimated to be 75 - 100 μm. A total of 51 patients were included, 16 in group I and 35 in group II." |
163
- | Bacterium (T007) | A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. |
164
- | Body Substance (T031) | A body substance is any material produced by or found within the body, such as blood, serum, saliva, sweat, or gastric acid. |
165
- | Food (T168) | A food refers to any substance consumed to provide nutritional support for the body. This includes snacks, meat, dairy products, grains, and edible substances like carbohydrates, proteins, and fats. |
166
- | Body System (T022) | A body system consists of interconnected organs and tissues working together to carry out essential functions. Examples include the gastrointestinal tract, nervous system, hematological system, and endocrine system. |
167
  | Professional or Occupational Group (T097) | A professional refers to individuals who share the same profession, occupation, or role within a specific field. Examples include cardiologists, psychologists, assessors, hospice staff, and volunteers. |
168
 
169
  ---
170
 
171
 
 
172
  # 🧬 How to Write Effective Entity Type Descriptions
173
 
174
  Entity type descriptions are crucial for improving generalization in OpenBioNER. Well-written descriptions help models disambiguate types, handle rare classes, and align with real-world usage across diverse datasets.
 
74
  | UniNER | 7B | 25.1 | 60.4 | 48.1 | 46.2 | 47.9 | **68.0** | 50.2 | **53.4** | 49.9 |
75
  | GLiNER_large-v1 | 459M | 33.3 | **61.9** | **57.1** | 47.9 | 43.1 | 66.4 | 51.9 | **53.4** | 51.9 |
76
  | OpenBioNER *(Ours)* | 110M | 35.2 | 58.5 | **57.1** | **49.1** | **48.0** | 60.4 | **63.9** | 50.9 | **52.9** |
77
+ | OpenBioNER *(Ours)* - Zshot | 110M | 34.8 | 57.8 | 56.8 | 49.5 | 47.1 | 60.1 | 64.6 | 52.9 | 53.0 |
78
 
79
  > ⚠️ **Disclaimer**: Please note that running evaluations using the `zshot` library may lead to slightly different results on certain benchmarks compared to those reported in the paper (above). This discrepancy is due to differences in token alignment: `zshot` uses spaCy's character-based span matching, while our experiments use token-level alignment as handled by BERT-based NER pipelines. These differences can affect how entity spans are matched and evaluated, particularly in cases with subword tokenization or punctuation.
80
 
 
99
  | :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
100
  | DISEASE | A disease is a medical condition that disrupts normal bodily functions or structures, affecting various organs or systems, and leading to symptoms like muscle weakness, fatigue, stiffness, or cognitive impairment. Diseases can impact muscles, the nervous system, heart, eyes, and more, and may be chronic or acute, such as diabetes, cardiovascular or neurological disorders, and cancer-related conditions like lymphoblastic leukemia or lymphoma. |
101
 
 
102
 
103
  ### AnatEM
104
 
 
106
  | :------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
107
  | ANATOMY | The anatomy refers to biological components at various scales, including cells, tissues, and organs. These entities can be identified by proper nouns referring to cell types (e.g., HeLa cells, neurospheres, NSCLC, SCC), body parts (e.g., serum, blood) or biological substances (e.g., vegetables, meats, cow milk) or tumors. |
108
 
 
109
 
110
  ### BC4CHEMD
111
 
 
113
  | :------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
114
  | CHEMICAL | Chemicals are substances that are composed of one or more elements, typically consisting of atoms bonded together by chemical bonds. They can be naturally occurring, such as vitamins or sterols, or synthesized, like alkylcarbazoles or tetrachlorodibenzo-p-dioxins (TCDD). Chemicals can also be modified or combined to form new compounds, such as esters or polymers. |
115
 
 
116
 
117
  ### BC2GM
118
 
 
120
  | :--- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
121
  | GENE | A gene is a unit of heredity that carries information from one generation to the next and is composed of DNA sequences that encode the instructions for the development, growth, and function of an organism. It can be a segment of DNA that is passed from one generation to the next and is responsible for the transmission of traits from parents to offspring. A gene is often represented using a three-letter code (e.g., trios, ABL, DNA-PK). |
122
 
 
123
 
124
  ### BC5CDR
125
 
 
128
  | CHEMICAL | Chemicals are substances that are composed of atoms, either bonded together in a molecule or as a mixture of different substances. This includes medications (e.g., nitroarginine methyl ester, nifedipine, prednisolone, methyldopa), compounds (e.g., potassium, calcium, ammonium), and other substances that can have various effects on the body. |
129
  | DISEASE | Diseases are any medical condition that affects the normal functioning of the body, resulting in symptoms, discomfort, or potentially life-threatening complications. This includes chronic and acute disorders, conditions affecting specific bodily systems, cancer-related conditions, and complications arising from medical treatments or external factors. |
130
 
 
131
 
132
  ### JNLPBA
133
 
 
139
  | CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments. |
140
  | RNA | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins. |
141
 
 
142
 
143
  ### JNLPBA-Rare
144
 
 
147
  | CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments. |
148
  | RNA | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins. |
149
 
 
150
 
151
  ### MedMentions-Rare
152
 
153
  | TYPE | Description |
154
  | :--- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
155
+ | NEG | In this study, we fabricated prevascularized synthetic device ports to help mitigate this limitation. Thus, the optimum range of pore size for prevascularization of these membranes was estimated to be 75 - 100 μm. A total of 51 patients were included, 16 in group I and 35 in group II. |
156
+ | Bacterium (T007) | A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. Examples include species like Streptococcus pneumoniae and Streptomyces ahygroscopicus. |
157
+ | Body Substance (T031) | A body substance is any material produced by or found within the body, such as blood, serum, saliva, sweat, or gastric acid. Specific examples include serum cytokine levels for immune responses, blood lipids for metabolic studies, and hemolymph glucose for stress responses. |
158
+ | Food (T168) | A food refers to any substance consumed to provide nutritional support for the body. This includes a wide range of items such as snacks, meat, dairy products, grains like wheat, and edible substances like carbohydrates, proteins, and fats. |
159
+ | Body System (T022) | A body system consists of interconnected organs and tissues working together to carry out essential functions. Examples include the gastrointestinal tract for digestion, the nervous system for sensory and motor control, the hematological system for blood-related functions, and the endocrine system for hormone regulation. |
160
  | Professional or Occupational Group (T097) | A professional refers to individuals who share the same profession, occupation, or role within a specific field. Examples include cardiologists, psychologists, assessors, hospice staff, and volunteers. |
161
 
162
  ---
163
 
164
 
165
+
166
  # 🧬 How to Write Effective Entity Type Descriptions
167
 
168
  Entity type descriptions are crucial for improving generalization in OpenBioNER. Well-written descriptions help models disambiguate types, handle rare classes, and align with real-world usage across diverse datasets.