jackboyla commited on
Commit
d879f83
1 Parent(s): 02e48d6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -0
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GLiREL : Generalist and Lightweight model for Zero-Shot Relation Extraction
2
+
3
+ GLiREL is a Relation Extraction model capable of classifying unseen relations given the entities within a text. This builds upon the excelent work done by Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois on the [GLiNER](https://github.com/urchade/GLiNER) library which enables efficient zero-shot Named Entity Recognition.
4
+
5
+ * GLiNER paper: [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
6
+
7
+ * Train a Zero-shot model: <a href="https://colab.research.google.com/github/jackboyla/GLiREL/blob/main/train.ipynb" target="_blank">
8
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
9
+ </a>
10
+
11
+ <!-- <img src="demo.jpg" alt="Demo Image" width="50%"/> -->
12
+
13
+ ---
14
+ # Installation
15
+
16
+ ```bash
17
+ pip install glirel
18
+ ```
19
+
20
+ ## Usage
21
+ Once you've downloaded the GLiREL library, you can import the `GLiREL` class. You can then load this model using `GLiREL.from_pretrained` and predict entities with `predict_relations`.
22
+
23
+ ```python
24
+ from glirel import GLiREL
25
+ import spacy
26
+
27
+ model = GLiREL.from_pretrained("jackboyla/glirel_beta")
28
+
29
+ nlp = spacy.load('en_core_web_sm')
30
+
31
+ text = 'Derren Nesbitt had a history of being cast in "Doctor Who", having played villainous warlord Tegana in the 1964 First Doctor serial "Marco Polo".'
32
+ doc = nlp(text)
33
+ tokens = [token.text for token in doc]
34
+
35
+ labels = ['country of origin', 'licensed to broadcast to', 'father', 'followed by', 'characters']
36
+
37
+ ner = [[26, 27, 'PERSON', 'Marco Polo'], [22, 23, 'Q2989412', 'First Doctor']] # 'type' is not used -- it can be any string!
38
+
39
+ relations = model.predict_relations(tokens, labels, threshold=0.0, ner=ner, top_k=1)
40
+
41
+ print('Number of relations:', len(relations))
42
+
43
+ sorted_data_desc = sorted(relations, key=lambda x: x['score'], reverse=True)
44
+ print("\nDescending Order by Score:")
45
+ for item in sorted_data_desc:
46
+ print(f"{item['head_text']} --> {item['label']} --> {item['tail_text']} | score: {item['score']}")
47
+ ```
48
+
49
+ ### Expected Output
50
+
51
+ ```
52
+ Number of relations: 2
53
+
54
+ Descending Order by Score:
55
+ {'head_pos': [26, 28], 'tail_pos': [22, 24], 'head_text': ['Marco', 'Polo'], 'tail_text': ['First', 'Doctor'], 'label': 'characters', 'score': 0.9923334121704102}
56
+ {'head_pos': [22, 24], 'tail_pos': [26, 28], 'head_text': ['First', 'Doctor'], 'tail_text': ['Marco', 'Polo'], 'label': 'characters', 'score': 0.9915636777877808}
57
+ ```
58
+
59
+ ## Constrain labels
60
+ In practice, we usually want to define the types of entities that can exist as a head and/or tail of a relationship. This is already implemented in GLiREL:
61
+
62
+ ```python
63
+ labels = {"glirel_labels": {
64
+ 'co-founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
65
+ 'no relation': {}, # head and tail can be any entity type
66
+ 'country of origin': {"allowed_head": ["PERSON", "ORG"], "allowed_tail": ["LOC", "GPE"]},
67
+ 'parent': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
68
+ 'located in or next to body of water': {"allowed_head": ["LOC", "GPE", "FAC"], "allowed_tail": ["LOC", "GPE"]},
69
+ 'spouse': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
70
+ 'child': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
71
+ 'founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
72
+ 'founded on date': {"allowed_head": ["ORG"], "allowed_tail": ["DATE"]},
73
+ 'headquartered in': {"allowed_head": ["ORG"], "allowed_tail": ["LOC", "GPE", "FAC"]},
74
+ 'acquired by': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
75
+ 'subsidiary of': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
76
+ }
77
+ }
78
+ ```
79
+
80
+ ## Usage with spaCy
81
+
82
+ You can also load GliREL into a regular spaCy NLP pipeline. Here's an example using an English pipeline.
83
+
84
+ ```python
85
+ import spacy
86
+ import glirel
87
+
88
+ # Load a blank spaCy model or an existing one
89
+ nlp = spacy.load('en_core_web_sm')
90
+
91
+ # Add the GLiREL component to the pipeline
92
+ nlp.add_pipe("glirel", after="ner")
93
+
94
+ # Now you can use the pipeline with the GLiREL component
95
+ text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. The company is headquartered in Cupertino, California."
96
+
97
+ labels = {"glirel_labels": {
98
+ 'co-founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
99
+ 'country of origin': {"allowed_head": ["PERSON", "ORG"], "allowed_tail": ["LOC", "GPE"]},
100
+ 'licensed to broadcast to': {"allowed_head": ["ORG"]},
101
+ 'no relation': {},
102
+ 'parent': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
103
+ 'followed by': {"allowed_head": ["PERSON", "ORG"], "allowed_tail": ["PERSON", "ORG"]},
104
+ 'located in or next to body of water': {"allowed_head": ["LOC", "GPE", "FAC"], "allowed_tail": ["LOC", "GPE"]},
105
+ 'spouse': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
106
+ 'child': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
107
+ 'founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
108
+ 'headquartered in': {"allowed_head": ["ORG"], "allowed_tail": ["LOC", "GPE", "FAC"]},
109
+ 'acquired by': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
110
+ 'subsidiary of': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
111
+ }
112
+ }
113
+
114
+ # Add the labels to the pipeline at inference time
115
+ docs = list( nlp.pipe([(text, labels)], as_tuples=True) )
116
+ relations = docs[0][0]._.relations
117
+
118
+ print('Number of relations:', len(relations))
119
+
120
+ sorted_data_desc = sorted(relations, key=lambda x: x['score'], reverse=True)
121
+ print("\nDescending Order by Score:")
122
+ for item in sorted_data_desc:
123
+ print(f"{item['head_text']} --> {item['label']} --> {item['tail_text']} | score: {item['score']}")
124
+
125
+ ```
126
+
127
+ ### Expected Output
128
+
129
+ ```
130
+ Number of relations: 5
131
+
132
+ Descending Order by Score:
133
+ ['Apple', 'Inc.'] --> headquartered in --> ['California'] | score: 0.9854260683059692
134
+ ['Apple', 'Inc.'] --> headquartered in --> ['Cupertino'] | score: 0.9569844603538513
135
+ ['Steve', 'Wozniak'] --> co-founder --> ['Apple', 'Inc.'] | score: 0.09025496244430542
136
+ ['Steve', 'Jobs'] --> co-founder --> ['Apple', 'Inc.'] | score: 0.08805803954601288
137
+ ['Ronald', 'Wayne'] --> co-founder --> ['Apple', 'Inc.'] | score: 0.07996643334627151
138
+ ```
139
+
140
+
141
+ ## To run experiments
142
+
143
+ FewRel: ~56k examples
144
+ WikiZSL: ~85k examples
145
+
146
+ ```bash
147
+ # few_rel
148
+ cd data
149
+ python process_few_rel.py
150
+ cd ..
151
+ # adjust config
152
+ python train.py --config config_few_rel.yaml
153
+ ```
154
+
155
+ ```bash
156
+ # wiki_zsl
157
+ cd data
158
+ python process_wiki_zsl.py
159
+ cd ..
160
+ # <adjust config>
161
+ python train.py --config config_wiki_zsl.yaml
162
+ ```
163
+
164
+ ## Example training data
165
+
166
+ NOTE that the entity indices are inclusive i.e `"Binsey"` is `[7, 7]`. This differs from spaCy where the end index is exclusive (in this case spaCy would set the indices to `[7, 8]`)
167
+
168
+ JSONL file:
169
+ ```json
170
+ {
171
+ "ner": [
172
+ [7, 7, "Q4914513", "Binsey"],
173
+ [11, 12, "Q19686", "River Thames"]
174
+ ],
175
+ "relations": [
176
+ {
177
+ "head": {"mention": "Binsey", "position": [7, 7], "type": "LOC"}, # 'type' is not used -- it can be any string!
178
+ "tail": {"mention": "River Thames", "position": [11, 12], "type": "Q19686"},
179
+ "relation_text": "located in or next to body of water"
180
+ }
181
+ ],
182
+ "tokenized_text": ["The", "race", "took", "place", "between", "Godstow", "and", "Binsey", "along", "the", "Upper", "River", "Thames", "."]
183
+ },
184
+ {
185
+ "ner": [
186
+ [9, 10, "Q4386693", "Legislative Assembly"],
187
+ [1, 3, "Q1848835", "Parliament of Victoria"]
188
+ ],
189
+ "relations": [
190
+ {
191
+ "head": {"mention": "Legislative Assembly", "position": [9, 10], "type": "Q4386693"},
192
+ "tail": {"mention": "Parliament of Victoria", "position": [1, 3], "type": "Q1848835"},
193
+ "relation_text": "part of"
194
+ }
195
+ ],
196
+ "tokenized_text": ["The", "Parliament", "of", "Victoria", "consists", "of", "the", "lower", "house", "Legislative", "Assembly", ",", "the", "upper", "house", "Legislative", "Council", "and", "the", "Queen", "of", "Australia", "."]
197
+ }
198
+ ```
199
+
200
+ ## License
201
+
202
+ [GLiREL](https://github.com/jackboyla/GLiREL) by [Jack Boylan](https://github.com/jackboyla) is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1).
203
+
204
+ <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer">
205
+ <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" alt="CC Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
206
+ <img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" alt="BY Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
207
+ <img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1" alt="NC Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
208
+ <img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1" alt="SA Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
209
+ </a>
210
+
211
+
212
+