stefan-it commited on
Commit
96c10a4
·
verified ·
1 Parent(s): 66cfb7f

docs: add initial version

Browse files
Files changed (1) hide show
  1. README.md +154 -3
README.md CHANGED
@@ -1,3 +1,154 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ base_model: Qwen/Qwen2.5-7B
6
+ tags:
7
+ - ner
8
+ - named-entity-recognition
9
+ - conll2003
10
+ - lora
11
+ - llm
12
+ datasets:
13
+ - conll2003
14
+ metrics:
15
+ - f1
16
+ - precision
17
+ - recall
18
+ ---
19
+
20
+ # Qwen2.5-7B · CoNLL-2003 English NER
21
+
22
+ This model is a LoRA fine-tune of [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) for Named Entity Recognition (NER) on the original English [CoNLL-2003](https://aclanthology.org/W03-0419/) dataset. It was trained using a **bracketed inline output format** and an **Alpaca-style instruction-tuning setup**, reproducing the experimental configuration from [Zhan et al. (2026)](https://arxiv.org/abs/2601.17898).
23
+
24
+ Training code, evaluation scripts, and full configuration are available at:
25
+ **[https://github.com/stefan-it/llms-meet-ner](https://github.com/stefan-it/llms-meet-ner)**
26
+
27
+ ---
28
+
29
+ ## Bracketed Inline Format
30
+
31
+ Instead of producing a label sequence, the model rewrites the input sentence by wrapping each named entity in `[Entity Text | LABEL]` brackets. Plain (non-entity) tokens are left unchanged.
32
+
33
+ **Input:**
34
+ ```
35
+ Hussain , considered surplus to England 's one-day requirements , struck 158 , his first championship century of the season , as Essex reached 372 .
36
+ ```
37
+
38
+ **Output:**
39
+ ```
40
+ [Hussain | PER] , considered surplus to [England | LOC] 's one-day requirements , struck 158 , his first championship century of the season , as [Essex | ORG] reached 372 .
41
+ ```
42
+
43
+ Multi-token entities are supported naturally — the entire span appears inside a single bracket pair. Everything after the first newline in the model output is discarded to handle hallucinated continuations.
44
+
45
+ ---
46
+
47
+ ## Instruction Format
48
+
49
+ The model was fine-tuned in **Alpaca-style instruction format**. Each training example consists of a system instruction defining the task and label set, followed by the input sentence as the user turn and the bracketed inline annotation as the expected response.
50
+
51
+ The instruction used for CoNLL-2003 English:
52
+
53
+ ```
54
+ Your task is to identify all named entities in the input sentence and rewrite
55
+ the sentence by enclosing each entity using the format [Entity Text | LABEL].
56
+ Use only the label tags defined in the Label Set below.
57
+ Label Set:
58
+ ORG(organization): A collective entity such as a company, institution, brand,
59
+ political or governmental body, publication, or any organized group of people
60
+ acting as a unit.
61
+ PER(person): A named individual, including humans, animals, fictional
62
+ characters, and their aliases.
63
+ LOC(location): A geographical or spatial entity, including natural features,
64
+ built structures, regions, public or commercial places, assorted buildings,
65
+ and abstract or metaphorical places.
66
+ MISC(miscellaneous): Named entities that are not persons, organizations, or
67
+ locations, including derived adjectives, religions, ideologies,
68
+ nationalities, languages, events, programs, wars, titles of works, slogans,
69
+ eras, and types of objects.
70
+ Now process the input sentence:
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Training
76
+
77
+ | Hyperparameter | Value |
78
+ |---|---|
79
+ | Base model | `Qwen/Qwen2.5-7B` |
80
+ | Fine-tuning method | LoRA (via LLaMA-Factory) |
81
+ | LoRA rank | 256 |
82
+ | LoRA alpha | 512 |
83
+ | LoRA target | all |
84
+ | Training dataset | CoNLL-2003 English train split |
85
+ | Epochs | 2 |
86
+ | Learning rate | 2.0e-5 |
87
+ | LR scheduler | cosine |
88
+ | Warmup ratio | 0.1 |
89
+ | Per-device batch size | 1 |
90
+ | Gradient accumulation steps | 8 |
91
+ | Effective batch size | 8 |
92
+ | Max sequence length | 2048 |
93
+ | Precision | bfloat16 |
94
+
95
+ ---
96
+
97
+ ## Evaluation Setup
98
+
99
+ Evaluation is performed in two complementary ways, both working from the raw model output (bracketed inline predictions) aligned against **gold labels taken directly from the original CoNLL-2003 IOB1 dataset** — never from the converted training format, to avoid any annotation artefacts.
100
+
101
+ - **[seqeval](https://github.com/chakki-works/seqeval)** — token-level strict span matching: a span is correct only if both its boundaries and entity type match exactly.
102
+ - **[nervaluate](https://github.com/MantisAI/nervaluate)** — span-level evaluation reporting four scenarios (strict, exact, partial, ent_type) following the SemEval 2013 Task 9.1 metrics.
103
+
104
+ ---
105
+
106
+ ## Results
107
+
108
+ ### Development Set (eng.testa) — 3,466 sentences, 51,578 tokens
109
+
110
+ #### seqeval
111
+
112
+ | Entity | Precision | Recall | F1 |
113
+ |--------|-----------|--------|----|
114
+ | LOC | 0.98 | 0.98 | 0.98 |
115
+ | MISC | 0.92 | 0.93 | 0.93 |
116
+ | ORG | 0.95 | 0.96 | 0.96 |
117
+ | PER | 0.98 | 0.99 | 0.99 |
118
+ | **micro avg** | **0.9660** | **0.9704** | **0.9682** |
119
+
120
+ #### nervaluate (aggregated)
121
+
122
+ | Scenario | Precision | Recall | F1 |
123
+ |----------|-----------|--------|----|
124
+ | strict | 0.9660 | 0.9704 | 0.9682 |
125
+ | exact | 0.9782 | 0.9827 | 0.9804 |
126
+ | partial | 0.9834 | 0.9879 | 0.9856 |
127
+ | ent_type | 0.9744 | 0.9788 | 0.9766 |
128
+
129
+ ---
130
+
131
+ ### Test Set (eng.testb) — 3,684 sentences, 46,666 tokens
132
+
133
+ #### seqeval
134
+
135
+ | Entity | Precision | Recall | F1 |
136
+ |--------|-----------|--------|----|
137
+ | LOC | 0.95 | 0.94 | 0.95 |
138
+ | MISC | 0.82 | 0.84 | 0.83 |
139
+ | ORG | 0.92 | 0.95 | 0.93 |
140
+ | PER | 0.98 | 0.97 | 0.98 |
141
+ | **micro avg** | **0.9350** | **0.9396** | **0.9373** |
142
+
143
+ #### nervaluate (aggregated)
144
+
145
+ | Scenario | Precision | Recall | F1 |
146
+ |----------|-----------|--------|----|
147
+ | strict | 0.9350 | 0.9396 | 0.9373 |
148
+ | exact | 0.9639 | 0.9687 | 0.9663 |
149
+ | partial | 0.9716 | 0.9765 | 0.9740 |
150
+ | ent_type | 0.9436 | 0.9483 | 0.9460 |
151
+
152
+ The test set F1 of **93.73** exactly matches the result reported by Zhan et al. for this model configuration.
153
+
154
+ ---