sted97 commited on
Commit
73d4158
1 Parent(s): 934471a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -36
README.md CHANGED
@@ -4,12 +4,13 @@ annotations_creators:
4
  language_creators:
5
  - machine-generated
6
  widget:
7
- - text: "My name is Wolfgang and I live in Berlin."
8
- - text: "George Washington went to Washington."
9
- - text: "Mi nombre es Sarah y vivo en Londres."
10
- - text: "Меня зовут Симона, и я живу в Риме."
11
  tags:
12
  - named-entity-recognition
 
13
  datasets:
14
  - Babelscape/wikineural
15
  language:
@@ -32,18 +33,35 @@ task_categories:
32
  - structure-prediction
33
  task_ids:
34
  - named-entity-recognition
35
-
36
  ---
37
 
38
- ## Model Description
 
39
 
40
- - **Summary:** mBERT model fine-tuned for 3 epochs on the recently-introduced WikiNEuRal dataset for Multilingual NER. The system supports the 9 languages covered by WikiNEuRal (de, en, es, fr, it, nl, pl, pt, ru), and it was trained on all 9 languages jointly. For a stronger baseline system (mBERT + Bi-LSTM + CRF) look at the official repository.
41
- - **Official Repository:** [https://github.com/Babelscape/wikineural](https://github.com/Babelscape/wikineural)
42
- - **Paper:** [https://aclanthology.org/wikineural](https://aclanthology.org/2021.findings-emnlp.215/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- #### How to use
45
 
46
- You can use this model with Transformers *pipeline* for NER. **Please consider citing our work if you use this model.**
47
 
48
  ```python
49
  from transformers import AutoTokenizer, AutoModelForTokenClassification
@@ -59,32 +77,10 @@ ner_results = nlp(example)
59
  print(ner_results)
60
  ```
61
 
62
- #### Limitations and bias
63
 
64
- This model is trained on WikiNEuRal, a state-of-the-art dataset for Multilingual NER automatically derived from Wikipedia. Therefore, it may not generalize well on all textual genres (e.g. news). On the other hand, models trained only on news articles (e.g. only on CoNLL03) have been proven to obtain much lower scores on encyclopedic articles. To obtain a more robust system, we encourage to train a system on the combination of WikiNEuRal + CoNLL.
65
 
66
  ## Licensing Information
67
 
68
- Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright of the dataset contents and models belongs to the original copyright holders.
69
-
70
- ## Citation Information
71
-
72
- ```bibtex
73
- @inproceedings{tedeschi-etal-2021-wikineural-combined,
74
- title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
75
- author = "Tedeschi, Simone and
76
- Maiorca, Valentino and
77
- Campolungo, Niccol{\`o} and
78
- Cecconi, Francesco and
79
- Navigli, Roberto",
80
- booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
81
- month = nov,
82
- year = "2021",
83
- address = "Punta Cana, Dominican Republic",
84
- publisher = "Association for Computational Linguistics",
85
- url = "https://aclanthology.org/2021.findings-emnlp.215",
86
- pages = "2521--2533",
87
- abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
88
- }
89
- ```
90
-
 
4
  language_creators:
5
  - machine-generated
6
  widget:
7
+ - text: My name is Wolfgang and I live in Berlin.
8
+ - text: George Washington went to Washington.
9
+ - text: Mi nombre es Sarah y vivo en Londres.
10
+ - text: Меня зовут Симона, и я живу в Риме.
11
  tags:
12
  - named-entity-recognition
13
+ - sequence-tagger-model
14
  datasets:
15
  - Babelscape/wikineural
16
  language:
 
33
  - structure-prediction
34
  task_ids:
35
  - named-entity-recognition
 
36
  ---
37
 
38
+ # WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER
39
+ This is the model card for the EMNLP 2021 paper [WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER](https://aclanthology.org/2021.findings-emnlp.215/). In a nutshell, WikiNEuRal consists in a novel technique which builds upon a multilingual lexical knowledge base (i.e., BabelNet) and transformer-based architectures (i.e., BERT) to produce high-quality annotations for multilingual NER. We then fine-tuned a multilingual language model (mBERT) for 3 epochs on the resulting WikiNEuRal dataset. The system supports the 9 languages covered by WikiNEuRal (de, en, es, fr, it, nl, pl, pt, ru), and it was trained on all 9 languages jointly. **If you use the model, please reference this work in your paper**:
40
 
41
+ ```bibtex
42
+ @inproceedings{tedeschi-etal-2021-wikineural-combined,
43
+ title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
44
+ author = "Tedeschi, Simone and
45
+ Maiorca, Valentino and
46
+ Campolungo, Niccol{\`o} and
47
+ Cecconi, Francesco and
48
+ Navigli, Roberto",
49
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
50
+ month = nov,
51
+ year = "2021",
52
+ address = "Punta Cana, Dominican Republic",
53
+ publisher = "Association for Computational Linguistics",
54
+ url = "https://aclanthology.org/2021.findings-emnlp.215",
55
+ pages = "2521--2533",
56
+ abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
57
+ }
58
+ ```
59
+
60
+ The original repository for the paper can be found at [https://github.com/Babelscape/wikineural](https://github.com/Babelscape/wikineural).
61
 
62
+ ## How to use
63
 
64
+ You can use this model with Transformers *pipeline* for NER.
65
 
66
  ```python
67
  from transformers import AutoTokenizer, AutoModelForTokenClassification
 
77
  print(ner_results)
78
  ```
79
 
80
+ ## Limitations and bias
81
 
82
+ This model is trained on WikiNEuRal, a state-of-the-art dataset for Multilingual NER automatically derived from Wikipedia. Therefore, it might not generalize well to all textual genres (e.g. news). On the other hand, models trained only on news articles (e.g. only on CoNLL03) have been proven to obtain much lower scores on encyclopedic articles. To obtain more robust systems, we encourage you to train a system on the combination of WikiNEuRal with other datasets (e.g. WikiNEuRal + CoNLL).
83
 
84
  ## Licensing Information
85
 
86
+ Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright of the dataset contents and models belongs to the original copyright holders.