novakat commited on
Commit
b6f96af
2 Parent(s): 995e40c 3195d9d

Merge branch 'main' of https://huggingface.co/novakat/nerkor-cars-onpp-hubert into main

Browse files
Files changed (1) hide show
  1. README.md +58 -1
README.md CHANGED
@@ -23,4 +23,61 @@ inference:
23
 
24
  ## Limitations
25
 
26
- - max_seq_length = 448
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Limitations
25
 
26
+ - max_seq_length = 448
27
+
28
+ ## Training data
29
+
30
+ The underlying corpus, [NerKor+CARS-ONPP](https://github.com/novakat/NYTK-NerKor-Cars-OntoNotesPP), was derived from [NYTK-NerKor](https://github.com/nytud/NYTK-NerKor), a Hungarian gold standard named entity annotated corpus containing about 1 million tokens.
31
+ It includes a small addition of 12k tokens of text (individual sentences) concerning motor vehicles (cars, buses, motorcycles) from the news archive of [hvg.hu](hvg.hu).
32
+ While the annotation in NYTK-NerKor followed the CoNLL2002 labelling standard with just four NE categories (`PER`, `LOC`, `MISC`, `ORG`), this version of the corpus features over 30 entity types, including all entity types used in the [OntoNotes 5.0] English NER annotation.
33
+ The new annotation elaborates on subtypes of the `LOC` and `MISC` entity types, and includes annotation for non-names like times and dates, quantities, languages and nationalities or religious or political groups. The annotation was elaborated with further entity subtypes not present in the Ontonotes 5 annotation (see below).
34
+
35
+ ## Tags derived from the OntoNotes 5.0 annotation
36
+
37
+ Names are annotated according to the following set of types:
38
+
39
+ | | |
40
+ |---|---------|
41
+ |`PER` | = PERSON People, including fictional |
42
+ |`FAC` | = FACILITY Buildings, airports, highways, bridges, etc. |
43
+ |`ORG` | = ORGANIZATION Companies, agencies, institutions, etc. |
44
+ |`GPE` | Geopolitical entites: countries, cities, states |
45
+ |`LOC` | = LOCATION Non-GPE locations, mountain ranges, bodies of water |
46
+ |`PROD` | = PRODUCT Vehicles, weapons, foods, etc. (Not services) |
47
+ |`EVENT` | Named hurricanes, battles, wars, sports events, etc. |
48
+ |`WORK_OF_ART` | Titles of books, songs, etc. |
49
+ |`LAW` | Named documents made into laws |
50
+
51
+ The following are also annotated in a style similar to names:
52
+
53
+ | | |
54
+ |---|---------|
55
+ | `NORP` | Nationalities or religious or political groups |
56
+ | `LANGUAGE` | Any named language |
57
+ | `DATE` | Absolute or relative dates or periods |
58
+ | `TIME` | Times smaller than a day |
59
+ | `PERCENT` | Percentage (including "%") |
60
+ | `MONEY` | Monetary values, including unit |
61
+ | `QUANTITY` | Measurements, as of weight or distance |
62
+ | `ORDINAL` | "first", "second" |
63
+ | `CARDINAL` | Numerals that do not fall under another type |
64
+
65
+ ## Additional tags (not in OntoNotes 5)
66
+ Further subtypes of names of type `MISC`:
67
+
68
+ | | |
69
+ |-|-|
70
+ |`AWARD`| Awards and prizes |
71
+ | `CAR` | Cars and trucks |
72
+ |`MEDIA`| Media outlets, TV channels, news portals|
73
+ |`SMEDIA`| Social media platforms|
74
+ |`PROJ`| Projects and initiatives |
75
+ |`MISC`| Unresolved subtypes of MISC entities |
76
+ |`MISC-ORG`| Organization-like unresolved subtypes of MISC entities |
77
+
78
+ Further non-name entities:
79
+
80
+ | | |
81
+ |-|-|
82
+ |`DUR` |Time duration
83
+ |`ID`| identifier