novakat commited on
Commit
15f562c
1 Parent(s): 7524e49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -1
README.md CHANGED
@@ -23,4 +23,55 @@ inference:
23
 
24
  ## Limitations
25
 
26
- - max_seq_length = 448
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Limitations
25
 
26
+ - max_seq_length = 448
27
+
28
+ The underlying corpus [NerKor+CARS-ONPP](https://github.com/novakat/NYTK-NerKor-Cars-OntoNotesPP) was derived from [NYTK-NerKor](https://github.com/nytud/NYTK-NerKor), a Hungarian gold standard named entity annotated corpus containing about 1 million tokens.
29
+ It includes a small addition of 12k tokens of text (individual sentences) concerning motor vehicles (cars, buses, motorcycles) from the news archive of [hvg.hu](hvg.hu).
30
+ While the annotation in NYTK-NerKor followed the CoNLL2002 labelling standard with just four NE categories (`PER`, `LOC`, `MISC`, `ORG`), this version of the corpus features over 30 entity types, including all entity types used in the [OntoNotes 5.0] English NER annotation.
31
+ The new annotation elaborates on subtypes of the `LOC` and `MISC` entity types, and includes annotation for non-names like times and dates, quantities, languages and nationalities or religious or political groups.
32
+
33
+ ## Tags derived from the OntoNotes 5.0 annotation
34
+
35
+ Names are annotated according to the following set of types:
36
+ | | |
37
+ |---|---------|
38
+ |`PER` | = PERSON People, including fictional |
39
+ |`FAC` | = FACILITY Buildings, airports, highways, bridges, etc. |
40
+ |`ORG` | = ORGANIZATION Companies, agencies, institutions, etc. |
41
+ |`GPE` | Geopolitical entites: countries, cities, states |
42
+ |`LOC` | = LOCATION Non-GPE locations, mountain ranges, bodies of water |
43
+ |`PROD` | = PRODUCT Vehicles, weapons, foods, etc. (Not services) |
44
+ |`EVENT` | Named hurricanes, battles, wars, sports events, etc. |
45
+ |`WORK_OF_ART` | Titles of books, songs, etc. |
46
+ |`LAW` | Named documents made into laws |
47
+
48
+ The following are also annotated in a style similar to names:
49
+ | | |
50
+ |---|---------|
51
+ | `NORP` | Nationalities or religious or political groups |
52
+ | `LANGUAGE` | Any named language |
53
+ | `DATE` | Absolute or relative dates or periods |
54
+ | `TIME` | Times smaller than a day |
55
+ | `PERCENT` | Percentage (including "%") |
56
+ | `MONEY` | Monetary values, including unit |
57
+ | `QUANTITY` | Measurements, as of weight or distance |
58
+ | `ORDINAL` | "first", "second" |
59
+ | `CARDINAL` | Numerals that do not fall under another type |
60
+
61
+ ## Additional tags (not in OntoNotes 5)
62
+ Further subtypes of names of type `MISC`:
63
+ | | |
64
+ |-|-|
65
+ |`AWARD`| Awards and prizes |
66
+ | `CAR` | Cars and trucks |
67
+ |`MEDIA`| Media outlets, TV channels, news portals|
68
+ |`SMEDIA`| Social media platforms|
69
+ |`PROJ`| Projects and initiatives |
70
+ |`MISC`| Unresolved subtypes of MISC entities |
71
+ |`MISC-ORG`| Organization-like unresolved subtypes of MISC entities |
72
+
73
+ Further non-name entities:
74
+ | | |
75
+ |-|-|
76
+ |`DUR` |Time duration
77
+ |`ID`| identifier