added model usage and preprocessing details
Browse files
README.md
CHANGED
@@ -13,14 +13,19 @@ tags:
|
|
13 |
This work is a joint effort of Slovak National Competence Center for High-Performance Computing and Nettle, s.r.o., a Slovak-based start-up focusing on natural language processing, chatbots and voicebots.
|
14 |
|
15 |
## Model usage
|
16 |
-
The model recognizes following entities: STREET, HOUSENUMBER, MUNICIPALITY, POSTALNUMBER.
|
|
|
17 |
|
18 |
It is inteded to be used only on SLOVAK addresses.
|
19 |
|
20 |
The primary use is to annotate input from speech-to-text transcriptions, therefore it handles natural speech hesitations (e.g. "Ďalej", "no", "uh").
|
21 |
|
|
|
|
|
|
|
22 |
### Preprocessing and input format
|
23 |
The input is preprocessed so that it doesn't contain any commas!
|
|
|
24 |
The house number can have two parts separated by a slash and it can contain a letter from A to F at the end (e.g. "Mätová ulica 97/25C").
|
25 |
The postal number is always composed of 5 digits, but can be split into two parts by 3 and 2 digits respectively (e.g. "923 12", "84401").
|
26 |
The street can contain shortened parts (e.g. "Ulica J. Matúšku").
|
|
|
13 |
This work is a joint effort of Slovak National Competence Center for High-Performance Computing and Nettle, s.r.o., a Slovak-based start-up focusing on natural language processing, chatbots and voicebots.
|
14 |
|
15 |
## Model usage
|
16 |
+
The model recognizes following entities: STREET, HOUSENUMBER, MUNICIPALITY, POSTALNUMBER.
|
17 |
+
It uses BIO annotation scheme and therefore together with the O label has 9 labels in total.
|
18 |
|
19 |
It is inteded to be used only on SLOVAK addresses.
|
20 |
|
21 |
The primary use is to annotate input from speech-to-text transcriptions, therefore it handles natural speech hesitations (e.g. "Ďalej", "no", "uh").
|
22 |
|
23 |
+
Both house number and cadastral registration number are labelled as HOUSENUMBER.
|
24 |
+
Names of parts of a municipalities are also labelled as MUNICIPALITY.
|
25 |
+
|
26 |
### Preprocessing and input format
|
27 |
The input is preprocessed so that it doesn't contain any commas!
|
28 |
+
The input can be both lower case and upper case, even contain errors when a proper noun starts lowercase.
|
29 |
The house number can have two parts separated by a slash and it can contain a letter from A to F at the end (e.g. "Mätová ulica 97/25C").
|
30 |
The postal number is always composed of 5 digits, but can be split into two parts by 3 and 2 digits respectively (e.g. "923 12", "84401").
|
31 |
The street can contain shortened parts (e.g. "Ulica J. Matúšku").
|