mspisiak commited on
Commit
a4ea365
1 Parent(s): 9231597

added model usage and preprocessing details

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -13,14 +13,19 @@ tags:
13
  This work is a joint effort of Slovak National Competence Center for High-Performance Computing and Nettle, s.r.o., a Slovak-based start-up focusing on natural language processing, chatbots and voicebots.
14
 
15
  ## Model usage
16
- The model recognizes following entities: STREET, HOUSENUMBER, MUNICIPALITY, POSTALNUMBER. It uses BIO annotation scheme and therefore together with the O label has 9 labels in total.
 
17
 
18
  It is inteded to be used only on SLOVAK addresses.
19
 
20
  The primary use is to annotate input from speech-to-text transcriptions, therefore it handles natural speech hesitations (e.g. "Ďalej", "no", "uh").
21
 
 
 
 
22
  ### Preprocessing and input format
23
  The input is preprocessed so that it doesn't contain any commas!
 
24
  The house number can have two parts separated by a slash and it can contain a letter from A to F at the end (e.g. "Mätová ulica 97/25C").
25
  The postal number is always composed of 5 digits, but can be split into two parts by 3 and 2 digits respectively (e.g. "923 12", "84401").
26
  The street can contain shortened parts (e.g. "Ulica J. Matúšku").
 
13
  This work is a joint effort of Slovak National Competence Center for High-Performance Computing and Nettle, s.r.o., a Slovak-based start-up focusing on natural language processing, chatbots and voicebots.
14
 
15
  ## Model usage
16
+ The model recognizes following entities: STREET, HOUSENUMBER, MUNICIPALITY, POSTALNUMBER.
17
+ It uses BIO annotation scheme and therefore together with the O label has 9 labels in total.
18
 
19
  It is inteded to be used only on SLOVAK addresses.
20
 
21
  The primary use is to annotate input from speech-to-text transcriptions, therefore it handles natural speech hesitations (e.g. "Ďalej", "no", "uh").
22
 
23
+ Both house number and cadastral registration number are labelled as HOUSENUMBER.
24
+ Names of parts of a municipalities are also labelled as MUNICIPALITY.
25
+
26
  ### Preprocessing and input format
27
  The input is preprocessed so that it doesn't contain any commas!
28
+ The input can be both lower case and upper case, even contain errors when a proper noun starts lowercase.
29
  The house number can have two parts separated by a slash and it can contain a letter from A to F at the end (e.g. "Mätová ulica 97/25C").
30
  The postal number is always composed of 5 digits, but can be split into two parts by 3 and 2 digits respectively (e.g. "923 12", "84401").
31
  The street can contain shortened parts (e.g. "Ulica J. Matúšku").