--- license: mit language: - sk pipeline_tag: token-classification tags: - SlovakBERT --- # SlovakBERT address NER [SlovakBERT](https://huggingface.co/gerulata/slovakbert) based model for named entity recognition of Slovak addresses. This work is a joint effort of Slovak National Competence Center for High-Performance Computing and Nettle, s.r.o., a Slovak-based start-up focusing on natural language processing, chatbots and voicebots. ## Model usage The model recognizes following entities: STREET, HOUSENUMBER, MUNICIPALITY, POSTALNUMBER. It uses BIO annotation scheme and therefore together with the O label has 9 labels in total. It is inteded to be used only on SLOVAK addresses. The primary use is to annotate input from speech-to-text transcriptions, therefore it handles natural speech hesitations (e.g. "Ďalej", "no", "uh"). Both house number and cadastral registration number are labelled as HOUSENUMBER. Names of parts of a municipalities are also labelled as MUNICIPALITY. ### Preprocessing and input format The input is preprocessed so that it doesn't contain any commas! The input can be both lower case and upper case, even contain errors when a proper noun starts lowercase. The house number can have two parts separated by a slash and it can contain a letter from A to F at the end (e.g. "Mätová ulica 97/25C"). The postal number is always composed of 5 digits, but can be split into two parts by 3 and 2 digits respectively (e.g. "923 12", "84401"). The street can contain shortened parts (e.g. "Ulica J. Matúšku"). ### Code example ``` from transformers import pipeline ner_pipeline = pipeline(task='ner', model='nettle-ai/slovakbert-address-ner') input_sentence = "Žiškova uhm 21 85510 no Pezinok" classifications = ner_pipeline(input_sentence) ``` ## Acknowledgement The research results were obtained with the support of the Slovak National competence centre for HPC, the EuroCC 2 project and Slovak National Supercomputing Centre under grant agreement 101101903-EuroCC 2-DIGITAL-EUROHPC-JU-2022-NCC-01. ## Framework Versions - Transformers 4.26.0 - PyTorch 1.13.1 - Tokenizers 0.13.2