File size: 1,361 Bytes
32f20a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# Model description
- Morphosyntactic analyzer: Stanza
- Tagset: NKJP
- Embedding vectors: Fasttext (wiki)
- Dataset: NLPrePL-NKJP-fair-by-name (https://huggingface.co/datasets/ipipan/nlprepl)
# How to use
## Clone
```
git clone git@hf.co:ipipan/nlpre_stanza_nkjp_fasttext_nkjp-by-name
```
## Load model
```
import stanza
lang = 'pl'
model_name = 'nlpre_stanza_nkjp_fasttext_nkjp-by-name'
prefix = 'nkjpbyname_nkjp'
config = \
{
# Comma-separated list of processors to use
'processors': 'tokenize,mwt,pos,lemma',
# Language code for the language to build the Pipeline in
'lang': lang,
# Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
# You only need model paths if you have a specific model outside of stanza_resources
'tokenize_model_path': os.path.join(model_name, f'{lang}_{prefix}_tokenizer.pt'),
'mwt_model_path': os.path.join(model_name, f'{lang}_{prefix}_mwt_expander.pt'),
'pos_model_path': os.path.join(model_name, f'{lang}_{prefix}_tagger.pt'),
'pos_pretrain_path': os.path.join(model_name, f'{lang}_{prefix}.pretrain.pt'),
'lemma_model_path': os.path.join(model_name, f'{lang}_{prefix}_lemmatizer.pt'),
# Use pretokenized text as input and disable tokenization
'tokenize_pretokenized': True
}
model = stanza.Pipeline(**config)
|