alvarobartt
/

span-marker-roberta-base-bne-conll-2002-es

Token Classification

named-entity-recognition

generated_from_span_marker_trainer

Model card Files Files and versions Community

alvarobartt HF staff commited on Sep 26, 2023

Commit

929d13f

•

1 Parent(s): 39e53a6

Update README.md

Files changed (1) hide show

README.md +2 -8

README.md CHANGED Viewed

@@ -43,14 +43,14 @@ model-index:
 # SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
-This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
-- **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne)
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
 - **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
@@ -90,12 +90,6 @@ entities = model.predict("George Washington estuvo en Washington.")
 *List how the model may foreseeably be misused and address what users ought not to do with the model.*
 -->
-### ⚠️ Tokenizer Warning
-The [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
-In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. One approach is to use the [spaCy integration](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) which automatically separates all words and punctuation. Alternatively, some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and joining the resulting words with a space.
 <!--
 ## Bias, Risks and Limitations

 # SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
+This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
+- **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne)
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
 - **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
 *List how the model may foreseeably be misused and address what users ought not to do with the model.*
 -->
 <!--
 ## Bias, Risks and Limitations