tomaarsen
/

span-marker-roberta-large-fewnerd-fine-super

Token Classification

named-entity-recognition

generated_from_span_marker_trainer

Model card Files Files and versions Community

tomaarsen HF staff commited on Sep 26, 2023

Commit

9ac9a3e

•

1 Parent(s): cb20c77

Add examples, remove warning

Files changed (1) hide show

README.md +6 -7

README.md CHANGED Viewed

@@ -20,9 +20,13 @@ widget:
 - text: Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic
     to Paris.
   example_title: Amelia Earhart
-- text: Leonardo di ser Piero da Vinci painted the Mona Lisa based on Italian noblewoman
     Lisa del Giocondo.
   example_title: Leonardo da Vinci
 base_model: roberta-large
 model-index:
 - name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
@@ -150,7 +154,7 @@ from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
 model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
 # Run inference
-entities = model.predict("Most of the Steven Seagal movie `` Under Siege `` ( co-starring Tommy Lee Jones ) was filmed on the , which is docked on Mobile Bay at Battleship Memorial Park and open to the public .")
 ```
 ### Downstream Use
@@ -178,11 +182,6 @@ trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finet
 ```
 </details>
-### ⚠️ Tokenizer Warning
-The [roberta-large](https://huggingface.co/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
-In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
 ## Training Details
 ### Training Set Metrics

 - text: Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic
     to Paris.
   example_title: Amelia Earhart
+- text: Leonardo da Vinci painted the Mona Lisa based on Italian noblewoman
     Lisa del Giocondo.
   example_title: Leonardo da Vinci
+- text: Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones)
+    was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at
+    Battleship Memorial Park and open to the public.
+  example_title: Under Siege
 base_model: roberta-large
 model-index:
 - name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
 # Download from the 🤗 Hub
 model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
 # Run inference
+entities = model.predict("Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones) was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.")
 ```
 ### Downstream Use
 ```
 </details>
 ## Training Details
 ### Training Set Metrics