tomaarsen HF staff commited on
Commit
9ac9a3e
1 Parent(s): cb20c77

Add examples, remove warning

Browse files
Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -20,9 +20,13 @@ widget:
20
  - text: Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic
21
  to Paris.
22
  example_title: Amelia Earhart
23
- - text: Leonardo di ser Piero da Vinci painted the Mona Lisa based on Italian noblewoman
24
  Lisa del Giocondo.
25
  example_title: Leonardo da Vinci
 
 
 
 
26
  base_model: roberta-large
27
  model-index:
28
  - name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
@@ -150,7 +154,7 @@ from span_marker import SpanMarkerModel
150
  # Download from the 🤗 Hub
151
  model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
152
  # Run inference
153
- entities = model.predict("Most of the Steven Seagal movie `` Under Siege `` ( co-starring Tommy Lee Jones ) was filmed on the , which is docked on Mobile Bay at Battleship Memorial Park and open to the public .")
154
  ```
155
 
156
  ### Downstream Use
@@ -178,11 +182,6 @@ trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finet
178
  ```
179
  </details>
180
 
181
- ### ⚠️ Tokenizer Warning
182
- The [roberta-large](https://huggingface.co/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
183
-
184
- In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
185
-
186
  ## Training Details
187
 
188
  ### Training Set Metrics
 
20
  - text: Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic
21
  to Paris.
22
  example_title: Amelia Earhart
23
+ - text: Leonardo da Vinci painted the Mona Lisa based on Italian noblewoman
24
  Lisa del Giocondo.
25
  example_title: Leonardo da Vinci
26
+ - text: Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones)
27
+ was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at
28
+ Battleship Memorial Park and open to the public.
29
+ example_title: Under Siege
30
  base_model: roberta-large
31
  model-index:
32
  - name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
 
154
  # Download from the 🤗 Hub
155
  model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
156
  # Run inference
157
+ entities = model.predict("Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones) was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.")
158
  ```
159
 
160
  ### Downstream Use
 
182
  ```
183
  </details>
184
 
 
 
 
 
 
185
  ## Training Details
186
 
187
  ### Training Set Metrics