tomaarsen
/

span-marker-xlm-roberta-large-verbs

Token Classification

Model card Files Files and versions Community

span-marker-xlm-roberta-large-verbs / README.md

tomaarsen's picture

tomaarsen HF staff

Add limitation due to RoBERTa

1bdadff about 1 year ago

|

history blame contribute delete

3.43 kB


	---
	license: apache-2.0
	library_name: span-marker
	tags:
	- span-marker
	- token-classification
	- pos
	- part-of-speech
	pipeline_tag: token-classification
	---

	# SpanMarker for Named Entity Recognition

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for identifying verbs in text.
	In particular, this SpanMarker model uses [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) as the underlying encoder.
	See [span_marker_verbs_train.ipynb](span_marker_verbs_train.ipynb) for the training script used to create this model.

	Note that this model is an experiment about the feasibility of SpanMarker as a POS tagger. I would generally recommend using spaCy or NLTK instead, as these are more computationally efficient approaches.

	## Usage

	To use this model for inference, first install the `span_marker` library:

	```bash
	pip install span_marker
	```

	You can then run inference with this model like so:

	```python
	from span_marker import SpanMarkerModel

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-xlm-roberta-large-verbs")
	# Run inference
	entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
	```

	See the [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) repository for documentation and additional information on this library.

	### Performance

	It achieves the following results on the evaluation set:
	- Loss: 0.0152
	- Overall Precision: 0.9845
	- Overall Recall: 0.9849
	- Overall F1: 0.9847
	- Overall Accuracy: 0.9962

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 8
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Overall Precision \| Overall Recall \| Overall F1 \| Overall Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:-----------------:\|:--------------:\|:----------:\|:----------------:\|
	\| 0.036 \| 0.61 \| 1000 \| 0.0151 \| 0.9911 \| 0.9733 \| 0.9821 \| 0.9956 \|
	\| 0.0126 \| 1.22 \| 2000 \| 0.0131 \| 0.9856 \| 0.9864 \| 0.9860 \| 0.9965 \|
	\| 0.0175 \| 1.83 \| 3000 \| 0.0154 \| 0.9735 \| 0.9894 \| 0.9814 \| 0.9953 \|
	\| 0.0115 \| 2.45 \| 4000 \| 0.0172 \| 0.9821 \| 0.9871 \| 0.9845 \| 0.9962 \|


	### Limitations

	Warning: This model works best when punctuation is separated from the prior words, so
	```python
	# ✅
	model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
	# ❌
	model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")

	# You can also supply a list of words directly: ✅
	model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])
	```
	The same may be beneficial for some languages, such as splitting `"l'ocean Atlantique"` into `"l' ocean Atlantique"`.

	### Framework versions

	- Transformers 4.30.2
	- Pytorch 2.0.1+cu118
	- Datasets 2.13.1
	- Tokenizers 0.13.3
	- SpanMarker 1.2.3