Metrics per language
Hello! This is looking great, your F1 seems higher than my version 🎉
I computed some per-language metrics for you. Here's the script:
from datasets import load_dataset
from transformers import TrainingArguments
from span_marker import SpanMarkerModel, Trainer
def main() -> None:
model_name = "lxyuan/span-marker-bert-base-multilingual-cased-multinerd"
model = SpanMarkerModel.from_pretrained(model_name).cuda()
# Prepare the 🤗 transformers training arguments
args = TrainingArguments(
output_dir="results",
per_device_eval_batch_size=32,
bf16=True,
dataloader_num_workers=2,
report_to="none",
)
# Initialize the trainer using our model, training args & dataset, and train
trainer = Trainer(
model=model,
args=args,
)
dataset = "Babelscape/multinerd"
languages = ["de", "en", "es", "fr", "it", "nl", "pl", "pt", "ru", "zh"]
test_dataset = load_dataset(dataset, split="test")
for lang in languages:
split_test_dataset = test_dataset.filter(lambda sample: sample["lang"] == lang)
# Compute & save the metrics on the test set
metrics = trainer.evaluate(split_test_dataset, metric_key_prefix=f"test_{lang}")
trainer.save_metrics(f"test_{lang}", metrics)
if __name__ == "__main__":
main()
This resulted in some files, including an all_results.json
file. When formatted nicely, that becomes:
Language | Precision | Recall | F1 |
---|---|---|---|
all | 92.42 | 92.81 | 92.61 |
de | 95.03 | 95.07 | 95.05 |
en | 95.00 | 95.40 | 95.20 |
es | 92.05 | 91.37 | 91.71 |
fr | 92.37 | 91.41 | 91.89 |
it | 91.45 | 93.15 | 92.29 |
nl | 93.85 | 92.98 | 93.41 |
pl | 93.13 | 92.66 | 92.89 |
pt | 93.60 | 92.50 | 93.05 |
ru | 93.25 | 93.32 | 93.29 |
zh | 89.47 | 88.40 | 88.93 |
For reference, your model performs better in this test for all languages except French, Italian and Russian. Notably, your model is much better (~2 F1) on Chinese. Here's the markdown version so you can copy paste it into your README if you want:
| **Language** | **Precision** | **Recall** | **F1** |
|--------------|---------------|------------|------------|
| **all** | 92.42 | 92.81 | **92.61** |
| **de** | 95.03 | 95.07 | **95.05** |
| **en** | 95.00 | 95.40 | **95.20** |
| **es** | 92.05 | 91.37 | **91.71** |
| **fr** | 92.37 | 91.41 | **91.89** |
| **it** | 91.45 | 93.15 | **92.29** |
| **nl** | 93.85 | 92.98 | **93.41** |
| **pl** | 93.13 | 92.66 | **92.89** |
| **pt** | 93.60 | 92.50 | **93.05** |
| **ru** | 93.25 | 93.32 | **93.29** |
| **zh** | 89.47 | 88.40 | **88.93** |
Perhaps in a future iteration of SpanMarker, I can automatically generate F1 scores per entity class. I think it would be valuable to learn what performances make up the 92.61 F1. After all, the 92.61 tells you nothing about how good the model can detect e.g. foods.
I'll add a link to this model at the bottom of my mBERT model!
- Tom Aarsen
Unrelated, but I wonder if a uncased encoder (e.g. bert-base-multilingual-uncased) would perform better for entities that are often times not capitalized, like foods.
Hi Tom,
Thanks for the evaluation script and the results in markdown format.
I will include them in my model card later.
Unrelated, but I wonder if a uncased encoder (e.g. bert-base-multilingual-uncased) would perform better for entities that are often times not capitalized, like foods.
Interesting observation! Happy to run another experiment using bert-base-multilingual-uncased
and compare the results. I will definitely ping you again when I complete the training.
That would be awesome!
I'm running some tests now regarding per-entity class metrics. Here's the results:
Language | PER | ORG | LOC | ANIM | BIO | CEL | DIS | EVE | FOOD | INST | MEDIA | PLANT | MYTH | TIME | VEHI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
en | 99.49 (10530) | 98.31 (6616) | 99.47 (24046) | 76.26 (3208) | 77.78 (16) | 79.07 (82) | 79.15 (1514) | 97.32 (704) | 68.02 (1132) | 78.58 (24) | 97.93 (916) | 69.69 (1788) | 86.15 (64) | 85.61 (578) | 86.67 (64) |
zh | 79.71 (4174) | 61.66 (1926) | 78.79 (3850) | 96.56 (7918) | 71.56 (110) | 83.40 (92) | 78.95 (40) | 84.66 (1696) | 81.93 (738) | 83.82 (502) | 93.17 (23902) | 86.93 (1682) | 85.40 (780) | 76.12 (152) | 75.61 (98) |
The first value is the F1, and the second value is the number of entities that were used to calculate that F1. I mention it because some entity classes only have e.g. 24 entities in the test set. I think the differences are quite fascinating. For example the performance of person, organization and locations are much better in English, while the Chinese model seems super good at animals for some reason.
Intuitively, I would look at the number of unique animal entities in the Chinese test split and see if we have something interesting.
Additionally, I quickly glanced through the training and test split on the Hugging Face dataset hub. I noticed there are some duplications, and a common pattern is that animal entity samples commonly begin with "redirect # ." For instance, if you select the train split and click on the last page, you will see:
[ "R", "E", "D", "I", "R", "E", "C", "T", "#", "玳", "瑁" ] [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8 ] "zh"
[ "R", "E", "D", "I", "R", "E", "C", "T", "#", "玳", "瑁" ] [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8 ] "zh"
[ "R", "E", "D", "I", "R", "E", "C", "T", "#", "玳", "瑁" ] [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8 ] "zh"
[ "R", "E", "D", "I", "R", "E", "C", "T", "#", "玳", "瑁" ] [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8 ] "zh"
Ooh, you're right. That might unfairly skew the F1 of the Chinese model.
I see some other quirks here too, like lowercase "redirect" and some cases of None. I can notify the dataset author with this information.
Well spotted
Unrelated, but I wonder if a uncased encoder (e.g. bert-base-multilingual-uncased) would perform better for entities that are often times not capitalized, like foods.
Interesting observation! Happy to run another experiment using
bert-base-multilingual-uncased
and compare the results. I will definitely ping you again when I complete the training.
The finetuned version of bert-base-multilingual-uncased
on the Babelscape/multinerd
dataset is ready and we got some interesting findings as well.
Link: https://huggingface.co/lxyuan/span-marker-bert-base-multilingual-uncased-multinerd
CC: @tomaarsen
Very interesting behaviour indeed. I do tend to see that a cased model outperforms the uncased variant slightly, although users tend to prefer uncased versions as it works on both lowercase and uppercase text. I tried out your model for a bit and it seems quite strong, although I do agree that none of the models do great on food or plants. I think this might be caused by the dataset?
As you can see, lowercase text works very well:
If I use that same text in this cased model, then it doesn't even find any entities:
(For reference, both models work perfectly with capitalized text)
At this point I have the following recommendations:
- Cross-reference the models at the top of the model cards.
- Add examples for the Hosted Inference API widget. We want to show people that the model is awesome, but people won't really be able to come up with difficult examples quickly. The examples help with that.
I can make PRs for these.
although I do agree that none of the models do great on food or plants. I think this might be caused by the dataset?
Agree. The dataset (i.e., type of sentence and sentence quality) will have a huge impact on model performance for different entities.
Thanks for submitting all the PRs and your training scripts.
Indeed. I've noticed it with my models trained on data from Arxiv papers: they don't work as well on informal text.
Is it okay if I share this model on my LinkedIn over the next few days? It's totally okay if you'd rather that I don't :)
Sure, go ahead! Happy to be mentioned on LinkedIn.