metadata

title: segmentation_scores
datasets:
  - transformersegmentation/CHILDES_EnglishNA
tags:
  - evaluate
  - metric
language:
  - en
description: ' metric for word segmentation scores '
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false

Metric Card for Segmentation Scores

Metric Description

There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the boundaries, the set of word tokens produced, and the set of word types produced. For each of these, we can compute precision, recall and F-score. In the literature, type and token scores are also referred to as word and lexicon scores, respectively.

For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a boundary precision of 1.0 (no incorrect boundaries), a boundary recall of 0.8 (4 boundaries hit out of 5) and a boundary f-score of 0.89 (harmonic mean of precision and recall). The full list of scores would be:

Score	Value
Boundary Precision	1.0
Boundary Recall	0.8
Boundary F-Score	0.89
Token Precision	0.8
Token Recall	0.67
Token F-Score	0.73
Type Precision	0.8
Type Recall	0.8
Type F-Score	0.8

Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled boundary_all with the boundary scores excluding the edge labeled as boundary_noedge. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence).

How to Use

At minimum, this metric requires predictions and references as inputs.

 >>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
 >>> results = segmentation_scores.compute(references=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"], predictions=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"])
 >>> print(results)
 {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}

Inputs

predictions (list of str): Predicted segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".
references (list of str): Ground truth segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".

Output Values

All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy.

boundary_all_precision(float): Boundary precision score, including edge boundaries.
boundary_all_recall(float): Boundary recall score, including edge boundaries.
boundary_all_fscore(float): Boundary F-score score, including edge boundaries.
boundary_noedge_precision(float): Boundary precision score, excluding edge boundaries.
boundary_noedge_recall(float): Boundary recall score, excluding edge boundaries.
boundary_noedge_fscore(float): Boundary F-score score, excluding edge boundaries.
token_precision(float): Token/Word precision score.
token_recall(float): Token/Word recall score.
token_fscore(float): Token/Word F-score.
type_precision(float): Type/Lexicon precision score.
type_recall(float): Type/Lexicon recall score.
type_fscore(float): Type/Lexicon F-score score.