--- title: segmentation_scores datasets: - "transformersegmentation/CHILDES_EnglishNA" tags: - evaluate - metric language: - en description: " metric for word segmentation scores " sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false --- # Metric Card for Segmentation Scores ## Metric Description There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the *boundaries*, the set of word *tokens* produced, and the set of word *types* produced. For each of these, we can compute *precision*, *recall* and *F-score*. In the literature, type and token scores are also referred to as *word* and *lexicon* scores, respectively. For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a *boundary precision* of 1.0 (no incorrect boundaries), a *boundary recall* of 0.8 (4 boundaries hit out of 5) and a *boundary f-score* of 0.89 (harmonic mean of precision and recall). The full list of scores would be: | Score | Value | |--------------|-----------| | Boundary Precision | 1.0 | | Boundary Recall | 0.8 | | Boundary F-Score | 0.89 | | Token Precision | 0.8 | | Token Recall | 0.67 | | Token F-Score | 0.73 | | Type Precision | 0.8 | | Type Recall | 0.8 | | Type F-Score | 0.8 | Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled *boundary_all* with the boundary scores excluding the edge labeled as *boundary_noedge*. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence). ## How to Use At minimum, this metric requires predictions and references as inputs. ```python >>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores") >>> results = segmentation_scores.compute(references=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"], predictions=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"]) >>> print(results) {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0} ``` ### Inputs - **predictions** (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY". - **references** (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY". ### Output Values All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy. - **boundary_all_precision**(`float`): Boundary precision score, including edge boundaries. - **boundary_all_recall**(`float`): Boundary recall score, including edge boundaries. - **boundary_all_fscore**(`float`): Boundary F-score score, including edge boundaries. - **boundary_noedge_precision**(`float`): Boundary precision score, excluding edge boundaries. - **boundary_noedge_recall**(`float`): Boundary recall score, excluding edge boundaries. - **boundary_noedge_fscore**(`float`): Boundary F-score score, excluding edge boundaries. - **token_precision**(`float`): Token/Word precision score. - **token_recall**(`float`): Token/Word recall score. - **token_fscore**(`float`): Token/Word F-score. - **type_precision**(`float`): Type/Lexicon precision score. - **type_recall**(`float`): Type/Lexicon recall score. - **type_fscore**(`float`): Type/Lexicon F-score score.