File size: 6,689 Bytes
7c705b1
2b6b085
64ff291
 
 
a9280c9
a4fa0db
 
 
 
7c705b1
304ef3e
7c705b1
 
 
 
ba92a0f
64ff291
 
2b6b085
50602cd
64ff291
 
 
50602cd
 
ba92a0f
 
50602cd
64ff291
 
50602cd
 
 
 
64ff291
 
50602cd
 
 
 
 
14b55be
50602cd
14b55be
64ff291
 
 
50602cd
64ff291
 
50602cd
 
 
ba92a0f
77256bf
 
 
50602cd
 
 
 
ba92a0f
14b55be
 
 
50602cd
 
 
 
ba92a0f
14b55be
 
 
50602cd
 
 
 
ba92a0f
50602cd
 
 
 
 
64ff291
 
50602cd
 
64ff291
 
50602cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64ff291
 
50602cd
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: phone_errors
tags:
- evaluate
- metric
description: >-
  Error rates in terms of distance between articulatory phonological features
  can help understand differences  between strings in the International Phonetic
  Alphabet (IPA) in a linguistically motivated way.  This is useful when
  evaluating speech recognition or orthographic to IPA conversion tasks.
sdk: gradio
sdk_version: 3.50.2
app_file: app.py
pinned: false
---

# Metric Card for Phone Errors

## Metric Description
Error rates in terms of distance between articulatory phonological features can help understand differences between strings in the International Phonetic Alphabet (IPA) in a linguistically motivated way. 
This is useful when evaluating speech recognition or orthographic to IPA conversion tasks. These are Levenshtein distances for comparing strings where the smallest unit of measurement is based on phones or articulatory phonological features, rather than Unicode characters. 

## How to Use

```python
import evaluate
phone_errors = evaluate.load("ginic/phone_errors")
phone_errors.compute(predictions=["bob", "ði"], references=["pop", "ðə"])
```

### Inputs
- **predictions** (`list` of `str`): Transcriptions to score.
- **references** (`list` of `str`) : Reference strings serving as ground truth.
- **feature_model** (`str`): Set which panphon.distance.Distance feature parsing model is used, choose from `"strict"`, `"permissive"`, `"segment"`. Defaults to `"segment"`. 
- **is_normalize_pfer** (`bool`): Set to `True `to normalize PFER by the largest number of phones in the prediction, reference pair. Defaults to `False`. When this is used PFER will no longer obey the triangle inequality.


### Output Values
The computation returns a dictionary with the following key and values:
 - **phone_error_rates** (`list` of `float`): Phone error rate (PER) gives edit distance in terms of phones for each prediction-reference pair, rather than Unicode characters, since phones can consist of multiple characters. It is normalized by the number of phones of the reference string. The result with have the same length as the input prediction and reference lists.
 - **mean_phone_error_rate** (`float`): Overall mean of PER.
 - **phone_feature_error_rates** (`list` of `float`): Phone feature error rate (PFER) is Levenshtein distance between strings where distance between individual phones is computed using Hamming distance between phonetic features for each prediction-reference pair. By default it is a metric that obeys the triangle equality, but can also be normalized by number of phones. 
 - **mean_phone_feature_error_rate** (`float`):  Overall mean of PFER.
 - **feature_error_rates** (`list` of `float`): Feature error rate (FER) is the edit distance in terms of articulatory features normalized by the number of phones in the reference, computed for each prediction-reference pair. 
 - **mean_feature_error_rate** (`float`): Overall mean of FER.


#### Values from Popular Papers
[Universal Automatic Phonetic Transcription into the International Phonetic Alphabet (Taguchi et al.)](https://www.isca-archive.org/interspeech_2023/taguchi23_interspeech.html) reported an overall PER of 0.21 and PFER of 0.057 on supervised phonetic transcription of in-domain languages, a PER of 0.632 and PFER of 0.213 on zero-shot phonetic transcription of languages not seen in training data. On the zero-shot languages they also reported inter-annotator scores between human annotators as PER 0.533 and PFER 0.196.

### Examples

Simplest use case to compute phone error rates between two IPA strings: 
```python
>>> phone_errors.compute(predictions=["bob", "ði", "spin"], references=["pop", "ðə", "spʰin"])
{'phone_error_rates': [0.6666666666666666, 0.5, 0.25], 'mean_phone_error_rate': 0.47222222222222215, 
 'phone_feature_error_rates': [0.08333333333333333, 0.125, 0.041666666666666664], 'mean_phone_feature_error_rate': 0.08333333333333333, 
  'feature_error_rates': [0.027777777777777776, 0.0625, 0.30208333333333337], 'mean_feature_error_rate': 0.13078703703703706}
```

Normalize phone feature error rate by the length of the reference string: 
```python
>>> phone_errors.compute(predictions=["bob", "ði"], references=["pop", "ðə"], is_normalize_pfer=True)
{'phone_error_rates': [0.6666666666666666, 0.5], 'mean_phone_error_rate': 0.5833333333333333, 
 'phone_feature_error_rates': [0.027777777777777776, 0.0625], 'mean_phone_feature_error_rate': 0.04513888888888889, 
 'feature_error_rates': [0.027777777777777776, 0.0625], 'mean_feature_error_rate': 0.04513888888888889}
```

Error rates may be greater than 1.0 if the reference string is shorter than the prediction string:
```python
>>> phone_errors.compute(predictions=["bob"], references=["po"])
{'phone_error_rates': [1.0], 'mean_phone_error_rate': 1.0, 
 'phone_feature_error_rates': [1.0416666666666667], 'mean_phone_feature_error_rate': 1.0416666666666667, 
 'feature_error_rates': [0.020833333333333332], 'mean_feature_error_rate': 0.020833333333333332}
```

Empty reference strings will cause an ValueError, you should handle them separately:
```python
>>> phone_errors.compute(predictions=["bob"], references=[""])
Traceback (most recent call last):
  ...
    raise ValueError("one or more references are empty strings")
ValueError: one or more references are empty strings
```

## Limitations and Bias
- Phone error rate and feature error rate can be greater than 1.0 if the reference string is shorter than the prediction string.
- Since these are error rates, not edit distances, the reference strings cannot be empty. 

## Citation
```bibtex
@inproceedings{Mortensen-et-al:2016,
  author    = {David R. Mortensen and
               Patrick Littell and
               Akash Bharadwaj and
               Kartik Goyal and
               Chris Dyer and
               Lori S. Levin},
  title     = {PanPhon: {A} Resource for Mapping {IPA} Segments to Articulatory Feature Vectors},
  booktitle = {Proceedings of {COLING} 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
  pages     = {3475--3484},
  publisher = {{ACL}},
  year      = {2016}
}
```

## Further References
- PER and PFER are used as evaluation metrics in [Universal Automatic Phonetic Transcription into the International Phonetic Alphabet (Taguchi et al.)](https://www.isca-archive.org/interspeech_2023/taguchi23_interspeech.html) 
- Pierce Darragh's blog post [Introduction to Phonology, Part 3: Phonetic Features](https://pdarragh.github.io/blog/2018/04/26/intro-to-phonology-pt-3/) gives an overview of phonetic features for speech sounds.
- [panphon Github repository](https://github.com/dmort27/panphon)