Papers
arxiv:2307.03322

BiPhone: Modeling Inter Language Phonetic Influences in Text

Published on Jul 6, 2023
· Featured in Daily Papers on Jul 10, 2023
Authors:
,
,
,
,
,
,

Abstract

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

Community

Very interesting work, thanks! How does the BERT-large mask fill approach work for spelling correction (5.2)? I don't understand the assumption that the word predicted will be the (corrected) misspelled word, is there some extra mechanism that guides it?

Paper author

Hi Andy, thanks for your interest in our work.

Your observation is correct, the BERT-large mask fill approach only depends on the context to "correct" misspelled words and does not use the misspelled word itself.

This is a naive baseline suggested by one of our reviewers at ACL.

@Abhirut got it, thank you for clarifying!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2307.03322 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2307.03322 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2307.03322 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.