BORT

BORT is a pretrained LLM that is designed to accept a mixture of English phonemes (in IPA) and orthography, made with clinical language evaluation tasks in mind. From the paper:

Robert Gale, Alexandra C. Salem, Gerasimos Fergadiotis, and Steven Bedrick. 2023. Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT). In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP-2023), pages TBD, Online. Association for Computational Linguistics. [paper] [poster]

Acknowledgements

This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award 5R01DC015999 (Principal Investigators: Bedrick & Fergadiotis). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Limitations

The models presented here were trained with the basic inventory of English phonemes found in CMUDict. However, a more fine-grained phonetic analysis would require a pronunciation dictionary with more narrowly defined entries. Additionally, while this paper focused on models trained with English-only resources (pre-trained BART-BASE, English Wikipedia text, CMUDict, and the English AphasiaBank), the techniques should be applicable to non-English language models as well. Finally, from a clinical standpoint, the model we describe in this paper assumes the existence of transcribed input (from either a manual or automated source, discussed in detail in §2.1 of the paper; in its current form, this represents a limitation to its clinical implementation, though not to its use in research settings with archival or newly-transcribed datasets.

Ethics Statement

Our use of the AphasiaBank data was governed by the TalkBank consortium's data use agreement, and the underlying recordings were collected and shared with approval of the contributing sites' institutional review boards. Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used. It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.

Wikipedia Dataset Used in Pre-Training

The BPE-tokenized version of the dataset, including metadata used in word transforms.

Dataset (upload ETA ≤ ACL 2023)

Usage

Downloading BORT

from transformers import AutoTokenizer, BartForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("palat/bort")
model = BartForConditionalGeneration.from_pretrained("palat/bort")

The above uses the default variant, bort-pr-sp-noisy. Each variant from the paper can be retrieved by specifiying like so:

BartForConditionalGeneration.from_pretrained("palat/bort", variant="bort-sp")

The following variants are available, pre-trained on the specified proportion of each task:

Variant	Pronunciation	Spelling	Noise
`bort-pr`	10%	—	—
`bort-sp`	—	10%	—
`bort-pr-sp`	10%	10%	—
`bort-pr-noisy`	10%	—	5%
`bort-sp-noisy`	—	10%	5%
`bort-pr-sp-noisy`	10%	10%	5%

Basic usage

BORT was intended to be fine-tuned to a specific task, but for a basic demonstration of what distinguishes it from other LLMs, please consider the following example. The pre-trained model has no issue translating "Long /aɪlənd/" to "Long Island", or "Long /biʧ/" to "Long Beach". The next two texts demonstrate the effect of context. While "lɔŋ ·aɪlən·d" still translates to "Long Island", "lɔŋ ·b·iʧ" bumps up against a homonym, and the model produces "long beech". (Note: the bullet character · is used to prevent the BPE tokenizer from combining phonemes.)

from transformers import AutoTokenizer, BartForConditionalGeneration

# Examples of mixed orthography and IPA phonemes:
in_texts = [
    "Due to its coastal location, Long ·aɪlən·d winter temperatures are milder than most of the state.",
    "Due to its coastal location, Long ·b·iʧ winter temperatures are milder than most of the state.",
    "Due to its coastal location, lɔŋ ·aɪlən·d winter temperatures are milder than most of the state.",
    "Due to its coastal location, lɔŋ ·b·iʧ winter temperatures are milder than most of the state.",
]

# Set up model and tokenizer:
tokenizer = AutoTokenizer.from_pretrained("palat/bort")
model = BartForConditionalGeneration.from_pretrained("palat/bort")

# Run generative inference for the batch of examples:
inputs = tokenizer(in_texts, return_tensors="pt", padding=True)
summary_ids = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=2048)
decoded = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

# Print the translated text:
for in_text, out_text in zip(in_texts, decoded):
    print(f"In:   \t{in_text}")
    print(f"Out:  \t{out_text}")
    print()

Full output for the above example:

In:   	Due to its coastal location, lɔŋ ·aɪlən·d winter temperatures are milder than most of the state.
Out:  	Due to its coastal location, Long Island winter temperatures are milder than most of the state.

In:   	Due to its coastal location, lɔŋ ·b·iʧ winter temperatures are milder than most of the state.
Out:  	Due to its coastal location, long beech winter temperatures are milder than most of the state.

In:   	Due to its coastal location, Long ·b·iʧ winter temperatures are milder than most of the state.
Out:  	Due to its coastal location, Long Beach winter temperatures are milder than most of the state.

In:   	Due to its coastal location, lɔŋfɝd winter temperatures are milder than most of the state.
Out:  	Due to its coastal location, Longford winter temperatures are milder than most of the state.