YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

About

This model takes in a word as an input and splits it into syllables. I did this by pre-training a T5 model from a syllables dataset I scraped from the internet. I'm using a custom tokenizer that is effectively character-based. It seems to work okay in my limited tests, but the output may be unpredictable when inputting multiple words, numbers, or non-English characters. It can, however, handle things such as trailing punctuation.

Calling the Model

from transformers import AutoTokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained('imjeffhi/syllabizer')
tokenizer = AutoTokenizer.from_pretrained('imjeffhi/syllabizer')

def generate_output(word):
    tokens = tokenizer(word, return_tensors='pt')
    output = model.generate(**tokens, do_sample=False, max_length=30, early_stopping=True)[0]
    return tokenizer.decode(output, skip_special_tokens=True)
    
syllables = generate_output('syllabizer')

The model returns syllables in spaced format. See output below.

syl la biz er

Using pipelines to syllabize sentences

You can easily syllabize an entire sentence/paragraph and/or convert the output into a list of syllables with the following code:

from transformers import pipeline

syllabizer_pipe = pipeline('text2text-generation', model = 'imjeffhi/syllabizer', tokenizer='imjeffhi/syllabizer')

sentence = "A unit of spoken language consisting of a single uninterrupted sound formed by a vowel, diphthong, or syllabic consonant alone, or by any of these sounds preceded, followed, or surrounded by one or more consonants."
words = sentence.split(" ")
output = syllabizer_pipe(words, batch_size=len(words),do_sample=False, max_length=30, early_stopping=True)

[{words[i]: gen_text['generated_text'].split(" ")} for i, gen_text in enumerate(output)]

This outputs the following:

[{'A': ['a']},
 {'unit': ['u', 'nit']},
 {'of': ['of']},
 {'spoken': ['spok', 'en']},
 {'language': ['lan', 'guage']},
 {'consisting': ['con', 'sis', 'ting']},
 {'of': ['of']},
 {'a': ['a']},
 {'single': ['sing', 'le']},
 {'uninterrupted': ['un', 'in', 'ter', 'rupt', 'ed']},
 {'sound': ['sound']},
 {'formed': ['formed']},
 {'by': ['by']},
 {'a': ['a']},
 {'vowel,': ['vow', 'el']},
 {'diphthong,': ['diph', 'thong']},
 {'or': ['or']},
 {'syllabic': ['syl', 'la', 'bic']},
 {'consonant': ['con', 'so', 'nant']},
 {'alone,': ['a', 'lone']},
 {'or': ['or']},
 {'by': ['by']},
 {'any': ['an', 'y']},
 {'of': ['of']},
 {'these': ['these']},
 {'sounds': ['sounds']},
 {'preceded,': ['pre', 'ced', 'ed']},
 {'followed,': ['fol', 'lowed']},
 {'or': ['or']},
 {'surrounded': ['sur', 'round', 'ed']},
 {'by': ['by']},
 {'one': ['one']},
 {'or': ['or']},
 {'more': ['more']},
 {'consonants.': ['con', 'so', 'nants']}]
Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.