YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
About
This model takes in a word as an input and splits it into syllables. I did this by pre-training a T5 model from a syllables dataset I scraped from the internet. I'm using a custom tokenizer that is effectively character-based. It seems to work okay in my limited tests, but the output may be unpredictable when inputting multiple words, numbers, or non-English characters. It can, however, handle things such as trailing punctuation.
Calling the Model
from transformers import AutoTokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('imjeffhi/syllabizer')
tokenizer = AutoTokenizer.from_pretrained('imjeffhi/syllabizer')
def generate_output(word):
tokens = tokenizer(word, return_tensors='pt')
output = model.generate(**tokens, do_sample=False, max_length=30, early_stopping=True)[0]
return tokenizer.decode(output, skip_special_tokens=True)
syllables = generate_output('syllabizer')
The model returns syllables in spaced format. See output below.
syl la biz er
Using pipelines to syllabize sentences
You can easily syllabize an entire sentence/paragraph and/or convert the output into a list of syllables with the following code:
from transformers import pipeline
syllabizer_pipe = pipeline('text2text-generation', model = 'imjeffhi/syllabizer', tokenizer='imjeffhi/syllabizer')
sentence = "A unit of spoken language consisting of a single uninterrupted sound formed by a vowel, diphthong, or syllabic consonant alone, or by any of these sounds preceded, followed, or surrounded by one or more consonants."
words = sentence.split(" ")
output = syllabizer_pipe(words, batch_size=len(words),do_sample=False, max_length=30, early_stopping=True)
[{words[i]: gen_text['generated_text'].split(" ")} for i, gen_text in enumerate(output)]
This outputs the following:
[{'A': ['a']},
{'unit': ['u', 'nit']},
{'of': ['of']},
{'spoken': ['spok', 'en']},
{'language': ['lan', 'guage']},
{'consisting': ['con', 'sis', 'ting']},
{'of': ['of']},
{'a': ['a']},
{'single': ['sing', 'le']},
{'uninterrupted': ['un', 'in', 'ter', 'rupt', 'ed']},
{'sound': ['sound']},
{'formed': ['formed']},
{'by': ['by']},
{'a': ['a']},
{'vowel,': ['vow', 'el']},
{'diphthong,': ['diph', 'thong']},
{'or': ['or']},
{'syllabic': ['syl', 'la', 'bic']},
{'consonant': ['con', 'so', 'nant']},
{'alone,': ['a', 'lone']},
{'or': ['or']},
{'by': ['by']},
{'any': ['an', 'y']},
{'of': ['of']},
{'these': ['these']},
{'sounds': ['sounds']},
{'preceded,': ['pre', 'ced', 'ed']},
{'followed,': ['fol', 'lowed']},
{'or': ['or']},
{'surrounded': ['sur', 'round', 'ed']},
{'by': ['by']},
{'one': ['one']},
{'or': ['or']},
{'more': ['more']},
{'consonants.': ['con', 'so', 'nants']}]
- Downloads last month
- 7
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.