## About This model takes in a word as an input and splits it into syllables. I did this by pre-training a T5 model from a syllables dataset I scraped from the internet. I'm using a custom tokenizer that is effectively character-based. It seems to work okay in my limited tests, but the output may be unpredictable when inputting multiple words, numbers, or non-English characters. It can, however, handle things such as trailing punctuation. ## Calling the Model ```python from transformers import AutoTokenizer, T5ForConditionalGeneration model = T5ForConditionalGeneration.from_pretrained('imjeffhi/syllabizer') tokenizer = AutoTokenizer.from_pretrained('imjeffhi/syllabizer') def generate_output(word): tokens = tokenizer(word, return_tensors='pt') output = model.generate(**tokens, do_sample=False, max_length=30, early_stopping=True)[0] return tokenizer.decode(output, skip_special_tokens=True) syllables = generate_output('syllabizer') ``` The model returns syllables in spaced format. See output below. ```python syl la biz er ``` ## Using pipelines to syllabize sentences You can easily syllabize an entire sentence/paragraph and/or convert the output into a list of syllables with the following code: ```python from transformers import pipeline syllabizer_pipe = pipeline('text2text-generation', model = 'imjeffhi/syllabizer', tokenizer='imjeffhi/syllabizer') sentence = "A unit of spoken language consisting of a single uninterrupted sound formed by a vowel, diphthong, or syllabic consonant alone, or by any of these sounds preceded, followed, or surrounded by one or more consonants." words = sentence.split(" ") output = syllabizer_pipe(words, batch_size=len(words),do_sample=False, max_length=30, early_stopping=True) [{words[i]: gen_text['generated_text'].split(" ")} for i, gen_text in enumerate(output)] ``` This outputs the following: ``` [{'A': ['a']}, {'unit': ['u', 'nit']}, {'of': ['of']}, {'spoken': ['spok', 'en']}, {'language': ['lan', 'guage']}, {'consisting': ['con', 'sis', 'ting']}, {'of': ['of']}, {'a': ['a']}, {'single': ['sing', 'le']}, {'uninterrupted': ['un', 'in', 'ter', 'rupt', 'ed']}, {'sound': ['sound']}, {'formed': ['formed']}, {'by': ['by']}, {'a': ['a']}, {'vowel,': ['vow', 'el']}, {'diphthong,': ['diph', 'thong']}, {'or': ['or']}, {'syllabic': ['syl', 'la', 'bic']}, {'consonant': ['con', 'so', 'nant']}, {'alone,': ['a', 'lone']}, {'or': ['or']}, {'by': ['by']}, {'any': ['an', 'y']}, {'of': ['of']}, {'these': ['these']}, {'sounds': ['sounds']}, {'preceded,': ['pre', 'ced', 'ed']}, {'followed,': ['fol', 'lowed']}, {'or': ['or']}, {'surrounded': ['sur', 'round', 'ed']}, {'by': ['by']}, {'one': ['one']}, {'or': ['or']}, {'more': ['more']}, {'consonants.': ['con', 'so', 'nants']}] ```