Generate short summary of transcript in 5 words

#6
by praful-soni - opened

While the current model adeptly generates titles from articles, I require a model for summarizing transcripts in 5-10 words.
LLM, though effective, is resource-intensive and takes 1-2 minutes for generation.
My attempt to fine-tune BERT, BART, and T5 models yielded suboptimal results. I possess 1-hour meeting transcripts for training but struggle to achieve improved accuracy and results. Seeking guidance on optimizing the model for better performance.

Have you looked how many tokens your transcripts have on average?
For longer documents most models context length is simply too short (most bert models have a maximum context length of 512 tokens, if your transcripts are longer than that you might have to look for models with larger context lengths)

you can check the maximum amount of tokens for a given model via the following code:

´´´
tokenizer = AutoTokenizer.from_pretrained(your_model_name)
print(tokenizer.model_max_length)
´´´

Also I believe that models like BERT or BART or not made for the task of Text generation, since their architectures are designed to output only a singular token, which is very rarely enough to create a meaningful title. I suggest you best take a look at Text2Text Generation models and fine tune one with a context size that is large enough to fit your documents

If you dot know how to calculate the amount of tokens your documents take up you can do it like this:
´´´
tokenizer = AutoTokenizer.from_pretrained(your_model_name)
data = Dataset(your_data) # if you do not know how to create a dataset from your data look at the dataset section in tutorials
def tokenize_input(inputs):
result = rule_tokenizer(inputs[your_text_column_name], return_tensors='pt',padding=False, truncation=False)
return result
tokenized_data = data.map(tokenize_input, batched=True, remove_columns=data.column_names)
token_lengths = torch.tensor([document.shape[-1] for document in data])
print(f'Average number of tokens: {token_lengths.mean()}\nStandart Deviation: {token_lengths.std()}\nLongest Document Token Amount:{token_lengths.max()}')
´´´
Please note, that the amount of tokens per document may be different for different tokenizers

Sign up or log in to comment