czearing/article-title-generator · Generate short summary of transcript in 5 words

Have you looked how many tokens your transcripts have on average?
For longer documents most models context length is simply too short (most bert models have a maximum context length of 512 tokens, if your transcripts are longer than that you might have to look for models with larger context lengths)

you can check the maximum amount of tokens for a given model via the following code:

´´´
tokenizer = AutoTokenizer.from_pretrained(your_model_name)
print(tokenizer.model_max_length)
´´´

Also I believe that models like BERT or BART or not made for the task of Text generation, since their architectures are designed to output only a singular token, which is very rarely enough to create a meaningful title. I suggest you best take a look at Text2Text Generation models and fine tune one with a context size that is large enough to fit your documents

If you dot know how to calculate the amount of tokens your documents take up you can do it like this:
´´´
tokenizer = AutoTokenizer.from_pretrained(your_model_name)
data = Dataset(your_data) # if you do not know how to create a dataset from your data look at the dataset section in tutorials
def tokenize_input(inputs):
result = rule_tokenizer(inputs[your_text_column_name], return_tensors='pt',padding=False, truncation=False)
return result
tokenized_data = data.map(tokenize_input, batched=True, remove_columns=data.column_names)
token_lengths = torch.tensor([document.shape[-1] for document in data])
print(f'Average number of tokens: {token_lengths.mean()}\nStandart Deviation: {token_lengths.std()}\nLongest Document Token Amount:{token_lengths.max()}')
´´´
Please note, that the amount of tokens per document may be different for different tokenizers