I have the same problem... it also has incorrect grammar... An image of chunks of ice in water in the sea got:
"arafed ice floess are floating in the water near the shore"
I have no clue what 'arafed' or 'floess' are..
googling 'arafed' led me to this dataset: https://huggingface.co/datasets/multimodalart/facesyntheticsspigacaptioned
Which I am assuming they might have used to train on for some weird reason?