microsoft/layoutlmv2-base-uncased · LayoutLMv2: Visual-Backbone unexpected behavior

Dear Community, I have been studying LayoutLMv2 for the last couple of months. I recently implemented a Saliency Map technique to visualize the most significant visual features for Token Classification, so I could understand more about "where the model is looking at in the image."

I have observed surprising behavior from these experiments: LayoutLMv2 almost wholly relies on the model's textual part, while the model's visual part (visual backbone) contributes nearly nothing. I could even train LayoutLMv2 with black-out images (adequately labeled so that the textual part could work) and still obtain the same results as when conventionally trained. This result suggests that the visual backbone is not working correctly (or I am doing something wrong).

Could I share my pieces of evidence with you?
Teamwork makes the dreamwork