lm_head.weight missing from checkpoint and inference returns garbage on the bundled example

#2
by jishnu-jp - opened

I'm trying to use TRASER on traffic-CCTV footage. I'm running into an issue I can't get past on my own, and I'd really appreciate any guidance. If I'm missing a required step please point me to it. I may very well be holding it wrong.

I'm following inference.py as documented, on your bundled example:

python inference.py \
--model_path . \
--video_path example/2401075277.mp4 \
--mask_path example/2401075277_rle.json \
--out_dir ./output

The decoded output is 31 tokens and looks like this:

{"cri child",away playingful's " " " "{"cri", "{"{"{"]} "attributes playingfulfulroom",]},{"]}

A few observations while I was trying to debug ( not sure if this is expected behaviour):

  1. In model.safetensors.index.json I see 888 tensors total, including model.embed_tokens.weight and every perceiver_resampler.* / second_perceiver_resampler.* tensor, but I can't find lm_head.weight listed anywhere (neither shard).

  2. In config.json, the outer Qwen2_5_VLConfig doesn't set tie_word_embeddings. The inner text_config has it as true, but my understanding is that from_pretrained reads the outer one (which would default to false here).

  3. After loading, model.lm_head.weight.abs().sum() comes out as 0.0 while model.model.language_model.embed_tokens.weight.abs().sum() is around 5.45M, which makes me think the LM head is sitting at its random init.

I tried two workarounds in case I was misreading the situation:

  • Adding "tie_word_embeddings": true to the outer config and re-loading.
  • Manually doing model.lm_head.weight = model.model.language_model.embed_tokens.weight after from_pretrained.

With both applied, the output does change. The perceivers seem to surface real visual concepts (tokens like "motorcycle", "scooter", "rider", "cycl" appear in the prefix), but the rest of the output collapses into long runs of the digit 0 (β‰ˆ96% of a 4,200-character output on a different clip).

My read is that the visual side is working, but the LM head isn't really tied, which would be expected if the head was trained separately and just wasn't included in the upload, but I'm honestly not sure.

I'd be very grateful for any pointer
Thankx again for the work.

Hey - thanks for your interests at first! Could you please share the version of your transformers package? We are using 4.54.0 and I guess module names may vary across different versions, so some weight loading would fail. As for the tie_word_embeddings setting, I think the default value is True in transformers.

Sign up or log in to comment