Normalize Nemotron Parse images in the processor for vLLM compatibility

#6

Summary

Move Nemotron Parse image normalization into the Hugging Face processor so the checkpoint matches vLLM’s RADIO input contract while preserving Transformers generation output.

Background context

vLLM has changed how Nemotron Parse image preprocessing is handled across recent PRs:

  • Initial Nemotron Parse support landed in vLLM PR #30864.
  • vLLM PR #37260 moved the custom Nemotron Parse processor into vllm/transformers_utils/processors/nemotron_parse.py. At this point, CLIP image normalization was still implemented in vLLM’s local processor.
  • vLLM PR #37456 removed the local Nemotron Parse processor and its custom get_hf_processor() override, so vLLM now relies on the processor loaded from the HF checkpoint.
  • vLLM’s RADIO implementation still skips input_conditioner.* weights because it expects image normalization to happen before tensors reach the model.
  • Current vLLM Nemotron Parse processing now goes through the generic HF processor path.
  • vLLM PR #38748 addresses a Transformers v5 image_size tuple/scalar compatibility issue, but does not restore the removed image normalization behavior.

This PR makes the HF checkpoint’s processor perform the normalization that vLLM already expects, while updating the HF model path to avoid applying the RADIO input conditioner a second time. This keeps the final generated output unchanged in Transformers while restoring correct behavior in recent vLLM releases.

Changes

  • Add CLIP mean/std normalization support to NemotronParseImageProcessor.
  • Set do_normalize=true, image_mean, and image_std in preprocessor_config.json.
  • Add encoder.processor_normalizes=true to signal that RADIO preprocessing is handled by the processor.
  • Call RADIO’s existing make_preprocessor_external() in the HF model path to avoid double normalization in Transformers.
  • Cast normalized pixels to the encoder dtype after externalizing RADIO preprocessing.
  • Return BatchFeature from the combined processor output.
  • Add a vLLM golden generation test.
  • Add .dockerignore to avoid copying local model weights into Docker builds.

Compatibility

For normal Transformers usage with AutoProcessor + AutoModel, final generation output is unchanged.

The intentional contract change is that:

processor(...).pixel_values

now returns CLIP-normalized tensors rather than raw [0, 1] tensors.

Direct callers that bypass AutoProcessor and pass pixel_values manually should pass normalized tensors for this checkpoint.

Validation

Transformers golden tests:

  • transformers==4.51.3: pass
  • transformers==4.57.6: pass
  • transformers==5.5.4: pass
  • transformers==5.7.0: pass

vLLM golden tests:

  • vllm==0.20.0: pass
  • vllm==0.19.1: pass
  • vllm==0.17.0: pass
  • vllm==0.14.1: pass

Known vLLM failures unrelated to this normalization issue:

  • vllm==0.18.0 and vllm==0.18.1 fail during weight loading with a decoder shard-id error before generation.
emelryan changed pull request status to open
NVIDIA org

Looks good to me. Validated on omnidocbench to not change results for transformers and reduces disrepancy between transformers and vllm.

Screenshot 2026-05-05 at 9.04.16 AM

emelryan changed pull request status to merged

Sign up or log in to comment