Check roberta-base ...
--------------------------Checking logits match--------------------------
Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265])
✅ Difference between Flax and PyTorch is 0.00013017654418945312 (< 0.01)
--------------------------Checking losses match--------------------------
Flax loss: 14.801228523254395, PyTorch loss: 14.801219940185547
✅ Difference between Flax and PyTorch is 8.58306884765625e-06 (< 0.01)
--------------------------Checking gradients match--------------------------
✅ All grads pass
--------------------------Checking rel gradients match--------------------------
❌ Layer ('roberta', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 6.889232651019483e-08 and flax grad norm 5.7956174970286156e-08.
...
=========================================
Check bert-base-cased ...
--------------------------Checking logits match--------------------------
Flax logits shape: (2, 64, 28996), PyTorch logits shape: torch.Size([2, 64, 28996])
✅ Difference between Flax and PyTorch is 5.4836273193359375e-05 (< 0.01)
--------------------------Checking losses match--------------------------
Flax loss: 13.967159271240234, PyTorch loss: 13.967162132263184
✅ Difference between Flax and PyTorch is 2.86102294921875e-06 (< 0.01)
--------------------------Checking gradients match--------------------------
✅ All grads pass
--------------------------Checking rel gradients match--------------------------
❌ Layer ('bert', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 8.025740783068613e-08 and flax grad norm 8.381563532111613e-08.
...
=========================================
Check t5-small ...
--------------------------Checking logits match--------------------------
Flax logits shape: (2, 64, 32128), PyTorch logits shape: torch.Size([2, 64, 32128])
✅ Difference between Flax and PyTorch is 7.62939453125e-05 (< 0.01)
--------------------------Checking losses match--------------------------
Flax loss: 20.534835815429688, PyTorch loss: 20.534835815429688
✅ Difference between Flax and PyTorch is 0.0 (< 0.01)
--------------------------Checking gradients match--------------------------
✅ All grads pass
--------------------------Checking rel gradients match--------------------------
✅ All rel grads pass
=========================================
Check facebook/bart-large ...
--------------------------Checking logits match--------------------------
Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265])
✅ Difference between Flax and PyTorch is 0.0004191398620605469 (< 0.01)
--------------------------Checking losses match--------------------------
Flax loss: 13.993148803710938, PyTorch loss: 13.993138313293457
✅ Difference between Flax and PyTorch is 1.049041748046875e-05 (< 0.01)
--------------------------Checking gradients match--------------------------
❌ Layer ('model', 'decoder', 'layers', '0', 'fc1', 'kernel') has PT grad norm 11.655710220336914 and flax grad norm 11.6015625.
❌ Layer ('model', 'decoder', 'layers', '0', 'fc2', 'kernel') has PT grad norm 7.740886211395264 and flax grad norm 7.71484375.
❌ Layer ('model', 'decoder', 'layers', '10', 'self_attn', 'v_proj', 'kernel') has PT grad norm 6.97633171081543 and flax grad norm 6.96484375.
...
--------------------------Checking rel gradients match--------------------------
❌ Layer ('final_logits_bias',) has PT grad norm 0.0 and flax grad norm 0.0.
❌ Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 8.274865592738934e-08 and flax grad norm 0.0.
❌ Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 2.2391466458770992e-08 and flax grad norm 0.0.
❌ Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 8.3155640595578e-08 and flax grad norm 0.0.
...
=========================================
Check facebook/bart-large-cnn ...
--------------------------Checking logits match--------------------------
Flax logits shape: (2, 64, 50264), PyTorch logits shape: torch.Size([2, 64, 50264])
✅ Difference between Flax and PyTorch is 0.0003502368927001953 (< 0.01)
--------------------------Checking losses match--------------------------
Flax loss: 13.418181419372559, PyTorch loss: 13.418176651000977
✅ Difference between Flax and PyTorch is 4.76837158203125e-06 (< 0.01)
--------------------------Checking gradients match--------------------------
✅ All grads pass
--------------------------Checking rel gradients match--------------------------
❌ Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 3.5387660091146245e-07 and flax grad norm 4.874667069998395e-07.
❌ Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 6.254911966152576e-08 and flax grad norm 6.927437112835833e-08.
❌ Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 5.864935914701164e-08 and flax grad norm 6.345069891722233e-08.
...
=========================================