Check roberta-base ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265]) ✅ Difference between Flax and PyTorch is 0.00013017654418945312 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 14.801228523254395, PyTorch loss: 14.801219940185547 ✅ Difference between Flax and PyTorch is 8.58306884765625e-06 (< 0.01) --------------------------Checking gradients match-------------------------- ✅ All grads pass --------------------------Checking rel gradients match-------------------------- ❌ Layer ('roberta', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 6.889232651019483e-08 and flax grad norm 5.7956174970286156e-08. ... ========================================= Check bert-base-cased ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 28996), PyTorch logits shape: torch.Size([2, 64, 28996]) ✅ Difference between Flax and PyTorch is 5.4836273193359375e-05 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 13.967159271240234, PyTorch loss: 13.967162132263184 ✅ Difference between Flax and PyTorch is 2.86102294921875e-06 (< 0.01) --------------------------Checking gradients match-------------------------- ✅ All grads pass --------------------------Checking rel gradients match-------------------------- ❌ Layer ('bert', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 8.025740783068613e-08 and flax grad norm 8.381563532111613e-08. ... ========================================= Check t5-small ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 32128), PyTorch logits shape: torch.Size([2, 64, 32128]) ✅ Difference between Flax and PyTorch is 7.62939453125e-05 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 20.534835815429688, PyTorch loss: 20.534835815429688 ✅ Difference between Flax and PyTorch is 0.0 (< 0.01) --------------------------Checking gradients match-------------------------- ✅ All grads pass --------------------------Checking rel gradients match-------------------------- ✅ All rel grads pass ========================================= Check facebook/bart-large ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265]) ✅ Difference between Flax and PyTorch is 0.0004191398620605469 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 13.993148803710938, PyTorch loss: 13.993138313293457 ✅ Difference between Flax and PyTorch is 1.049041748046875e-05 (< 0.01) --------------------------Checking gradients match-------------------------- ❌ Layer ('model', 'decoder', 'layers', '0', 'fc1', 'kernel') has PT grad norm 11.655710220336914 and flax grad norm 11.6015625. ❌ Layer ('model', 'decoder', 'layers', '0', 'fc2', 'kernel') has PT grad norm 7.740886211395264 and flax grad norm 7.71484375. ❌ Layer ('model', 'decoder', 'layers', '10', 'self_attn', 'v_proj', 'kernel') has PT grad norm 6.97633171081543 and flax grad norm 6.96484375. ... --------------------------Checking rel gradients match-------------------------- ❌ Layer ('final_logits_bias',) has PT grad norm 0.0 and flax grad norm 0.0. ❌ Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 8.274865592738934e-08 and flax grad norm 0.0. ❌ Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 2.2391466458770992e-08 and flax grad norm 0.0. ❌ Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 8.3155640595578e-08 and flax grad norm 0.0. ... ========================================= Check facebook/bart-large-cnn ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 50264), PyTorch logits shape: torch.Size([2, 64, 50264]) ✅ Difference between Flax and PyTorch is 0.0003502368927001953 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 13.418181419372559, PyTorch loss: 13.418176651000977 ✅ Difference between Flax and PyTorch is 4.76837158203125e-06 (< 0.01) --------------------------Checking gradients match-------------------------- ✅ All grads pass --------------------------Checking rel gradients match-------------------------- ❌ Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 3.5387660091146245e-07 and flax grad norm 4.874667069998395e-07. ❌ Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 6.254911966152576e-08 and flax grad norm 6.927437112835833e-08. ❌ Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 5.864935914701164e-08 and flax grad norm 6.345069891722233e-08. ... =========================================