|
Check roberta-base ... |
|
--------------------------Checking logits match-------------------------- |
|
Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265]) |
|
β
Difference between Flax and PyTorch is 0.00013017654418945312 (< 0.01) |
|
--------------------------Checking losses match-------------------------- |
|
Flax loss: 14.801228523254395, PyTorch loss: 14.801219940185547 |
|
β
Difference between Flax and PyTorch is 8.58306884765625e-06 (< 0.01) |
|
--------------------------Checking gradients match-------------------------- |
|
β
All grads pass |
|
--------------------------Checking rel gradients match-------------------------- |
|
β Layer ('roberta', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 6.889232651019483e-08 and flax grad norm 5.7956174970286156e-08. |
|
... |
|
========================================= |
|
Check bert-base-cased ... |
|
--------------------------Checking logits match-------------------------- |
|
Flax logits shape: (2, 64, 28996), PyTorch logits shape: torch.Size([2, 64, 28996]) |
|
β
Difference between Flax and PyTorch is 5.4836273193359375e-05 (< 0.01) |
|
--------------------------Checking losses match-------------------------- |
|
Flax loss: 13.967159271240234, PyTorch loss: 13.967162132263184 |
|
β
Difference between Flax and PyTorch is 2.86102294921875e-06 (< 0.01) |
|
--------------------------Checking gradients match-------------------------- |
|
β
All grads pass |
|
--------------------------Checking rel gradients match-------------------------- |
|
β Layer ('bert', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 8.025740783068613e-08 and flax grad norm 8.381563532111613e-08. |
|
... |
|
========================================= |
|
Check t5-small ... |
|
--------------------------Checking logits match-------------------------- |
|
Flax logits shape: (2, 64, 32128), PyTorch logits shape: torch.Size([2, 64, 32128]) |
|
β
Difference between Flax and PyTorch is 7.62939453125e-05 (< 0.01) |
|
--------------------------Checking losses match-------------------------- |
|
Flax loss: 20.534835815429688, PyTorch loss: 20.534835815429688 |
|
β
Difference between Flax and PyTorch is 0.0 (< 0.01) |
|
--------------------------Checking gradients match-------------------------- |
|
β
All grads pass |
|
--------------------------Checking rel gradients match-------------------------- |
|
β
All rel grads pass |
|
========================================= |
|
Check facebook/bart-large ... |
|
--------------------------Checking logits match-------------------------- |
|
Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265]) |
|
β
Difference between Flax and PyTorch is 0.0004191398620605469 (< 0.01) |
|
--------------------------Checking losses match-------------------------- |
|
Flax loss: 13.993148803710938, PyTorch loss: 13.993138313293457 |
|
β
Difference between Flax and PyTorch is 1.049041748046875e-05 (< 0.01) |
|
--------------------------Checking gradients match-------------------------- |
|
β Layer ('model', 'decoder', 'layers', '0', 'fc1', 'kernel') has PT grad norm 11.655710220336914 and flax grad norm 11.6015625. |
|
β Layer ('model', 'decoder', 'layers', '0', 'fc2', 'kernel') has PT grad norm 7.740886211395264 and flax grad norm 7.71484375. |
|
β Layer ('model', 'decoder', 'layers', '10', 'self_attn', 'v_proj', 'kernel') has PT grad norm 6.97633171081543 and flax grad norm 6.96484375. |
|
... |
|
--------------------------Checking rel gradients match-------------------------- |
|
β Layer ('final_logits_bias',) has PT grad norm 0.0 and flax grad norm 0.0. |
|
β Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 8.274865592738934e-08 and flax grad norm 0.0. |
|
β Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 2.2391466458770992e-08 and flax grad norm 0.0. |
|
β Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 8.3155640595578e-08 and flax grad norm 0.0. |
|
... |
|
========================================= |
|
Check facebook/bart-large-cnn ... |
|
--------------------------Checking logits match-------------------------- |
|
Flax logits shape: (2, 64, 50264), PyTorch logits shape: torch.Size([2, 64, 50264]) |
|
β
Difference between Flax and PyTorch is 0.0003502368927001953 (< 0.01) |
|
--------------------------Checking losses match-------------------------- |
|
Flax loss: 13.418181419372559, PyTorch loss: 13.418176651000977 |
|
β
Difference between Flax and PyTorch is 4.76837158203125e-06 (< 0.01) |
|
--------------------------Checking gradients match-------------------------- |
|
β
All grads pass |
|
--------------------------Checking rel gradients match-------------------------- |
|
β Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 3.5387660091146245e-07 and flax grad norm 4.874667069998395e-07. |
|
β Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 6.254911966152576e-08 and flax grad norm 6.927437112835833e-08. |
|
β Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 5.864935914701164e-08 and flax grad norm 6.345069891722233e-08. |
|
... |
|
========================================= |
|
|