File size: 5,156 Bytes
a85c2d0 cf42a95 a85c2d0 cf42a95 a85c2d0 cf42a95 a85c2d0 cf42a95 a85c2d0 cf42a95 a85c2d0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
Check roberta-base ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265]) β Difference between Flax and PyTorch is 0.00013017654418945312 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 14.801228523254395, PyTorch loss: 14.801219940185547 β Difference between Flax and PyTorch is 8.58306884765625e-06 (< 0.01) --------------------------Checking gradients match-------------------------- β All grads pass --------------------------Checking rel gradients match-------------------------- β Layer ('roberta', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 6.889232651019483e-08 and flax grad norm 5.7956174970286156e-08. ... ========================================= Check bert-base-cased ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 28996), PyTorch logits shape: torch.Size([2, 64, 28996]) β Difference between Flax and PyTorch is 5.4836273193359375e-05 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 13.967159271240234, PyTorch loss: 13.967162132263184 β Difference between Flax and PyTorch is 2.86102294921875e-06 (< 0.01) --------------------------Checking gradients match-------------------------- β All grads pass --------------------------Checking rel gradients match-------------------------- β Layer ('bert', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 8.025740783068613e-08 and flax grad norm 8.381563532111613e-08. ... ========================================= Check t5-small ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 32128), PyTorch logits shape: torch.Size([2, 64, 32128]) β Difference between Flax and PyTorch is 7.62939453125e-05 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 20.534835815429688, PyTorch loss: 20.534835815429688 β Difference between Flax and PyTorch is 0.0 (< 0.01) --------------------------Checking gradients match-------------------------- β All grads pass --------------------------Checking rel gradients match-------------------------- β All rel grads pass ========================================= Check facebook/bart-large ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265]) β Difference between Flax and PyTorch is 0.0004191398620605469 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 13.993148803710938, PyTorch loss: 13.993138313293457 β Difference between Flax and PyTorch is 1.049041748046875e-05 (< 0.01) --------------------------Checking gradients match-------------------------- β Layer ('model', 'decoder', 'layers', '0', 'fc1', 'kernel') has PT grad norm 11.655710220336914 and flax grad norm 11.6015625. β Layer ('model', 'decoder', 'layers', '0', 'fc2', 'kernel') has PT grad norm 7.740886211395264 and flax grad norm 7.71484375. β Layer ('model', 'decoder', 'layers', '10', 'self_attn', 'v_proj', 'kernel') has PT grad norm 6.97633171081543 and flax grad norm 6.96484375. ... --------------------------Checking rel gradients match-------------------------- β Layer ('final_logits_bias',) has PT grad norm 0.0 and flax grad norm 0.0. β Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 8.274865592738934e-08 and flax grad norm 0.0. β Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 2.2391466458770992e-08 and flax grad norm 0.0. β Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 8.3155640595578e-08 and flax grad norm 0.0. ... ========================================= Check facebook/bart-large-cnn ... --------------------------Checking logits match-------------------------- Flax logits shape: (2, 64, 50264), PyTorch logits shape: torch.Size([2, 64, 50264]) β Difference between Flax and PyTorch is 0.0003502368927001953 (< 0.01) --------------------------Checking losses match-------------------------- Flax loss: 13.418181419372559, PyTorch loss: 13.418176651000977 β Difference between Flax and PyTorch is 4.76837158203125e-06 (< 0.01) --------------------------Checking gradients match-------------------------- β All grads pass --------------------------Checking rel gradients match-------------------------- β Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 3.5387660091146245e-07 and flax grad norm 4.874667069998395e-07. β Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 6.254911966152576e-08 and flax grad norm 6.927437112835833e-08. β Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 5.864935914701164e-08 and flax grad norm 6.345069891722233e-08. ... ========================================= |