patrickvonplaten
/

codesnippets

Model card Files Files and versions Community

codesnippets / after_fix_pretrained_log.txt

patrickvonplaten

finish

cf42a95 about 2 years ago

raw history blame contribute delete

No virus

5.16 kB

	Check roberta-base ...
	--------------------------Checking logits match--------------------------
	Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265])
	✅ Difference between Flax and PyTorch is 0.00013017654418945312 (< 0.01)
	--------------------------Checking losses match--------------------------
	Flax loss: 14.801228523254395, PyTorch loss: 14.801219940185547
	✅ Difference between Flax and PyTorch is 8.58306884765625e-06 (< 0.01)
	--------------------------Checking gradients match--------------------------
	✅ All grads pass
	--------------------------Checking rel gradients match--------------------------
	❌ Layer ('roberta', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 6.889232651019483e-08 and flax grad norm 5.7956174970286156e-08.
	...
	=========================================
	Check bert-base-cased ...
	--------------------------Checking logits match--------------------------
	Flax logits shape: (2, 64, 28996), PyTorch logits shape: torch.Size([2, 64, 28996])
	✅ Difference between Flax and PyTorch is 5.4836273193359375e-05 (< 0.01)
	--------------------------Checking losses match--------------------------
	Flax loss: 13.967159271240234, PyTorch loss: 13.967162132263184
	✅ Difference between Flax and PyTorch is 2.86102294921875e-06 (< 0.01)
	--------------------------Checking gradients match--------------------------
	✅ All grads pass
	--------------------------Checking rel gradients match--------------------------
	❌ Layer ('bert', 'encoder', 'layer', '0', 'attention', 'self', 'key', 'bias') has PT grad norm 8.025740783068613e-08 and flax grad norm 8.381563532111613e-08.
	...
	=========================================
	Check t5-small ...
	--------------------------Checking logits match--------------------------
	Flax logits shape: (2, 64, 32128), PyTorch logits shape: torch.Size([2, 64, 32128])
	✅ Difference between Flax and PyTorch is 7.62939453125e-05 (< 0.01)
	--------------------------Checking losses match--------------------------
	Flax loss: 20.534835815429688, PyTorch loss: 20.534835815429688
	✅ Difference between Flax and PyTorch is 0.0 (< 0.01)
	--------------------------Checking gradients match--------------------------
	✅ All grads pass
	--------------------------Checking rel gradients match--------------------------
	✅ All rel grads pass
	=========================================
	Check facebook/bart-large ...
	--------------------------Checking logits match--------------------------
	Flax logits shape: (2, 64, 50265), PyTorch logits shape: torch.Size([2, 64, 50265])
	✅ Difference between Flax and PyTorch is 0.0004191398620605469 (< 0.01)
	--------------------------Checking losses match--------------------------
	Flax loss: 13.993148803710938, PyTorch loss: 13.993138313293457
	✅ Difference between Flax and PyTorch is 1.049041748046875e-05 (< 0.01)
	--------------------------Checking gradients match--------------------------
	❌ Layer ('model', 'decoder', 'layers', '0', 'fc1', 'kernel') has PT grad norm 11.655710220336914 and flax grad norm 11.6015625.
	❌ Layer ('model', 'decoder', 'layers', '0', 'fc2', 'kernel') has PT grad norm 7.740886211395264 and flax grad norm 7.71484375.
	❌ Layer ('model', 'decoder', 'layers', '10', 'self_attn', 'v_proj', 'kernel') has PT grad norm 6.97633171081543 and flax grad norm 6.96484375.
	...
	--------------------------Checking rel gradients match--------------------------
	❌ Layer ('final_logits_bias',) has PT grad norm 0.0 and flax grad norm 0.0.
	❌ Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 8.274865592738934e-08 and flax grad norm 0.0.
	❌ Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 2.2391466458770992e-08 and flax grad norm 0.0.
	❌ Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 8.3155640595578e-08 and flax grad norm 0.0.
	...
	=========================================
	Check facebook/bart-large-cnn ...
	--------------------------Checking logits match--------------------------
	Flax logits shape: (2, 64, 50264), PyTorch logits shape: torch.Size([2, 64, 50264])
	✅ Difference between Flax and PyTorch is 0.0003502368927001953 (< 0.01)
	--------------------------Checking losses match--------------------------
	Flax loss: 13.418181419372559, PyTorch loss: 13.418176651000977
	✅ Difference between Flax and PyTorch is 4.76837158203125e-06 (< 0.01)
	--------------------------Checking gradients match--------------------------
	✅ All grads pass
	--------------------------Checking rel gradients match--------------------------
	❌ Layer ('model', 'decoder', 'layers', '0', 'encoder_attn', 'k_proj', 'bias') has PT grad norm 3.5387660091146245e-07 and flax grad norm 4.874667069998395e-07.
	❌ Layer ('model', 'decoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 6.254911966152576e-08 and flax grad norm 6.927437112835833e-08.
	❌ Layer ('model', 'encoder', 'layers', '0', 'self_attn', 'k_proj', 'bias') has PT grad norm 5.864935914701164e-08 and flax grad norm 6.345069891722233e-08.
	...
	=========================================