finetuning issues

#9
by winglian - opened

Seeing some cases where it won't cleanly finetune. For example,

  1. qLoRA on a single GPU results in loss ~16
  2. qLoRA on multi-GPU, no deepspeed, no FSDP results in the error below

on the flip side, qlora w multi-gpu with deepspeed zero2 works great with a train loss of ~1.

File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3059, in compute_loss
outputs = model(**inputs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 24 25 26 27 28 29 30 31 32 33 34 35 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Been working on a fine-tune in Colab (A100) with Transformers/Unsloth and having some decent results. I was able to load it in 4Bit and cram it onto 40GB. Starts rough but is still learning

Training has been progressing slow but fine so far

 [3134/5850 3:03:46 < 2:39:22, 0.28 it/s, Epoch 5.35/10]
Step	Training Loss
1	10.109800
2	9.924600
3	9.919700
4	9.919100
5	9.917400
6	9.895900
7	9.891700
8	9.893500
9	9.917200
10	9.918800
11	10.056100
12	9.916200
13	9.911200
14	9.884300
15	9.909800
16	9.883800
17	9.883800
18	9.878300
19	9.904400
20	9.976400
21	10.061600
22	10.063300
23	9.876200
24	9.890900
25	9.873100
26	9.893700

85	9.790100
86	9.840400
87	9.809500
88	9.860000
89	9.807000
90	9.948200
91	9.779500
92	9.781800
93	9.802700
94	9.827700
95	9.798000
96	9.825900
97	9.966000
98	9.773000
99	9.775400

271	9.531600
272	9.555500
273	9.559900
274	9.524000
275	9.889300
276	9.553700
277	9.534400
278	9.566800
279	9.518700
280	9.510600
281	9.528800
282	9.545800
283	9.693700
284	9.507500
285	9.511300
286	9.500100
287	9.505000
288	9.540600
289	9.651200
290	9.570700

597	9.143200
598	9.271700
599	9.316200
600	9.105400
601	9.265900
602	9.103200
603	9.268700
604	9.099700
605	9.046600
606	9.046700

1596	7.850300
1597	8.045900
1598	7.740900
1599	7.737900
1600	7.733500
1601	7.734300
1602	7.729800
1603	8.040600
1604	7.817500
1605	7.920400
1606	7.730800
1607	7.816300
1608	7.820600
1609	7.823400
1610	7.730100
1611	7.726700
1612	7.833400
1613	7.899600
1614	7.806200
1615	7.722700
1616	8.222000
1617	7.807400
1618	7.717900
1619	7.718900

2118	7.301500
2119	7.185200
2120	7.183700
2121	7.411800
2122	7.282500
2123	7.285600
2124	7.292500
2125	7.178600
2126	7.295200
2127	7.184200
2128	7.288200
2129	7.423800
2130	7.277300
2131	7.174300
2132	7.288300
2133	8.224200
2134	7.276500
2135	7.173400

3108	6.350500
3109	6.619800
3110	6.344700
3111	6.505300
3112	6.343300
3113	6.502800
3114	6.342800
3115	6.502300
3116	6.489400
3117	6.487600
3118	7.165100
3119	6.495300
3120	6.479400
3121	6.483700
3122	6.913700
3123	6.338300
3124	6.474800
3125	6.341200
3126	6.493200
3127	6.657600
3128	6.340400
3129	6.493700
3130	7.083600
3131	6.327500
3132	6.326300
/usr/local/lib/python3.10/dist-packages/peft/utils/save_and_load.py:139: UserWarning: Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.
  warnings.warn("Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.")
``

@Severian oof, that loss though. I've got it working now though in axolotl with a normal loss.

winglian changed discussion status to closed

Sign up or log in to comment