File size: 153,939 Bytes
d329a3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
Introduction
0:00
- I see the danger of this concentration of power through proprietary AI systems
0:06
as a much bigger danger than everything else. What works against this
0:11
is people who think that for reasons of security, we should keep AI systems under lock and key
0:18
because it's too dangerous to put it in the hands of everybody. That would lead to a very bad future
0:25
in which all of our information diet is controlled by a small number of companies
0:30
through proprietary systems. - I believe that people are fundamentally good and so if AI, especially open source AI
0:38
can make them smarter, it just empowers the goodness in humans.
0:44
- So I share that feeling. Okay? I think people are fundamentally good. (laughing)
0:50
And in fact a lot of doomers are doomers because they don't think that people are fundamentally good.
0:57
- The following is a conversation with Yann LeCun, his third time on this podcast.
1:02
He is the chief AI scientist at Meta, professor at NYU, Turing Award winner
1:08
and one of the seminal figures in the history of artificial intelligence. He and Meta AI
1:15
have been big proponents of open sourcing AI development, and have been walking the walk
1:21
by open sourcing many of their biggest models, including LLaMA 2 and eventually LLaMA 3.
1:28
Also, Yann has been an outspoken critic of those people in the AI community
1:34
who warn about the looming danger and existential threat of AGI.
1:39
He believes the AGI will be created one day, but it will be good.
1:45
It will not escape human control nor will it dominate and kill all humans.
1:52
At this moment of rapid AI development, this happens to be somewhat a controversial position.
1:58
And so it's been fun seeing Yann get into a lot of intense and fascinating discussions online
2:05
as we do in this very conversation. This is the Lex Fridman podcast. To support it,
2:11
please check out our sponsors in the description. And now, dear friends, here's Yann LeCun.
Limits of LLMs
2:18
You've had some strong statements, technical statements about the future of artificial intelligence recently,
2:25
throughout your career actually but recently as well. You've said that autoregressive LLMs
2:31
are not the way we're going to make progress towards superhuman intelligence.
2:38
These are the large language models like GPT-4, like LLaMA 2 and 3 soon and so on.
2:44
How do they work and why are they not going to take us all the way? - For a number of reasons. The first is that there is a number of characteristics
2:51
of intelligent behavior. For example, the capacity to understand the world,
2:58
understand the physical world, the ability to remember and retrieve things,
3:06
persistent memory, the ability to reason and the ability to plan.
3:12
Those are four essential characteristic of intelligent systems or entities,
3:18
humans, animals. LLMs can do none of those, or they can only do them in a very primitive way.
3:26
And they don't really understand the physical world, they don't really have persistent memory, they can't really reason
3:32
and they certainly can't plan. And so if you expect the system to become intelligent
3:38
just without having the possibility of doing those things, you're making a mistake.
3:45
That is not to say that autoregressive LLMs are not useful, they're certainly useful.
3:53
That they're not interesting, that we can't build a whole ecosystem of applications around them,
4:00
of course we can. But as a path towards human level intelligence,
4:05
they're missing essential components. And then there is another tidbit or fact
4:11
that I think is very interesting; those LLMs are trained on enormous amounts of text.
4:16
Basically the entirety of all publicly available text on the internet, right? That's typically on the order of 10 to the 13 tokens.
4:26
Each token is typically two bytes. So that's two 10 to the 13 bytes as training data.
4:31
It would take you or me 170,000 years to just read through this at eight hours a day. (laughs)
4:37
So it seems like an enormous amount of knowledge, right? That those systems can accumulate.
4:46
But then you realize it's really not that much data. If you talk to developmental psychologists,
4:51
and they tell you a 4-year-old has been awake for 16,000 hours in his or her life,
5:00
and the amount of information that has reached the visual cortex of that child
5:06
in four years is about 10 to 15 bytes.
5:12
And you can compute this by estimating that the optical nerve carry about 20 megabytes per second, roughly.
5:19
And so 10 to the 15 bytes for a 4-year-old versus two times 10 to the 13 bytes
5:25
for 170,000 years worth of reading. What that tells you is that through sensory input,
5:33
we see a lot more information than we do through language. And that despite our intuition,
5:40
most of what we learn and most of our knowledge is through our observation and interaction
5:46
with the real world, not through language. Everything that we learn in the first few years of life,
5:51
and certainly everything that animals learn has nothing to do with language.
5:57
- So it would be good to maybe push against some of the intuition behind what you're saying. So it is true there's several orders of magnitude
6:05
more data coming into the human mind, much faster,
6:10
and the human mind is able to learn very quickly from that, filter the data very quickly. Somebody might argue
6:16
your comparison between sensory data versus language. That language is already very compressed.
6:23
It already contains a lot more information than the bytes it takes to store them, if you compare it to visual data.
6:29
So there's a lot of wisdom in language. There's words and the way we stitch them together, it already contains a lot of information.
6:36
So is it possible that language alone already has enough wisdom and knowledge in there
6:47
to be able to, from that language construct a world model and understanding of the world,
6:52
an understanding of the physical world that you're saying LLMs lack? - So it's a big debate among philosophers
7:00
and also cognitive scientists, like whether intelligence needs to be grounded in reality.
7:05
I'm clearly in the camp that yes, intelligence cannot appear without some grounding in some reality.
7:14
It doesn't need to be physical reality, it could be simulated but the environment is just much richer
7:20
than what you can express in language. Language is a very approximate representation or percepts
7:27
and or mental models, right? I mean, there's a lot of tasks that we accomplish where we manipulate a mental model of the situation at hand,
7:38
and that has nothing to do with language. Everything that's physical, mechanical, whatever,
7:43
when we build something, when we accomplish a task, a moderate task of grabbing something, et cetera,
7:50
we plan our action sequences, and we do this by essentially imagining the result
7:55
of the outcome of sequence of actions that we might imagine.
8:01
And that requires mental models that don't have much to do with language. And that's, I would argue,
8:07
most of our knowledge is derived from that interaction with the physical world.
8:13
So a lot of my colleagues who are more interested in things like computer vision
8:19
are really on that camp that AI needs to be embodied, essentially.
8:25
And then other people coming from the NLP side or maybe some other motivation
8:32
don't necessarily agree with that. And philosophers are split as well.
8:38
And the complexity of the world is hard to imagine.
8:44
It's hard to represent all the complexities
8:51
that we take completely for granted in the real world that we don't even imagine require intelligence, right? This is the old Moravec's paradox
8:57
from the pioneer of robotics, Hans Moravec, who said, how is it that with computers,
9:03
it seems to be easy to do high level complex tasks like playing chess and solving integrals
9:08
and doing things like that, whereas the thing we take for granted that we do every day,
9:13
like, I don't know, learning to drive a car or grabbing an object,
9:18
we can't do with computers. (laughs) And we have LLMs that can pass the bar exam,
9:28
so they must be smart. But then they can't launch a drive in 20 hours like any 17-year-old.
9:35
They can't learn to clear out the dinner table and fill out the dishwasher like any 10-year-old can learn in one shot.
9:42
Why is that? Like what are we missing? What type of learning or reasoning architecture or whatever are we missing
9:52
that basically prevent us from having level five self-driving cars
9:58
and domestic robots? - Can a large language model construct a world model
10:05
that does know how to drive and does know how to fill a dishwasher, but just doesn't know how to deal with visual data at this time?
10:12
So it can operate in a space of concepts. - So yeah, that's what a lot of people are working on.
10:19
So the answer, the short answer is no. And the more complex answer is you can use all kind of tricks
10:26
to get an LLM to basically digest visual representations
10:35
of images or video or audio for that matter.
10:42
And a classical way of doing this is you train a vision system in some way,
10:49
and we have a number of ways to train vision systems, either supervised, unsupervised, self-supervised, all kinds of different ways.
10:56
That will turn any image into a high level representation.
11:02
Basically, a list of tokens that are really similar to the kind of tokens that a typical LLM takes as an input.
11:10
And then you just feed that to the LLM in addition to the text,
11:17
and you just expect the LLM during training to kind of be able to use those representations
11:25
to help make decisions. I mean, there's been work along those lines for quite a long time.
11:31
And now you see those systems, right? I mean, there are LLMs that have some vision extension.
11:36
But they're basically hacks in the sense that those things are not like trained to handle,
11:41
to really understand the world. They're not trained with video, for example. They don't really understand intuitive physics,
11:48
at least not at the moment. - So you don't think there's something special to you about intuitive physics,
11:54
about sort of common sense reasoning about the physical space, about physical reality? That to you is a giant leap
12:00
that LLMs are just not able to do? - We're not gonna be able to do this with the type of LLMs that we are working with today.
12:07
And there's a number of reasons for this, but the main reason is the way LLMs are trained is that you take a piece of text,
12:16
you remove some of the words in that text, you mask them, you replace them by black markers,
12:22
and you train a gigantic neural net to predict the words that are missing. And if you build this neural net in a particular way
12:30
so that it can only look at words that are to the left of the one it's trying to predict,
12:36
then what you have is a system that basically is trying to predict the next word in a text, right? So then you can feed it a text, a prompt,
12:43
and you can ask it to predict the next word. It can never predict the next word exactly. And so what it's gonna do
12:49
is produce a probability distribution of all the possible words in the dictionary.
12:54
In fact, it doesn't predict words, it predicts tokens that are kind of subword units. And so it's easy to handle the uncertainty
13:01
in the prediction there because there's only a finite number of possible words in the dictionary,
13:07
and you can just compute a distribution over them. Then what the system does is that it picks a word from that distribution.
13:16
Of course, there's a higher chance of picking words that have a higher probability within that distribution. So you sample from that distribution
13:22
to actually produce a word, and then you shift that word into the input.
13:28
And so that allows the system now to predict the second word, right? And once you do this, you shift it into the input, et cetera.
13:35
That's called autoregressive prediction, which is why those LLMs should be called autoregressive LLMs,
13:43
but we just call them at LLMs. And there is a difference between this kind of process
13:50
and a process by which before producing a word, when you talk. When you and I talk,
Bilingualism and thinking
13:56
you and I are bilinguals. We think about what we're gonna say, and it's relatively independent
14:01
of the language in which we're gonna say. When we talk about like, I don't know, let's say a mathematical concept or something.
14:09
The kind of thinking that we're doing and the answer that we're planning to produce
14:14
is not linked to whether we're gonna say it in French or Russian or English.
14:19
- Chomsky just rolled his eyes, but I understand. So you're saying that there's a bigger abstraction
14:24
that goes before language- - [Yann] Yeah. - And maps onto language.
14:30
- Right. It's certainly true for a lot of thinking that we do. - Is that obvious that we don't?
14:35
Like you're saying your thinking is same in French as it is in English? - Yeah, pretty much.
14:42
- Pretty much or is this... Like how flexible are you, like if there's a probability distribution?
14:48
(both laugh) - Well, it depends what kind of thinking, right? If it's like producing puns,
14:53
I get much better in French than English about that (laughs) or much worse- - Is there an abstract representation of puns?
15:00
Like is your humor an abstract... Like when you tweet and your tweets are sometimes a little bit spicy,
15:06
is there an abstract representation in your brain of a tweet before it maps onto English?
15:11
- There is an abstract representation of imagining the reaction of a reader to that text.
15:18
- Oh, you start with laughter and then figure out how to make that happen? - Figure out like a reaction you wanna cause
15:25
and then figure out how to say it so that it causes that reaction. But that's like really close to language.
15:30
But think about like a mathematical concept or imagining something you want to build out of wood
15:38
or something like this, right? The kind of thinking you're doing has absolutely nothing to do with language, really.
15:43
Like it's not like you have necessarily like an internal monologue in any particular language. You're imagining mental models of the thing, right?
15:51
I mean, if I ask you to like imagine what this water bottle will look like if I rotate it 90 degrees,
15:59
that has nothing to do with language. And so clearly
16:04
there is a more abstract level of representation in which we do most of our thinking
16:11
and we plan what we're gonna say if the output is uttered words
16:19
as opposed to an output being muscle actions, right?
16:26
We plan our answer before we produce it. And LLMs don't do that, they just produce one word after the other,
16:33
instinctively if you want. It's a bit like the subconscious actions where you don't...
16:41
Like you're distracted. You're doing something, you're completely concentrated and someone comes to you and asks you a question.
16:47
And you kind of answer the question. You don't have time to think about the answer, but the answer is easy so you don't need to pay attention
16:54
and you sort of respond automatically. That's kind of what an LLM does, right? It doesn't think about its answer, really.
17:01
It retrieves it because it's accumulated a lot of knowledge, so it can retrieve some things, but it's going to just spit out one token after the other
17:10
without planning the answer. - But you're making it sound just one token after the other,
17:17
one token at a time generation is bound to be simplistic.
17:25
But if the world model is sufficiently sophisticated, that one token at a time,
17:31
the most likely thing it generates as a sequence of tokens is going to be a deeply profound thing.
17:39
- Okay. But then that assumes that those systems actually possess an internal world model.
17:44
- So it really goes to the... I think the fundamental question is can you build a really complete world model?
Video prediction
17:54
Not complete, but one that has a deep understanding of the world. - Yeah.
17:59
So can you build this first of all by prediction? - [Lex] Right.
18:04
- And the answer is probably yes. Can you build it by predicting words?
18:10
And the answer is most probably no, because language is very poor in terms of...
18:17
Or weak or low bandwidth if you want, there's just not enough information there. So building world models means observing the world
18:27
and understanding why the world is evolving the way it is.
18:33
And then the extra component of a world model
18:38
is something that can predict how the world is going to evolve as a consequence of an action you might take, right?
18:45
So one model really is, here is my idea of the state of the world at time T, here is an action I might take.
18:51
What is the predicted state of the world at time T plus one? Now, that state of the world
18:57
does not need to represent everything about the world, it just needs to represent enough that's relevant for this planning of the action,
19:06
but not necessarily all the details. Now, here is the problem. You're not going to be able to do this
19:11
with generative models. So a generative model that's trained on video,
19:16
and we've tried to do this for 10 years. You take a video, show a system a piece of video
19:22
and then ask you to predict the reminder of the video. Basically predict what's gonna happen.
19:27
- One frame at a time. Do the same thing as sort of the autoregressive LLMs do,
19:33
but for video. - Right. Either one frame at a time or a group of frames at a time. But yeah, a large video model, if you want. (laughing)
19:43
The idea of doing this has been floating around for a long time. And at FAIR,
19:49
some colleagues and I have been trying to do this for about 10 years.
19:54
And you can't really do the same trick as with LLMs, because LLMs, as I said,
20:02
you can't predict exactly which word is gonna follow a sequence of words, but you can predict the distribution of the words.
20:09
Now, if you go to video, what you would have to do is predict the distribution of all possible frames in a video.
20:16
And we don't really know how to do that properly. We do not know how to represent distributions
20:21
over high dimensional continuous spaces in ways that are useful.
20:27
And there lies the main issue. And the reason we can do this
20:33
is because the world is incredibly more complicated and richer
20:38
in terms of information than text. Text is discreet. Video is high dimensional and continuous.
20:45
A lot of details in this. So if I take a video of this room,
20:50
and the video is a camera panning around,
20:56
there is no way I can predict everything that's gonna be in the room as I pan around, the system cannot predict what's gonna be in the room
21:02
as the camera is panning. Maybe it's gonna predict, this is a room where there's a light and there is a wall
21:08
and things like that. It can't predict what the painting of the wall looks like or what the texture of the couch looks like.
21:14
Certainly not the texture of the carpet. So there's no way it can predict all those details.
21:19
So the way to handle this is one way to possibly to handle this,
21:24
which we've been working for a long time, is to have a model that has what's called a latent variable. And the latent variable is fed to a neural net,
21:33
and it's supposed to represent all the information about the world that you don't perceive yet. And that you need to augment the system
21:43
for the prediction to do a good job at predicting pixels, including the fine texture of the carpet and the couch
21:53
and the painting on the wall. That has been a complete failure, essentially.
22:00
And we've tried lots of things. We tried just straight neural nets, we tried GANs, we tried VAEs,
22:08
all kinds of regularized auto encoders, we tried many things.
22:13
We also tried those kind of methods to learn good representations of images or video
22:20
that could then be used as input for example, an image classification system.
22:27
And that also has basically failed. Like all the systems that attempt to predict missing parts
22:33
of an image or a video from a corrupted version of it, basically.
22:40
So, right, take an image or a video, corrupt it or transform it in some way, and then try to reconstruct the complete video or image
22:47
from the corrupted version. And then hope that internally, the system will develop good representations of images
22:54
that you can use for object recognition, segmentation, whatever it is. That has been essentially a complete failure.
23:02
And it works really well for text. That's the principle that is used for LLMs, right? - So where's the failure exactly?
23:08
Is it that it is very difficult to form a good representation of an image,
23:14
like a good embedding of all the important information in the image?
23:19
Is it in terms of the consistency of image to image to image to image that forms the video?
23:26
If we do a highlight reel of all the ways you failed. What's that look like? - Okay.
23:31
So the reason this doesn't work is... First of all, I have to tell you exactly what doesn't work
23:37
because there is something else that does work. So the thing that does not work is training the system to learn representations of images
23:47
by training it to reconstruct a good image from a corrupted version of it.
23:53
Okay. That's what doesn't work. And we have a whole slew of techniques for this that are variant of then using auto encoders.
24:02
Something called MAE, developed by some of my colleagues at FAIR, masked autoencoder. So it's basically like the LLMs or things like this
24:11
where you train the system by corrupting text, except you corrupt images. You remove patches from it and you train a gigantic neural network to reconstruct.
24:19
The features you get are not good. And you know they're not good because if you now train the same architecture,
24:25
but you train it to supervise with label data, with textual descriptions of images, et cetera,
24:34
you do get good representations. And the performance on recognition tasks is much better
24:39
than if you do this self supervised free training. - So the architecture is good. - The architecture is good.
24:45
The architecture of the encoder is good. Okay? But the fact that you train the system to reconstruct images
24:51
does not lead it to produce long good generic features of images. - [Lex] When you train it in a self supervised way.
24:58
- Self supervised by reconstruction. - [Lex] Yeah, by reconstruction. - Okay, so what's the alternative? (both laugh)
25:04
The alternative is joint embedding. - What is joint embedding? What are these architectures that you're so excited about?
JEPA (Joint-Embedding Predictive Architecture)
25:11
- Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image
25:17
from a corrupted version, you take the full image, you take the corrupted or transformed version,
25:25
you run them both through encoders, which in general are identical but not necessarily.
25:31
And then you train a predictor on top of those encoders
25:37
to predict the representation of the full input
25:42
from the representation of the corrupted one. Okay?
25:47
So joint embedding, because you're taking the full input and the corrupted version or transformed version,
25:54
run them both through encoders so you get a joint embedding. And then you're saying can I predict the representation of the full one
26:02
from the representation of the corrupted one? Okay? And I call this a JEPA,
26:07
so that means joint embedding predictive architecture because there's joint embedding and there is this predictor that predicts the representation
26:13
of the good guy from the bad guy. And the big question is how do you train something like this?
26:20
And until five years ago or six years ago, we didn't have particularly good answers
26:26
for how you train those things, except for one called contrastive learning.
26:34
And the idea of contrastive learning is you take a pair of images that are, again, an image and a corrupted version
26:42
or degraded version somehow or transformed version of the original one. And you train the predicted representation
26:49
to be the same as that. If you only do this, this system collapses. It basically completely ignores the input
26:55
and produces representations that are constant. So the contrastive methods avoid this.
27:02
And those things have been around since the early '90s, I had a paper on this in 1993,
27:08
is you also show pairs of images that you know are different
27:14
and then you push away the representations from each other. So you say not only do representations of things
27:20
that we know are the same, should be the same or should be similar, but representation of things that we know are different
27:25
should be different. And that prevents the collapse, but it has some limitation. And there's a whole bunch of techniques
27:31
that have appeared over the last six, seven years that can revive this type of method.
27:38
Some of them from FAIR, some of them from Google and other places.
27:44
But there are limitations to those contrastive methods. What has changed in the last three, four years
27:51
is now we have methods that are non-contrastive. So they don't require those negative contrastive samples
27:59
of images that we know are different. You train them only with images
28:04
that are different versions or different views of the same thing. And you rely on some other tweaks
28:10
to prevent the system from collapsing. And we have half a dozen different methods for this now.
JEPA vs LLMs
28:16
- So what is the fundamental difference between joint embedding architectures and LLMs?
28:22
So can JEPA take us to AGI? Whether we should say that you don't like the term AGI
28:31
and we'll probably argue, I think every single time I've talked to you we've argued about the G in AGI.
28:36
- [Yann] Yes. - I get it, I get it, I get it. (laughing) Well we'll probably continue to argue about it.
28:42
It's great. Because you're like French,
28:48
and ami is I guess friend in French- - [Yann] Yes. - And AMI stands for advanced machine intelligence-
28:55
- [Yann] Right. - But either way, can JEPA take us to that, towards that advanced machine intelligence?
29:02
- Well, so it's a first step. Okay? So first of all, what's the difference with generative architectures like LLMs?
29:10
So LLMs or vision systems that are trained by reconstruction
29:17
generate the inputs, right? They generate the original input
29:22
that is non-corrupted, non-transformed, right? So you have to predict all the pixels.
29:29
And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details.
29:36
In a JEPA, you're not trying to predict all the pixels, you're only trying to predict
29:42
an abstract representation of the inputs, right? And that's much easier in many ways.
29:49
So what the JEPA system when it's being trained is trying to do, is extract as much information as possible from the input,
29:56
but yet only extract information that is relatively easily predictable.
30:01
Okay. So there's a lot of things in the world that we cannot predict. Like for example, if you have a self driving car
30:07
driving down the street or road. There may be trees around the road.
30:13
And it could be a windy day, so the leaves on the tree are kind of moving in kind of semi chaotic random ways
30:19
that you can't predict and you don't care, you don't want to predict. So what you want is your encoder
30:25
to basically eliminate all those details. It'll tell you there's moving leaves, but it's not gonna keep the details of exactly what's going on.
30:32
And so when you do the prediction in representation space, you're not going to have to predict
30:37
every single pixel of every leaf. And that not only is a lot simpler,
30:43
but also it allows the system to essentially learn an abstract representation of the world
30:49
where what can be modeled and predicted is preserved
30:54
and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction
31:00
of the representation. If you think about this, this is something we do absolutely all the time. Whenever we describe a phenomenon,
31:07
we describe it at a particular level of abstraction. And we don't always describe every natural phenomenon
31:13
in terms of quantum field theory, right? That would be impossible, right? So we have multiple levels of abstraction
31:19
to describe what happens in the world. Starting from quantum field theory to like atomic theory and molecules in chemistry,
31:27
materials, all the way up to kind of concrete objects in the real world
31:33
and things like that. So we can't just only model everything at the lowest level.
31:40
And that's what the idea of JEPA is really about. Learn abstract representation in a self supervised manner.
31:49
And you can do it hierarchically as well. So that I think is an essential component
31:54
of an intelligent system. And in language, we can get away without doing this because language is already to some level abstract
32:02
and already has eliminated a lot of information that is not predictable. And so we can get away without doing the joint embedding,
32:11
without lifting the abstraction level and by directly predicting words.
32:16
- So joint embedding. It's still generative, but it's generative in this abstract representation space.
32:23
- [Yann] Yeah. - And you're saying language, we were lazy with language 'cause we already got the abstract representation for free
32:30
and now we have to zoom out, actually think about generally intelligent systems, we have to deal with the full mess
32:37
of physical of reality, of reality. And you do have to do this step
32:42
of jumping from the full, rich, detailed reality
32:50
to an abstract representation of that reality based on what you can then reason
32:56
and all that kind of stuff. - Right. And the thing is those self supervised algorithms that learn by prediction,
33:03
even in representation space, they learn more concept
33:09
if the input data you feed them is more redundant. The more redundancy there is in the data, the more they're able to capture
33:15
some internal structure of it. And so there, there is way more redundancy in the structure
33:22
in perceptual inputs, sensory input like vision, than there is in text,
33:28
which is not nearly as redundant. This is back to the question you were asking a few minutes ago. Language might represent more information really
33:35
because it's already compressed, you're right about that. But that means it's also less redundant. And so self supervised only will not work as well.
33:43
- Is it possible to join the self supervised training on visual data
33:49
and self supervised training on language data? There is a huge amount of knowledge
33:56
even though you talk down about those 10 to the 13 tokens. Those 10 to the 13 tokens
34:01
represent the entirety, a large fraction of what us humans have figured out.
34:09
Both the shit talk on Reddit and the contents of all the books and the articles and the full spectrum of human intellectual creation.
34:18
So is it possible to join those two together? - Well, eventually, yes,
34:23
but I think if we do this too early, we run the risk of being tempted to cheat.
34:30
And in fact, that's what people are doing at the moment with vision language model. We're basically cheating. We are using language as a crutch
34:38
to help the deficiencies of our vision systems to kind of learn good representations from images and video.
34:46
And the problem with this is that we might improve our vision language system a bit,
34:53
I mean our language models by feeding them images. But we're not gonna get to the level
34:59
of even the intelligence or level of understanding of the world of a cat or a dog which doesn't have language.
35:07
They don't have language and they understand the world much better than any LLM. They can plan really complex actions
35:15
and sort of imagine the result of a bunch of actions. How do we get machines to learn that
35:20
before we combine that with language? Obviously, if we combine this with language, this is gonna be a winner,
35:28
but before that we have to focus on like how do we get systems to learn how the world works? - So this kind of joint embedding predictive architecture,
35:37
for you, that's gonna be able to learn something like common sense, something like what a cat uses
35:43
to predict how to mess with its owner most optimally by knocking over a thing.
35:49
- That's the hope. In fact, the techniques we're using are non-contrastive.
35:54
So not only is the architecture non-generative, the learning procedures we're using are non-contrastive.
36:00
We have two sets of techniques. One set is based on distillation and there's a number of methods that use this principle.
36:10
One by DeepMind called BYOL. A couple by FAIR, one called VICReg and another one called I-JEPA.
36:20
And VICReg, I should say, is not a distillation method actually, but I-JEPA and BYOL certainly are.
36:25
And there's another one also called DINO or Dino, also produced at FAIR.
36:31
And the idea of those things is that you take the full input, let's say an image. You run it through an encoder,
36:38
produces a representation. And then you corrupt that input or transform it, run it through essentially what amounts to the same encoder
36:46
with some minor differences. And then train a predictor. Sometimes a predictor is very simple,
36:51
sometimes it doesn't exist. But train a predictor to predict a representation of the first uncorrupted input from the corrupted input.
37:02
But you only train the second branch. You only train the part of the network
37:07
that is fed with the corrupted input. The other network, you don't train.
37:12
But since they share the same weight, when you modify the first one, it also modifies the second one.
37:18
And with various tricks, you can prevent the system from collapsing with the collapse of the type I was explaining before
37:24
where the system basically ignores the input. So that works very well.
DINO and I-JEPA
37:31
The two techniques we've developed at FAIR, DINO and I-JEPA work really well for that.
37:39
- So what kind of data are we talking about here? - So there's several scenarios. One scenario is you take an image,
37:47
you corrupt it by changing the cropping, for example,
37:52
changing the size a little bit, maybe changing the orientation, blurring it, changing the colors,
37:58
doing all kinds of horrible things to it- - But basic horrible things. - Basic horrible things that sort of degrade the quality a little bit
38:04
and change the framing, crop the image.
38:09
And in some cases, in the case of I-JEPA, you don't need to do any of this, you just mask some parts of it, right?
38:16
You just basically remove some regions like a big block, essentially.
38:21
And then run through the encoders and train the entire system, encoder and predictor,
38:27
to predict the representation of the good one from the representation of the corrupted one.
38:33
So that's the I-JEPA. It doesn't need to know that it's an image, for example, because the only thing it needs to know
38:39
is how to do this masking. Whereas with DINO, you need to know it's an image because you need to do things
38:45
like geometry transformation and blurring and things like that that are really image specific.
V-JEPA
38:51
A more recent version of this that we have is called V-JEPA. So it's basically the same idea as I-JEPA
38:56
except it's applied to video. So now you take a whole video and you mask a whole chunk of it.
39:02
And what we mask is actually kind of a temporal tube. So like a whole segment of each frame in the video
39:07
over the entire video. - And that tube is like statically positioned throughout the frames?
39:14
It's literally just a straight tube? - Throughout the tube, yeah. Typically it's 16 frames or something, and we mask the same region over the entire 16 frames.
39:22
It's a different one for every video, obviously. And then again, train that system
39:28
so as to predict the representation of the full video from the partially masked video.
39:34
And that works really well. It's the first system that we have that learns good representations of video
39:39
so that when you feed those representations to a supervised classifier head,
39:44
it can tell you what action is taking place in the video with pretty good accuracy.
39:51
So it's the first time we get something of that quality. - So that's a good test
39:57
that a good representation is formed. That means there's something to this. - Yeah. We also preliminary result
40:03
that seem to indicate that the representation allows our system to tell
40:09
whether the video is physically possible or completely impossible because some object disappeared
40:15
or an object suddenly jumped from one location to another or changed shape or something.
40:21
- So it's able to capture some physics based constraints
40:27
about the reality represented in the video? - [Yann] Yeah. - About the appearance and the disappearance of objects?
40:32
- Yeah. That's really new. - Okay, but can this actually
40:40
get us to this kind of world model that understands enough about the world
40:46
to be able to drive a car? - Possibly. And this is gonna take a while
40:51
before we get to that point. And there are systems already, robotic systems,
40:56
that are based on this idea. What you need for this
41:02
is a slightly modified version of this where imagine that you have a video,
41:11
a complete video, and what you're doing to this video is that you are either translating it in time
41:17
towards the future. So you'll only see the beginning of the video, but you don't see the latter part of it that is in the original one.
41:24
Or you just mask the second half of the video, for example. And then you train this I-JEPA system
41:30
or the type I described, to predict representation of the full video from the shifted one.
41:36
But you also feed the predictor with an action. For example, the wheel is turned
41:42
10 degrees to the right or something, right? So if it's a dash cam in a car
41:49
and you know the angle of the wheel, you should be able to predict to some extent what's going to happen to what you see.
41:57
You're not gonna be able to predict all the details of objects that appear in the view, obviously,
42:02
but at an abstract representation level, you can probably predict what's gonna happen.
42:08
So now what you have is an internal model that says, here is my idea of the state of the world at time T,
42:15
here is an action I'm taking, here is a prediction of the state of the world at time T plus one,
42:20
T plus delta T, T plus two seconds, whatever it is. If you have a model of this type,
42:26
you can use it for planning. So now you can do what LLMs cannot do,
42:31
which is planning what you're gonna do so as you arrive at a particular outcome
42:37
or satisfy a particular objective, right? So you can have a number of objectives, right?
42:44
I can predict that if I have an object like this, right?
42:50
And I open my hand, it's gonna fall, right? And if I push it with a particular force on the table,
42:57
it's gonna move. If I push the table itself, it's probably not gonna move with the same force.
43:03
So we have this internal model of the world in our mind,
43:09
which allows us to plan sequences of actions to arrive at a particular goal. And so now if you have this world model,
43:18
we can imagine a sequence of actions, predict what the outcome of the sequence of action is going to be,
43:25
measure to what extent the final state satisfies a particular objective
43:30
like moving the bottle to the left of the table. And then plan a sequence of actions
43:38
that will minimize this objective at runtime. We're not talking about learning, we're talking about inference time, right?
43:44
So this is planning, really. And in optimal control, this is a very classical thing. It's called model predictive control.
43:50
You have a model of the system you want to control that can predict the sequence of states
43:55
corresponding to a sequence of commands. And you are planning a sequence of commands
44:02
so that according to your world model, the end state of the system will satisfy any objectives that you fix.
44:10
This is the way rocket trajectories have been planned
44:16
since computers have been around. So since the early '60s, essentially. - So yes, for a model predictive control,
44:21
but you also often talk about hierarchical planning. - [Yann] Yeah.
Hierarchical planning
44:26
- Can hierarchical planning emerge from this somehow? - Well, so no. You will have to build a specific architecture
44:32
to allow for hierarchical planning. So hierarchical planning is absolutely necessary if you want to plan complex actions.
44:40
If I wanna go from, let's say, from New York to Paris, this the example I use all the time. And I'm sitting in my office at NYU.
44:48
My objective that I need to minimize is my distance to Paris. At a high level, a very abstract representation of my location,
44:57
I would have to decompose this into two sub-goals. First one is go to the airport, second one is catch a plane to Paris.
45:04
Okay. So my sub-goal is now going to the airport. My objective function is my distance to the airport.
45:12
How do I go to the airport? Well, I have to go in the street and hail a taxi,
45:18
which you can do in New York. Okay, now I have another sub-goal. Go down on the street.
45:24
Well, that means going to the elevator, going down the elevator, walk out to the street.
45:30
How do I go to the elevator? I have to stand up from my chair,
45:36
open the door of my office, go to the elevator, push the button. How do I get up for my chair?
45:42
Like you can imagine going down all the way down to basically what amounts
45:47
to millisecond by millisecond muscle control. Okay? And obviously you're not going to plan your entire trip
45:55
from New York to Paris in terms of millisecond by millisecond muscle control.
46:00
First, that would be incredibly expensive, but it will also be completely impossible because you don't know all the conditions
46:06
of what's gonna happen. How long it's gonna take to catch a taxi
46:11
or to go to the airport with traffic. I mean, you would have to know exactly
46:16
the condition of everything to be able to do this planning, and you don't have the information. So you have to do this hierarchical planning
46:23
so that you can start acting and then sort of re-planning as you go. And nobody really knows how to do this in AI.
46:33
Nobody knows how to train a system to learn the appropriate multiple levels of representation so that hierarchical planning works.
46:41
- Does something like that already emerge? So like can you use an LLM,
46:46
state-of-the-art LLM, to get you from New York to Paris by doing exactly the kind of detailed
46:54
set of questions that you just did? Which is can you give me a list of 10 steps I need to do
47:01
to get from New York to Paris? And then for each of those steps, can you give me a list of 10 steps
47:07
how I make that step happen? And for each of those steps, can you give me a list of 10 steps
47:12
to make each one of those, until you're moving your individual muscles? Maybe not.
47:17
Whatever you can actually act upon using your own mind. - Right. So there's a lot of questions
47:23
that are also implied by this, right? So the first thing is LLMs will be able to answer some of those questions
47:28
down to some level of abstraction. Under the condition that they've been trained
47:34
with similar scenarios in their training set. - They would be able to answer all of those questions.
47:40
But some of them may be hallucinated, meaning non-factual. - Yeah, true.
47:45
I mean they'll probably produce some answer. Except they're not gonna be able to really kind of produce millisecond by millisecond muscle control
47:50
of how you stand up from your chair, right? But down to some level of abstraction where you can describe things by words,
47:57
they might be able to give you a plan, but only under the condition that they've been trained to produce those kind of plans, right?
48:04
They're not gonna be able to plan for situations they never encountered before.
48:09
They basically are going to have to regurgitate the template that they've been trained on. - But where, just for the example of New York to Paris,
48:15
is it gonna start getting into trouble? Like at which layer of abstraction do you think you'll start?
48:22
Because like I can imagine almost every single part of that, an LLM will be able to answer somewhat accurately,
48:27
especially when you're talking about New York and Paris, major cities. - So I mean certainly an LLM
48:33
would be able to solve that problem if you fine tune it for it. - [Lex] Sure. - And so I can't say that an LLM cannot do this,
48:42
it can't do this if you train it for it, there's no question, down to a certain level
48:47
where things can be formulated in terms of words. But like if you wanna go down to like how do you climb down the stairs
48:54
or just stand up from your chair in terms of words, like you can't do it.
49:00
That's one of the reasons you need experience of the physical world,
49:06
which is much higher bandwidth than what you can express in words, in human language. - So everything we've been talking about
49:12
on the joint embedding space, is it possible that that's what we need for like the interaction with physical reality
49:18
on the robotics front? And then just the LLMs are the thing that sits on top of it
49:24
for the bigger reasoning about like the fact that I need to book a plane ticket
49:30
and I need to know know how to go to the websites and so on. - Sure. And a lot of plans that people know about
49:37
that are relatively high level are actually learned. Most people don't invent the plans by themselves.
49:50
We have some ability to do this, of course, obviously, but most plans that people use
49:57
are plans that have been trained on. Like they've seen other people use those plans or they've been told how to do things, right?
50:04
That you can't invent how you like take a person who's never heard of airplanes
50:09
and tell them like, how do you go from New York to Paris? They're probably not going to be able to kind of deconstruct the whole plan
50:16
unless they've seen examples of that before. So certainly LLMs are gonna be able to do this.
50:21
But then how you link this from the low level of actions,
50:28
that needs to be done with things like JEPA, that basically lift the abstraction level
50:33
of the representation without attempting to reconstruct every detail of the situation. That's why we need JEPAs for.
Autoregressive LLMs
50:40
- I would love to sort of linger on your skepticism around autoregressive LLMs.
50:48
So one way I would like to test that skepticism is everything you say makes a lot of sense,
50:57
but if I apply everything you said today and in general to like, I don't know,
51:03
10 years ago, maybe a little bit less. No, let's say three years ago. I wouldn't be able to predict the success of LLMs.
51:12
So does it make sense to you that autoregressive LLMs are able to be so damn good?
51:20
- [Yann] Yes. - Can you explain your intuition? Because if I were to take your wisdom and intuition
51:29
at face value, I would say there's no way autoregressive LLMs one token at a time,
51:34
would be able to do the kind of things they're doing. - No, there's one thing that autoregressive LLMs or that LLMs in general, not just the autoregressive ones,
51:42
but including the BERT style bidirectional ones, are exploiting and its self supervised running.
51:49
And I've been a very, very strong advocate of self supervised running for many years. So those things are an incredibly impressive demonstration
51:58
that self supervised learning actually works. The idea that started...
52:04
It didn't start with BERT, but it was really kind of a good demonstration with this. So the idea that you take a piece of text, you corrupt it,
52:14
and then you train some gigantic neural net to reconstruct the parts that are missing. That has been an enormous...
52:23
Produced an enormous amount of benefits. It allowed us to create systems that understand language,
52:31
systems that can translate hundreds of languages in any direction,
52:36
systems that are multilingual. It's a single system that can be trained to understand hundreds of languages
52:43
and translate in any direction and produce summaries
52:48
and then answer questions and produce text. And then there's a special case of it,
52:54
which is the autoregressive trick where you constrain the system to not elaborate a representation of the text
53:02
from looking at the entire text, but only predicting a word from the words that have come before.
53:08
Right? And you do this by constraining the architecture of the network. And that's what you can build an autoregressive LLM from.
53:15
So there was a surprise many years ago with what's called decoder only LLM.
53:20
So systems of this type that are just trying to produce words from the previous one.
53:28
And the fact that when you scale them up, they tend to really kind of understand more about language.
53:36
When you train them on lots of data, you make them really big. That was kind of a surprise. And that surprise occurred quite a while back.
53:42
Like with work from Google, Meta, OpenAI, et cetera,
53:50
going back to the GPT kind of general pre-trained transformers.
53:56
- You mean like GPT-2? Like there's a certain place where you start to realize scaling might actually keep giving us an emergent benefit.
54:06
- Yeah, I mean there were work from various places, but if you want to kind of place it in the GPT timeline,
54:16
that would be around GPT-2, yeah. - Well, 'cause you said it, you're so charismatic and you said so many words,
54:23
but self supervised learning, yes. But again, the same intuition you're applying
54:28
to saying that autoregressive LLMs cannot have a deep understanding of the world,
54:35
if we just apply that same intuition, does it make sense to you
54:40
that they're able to form enough of a representation in the world to be damn convincing,
54:45
essentially passing the original Turing test with flying colors.
54:50
- Well, we're fooled by their fluency, right? We just assume that if a system is fluent
54:56
in manipulating language, then it has all the characteristics of human intelligence. But that impression is false.
55:04
We're really fooled by it. - Well, what do you think Alan Turing would say? Without understanding anything,
55:10
just hanging out with it- - Alan Turing would decide that a Turing test is a really bad test. (Lex chuckles)
55:15
Okay. This is what the AI community has decided many years ago that the Turing test was a really bad test of intelligence.
55:22
- What would Hans Moravec say about the large language models? - Hans Moravec would say the Moravec's paradox still applies.
55:30
- [Lex] Okay. - Okay? Okay, we can pass- - You don't think he would be really impressed. - No, of course everybody would be impressed.
55:35
(laughs) But it is not a question of being impressed or not, it is a question of knowing
55:41
what the limit of those systems can do. Again, they are impressive. They can do a lot of useful things.
55:47
There's a whole industry that is being built around them. They're gonna make progress, but there is a lot of things they cannot do.
55:53
And we have to realize what they cannot do and then figure out how we get there.
55:59
And I'm not saying this... I'm saying this from basically 10 years of research
56:07
on the idea of self supervised running, actually that's going back more than 10 years,
56:13
but the idea of self supervised learning. So basically capturing the internal structure of a piece of a set of inputs
56:21
without training the system for any particular task, right? Learning representations. The conference I co-founded 14 years ago
56:28
is called International Conference on Learning Representations, that's the entire issue that deep learning is dealing with.
56:34
Right? And it's been my obsession for almost 40 years now. So learning representation is really the thing.
56:42
For the longest time we could only do this with supervised learning. And then we started working on
56:47
what we used to call unsupervised learning and sort of revived the idea of unsupervised learning
56:55
in the early 2000s with Yoshua Bengio and Jeff Hinton. Then discovered that supervised learning
57:00
actually works pretty well if you can collect enough data. And so the whole idea of unsupervised self supervision
57:07
took a backseat for a bit and then I kind of tried to revive it in a big way,
57:16
starting in 2014 basically when we started FAIR, and really pushing for like finding new methods
57:24
to do self supervised running, both for text and for images and for video and audio.
57:29
And some of that work has been incredibly successful. I mean, the reason why we have multilingual translation system,
57:37
things to do, content moderation on Meta, for example, on Facebook that are multilingual,
57:42
that understand whether piece of text is hate speech or not, or something is due to their progress using self supervised running for NLP,
57:50
combining this with transformer architectures and blah blah blah. But that's the big success of self supervised running.
57:55
We had similar success in speech recognition, a system called Wav2Vec, which is also a joint embedding architecture by the way,
58:02
trained with contrastive learning. And that system also can produce speech recognition systems that are multilingual
58:10
with mostly unlabeled data and only need a few minutes of labeled data to actually do speech recognition.
58:16
That's amazing. We have systems now based on those combination of ideas
58:22
that can do real time translation of hundreds of languages into each other, speech to speech.
58:28
- Speech to speech, even including, which is fascinating, languages that don't have written forms-
58:33
- That's right. - They're spoken only. - That's right. We don't go through text, it goes directly from speech to speech using an internal representation
58:40
of kinda speech units that are discrete. But it's called Textless NLP. We used to call it this way.
58:45
But yeah. I mean incredible success there. And then for 10 years we tried to apply this idea
58:53
to learning representations of images by training a system to predict videos, learning intuitive physics
58:58
by training a system to predict what's gonna happen in the video. And tried and tried and failed and failed
59:05
with generative models, with models that predict pixels. We could not get them to learn
59:10
good representations of images, we could not get them to learn good presentations of videos.
59:16
And we tried many times, we published lots of papers on it. They kind of sort of worked, but not really great.
59:23
It started working, we abandoned this idea of predicting every pixel and basically just doing the joint embedding and predicting
59:30
in representation space. That works. So there's ample evidence
59:36
that we're not gonna be able to learn good representations of the real world
59:42
using generative model. So I'm telling people, everybody's talking about generative AI. If you're really interested in human level AI,
59:48
abandon the idea of generative AI. (Lex laughs) - Okay. But you really think it's possible
59:54
to get far with joint embedding representation? So like there's common sense reasoning
1:00:01
and then there's high level reasoning. Like I feel like those are two...
1:00:08
The kind of reasoning that LLMs are able to do. Okay, let me not use the word reasoning,
1:00:13
but the kind of stuff that LLMs are able to do seems fundamentally different than the common sense reasoning we use
1:00:19
to navigate the world. - [Yann] Yeah. - It seems like we're gonna need both- - Sure. - Would you be able to get,
1:00:25
with the joint embedding which is a JEPA type of approach, looking at video, would you be able to learn,
1:00:32
let's see, well, how to get from New York to Paris, or how to understand the state of politics in the world?
1:00:42
(both laugh) Right? These are things where various humans generate a lot of language and opinions on,
1:00:49
in the space of language, but don't visually represent that in any clearly compressible way.
1:00:56
- Right. Well, there's a lot of situations that might be difficult for a purely language based system to know.
1:01:04
Like, okay, you can probably learn from reading texts, the entirety of the publicly available text in the world
1:01:11
that I cannot get from New York to Paris by snapping my fingers. That's not gonna work, right? - [Lex] Yes.
1:01:18
- But there's probably sort of more complex scenarios of this type which an LLM may never have encountered
1:01:25
and may not be able to determine whether it's possible or not. So that link from the low level to the high level...
1:01:35
The thing is that the high level that language expresses is based on the common experience of the low level,
1:01:43
which LLMs currently do not have. When we talk to each other, we know we have a common experience of the world.
1:01:50
Like a lot of it is similar. And LLMs don't have that.
1:01:59
- But see, there it's present. You and I have a common experience of the world in terms of the physics of how gravity works
1:02:05
and stuff like this. And that common knowledge of the world,
1:02:11
I feel like is there in the language. We don't explicitly express it,
1:02:17
but if you have a huge amount of text, you're going to get this stuff that's between the lines.
1:02:24
In order to form a consistent world model, you're going to have to understand how gravity works,
1:02:31
even if you don't have an explicit explanation of gravity. So even though, in the case of gravity,
1:02:37
there is explicit explanation. There's gravity in Wikipedia. But like the stuff that we think of
1:02:44
as common sense reasoning, I feel like to generate language correctly, you're going to have to figure that out.
1:02:51
Now, you could say as you have, there's not enough text- - Well, I agree. - Sorry. Okay, yeah.
1:02:56
(laughs) You don't think so? - No, I agree with what you just said, which is that to be able to do high level common sense...
1:03:03
To have high level common sense, you need to have the low level common sense to build on top of. - [Lex] Yeah.
1:03:09
But that's not there. - That's not there in LLMs. LLMs are purely trained from text. So then the other statement you made,
1:03:15
I would not agree with the fact that implicit in all languages in the world
1:03:20
is the underlying reality. There's a lot about underlying reality which is not expressed in language.
1:03:26
- Is that obvious to you? - Yeah, totally. - So like all the conversations we have...
1:03:34
Okay, there's the dark web, meaning whatever, the private conversations like DMs and stuff like this,
1:03:41
which is much, much larger probably than what's available, what LLMs are trained on.
1:03:46
- You don't need to communicate the stuff that is common. - But the humor, all of it. No, you do.
1:03:52
You don't need to, but it comes through. Like if I accidentally knock this over,
1:03:58
you'll probably make fun of me. And in the content of the you making fun of me will be explanation of the fact that cups fall
1:04:07
and then gravity works in this way. And then you'll have some very vague information
1:04:12
about what kind of things explode when they hit the ground. And then maybe you'll make a joke about entropy
1:04:19
or something like this and we will never be able to reconstruct this again. Like, okay, you'll make a little joke like this
1:04:24
and there'll be trillion of other jokes. And from the jokes, you can piece together the fact that gravity works
1:04:30
and mugs can break and all this kind of stuff, you don't need to see... It'll be very inefficient.
1:04:36
It's easier for like to not knock the thing over. (laughing) - [Yann] Yeah.
1:04:42
- But I feel like it would be there if you have enough of that data. - I just think that most of the information of this type
1:04:50
that we have accumulated when we were babies is just not present in text,
1:04:58
in any description, essentially. And the sensory data is a much richer source for getting that kind of understanding.
1:05:04
I mean, that's the 16,000 hours of wake time of a 4-year-old. And tend to do 15 bytes, going through vision.
1:05:12
Just vision, right? There is a similar bandwidth of touch
1:05:17
and a little less through audio. And then text doesn't... Language doesn't come in until like a year in life.
1:05:26
And by the time you are nine years old, you've learned about gravity, you know about inertia,
1:05:31
you know about gravity, you know there's stability, you know about the distinction between animate and inanimate objects.
1:05:38
By 18 months, you know about like why people want to do things and you help them if they can't.
1:05:45
I mean there's a lot of things that you learn mostly by observation, really not even through interaction.
1:05:52
In the first few months of life, babies don't really have any influence on the world. They can only observe, right?
1:05:58
And you accumulate like a gigantic amount of knowledge just from that. So that's what we're missing from current AI systems.
AI hallucination
1:06:07
- I think in one of your slides you have this nice plot that is one of the ways you show that LLMs are limited.
1:06:13
I wonder if you could talk about hallucinations from your perspectives. Why hallucinations happen from large language models,
1:06:23
and to what degree is that a fundamental flaw of large language models.
1:06:29
- Right. So because of the autoregressive prediction, every time an LLM produces a token or a word,
1:06:37
there is some level of probability for that word to take you out of the set of reasonable answers.
1:06:45
And if you assume, which is a very strong assumption, that the probability of such error
1:06:53
is those errors are independent across a sequence of tokens being produced.
1:06:59
What that means is that every time you produce a token, the probability that you stay within the set of correct answer decreases
1:07:06
and it decreases exponentially. - So there's a strong, like you said, assumption there that if there's a non-zero probability of making a mistake,
1:07:14
which there appears to be, then there's going to be a kind of drift. - Yeah.
1:07:19
And that drift is exponential. It's like errors accumulate, right? So the probability that an answer would be nonsensical
1:07:27
increases exponentially with the number of tokens. - Is that obvious to you by the way?
1:07:33
Well, so mathematically speaking maybe, but like isn't there a kind of gravitational pull
1:07:39
towards the truth? Because on average, hopefully, the truth is well represented in the training set.
1:07:48
- No, it's basically a struggle against the curse of dimensionality.
1:07:55
So the way you can correct for this is that you fine tune the system by having it produce answers
1:08:01
for all kinds of questions that people might come up with. And people are people,
1:08:06
so a lot of the questions that they have are very similar to each other. So you can probably cover, you know, 80% or whatever of questions that people will ask
1:08:16
by collecting data.
1:08:22
And then you fine tune the system to produce good answers for all of those things. And it's probably gonna be able to learn that
1:08:27
because it's got a lot of capacity to learn. But then there is the enormous set of prompts
1:08:36
that you have not covered during training. And that set is enormous. Like within the set of all possible prompts,
1:08:43
the proportion of prompts that have been used for training is absolutely tiny.
1:08:49
It's a tiny, tiny, tiny subset of all possible prompts. And so the system will behave properly
1:08:56
on the prompts that it's been either trained, pre-trained or fine tuned.
1:09:01
But then there is an entire space of things that it cannot possibly have been trained on
1:09:06
because it's just the number is gigantic. So whatever training the system
1:09:14
has been subject to produce appropriate answers, you can break it by finding out a prompt
1:09:20
that will be outside of the set of prompts it's been trained on
1:09:25
or things that are similar, and then it will just spew complete nonsense. - When you say prompt,
1:09:31
do you mean that exact prompt or do you mean a prompt that's like, in many parts very different than...
1:09:38
Is it that easy to ask a question or to say a thing that hasn't been said before
1:09:45
on the internet? - I mean, people have come up with things where like you put essentially
1:09:51
a random sequence of characters in a prompt and that's enough to kind of throw the system into a mode
1:09:57
where it's gonna answer something completely different than it would have answered without this.
1:10:03
So that's a way to jailbreak the system, basically. Go outside of its conditioning, right?
1:10:09
- So that's a very clear demonstration of it. But of course, that goes outside
1:10:16
of what it's designed to do, right? If you actually stitch together reasonably grammatical sentences,
1:10:22
is it that easy to break it? - Yeah. Some people have done things like
1:10:29
you write a sentence in English or you ask a question in English and it produces a perfectly fine answer.
1:10:36
And then you just substitute a few words by the same word in another language,
1:10:42
and all of a sudden the answer is complete nonsense. - Yeah. So I guess what I'm saying is like, which fraction of prompts that humans are likely to generate
1:10:52
are going to break the system? - So the problem is that there is a long tail.
1:10:57
- [Lex] Yes. - This is an issue that a lot of people have realized in social networks and stuff like that,
1:11:04
which is there's a very, very long tail of things that people will ask. And you can fine tune the system
1:11:09
for the 80% or whatever of the things that most people will ask.
1:11:16
And then this long tail is so large that you're not gonna be able to fine tune the system for all the conditions.
1:11:21
And in the end, the system ends up being kind of a giant lookup table, right? (laughing) Essentially. Which is not really what you want.
1:11:27
You want systems that can reason, certainly that can plan. So the type of reasoning that takes place in LLM
Reasoning in AI
1:11:33
is very, very primitive. And the reason you can tell it's primitive is because the amount of computation
1:11:39
that is spent per token produced is constant. So if you ask a question
1:11:45
and that question has an answer in a given number of token, the amount of computation devoted to computing that answer
1:11:52
can be exactly estimated. It's the size of the prediction network
1:11:59
with its 36 layers or 92 layers or whatever it is, multiplied by number of tokens.
1:12:05
That's it. And so essentially, it doesn't matter if the question being asked
1:12:12
is simple to answer, complicated to answer, impossible to answer
1:12:17
because it's decided, well, there's something. The amount of computation the system will be able to devote to the answer is constant
1:12:25
or is proportional to the number of token produced in the answer, right? This is not the way we work,
1:12:30
the way we reason is that when we are faced with a complex problem
1:12:37
or a complex question, we spend more time trying to solve it and answer it, right?
1:12:42
Because it's more difficult. - There's a prediction element, there's an iterative element where you're like adjusting your understanding of a thing
1:12:52
by going over and over and over. There's a hierarchical elements on. Does this mean it's a fundamental flaw of LLMs-
1:12:59
- [Yann] Yeah. - Or does it mean that... (laughs) There's more part to that question? (laughs)
1:13:04
Now you're just behaving like an LLM. (laughs) Immediately answering. No, that it's just the low level world model
1:13:14
on top of which we can then build some of these kinds of mechanisms, like you said, persistent long-term memory or reasoning,
1:13:23
so on. But we need that world model that comes from language.
1:13:29
Maybe it is not so difficult to build this kind of reasoning system on top of a well constructed world model.
1:13:36
- Okay. Whether it's difficult or not, the near future will say, because a lot of people are working on reasoning
1:13:43
and planning abilities for dialogue systems. I mean, even if we restrict ourselves to language,
1:13:52
just having the ability to plan your answer before you answer, in terms that are not necessarily linked
1:13:59
with the language you're gonna use to produce the answer. Right? So this idea of this mental model that allows you to plan what you're gonna say
1:14:06
before you say it. That is very important.
1:14:11
I think there's going to be a lot of systems over the next few years that are going to have this capability,
1:14:17
but the blueprint of those systems will be extremely different from autoregressive LLMs.
1:14:23
So it's the same difference as the difference between
1:14:29
what psychology has called system one and system two in humans, right? So system one is the type of task that you can accomplish
1:14:35
without like deliberately consciously think about how you do them. You just do them.
1:14:42
You've done them enough that you can just do it subconsciously, right? Without thinking about them. If you're an experienced driver,
1:14:48
you can drive without really thinking about it and you can talk to someone at the same time or listen to the radio, right?
1:14:55
If you are a very experienced chess player, you can play against a non-experienced chess player
1:15:01
without really thinking either, you just recognize the pattern and you play, right? That's system one.
1:15:08
So all the things that you do instinctively without really having to deliberately plan and think about it.
1:15:13
And then there is other tasks where you need to plan. So if you are a not too experienced chess player
1:15:19
or you are experienced but you play against another experienced chess player, you think about all kinds of options, right?
1:15:24
You think about it for a while, right? And you're much better if you have time to think about it
1:15:30
than you are if you play blitz with limited time.
1:15:35
And so this type of deliberate planning,
1:15:40
which uses your internal world model, that's system two, this is what LLMs currently cannot do.
1:15:46
How do we get them to do this, right? How do we build a system that can do this kind of planning or reasoning
1:15:55
that devotes more resources to complex problems than to simple problems.
1:16:00
And it's not going to be autoregressive prediction of tokens, it's going to be more something akin to inference
1:16:08
of latent variables in what used to be called probabilistic models
1:16:14
or graphical models and things of that type. So basically the principle is like this.
1:16:21
The prompt is like observed variables.
1:16:26
And what the model does is that it's basically a measure of...
1:16:33
It can measure to what extent an answer is a good answer for a prompt. Okay?
1:16:38
So think of it as some gigantic neural net, but it's got only one output. And that output is a scaler number,
1:16:45
which is let's say zero if the answer is a good answer for the question, and a large number
1:16:51
if the answer is not a good answer for the question. Imagine you had this model. If you had such a model,
1:16:56
you could use it to produce good answers. The way you would do is produce the prompt
1:17:02
and then search through the space of possible answers for one that minimizes that number.
1:17:10
That's called an energy based model. - But that energy based model would need the model constructed by the LLM.
1:17:18
- Well, so really what you need to do would be to not search over possible strings of text
1:17:24
that minimize that energy. But what you would do is do this in abstract representation space.
1:17:31
So in sort of the space of abstract thoughts, you would elaborate a thought, right?
1:17:37
Using this process of minimizing the output of your model.
1:17:42
Okay? Which is just a scaler. It's an optimization process, right? So now the way the system produces its answer
1:17:48
is through optimization by minimizing an objective function basically, right?
1:17:56
And this is, we're talking about inference, we're not talking about training, right? The system has been trained already.
1:18:01
So now we have an abstract representation of the thought of the answer, representation of the answer.
1:18:06
We feed that to basically an autoregressive decoder, which can be very simple, that turns this into a text that expresses this thought.
1:18:15
Okay? So that in my opinion is the blueprint of future data systems.
1:18:21
They will think about their answer, plan their answer by optimization before turning it into text.
1:18:27
And that is turning complete. - Can you explain exactly what the optimization problem there is?
1:18:34
Like what's the objective function? Just linger on it. You kind of briefly described it,
1:18:40
but over what space are you optimizing? - The space of representations-
1:18:45
- Goes abstract representation. - That's right. So you have an abstract representation inside the system.
1:18:51
You have a prompt. The prompt goes through an encoder, produces a representation, perhaps goes through a predictor that predicts a representation of the answer,
1:18:58
of the proper answer. But that representation may not be a good answer
1:19:03
because there might be some complicated reasoning you need to do, right? So then you have another process
1:19:11
that takes the representation of the answers and modifies it so as to minimize a cost function
1:19:20
that measures to what extent the answer is a good answer for the question. Now we sort of ignore the fact for...
1:19:27
I mean, the issue for a moment of how you train that system to measure whether an answer is a good answer for sure.
1:19:35
- But suppose such a system could be created, what's the process? This kind of search like process.
1:19:42
- It's an optimization process. You can do this if the entire system is differentiable,
1:19:47
that scaler output is the result of running through some neural net, running the answer,
1:19:54
the representation of the answer through some neural net. Then by gradient descent, by back propagating gradients,
1:20:00
you can figure out like how to modify the representation of the answers so as to minimize that. - So that's still a gradient based.
1:20:06
- It's gradient based inference. So now you have a representation of the answer in abstract space.
1:20:12
Now you can turn it into text, right? And the cool thing about this
1:20:17
is that the representation now can be optimized through gradient descent, but also is independent of the language
1:20:24
in which you're going to express the answer. - Right. So you're operating in the substruct of representation.
1:20:30
I mean this goes back to the joint embedding. - [Yann] Right. - That it's better to work in the space of...
1:20:36
I don't know. Or to romanticize the notion like space of concepts versus the space of concrete sensory information.
1:20:45
- Right. - Okay. But can this do something like reasoning, which is what we're talking about?
1:20:51
- Well, not really, only in a very simple way. I mean basically you can think of those things as doing
1:20:57
the kind of optimization I was talking about, except they're optimizing the discrete space which is the space of possible sequences of tokens.
1:21:05
And they do this optimization in a horribly inefficient way, which is generate a lot of hypothesis
1:21:11
and then select the best ones. And that's incredibly wasteful
1:21:16
in terms of competition, 'cause you basically have to run your LLM for like every possible generative sequence.
1:21:25
And it's incredibly wasteful. So it's much better to do an optimization
1:21:31
in continuous space where you can do gradient descent as opposed to like generate tons of things and then select the best,
1:21:38
you just iteratively refine your answer to go towards the best, right? That's much more efficient.
1:21:44
But you can only do this in continuous spaces with differentiable functions. - You're talking about the reasoning,
1:21:50
like ability to think deeply or to reason deeply. How do you know what is an answer
1:21:58
that's better or worse based on deep reasoning?
1:22:04
- Right. So then we're asking the question, of conceptually, how do you train an energy based model? Right?
1:22:10
So energy based model is a function with a scaler output, just a number. You give it two inputs, X and Y,
1:22:18
and it tells you whether Y is compatible with X or not. X you observe, let's say it's a prompt, an image, a video, whatever.
1:22:24
And Y is a proposal for an answer, a continuation of video, whatever.
1:22:30
And it tells you whether Y is compatible with X. And the way it tells you that Y is compatible with X
1:22:37
is that the output of that function would be zero if Y is compatible with X, it would be a positive number, non-zero
1:22:44
if Y is not compatible with X. Okay. How do you train a system like this? At a completely general level,
1:22:51
is you show it pairs of X and Ys that are compatible, a question and the corresponding answer.
1:22:58
And you train the parameters of the big neural net inside to produce zero.
1:23:04
Okay. Now that doesn't completely work because the system might decide, well, I'm just gonna say zero for everything.
1:23:11
So now you have to have a process to make sure that for a wrong Y, the energy will be larger than zero.
1:23:18
And there you have two options, one is contrastive methods. So contrastive method is you show an X and a bad Y,
1:23:26
and you tell the system, well, give a high energy to this. Like push up the energy, right? Change the weights in the neural net that compute the energy
1:23:33
so that it goes up. So that's contrasting methods. The problem with this is if the space of Y is large,
1:23:41
the number of such contrasted samples you're gonna have to show is gigantic.
1:23:48
But people do this. They do this when you train a system with RLHF,
1:23:53
basically what you're training is what's called a reward model, which is basically an objective function
1:24:00
that tells you whether an answer is good or bad. And that's basically exactly what this is.
1:24:06
So we already do this to some extent. We're just not using it for inference, we're just using it for training.
1:24:14
There is another set of methods which are non-contrastive, and I prefer those. And those non-contrastive method basically say,
1:24:22
okay, the energy function needs to have low energy on pairs of XYs that are compatible
1:24:29
that come from your training set. How do you make sure that the energy is gonna be higher everywhere else?
1:24:37
And the way you do this is by having a regularizer, a criterion,
1:24:43
a term in your cost function that basically minimizes the volume of space
1:24:49
that can take low energy. And the precise way to do this, there's all kinds of different specific ways to do this
1:24:55
depending on the architecture, but that's the basic principle. So that if you push down the energy function
1:25:00
for particular regions in the XY space, it will automatically go up in other places
1:25:06
because there's only a limited volume of space that can take low energy. Okay?
1:25:11
By the construction of the system or by the regularizing function. - We've been talking very generally,
1:25:18
but what is a good X and a good Y? What is a good representation of X and Y?
1:25:25
Because we've been talking about language. And if you just take language directly, that presumably is not good,
1:25:32
so there has to be some kind of abstract representation of ideas. - Yeah. I mean you can do this with language directly
1:25:39
by just, you know, X is a text and Y is the continuation of that text. - [Lex] Yes.
1:25:45
- Or X is a question, Y is the answer. - But you're saying that's not gonna take it. I mean, that's going to do what LLMs are doing.
1:25:52
- Well, no. It depends on how the internal structure of the system is built. If the internal structure of the system
1:25:59
is built in such a way that inside of the system there is a latent variable, let's called it Z,
1:26:04
that you can manipulate so as to minimize the output energy,
1:26:12
then that Z can be viewed as representation of a good answer that you can translate into a Y that is a good answer.
1:26:21
- So this kind of system could be trained in a very similar way? - Very similar way. But you have to have this way of preventing collapse,
1:26:26
of ensuring that there is high energy for things you don't train it on.
1:26:33
And currently it's very implicit in LLMs. It is done in a way
1:26:39
that people don't realize it's being done, but it is being done. It's due to the fact that when you give a high probability to a word,
1:26:49
automatically you give low probability to other words because you only have a finite amount of probability to go around. (laughing)
1:26:55
Right? They have to sub to one. So when you minimize the cross entropy or whatever, when you train your LLM to predict the next word,
1:27:05
you are increasing the probability your system will give to the correct word, but you're also decreasing the probability
1:27:10
it will give to the incorrect words. Now, indirectly, that gives a low probability to...
1:27:17
A high probability to sequences of words that are good and low probability two sequences of words that are bad, but it's very indirect.
1:27:23
It's not obvious why this actually works at all, because you're not doing it on a joint probability
1:27:30
of all the symbols in a sequence, you're just doing it kind of,
1:27:35
sort of factorized that probability in terms of conditional probabilities over successive tokens.
1:27:41
- So how do you do this for visual data? - So we've been doing this with all JEPA architectures, basically the-
1:27:47
- [Lex] The joint embedding? - I-JEPA. So there, the compatibility between two things
1:27:52
is here's an image or a video, here is a corrupted, shifted or transformed version
1:27:58
of that image or video or masked. Okay? And then the energy of the system
1:28:04
is the prediction error of the representation.
1:28:11
The predicted representation of the good thing versus the actual representation of the good thing, right?
1:28:17
So you run the corrupted image to the system, predict the representation of the good input uncorrupted,
1:28:24
and then compute the prediction error. That's the energy of the system. So this system will tell you,
1:28:30
this is a good image and this is a corrupted version.
1:28:36
It will give you zero energy if those two things are effectively, one of them is a corrupted version of the other,
1:28:43
give you a high energy if the two images are completely different. - And hopefully that whole process
1:28:48
gives you a really nice compressed representation of reality, of visual reality.
1:28:54
- And we know it does because then we use those presentations as input to a classification system or something, and it works- - And then
1:29:00
that classification system works really nicely. Okay. Well, so to summarize, you recommend in a spicy way that only Yann LeCun can,
Reinforcement learning
1:29:10
you recommend that we abandon generative models in favor of joint embedding architectures? - [Yann] Yes.
1:29:16
- Abandon autoregressive generation. - [Yann] Yes. - Abandon... (laughs) This feels like court testimony.
1:29:21
Abandon probabilistic models in favor of energy based models, as we talked about. Abandon contrastive methods
1:29:27
in favor of regularized methods. And let me ask you about this;
1:29:33
you've been for a while, a critic of reinforcement learning. - [Yann] Yes. - So the last recommendation is that we abandon RL
1:29:41
in favor of model predictive control, as you were talking about. And only use RL
1:29:46
when planning doesn't yield the predicted outcome. And we use RL in that case
1:29:52
to adjust the world model or the critic. - [Yann] Yes. - So you've mentioned RLHF,
1:30:00
reinforcement learning with human feedback. Why do you still hate reinforcement learning?
1:30:05
- [Yann] I don't hate reinforcement learning, and I think it's- - So it's all love? - I think it should not be abandoned completely,
1:30:12
but I think it's use should be minimized because it's incredibly inefficient in terms of samples.
1:30:18
And so the proper way to train a system is to first have it learn
1:30:24
good representations of the world and world models from mostly observation,
1:30:29
maybe a little bit of interactions. - And then steer it based on that. If the representation is good, then the adjustments should be minimal.
1:30:36
- Yeah. Now there's two things. If you've learned the world model, you can use the world model to plan a sequence of actions
1:30:42
to arrive at a particular objective. You don't need RL, unless the way you measure whether you succeed
1:30:50
might be inexact. Your idea of whether you were gonna fall from your bike
1:30:58
might be wrong, or whether the person you're fighting with MMA was gonna do something and they do something else. (laughing)
1:31:05
So there's two ways you can be wrong. Either your objective function
1:31:12
does not reflect the actual objective function you want to optimize, or your world model is inaccurate, right?
1:31:19
So the prediction you were making about what was gonna happen in the world is inaccurate.
1:31:25
So if you want to adjust your world model while you are operating the world
1:31:30
or your objective function, that is basically in the realm of RL. This is what RL deals with to some extent, right?
1:31:38
So adjust your world model. And the way to adjust your world model, even in advance,
1:31:44
is to explore parts of the space with your world model, where you know that your world model is inaccurate.
1:31:50
That's called curiosity basically, or play, right? When you play, you kind of explore part of the state space
1:31:58
that you don't want to do for real
1:32:04
because it might be dangerous, but you can adjust your world model
1:32:09
without killing yourself basically. (laughs) So that's what you want to use RL for.
1:32:14
When it comes time to learning a particular task, you already have all the good representations,
1:32:20
you already have your world model, but you need to adjust it for the situation at hand. That's when you use RL.
1:32:26
- Why do you think RLHF works so well? This enforcement learning with human feedback,
1:32:32
why did it have such a transformational effect on large language models that came before?
1:32:38
- So what's had the transformational effect is human feedback. There is many ways to use it
1:32:43
and some of it is just purely supervised, actually, it's not really reinforcement learning. - So it's the HF. (laughing)
1:32:49
- It's the HF. And then there is various ways to use human feedback, right? So you can ask humans to rate answers,
1:32:56
multiple answers that are produced by a world model. And then what you do is you train an objective function
1:33:05
to predict that rating. And then you can use that objective function
1:33:11
to predict whether an answer is good, and you can back propagate really through this to fine tune your system
1:33:16
so that it only produces highly rated answers. Okay?
1:33:22
So that's one way. So that's like in RL, that means training what's called a reward model, right?
1:33:29
So something that, basically your small neural net that estimates to what extent an answer is good, right?
1:33:35
It's very similar to the objective I was talking about earlier for planning, except now it's not used for planning,
1:33:41
it's used for fine tuning your system. I think it would be much more efficient to use it for planning,
1:33:46
but currently it's used to fine tune the parameters of the system.
1:33:52
Now, there's several ways to do this. Some of them are supervised. You just ask a human person,
1:33:59
like what is a good answer for this, right? Then you just type the answer.
1:34:05
I mean, there's lots of ways that those systems are being adjusted.
Woke AI
1:34:10
- Now, a lot of people have been very critical of the recently released Google's Gemini 1.5
1:34:19
for essentially, in my words, I could say super woke. Woke in the negative connotation of that word.
1:34:26
There is some almost hilariously absurd things that it does, like it modifies history,
1:34:32
like generating images of a black George Washington or perhaps more seriously
1:34:40
something that you commented on Twitter, which is refusing to comment on or generate images of,
1:34:49
or even descriptions of Tiananmen Square or the tank men,
1:34:55
one of the most sort of legendary protest images in history.
1:35:01
And of course, these images are highly censored by the Chinese government.
1:35:06
And therefore everybody started asking questions of what is the process of designing these LLMs?
1:35:14
What is the role of censorship in these, and all that kind of stuff. So you commented on Twitter
1:35:22
saying that open source is the answer. (laughs) - Yeah. - Essentially. So can you explain?
1:35:29
- I actually made that comment on just about every social network I can. (Lex laughs) And I've made that point multiple times in various forums.
1:35:40
Here's my point of view on this. People can complain that AI systems are biased,
1:35:47
and they generally are biased by the distribution of the training data that they've been trained on
1:35:55
that reflects biases in society. And that is potentially offensive to some people
1:36:05
or potentially not. And some techniques to de-bias
1:36:10
then become offensive to some people because of historical incorrectness and things like that.
1:36:23
And so you can ask the question. You can ask two questions. The first question is,
1:36:28
is it possible to produce an AI system that is not biased? And the answer is absolutely not.
1:36:33
And it's not because of technological challenges, although there are technological challenges to that.
1:36:41
It's because bias is in the eye of the beholder.
1:36:46
Different people may have different ideas about what constitutes bias for a lot of things.
1:36:52
I mean there are facts that are indisputable, but there are a lot of opinions or things
1:36:59
that can be expressed in different ways. And so you cannot have an unbiased system, that's just an impossibility.
1:37:08
And so what's the answer to this? And the answer is the same answer that we found
1:37:16
in liberal democracy about the press. The press needs to be free and diverse.
1:37:25
We have free speech for a good reason. It's because we don't want all of our information
1:37:31
to come from a unique source, 'cause that's opposite to the whole idea of democracy
1:37:41
and progressive ideas and even science, right? In science, people have to argue for different opinions.
1:37:48
And science makes progress when people disagree and they come up with an answer and a consensus forms, right?
1:37:54
And it's true in all democracies around the world. So there is a future which is already happening
1:38:03
where every single one of our interaction with the digital world will be mediated by AI systems,
1:38:10
AI assistance, right? We're gonna have smart glasses. You can already buy them from Meta, (laughing)
1:38:16
the Ray-Ban Meta. Where you can talk to them and they are connected with an LLM and you can get answers on any question you have.
1:38:25
Or you can be looking at a monument and there is a camera in the system, in the glasses,
1:38:31
you can ask it like what can you tell me about this building or this monument? You can be looking at a menu in a foreign language
1:38:39
and the thing we will translate it for you. We can do real time translation if we speak different languages.
1:38:44
So a lot of our interactions with the digital world are going to be mediated by those systems
1:38:49
in the near future. Increasingly, the search engines that we're gonna use
1:38:56
are not gonna be search engines, they're gonna be dialogue systems that we just ask a question,
1:39:04
and it will answer and then point you to the perhaps appropriate reference for it.
1:39:09
But here is the thing, we cannot afford those systems to come from a handful of companies on the west coast of the US
1:39:17
because those systems will constitute the repository of all human knowledge. And we cannot have that be controlled
1:39:25
by a small number of people, right? It has to be diverse for the same reason the press has to be diverse.
1:39:32
So how do we get a diverse set of AI assistance? It's very expensive and difficult
1:39:38
to train a base model, right? A base LLM at the moment. In the future might be something different,
1:39:43
but at the moment that's an LLM. So only a few companies can do this properly.
1:39:50
And if some of those subsystems are open source,
1:39:56
anybody can use them, anybody can fine tune them. If we put in place some systems
1:40:01
that allows any group of people, whether they are individual citizens,
1:40:10
groups of citizens, government organizations, NGOs, companies, whatever,
1:40:18
to take those open source systems, AI systems,
1:40:23
and fine tune them for their own purpose on their own data, there we're gonna have a very large diversity
1:40:29
of different AI systems that are specialized for all of those things, right? So I'll tell you,
1:40:35
I talked to the French government quite a bit and the French government will not accept
1:40:41
that the digital diet of all their citizens be controlled by three companies
1:40:46
on the west coast of the US. That's just not acceptable. It's a danger to democracy. Regardless of how well intentioned
1:40:52
those companies are, right? And it's also a danger to local culture,
1:41:00
to values, to language, right? I was talking with the founder of Infosys in India.
1:41:13
He's funding a project to fine tune LLaMA 2, the open source model produced by Meta.
1:41:19
So that LLaMA 2 speaks all 22 official languages in India. It's very important for people in India.
1:41:26
I was talking to a former colleague of mine, Moustapha Cisse, who used to be a scientist at FAIR, and then moved back to Africa
1:41:32
and created a research lab for Google in Africa and now has a new startup Kera.
1:41:37
And what he's trying to do is basically have LLM that speaks the local languages in Senegal so that people can have access to medical information,
1:41:46
'cause they don't have access to doctors, it's a very small number of doctors per capita in Senegal.
1:41:52
I mean, you can't have any of this unless you have open source platforms.
1:41:58
So with open source platforms, you can have AI systems that are not only diverse in terms of political opinions or things of that type,
1:42:05
but in terms of language, culture, value systems,
1:42:11
political opinions, technical abilities in various domains.
1:42:18
And you can have an industry, an ecosystem of companies that fine tune those open source systems
1:42:24
for vertical applications in industry, right? You have, I don't know, a publisher has thousands of books
1:42:30
and they want to build a system that allows a customer to just ask a question about the content of any of their books.
1:42:37
You need to train on their proprietary data, right? You have a company, we have one within Meta it's called Meta Mate.
1:42:44
And it's basically an LLM that can answer any question about internal stuff about about the company.
1:42:52
Very useful. A lot of companies want this, right? A lot of companies want this not just for their employees,
1:42:57
but also for their customers, to take care of their customers. So the only way you're gonna have an AI industry,
1:43:04
the only way you're gonna have AI systems that are not uniquely biased, is if you have open source platforms
1:43:10
on top of which any group can build specialized systems.
1:43:16
So the inevitable direction of history
1:43:22
is that the vast majority of AI systems will be built on top of open source platforms.
1:43:28
- So that's a beautiful vision. So meaning like a company like Meta or Google or so on,
1:43:37
should take only minimal fine tuning steps after the building, the foundation, pre-trained model.
1:43:44
As few steps as possible. - Basically. (Lex sighs) - Can Meta afford to do that?
Open source
1:43:51
- No. - So I don't know if you know this, but companies are supposed to make money somehow. And open source is like giving away...
1:44:00
I don't know, Mark made a video, Mark Zuckerberg. A very sexy video talking about 350,000 Nvidia H100s.
1:44:12
The math of that is, just for the GPUs, that's a hundred billion,
1:44:19
plus the infrastructure for training everything. So I'm no business guy,
1:44:26
but how do you make money on that? So the vision you paint is a really powerful one, but how is it possible to make money?
1:44:32
- Okay. So you have several business models, right? The business model that Meta is built around
1:44:40
is you offer a service, and the financing of that service
1:44:48
is either through ads or through business customers. So for example, if you have an LLM
1:44:54
that can help a mom-and-pop pizza place
1:45:00
by talking to their customers through WhatsApp, and so the customers can just order a pizza
1:45:05
and the system will just ask them, like what topping do you want or what size, blah blah, blah.
1:45:12
The business will pay for that. Okay? That's a model.
1:45:19
And otherwise, if it's a system that is on the more kind of classical services, it can be ad supported or there's several models.
1:45:28
But the point is, if you have a big enough potential customer base
1:45:34
and you need to build that system anyway for them,
1:45:40
it doesn't hurt you to actually distribute it to open source. - Again, I'm no business guy,
1:45:45
but if you release the open source model, then other people can do the same kind of task
1:45:51
and compete on it. Basically provide fine tuned models for businesses,
1:45:57
is the bet that Meta is making... By the way, I'm a huge fan of all this. But is the bet that Meta is making
1:46:03
is like, "we'll do a better job of it?" - Well, no. The bet is more, we already have a huge user base and customer base.
1:46:13
- [Lex] Ah, right. - Right? So it's gonna be useful to them. Whatever we offer them is gonna be useful and there is a way to derive revenue from this.
1:46:21
- [Lex] Sure. - And it doesn't hurt that we provide that system or the base model, right?
1:46:29
The foundation model in open source for others to build applications on top of it too.
1:46:35
If those applications turn out to be useful for our customers, we can just buy it for them.
1:46:42
It could be that they will improve the platform. In fact, we see this already. I mean there is literally millions of downloads of LLaMA 2
1:46:50
and thousands of people who have provided ideas about how to make it better. So this clearly accelerates progress
1:46:58
to make the system available to sort of a wide community of people.
1:47:05
And there is literally thousands of businesses who are building applications with it.
1:47:15
Meta's ability to derive revenue from this technology is not impaired by the distribution
1:47:24
of base models in open source. - The fundamental criticism that Gemini is getting is that, as you pointed out on the west coast...
AI and ideology
1:47:31
Just to clarify, we're currently in the east coast, where I would suppose Meta AI headquarters would be.
1:47:38
(laughs) So strong words about the west coast. But I guess the issue that happens is,
1:47:46
I think it's fair to say that most tech people have a political affiliation with the left wing.
1:47:53
They lean left. And so the problem that people are criticizing Gemini with is that in that de-biasing process that you mentioned,
1:48:02
that their ideological lean becomes obvious.
1:48:09
Is this something that could be escaped? You're saying open source is the only way?
1:48:16
- [Yann] Yeah. - Have you witnessed this kind of ideological lean that makes engineering difficult?
1:48:22
- No, I don't think it has to do... I don't think the issue has to do with the political leaning of the people designing those systems.
1:48:29
It has to do with the acceptability or political leanings of their customer base or audience, right?
1:48:38
So a big company cannot afford to offend too many people.
1:48:43
So they're going to make sure that whatever product they put out is "safe,"
1:48:49
whatever that means. And it's very possible to overdo it.
1:48:55
And it's also very possible to... It's impossible to do it properly for everyone. You're not going to satisfy everyone.
1:49:02
So that's what I said before, you cannot have a system that is unbiased and is perceived as unbiased by everyone.
1:49:08
It's gonna be, you push it in one way, one set of people are gonna see it as biased.
1:49:14
And then you push it the other way and another set of people is gonna see it as biased. And then in addition to this,
1:49:19
there's the issue of if you push the system perhaps a little too far in one direction, it's gonna be non-factual, right?
1:49:25
You're gonna have black Nazi soldiers in-
1:49:30
- Yeah. So we should mention image generation of black Nazi soldiers,
1:49:36
which is not factually accurate. - Right. And can be offensive for some people as well, right?
1:49:44
So it's gonna be impossible to kind of produce systems that are unbiased for everyone. So the only solution that I see is diversity.
1:49:53
- And diversity in full meaning of that word, diversity in every possible way. - [Yann] Yeah.
Marc Andreesen
1:49:58
- Marc Andreessen just tweeted today,
1:50:04
let me do a TL;DR. The conclusion is only startups and open source can avoid the issue that he's highlighting with big tech.
1:50:13
He's asking, can big tech actually field generative AI products? One, ever escalating demands from internal activists,
1:50:20
employee mobs, crazed executives, broken boards, pressure groups, extremist regulators, government agencies, the press,
1:50:28
in quotes "experts," and everything corrupting the output.
1:50:34
Two, constant risk of generating a bad answer or drawing a bad picture or rendering a bad video.
1:50:41
Who knows what it's going to say or do at any moment? Three, legal exposure, product liability, slander,
1:50:48
election law, many other things and so on. Anything that makes Congress mad.
1:50:54
Four, continuous attempts to tighten grip on acceptable output, degrade the model, like how good it actually is
1:51:01
in terms of usable and pleasant to use and effective and all that kind of stuff.
1:51:06
And five, publicity of bad text, images, video, actual puts those examples into the training data
1:51:13
for the next version. And so on. So he just highlights how difficult this is.
1:51:18
From all kinds of people being unhappy. He just said you can't create a system that makes everybody happy.
1:51:24
- [Yann] Yes. - So if you're going to do the fine tuning yourself and keep a close source,
1:51:31
essentially the problem there is then trying to minimize the number of people who are going to be unhappy. - [Yann] Yeah.
1:51:38
- And you're saying like the only... That that's almost impossible to do, right? And the better way is to do open source.
1:51:45
- Basically, yeah. I mean Marc is right about a number of things that he lists
1:51:51
that indeed scare large companies.
1:51:57
Certainly, congressional investigations is one of them. Legal liability.
1:52:04
Making things that get people to hurt themselves or hurt others. Like big companies are really careful
1:52:12
about not producing things of this type,
1:52:18
because they have... They don't want to hurt anyone, first of all. And then second, they wanna preserve their business.
1:52:23
So it's essentially impossible for systems like this that can inevitably formulate political opinions
1:52:30
and opinions about various things that may be political or not, but that people may disagree about.
1:52:36
About, you know, moral issues and things about like questions about religion
1:52:43
and things like that, right? Or cultural issues that people from different communities would disagree with in the first place.
1:52:50
So there's only kind of a relatively small number of things that people will sort of agree on,
1:52:55
basic principles. But beyond that, if you want those systems to be useful,
1:53:01
they will necessarily have to offend a number of people,
1:53:07
inevitably. - And so open source is just better- - [Yann] Diversity is better, right?
1:53:12
- And open source enables diversity. - That's right. Open source enables diversity. - This can be a fascinating world
1:53:19
where if it's true that the open source world, if Meta leads the way and creates this kind of open source foundation model world,
1:53:27
there's going to be, like governments will have a fine tuned model. (laughing) - [Yann] Yeah.
1:53:33
- And then potentially, people that vote left and right
1:53:39
will have their own model and preference to be able to choose. And it will potentially divide us even more but that's on us humans.
1:53:46
We get to figure out... Basically the technology enables humans to human more effectively.
1:53:53
And all the difficult ethical questions that humans raise we'll just leave it up to us to figure that out.
1:54:02
- Yeah, I mean there are some limits to what... The same way there are limits to free speech, there has to be some limit to the kind of stuff
1:54:08
that those systems might be authorized to produce,
1:54:15
some guardrails. So I mean, that's one thing I've been interested in, which is in the type of architecture
1:54:20
that we were discussing before, where the output of the system
1:54:26
is a result of an inference to satisfy an objective. That objective can include guardrails.
1:54:32
And we can put guardrails in open source systems. I mean, if we eventually have systems
1:54:39
that are built with this blueprint, we can put guardrails in those systems that guarantee
1:54:44
that there is sort of a minimum set of guardrails that make the system non-dangerous and non-toxic, et cetera.
1:54:50
Basic things that everybody would agree on. And then the fine tuning that people will add
1:54:57
or the additional guardrails that people will add will kind of cater to their community, whatever it is.
1:55:04
- And yeah, the fine tuning would be more about the gray areas of what is hate speech, what is dangerous and all that kind of stuff.
1:55:11
I mean, you've- - [Yann] Or different value systems. - Different value systems. But still even with the objectives
1:55:16
of how to build a bio weapon, for example, I think something you've commented on, or at least there's a paper
1:55:23
where a collection of researchers is trying to understand the social impacts of these LLMs.
1:55:29
And I guess one threshold that's nice is like does the LLM make it any easier than a search would,
1:55:38
like a Google search would? - Right. So the increasing number of studies on this
1:55:44
seems to point to the fact that it doesn't help. So having an LLM doesn't help you
1:55:52
design or build a bio weapon or a chemical weapon
1:55:57
if you already have access to a search engine and a library. And so the sort of increased information you get
1:56:04
or the ease with which you get it doesn't really help you. That's the first thing. The second thing is,
1:56:10
it's one thing to have a list of instructions of how to make a chemical weapon, for example, a bio weapon.
1:56:17
It's another thing to actually build it. And it's much harder than you might think, and then LLM will not help you with that.
1:56:25
In fact, nobody in the world, not even like countries use bio weapons because most of the time they have no idea
1:56:31
how to protect their own populations against it. So it's too dangerous actually to kind of ever use.
1:56:39
And it's in fact banned by international treaties. Chemical weapons is different.
1:56:45
It's also banned by treaties, but it's the same problem.
1:56:50
It's difficult to use in situations that doesn't turn against the perpetrators.
1:56:56
But we could ask Elon Musk. Like I can give you a very precise list of instructions of how you build a rocket engine.
1:57:04
And even if you have a team of 50 engineers that are really experienced building it, you're still gonna have to blow up a dozen of them
1:57:10
before you get one that works. And it's the same with chemical weapons or bio weapons
1:57:18
or things like this. It requires expertise in the real world that the LLM is not gonna help you with.
1:57:25
- And it requires even the common sense expertise that we've been talking about, which is how to take language based instructions
1:57:34
and materialize them in the physical world requires a lot of knowledge that's not in the instructions.
1:57:41
- Yeah, exactly. A lot of biologists have posted on this actually in response to those things saying like do you realize how hard it is
1:57:47
to actually do the lab work? Like this is not trivial. - Yeah.
1:57:52
And that's Hans Moravec comes to light once again. Just to linger on LLaMA.
Llama 3
1:57:59
Mark announced that LLaMA 3 is coming out eventually, I don't think there's a release date, but what are you most excited about?
1:58:06
First of all, LLaMA 2 that's already out there, and maybe the future LLaMA 3, 4, 5, 6, 10,
1:58:12
just the future of the open source under Meta? - Well, a number of things.
1:58:18
So there's gonna be like various versions of LLaMA that are improvements of previous LLaMAs.
1:58:26
Bigger, better, multimodal, things like that. And then in future generations,
1:58:32
systems that are capable of planning, that really understand how the world works, maybe are trained from video so they have some world model.
1:58:39
Maybe capable of the type of reasoning and planning I was talking about earlier. Like how long is that gonna take?
1:58:45
Like when is the research that is going in that direction going to sort of feed into the product line, if you want,
1:58:52
of LLaMA? I don't know, I can't tell you. And there's a few breakthroughs that we have to basically go through
1:58:59
before we can get there. But you'll be able to monitor our progress because we publish our research, right?
1:59:07
So last week we published the V-JEPA work, which is sort of a first step
1:59:13
towards training systems from video. And then the next step is gonna be world models
1:59:18
based on kind of this type of idea, training from video. There's similar work at DeepMind also taking place,
1:59:28
and also at UC Berkeley on world models and video.
1:59:33
A lot of people are working on this. I think a lot of good ideas are appearing. My bet is that those systems are gonna be JEPA-like,
1:59:41
they're not gonna be generative models. And we'll see what the future will tell.
1:59:49
There's really good work at... A gentleman called Danijar Hafner who is now DeepMind,
1:59:56
who's worked on kind of models of this type that learn representations and then use them for planning or learning tasks
2:00:02
by reinforcement training. And a lot of work at Berkeley by Pieter Abbeel, Sergey Levine,
2:00:11
a bunch of other people of that type. I'm collaborating with actually in the context of some grants with my NYU hat.
2:00:20
And then collaborations also through Meta, 'cause the lab at Berkeley is associated with Meta in some way, with FAIR.
2:00:28
So I think it's very exciting. I think I'm super excited about...
2:00:34
I haven't been that excited about like the direction of machine learning and AI since 10 years ago when FAIR was started,
2:00:41
and before that, 30 years ago, when we were working on, sorry 35, on combination nets and the early days of neural net.
2:00:51
So I'm super excited because I see a path towards
2:00:57
potentially human level intelligence with systems that can understand the world,
2:01:04
remember, plan, reason. There is some set of ideas to make progress there
2:01:09
that might have a chance of working. And I'm really excited about this. What I like is that
2:01:18
somewhat we get onto like a good direction and perhaps succeed before my brain turns to a white sauce
2:01:24
or before I need to retire. (laughs) - Yeah. Yeah.
2:01:30
Are you also excited by... Is it beautiful to you just the amount of GPUs involved,
2:01:38
sort of the whole training process on this much compute? Just zooming out,
2:01:43
just looking at earth and humans together have built these computing devices
2:01:49
and are able to train this one brain, we then open source.
2:01:56
(laughs) Like giving birth to this open source brain trained on this gigantic compute system.
2:02:04
There's just the details of how to train on that, how to build the infrastructure and the hardware,
2:02:10
the cooling, all of this kind of stuff. Are you just still the most of your excitement is in the theory aspect of it?
2:02:16
Meaning like the software. - Well, I used to be a hardware guy many years ago. (laughs) - Yes, yes, that's right.
2:02:22
- Decades ago. - Hardware has improved a little bit. Changed a little bit, yeah.
2:02:27
- I mean, certainly scale is necessary but not sufficient. - [Lex] Absolutely.
2:02:33
- So we certainly need computation. I mean, we're still far in terms of compute power from what we would need
2:02:39
to match the compute power of the human brain. This may occur in the next couple decades,
2:02:45
but we're still some ways away. And certainly in terms of power efficiency, we're really far.
2:02:51
So a lot of progress to make in hardware. And right now a lot of the progress is not...
2:03:00
I mean, there's a bit coming from Silicon technology, but a lot of it coming from architectural innovation
2:03:06
and quite a bit coming from like more efficient ways of implementing the architectures that have become popular.
2:03:13
Basically combination of transformers and com net, right? And so there's still some ways to go
2:03:22
until we are going to saturate. We're gonna have to come up
2:03:28
with like new principles, new fabrication technology, new basic components,
2:03:35
perhaps based on sort of different principles than those classical digital CMOS.
2:03:41
- Interesting. So you think in order to build AmI, ami,
2:03:48
we potentially might need some hardware innovation too? - Well, if we wanna make it ubiquitous,
2:03:55
yeah, certainly. Because we're gonna have to reduce the power consumption.
2:04:02
A GPU today, right? Is half a kilowatt to a kilowatt. Human brain is about 25 watts.
2:04:09
And the GPU is way below the power of human brain. You need something like a hundred thousand or a million to match it.
2:04:16
So we are off by a huge factor.
AGI
2:04:21
- You often say that AGI is not coming soon. Meaning like not this year, not the next few years,
2:04:30
potentially farther away. What's your basic intuition behind that?
2:04:35
- So first of all, it's not to be an event, right? The idea somehow which is popularized by science fiction in Hollywood
2:04:42
that somehow somebody is gonna discover the secret, the secret to AGI or human level AI or AmI,
2:04:50
whatever you wanna call it, and then turn on a machine and then we have AGI. That's just not going to happen.
2:04:57
It's not going to be an event. It's gonna be gradual progress.
2:05:03
Are we gonna have systems that can learn from video how the world works and learn good representations?
2:05:09
Yeah. Before we get them to the scale and performance that we observe in humans, it's gonna take quite a while.
2:05:15
It's not gonna happen in one day. Are we gonna get systems that can have large amount of associated memories
2:05:24
so they can remember stuff? Yeah. But same, it's not gonna happen tomorrow. I mean, there is some basic techniques
2:05:30
that need to be developed. We have a lot of them, but like to get this to work together with a full system
2:05:36
is another story. Are we gonna have systems that can reason and plan, perhaps along the lines of objective driven AI architectures
2:05:43
that I described before? Yeah, but like before we get this to work properly, it's gonna take a while.
2:05:49
And before we get all those things to work together. And then on top of this, have systems that can learn like hierarchical planning,
2:05:55
hierarchical representations, systems that can be configured for a lot of different situation at hands
2:06:00
the way the human brain can. All of this is gonna take at least a decade,
2:06:07
probably much more, because there are a lot of problems that we're not seeing right now
2:06:12
that we have not encountered. And so we don't know if there is an easy solution within this framework.
2:06:21
It's not just around the corner. I mean, I've been hearing people for the last 12, 15 years
2:06:27
claiming that AGI is just around the corner and being systematically wrong.
2:06:32
And I knew they were wrong when they were saying it. I called it bullshit. (laughs) - Why do you think people have been calling...
2:06:38
First of all, I mean, from the beginning of, from the birth of the term artificial intelligence, there has been an eternal optimism
2:06:46
that's perhaps unlike other technologies. Is it Moravec's paradox?
2:06:51
Is it the explanation for why people are so optimistic about AGI?
2:06:56
- I don't think it's just Moravec's paradox. Moravec's paradox is a consequence of realizing that the world is not as easy as we think.
2:07:03
So first of all, intelligence is not a linear thing that you can measure with a scaler,
2:07:10
with a single number. Can you say that humans are smarter than orangutans?
2:07:18
In some ways, yes, but in some ways orangutans are smarter than humans in a lot of domains
2:07:23
that allows them to survive in the forest, (laughing) for example. - So IQ is a very limited measure of intelligence.
2:07:30
True intelligence is bigger than what IQ, for example, measures. - Well, IQ can measure approximately something for humans,
2:07:39
but because humans kind of come in relatively kind of uniform form, right?
2:07:48
- [Lex] Yeah. - But it only measures one type of ability that may be relevant for some tasks, but not others.
2:07:58
But then if you are talking about other intelligent entities for which the basic things that are easy to them
2:08:07
is very different, then it doesn't mean anything. So intelligence is a collection of skills
2:08:18
and an ability to acquire new skills efficiently. Right?
2:08:23
And the collection of skills that a particular intelligent entity possess
2:08:29
or is capable of learning quickly is different from the collection of skills of another one.
2:08:35
And because it's a multidimensional thing, the set of skills is a high dimensional space, you can't measure.
2:08:40
You cannot compare two things as to whether one is more intelligent than the other. It's multidimensional.
AI doomers
2:08:48
- So you push back against what are called AI doomers a lot.
2:08:55
Can you explain their perspective and why you think they're wrong? - Okay.
2:09:00
So AI doomers imagine all kinds of catastrophe scenarios of how AI could escape our control
2:09:07
and basically kill us all. (laughs) And that relies on a whole bunch of assumptions
2:09:14
that are mostly false. So the first assumption is that the emergence of super intelligence
2:09:20
could be an event. That at some point we're going to figure out the secret and we'll turn on a machine that is super intelligent.
2:09:28
And because we'd never done it before, it's gonna take over the world and kill us all. That is false.
2:09:33
It's not gonna be an event. We're gonna have systems that are like as smart as a cat,
2:09:39
have all the characteristics of human level intelligence,
2:09:44
but their level of intelligence would be like a cat or a parrot maybe or something.
2:09:50
And then we're gonna walk our way up to kind of make those things more intelligent. And as we make them more intelligent,
2:09:56
we're also gonna put some guardrails in them and learn how to kind of put some guardrails so they behave properly. And we're not gonna do this with just one...
2:10:03
It's not gonna be one effort, but it's gonna be lots of different people doing this. And some of them are gonna succeed
2:10:09
at making intelligent systems that are controllable and safe and have the right guardrails.
2:10:14
And if some other goes rogue, then we can use the good ones to go against the rogue ones. (laughs)
2:10:20
So it's gonna be smart AI police against your rogue AI. So it's not gonna be like we're gonna be exposed
2:10:27
to like a single rogue AI that's gonna kill us all. That's just not happening. Now, there is another fallacy,
2:10:33
which is the fact that because the system is intelligent, it necessarily wants to take over.
2:10:40
And there is several arguments that make people scared of this, which I think are completely false as well.
2:10:48
So one of them is in nature,
2:10:53
it seems to be that the more intelligent species are the ones that end up dominating the other. And even extinguishing the others
2:11:03
sometimes by design, sometimes just by mistake.
2:11:09
And so there is sort of a thinking by which you say, well, if AI systems
2:11:15
are more intelligent than us, surely they're going to eliminate us, if not by design,
2:11:21
simply because they don't care about us. And that's just preposterous for a number of reasons.
2:11:27
First reason is they're not going to be a species. They're not gonna be a species that competes with us.
2:11:33
They're not gonna have the desire to dominate because the desire to dominate is something that has to be hardwired
2:11:38
into an intelligent system. It is hardwired in humans,
2:11:44
it is hardwired in baboons, in chimpanzees, in wolves, not in orangutans.
2:11:51
The species in which this desire to dominate or submit
2:11:56
or attain status in other ways is specific to social species.
2:12:03
Non-social species like orangutans don't have it. Right? And they are as smart as we are, almost.
2:12:09
Right? - And to you, there's not significant incentive for humans to encode that into the AI systems.
2:12:15
And to the degree they do, there'll be other AIs that sort of punish them for it.
2:12:22
Out-compete them over- - Well, there's all kinds of incentive to make AI systems submissive to humans. Right? - [Lex] Right.
2:12:27
- I mean, this is the way we're gonna build them, right? And so then people say, oh, but look at LLMs.
2:12:32
LLMs are not controllable. And they're right, LLMs are not controllable. But objective driven AI,
2:12:37
so systems that derive their answers by optimization of an objective
2:12:43
means they have to optimize this objective, and that objective can include guardrails. One guardrail is obey humans.
2:12:52
Another guardrail is don't obey humans if it's hurting other humans- - I've heard that before somewhere, I don't remember-
2:12:59
- [Yann] Yes. (Lex laughs) Maybe in a book. (laughs) - Yeah. But speaking of that book, could there be unintended consequences also
2:13:08
from all of this? - No, of course. So this is not a simple problem, right? I mean designing those guardrails
2:13:14
so that the system behaves properly is not gonna be a simple issue
2:13:20
for which there is a silver bullet, for which you have a mathematical proof that the system can be safe. It's gonna be very progressive,
2:13:27
iterative design system where we put those guardrails in such a way that the system behave properly.
2:13:32
And sometimes they're going to do something that was unexpected because the guardrail wasn't right,
2:13:38
and we're gonna correct them so that they do it right. The idea somehow that we can't get it slightly wrong,
2:13:44
because if we get it slightly wrong we all die, is ridiculous.
2:13:49
We're just gonna go progressively. The analogy I've used many times is turbojet design.
2:14:00
How did we figure out how to make turbojets so unbelievably reliable, right?
2:14:06
I mean, those are like incredibly complex pieces of hardware that run at really high temperatures
2:14:12
for 20 hours at a time sometimes. And we can fly halfway around the world
2:14:20
on a two engine jet liner at near the speed of sound.
2:14:27
Like how incredible is this? It is just unbelievable. And did we do this
2:14:33
because we invented like a general principle of how to make turbojets safe? No, it took decades
2:14:39
to kind of fine tune the design of those systems so that they were safe. Is there a separate group
2:14:46
within General Electric or Snecma or whatever that is specialized in turbojet safety?
2:14:54
No. The design is all about safety. Because a better turbojet is also a safer turbojet,
2:15:01
a more reliable one. It's the same for AI. Like do you need specific provisions to make AI safe?
2:15:08
No, you need to make better AI systems and they will be safe because they are designed to be more useful
2:15:14
and more controllable. - So let's imagine a system, AI system that's able to be incredibly convincing
2:15:23
and can convince you of anything. I can at least imagine such a system.
2:15:29
And I can see such a system be weapon-like, because it can control people's minds,
2:15:35
we're pretty gullible. We want to believe a thing. And you can have an AI system that controls it
2:15:40
and you could see governments using that as a weapon. So do you think if you imagine such a system,
2:15:48
there's any parallel to something like nuclear weapons? - [Yann] No.
2:15:54
- So why is that technology different? So you're saying there's going to be gradual development?
2:16:01
- [Yann] Yeah. - I mean it might be rapid, but they'll be iterative. And then we'll be able to kind of respond and so on.
2:16:09
- So that AI system designed by Vladimir Putin or whatever, or his minions (laughing)
2:16:16
is gonna be like trying to talk to every American
2:16:21
to convince them to vote for- - [Lex] Whoever. - Whoever pleases Putin or whatever.
2:16:31
Or rile people up against each other
2:16:36
as they've been trying to do. They're not gonna be talking to you, they're gonna be talking to your AI assistant
2:16:44
which is going to be as smart as theirs, right? Because as I said, in the future,
2:16:51
every single one of your interaction with the digital world will be mediated by your AI assistant. So the first thing you're gonna ask is, is this a scam?
2:16:58
Like is this thing like telling me the truth? Like it's not even going to be able to get to you because it's only going to talk to your AI assistant,
2:17:05
and your AI is not even going to... It's gonna be like a spam filter, right?
2:17:10
You're not even seeing the email, the spam email, right? It's automatically put in a folder that you never see.
2:17:16
It's gonna be the same thing. That AI system that tries to convince you of something, it's gonna be talking to an AI system
2:17:22
which is gonna be at least as smart as it. And is gonna say, this is spam. (laughs)
2:17:29
It's not even going to bring it to your attention. - So to you it's very difficult for any one AI system
2:17:34
to take such a big leap ahead to where it can convince even the other AI systems?
2:17:40
So like there's always going to be this kind of race where nobody's way ahead?
2:17:46
- That's the history of the world. History of the world is whenever there is a progress someplace,
2:17:51
there is a countermeasure. And it's a cat and mouse game.
2:17:57
- Mostly yes, but this is why nuclear weapons are so interesting because that was such a powerful weapon
2:18:05
that it mattered who got it first. That you could imagine Hitler, Stalin, Mao
2:18:16
getting the weapon first and that having a different kind of impact on the world
2:18:21
than the United States getting the weapon first. To you, nuclear weapons is like...
2:18:27
You don't imagine a breakthrough discovery and then Manhattan project like effort for AI?
2:18:35
- No. As I said, it's not going to be an event. It's gonna be continuous progress.
2:18:41
And whenever one breakthrough occurs, it's gonna be widely disseminated really quickly.
2:18:48
Probably first within industry. I mean, this is not a domain where government or military organizations
2:18:55
are particularly innovative, and they're in fact way behind. And so this is gonna come from industry.
2:19:02
And this kind of information disseminates extremely quickly. We've seen this over the last few years, right?
2:19:08
Where you have a new... Like even take AlphaGo. This was reproduced within three months
2:19:14
even without like particularly detailed information, right? - Yeah. This is an industry that's not good at secrecy.
2:19:20
(laughs) - But even if there is, just the fact that you know that something is possible
2:19:26
makes you like realize that it's worth investing the time to actually do it. You may be the second person to do it but you'll do it.
2:19:36
Say for all the innovations of self supervised running transformers,
2:19:43
decoder only architectures, LLMs. I mean those things, you don't need to know exactly the details of how they work
2:19:49
to know that it's possible because it's deployed and then it's getting reproduced. And then people who work for those companies move.
2:20:00
They go from one company to another. And the information disseminates. What makes the success of the US tech industry
2:20:09
and Silicon Valley in particular, is exactly that, is because information circulates really, really quickly and disseminates very quickly.
2:20:17
And so the whole region sort of is ahead because of that circulation of information.
2:20:24
- Maybe just to linger on the psychology of AI doomers. You give in the classic Yann LeCun way,
2:20:31
a pretty good example of just when a new technology comes to be, you say engineer says,
2:20:38
"I invented this new thing, I call it a ballpen."
2:20:44
And then the TwitterSphere responds, "OMG people could write horrible things with it like misinformation, propaganda, hate speech.
2:20:51
Ban it now!" Then writing doomers come in, akin to the AI doomers,
2:20:57
"imagine if everyone can get a ballpen. This could destroy society. There should be a law
2:21:03
against using ballpen to write hate speech, regulate ballpens now." And then the pencil industry mogul says,
2:21:09
"yeah, ballpens are very dangerous, unlike pencil writing which is erasable,
2:21:15
ballpen writing stays forever. Government should require a license for a pen manufacturer."
2:21:22
I mean, this does seem to be part of human psychology
2:21:28
when it comes up against new technology.
2:21:33
What deep insights can you speak to about this? - Well, there is a natural fear of new technology
2:21:43
and the impact it can have on society. And people have kind of instinctive reaction
2:21:48
to the world they know being threatened by major transformations
2:21:55
that are either cultural phenomena or technological revolutions.
2:22:02
And they fear for their culture, they fear for their job, they fear for the future of their children
2:22:10
and their way of life, right? So any change is feared.
2:22:17
And you see this along history, like any technological revolution or cultural phenomenon
2:22:24
was always accompanied by groups or reaction in the media
2:22:31
that basically attributed all the problems,
2:22:36
the current problems of society to that particular change, right? Electricity was going to kill everyone at some point.
2:22:45
The train was going to be a horrible thing because you can't breathe past 50 kilometers an hour.
2:22:52
And so there's a wonderful website called a Pessimists Archive, right? Which has all those newspaper clips (laughing)
2:22:59
of all the horrible things people imagined would arrive because of either technological innovation
2:23:06
or a cultural phenomenon.
2:23:13
Wonderful examples of jazz or comic books
2:23:18
being blamed for unemployment or young people not wanting to work anymore
2:23:25
and things like that, right? And that has existed for centuries.
2:23:34
And it's knee jerk reactions. The question is do we embrace change or do we resist it?
2:23:45
And what are the real dangers as opposed to the imagined imagined ones?
2:23:51
- So people worry about... I think one thing they worry about with big tech, something we've been talking about over and over
2:23:58
but I think worth mentioning again, they worry about how powerful AI will be
2:24:05
and they worry about it being in the hands of one centralized power of just a handful of central control.
2:24:13
And so that's the skepticism with big tech. These companies can make a huge amount of money
2:24:18
and control this technology. And by so doing,
2:24:24
take advantage, abuse the little guy in society. - Well, that's exactly why we need open source platforms.
2:24:31
- Yeah. I just wanted to... (laughs) Nail the point home more and more. - [Yann] Yes.
Joscha Bach
2:24:38
- So let me ask you on your... Like I said, you do get a little bit
2:24:44
flavorful on the internet. Joscha Bach tweeted something that you LOL'd at
2:24:50
in reference to HAL 9,000. Quote, "I appreciate your argument and I fully understand your frustration,
2:24:57
but whether the pod bay doors should be opened or closed is a complex and nuanced issue."
2:25:03
So you're at the head of Meta AI. This is something that really worries me,
2:25:12
that our AI overlords will speak down to us with corporate speak of this nature
2:25:20
and you sort of resist that with your way of being. Is this something you can just comment on
2:25:27
sort of working at a big company, how you can avoid the over fearing, I suppose,
2:25:37
the through caution create harm? - Yeah. Again, I think the answer to this is open source platforms
2:25:45
and then enabling a widely diverse set of people to build AI assistants
2:25:53
that represent the diversity of cultures, opinions, languages, and value systems across the world.
2:25:59
So that you're not bound to just be brainwashed
2:26:04
by a particular way of thinking because of a single AI entity.
2:26:10
So I mean, I think it's a really, really important question for society. And the problem I'm seeing,
2:26:20
which is why I've been so vocal and sometimes a little sardonic about it- - Never stop.
2:26:26
Never stop, Yann. (both laugh) We love it. - Is because I see the danger
2:26:31
of this concentration of power through proprietary AI systems as a much bigger danger than everything else.
2:26:39
That if we really want diversity of opinion AI systems
2:26:46
that in the future that we'll all be interacting through AI systems,
2:26:52
we need those to be diverse for the preservation of a diversity of ideas
2:26:58
and creeds and political opinions and whatever,
2:27:06
and the preservation of democracy. And what works against this
2:27:12
is people who think that for reasons of security, we should keep AI systems under lock and key
2:27:19
because it's too dangerous to put it in the hands of everybody because it could be used by terrorists or something.
2:27:28
That would lead to potentially a very bad future
2:27:36
in which all of our information diet is controlled by a small number of companies through proprietary systems.
2:27:44
- So you trust humans with this technology to build systems that are on the whole good for humanity?
2:27:53
- Isn't that what democracy and free speech is all about? - I think so. - Do you trust institutions to do the right thing?
2:28:00
Do you trust people to do the right thing? And yeah, there's bad people who are gonna do bad things,
2:28:05
but they're not going to have superior technology to the good people. So then it's gonna be my good AI against your bad AI, right?
2:28:12
I mean it's the examples that we were just talking about of maybe some rogue country will build some AI system
2:28:21
that's gonna try to convince everybody to go into a civil war or something
2:28:27
or elect a favorable ruler.
2:28:32
But then they will have to go past our AI systems, right? (laughs) - An AI system with a strong Russian accent
2:28:38
will be trying to convince our- - And doesn't put any articles in their sentences. (both laugh)
2:28:45
- Well, it'll be at the very least, absurdly comedic. Okay. So since we talked about sort of the physical reality,
Humanoid robots
2:28:55
I'd love to ask your vision of the future with robots in this physical reality.
2:29:00
So many of the kinds of intelligence you've been speaking about would empower robots
2:29:06
to be more effective collaborators with us humans. So since Tesla's Optimus team
2:29:14
has been showing us some progress in humanoid robots, I think it really reinvigorated the whole industry
2:29:20
that I think Boston Dynamics has been leading for a very, very long time. So now there's all kinds of companies,
2:29:25
Figure AI, obviously Boston Dynamics- - [Yann] Unitree. - Unitree.
2:29:31
But there's like a lot of them. It's great. It's great. I mean I love it.
2:29:37
So do you think there'll be millions of humanoid robots walking around soon?
2:29:44
- Not soon, but it's gonna happen. Like the next decade I think is gonna be really interesting in robots.
2:29:49
Like the emergence of the robotics industry has been in the waiting for 10, 20 years,
2:29:57
without really emerging other than for like kind of pre-program behavior and stuff like that.
2:30:03
And the main issue is again, the Moravec's paradox.
2:30:08
Like how do we get the systems to understand how the world works and kind of plan actions? And so we can do it for really specialized tasks.
2:30:17
And the way Boston Dynamics goes about it is basically with a lot of handcrafted dynamical models
2:30:25
and careful planning in advance, which is very classical robotics with a lot of innovation,
2:30:32
a little bit of perception, but it's still not... Like they can't build a domestic robot, right?
2:30:40
And we're still some distance away from completely autonomous level five driving.
2:30:47
And we're certainly very far away from having level five autonomous driving
2:30:53
by a system that can train itself by driving 20 hours, like any 17-year-old.
2:31:00
So until we have, again, world models,
2:31:08
systems that can train themselves to understand how the world works, we're not gonna have significant progress in robotics.
2:31:16
So a lot of the people working on robotic hardware at the moment are betting or banking
2:31:23
on the fact that AI is gonna make sufficient progress towards that. - And they're hoping to discover a product in it too-
2:31:31
- [Yann] Yeah. - Before you have a really strong world model, there'll be an almost strong world model.
2:31:38
And people are trying to find a product in a clumsy robot, I suppose.
2:31:43
Like not a perfectly efficient robot. So there's the factory setting where humanoid robots can help automate some aspects of the factory.
2:31:51
I think that's a crazy difficult task 'cause of all the safety required and all this kind of stuff, I think in the home is more interesting.
2:31:57
But then you start to think... I think you mentioned loading the dishwasher, right?
2:32:03
- [Yann] Yeah. - Like I suppose that's one of the main problems you're working on. - I mean there's cleaning up. (laughs)
2:32:10
- [Lex] Yeah. - Cleaning the house, clearing up the table after a meal,
2:32:17
washing the dishes, all those tasks, cooking. I mean all the tasks that in principle could be automated
2:32:24
but are actually incredibly sophisticated, really complicated. - But even just basic navigation
2:32:29
around a space full of uncertainty. - That sort of works. Like you can sort of do this now.
2:32:35
Navigation is fine. - Well, navigation in a way that's compelling to us humans
2:32:40
is a different thing. - Yeah. It's not gonna be necessarily... I mean we have demos actually
2:32:46
'cause there is a so-called embodied AI group at FAIR
2:32:52
and they've been not building their own robots but using commercial robots.
2:32:57
And you can tell the robot dog like go to the fridge and they can actually open the fridge
2:33:03
and they can probably pick up a can in the fridge and stuff like that and bring it to you.
2:33:08
So it can navigate, it can grab objects as long as it's been trained to recognize them,
2:33:14
which vision systems work pretty well nowadays. But it's not like a completely general robot
2:33:23
that would be sophisticated enough to do things like clearing up the dinner table.
2:33:28
(laughs) - Yeah, to me that's an exciting future of getting humanoid robots.
2:33:35
Robots in general in the home more and more because it gets humans to really directly interact with AI systems
2:33:40
in the physical space. And in so doing it allows us to philosophically, psychologically explore
2:33:46
our relationships with robots. It can be really, really interesting. So I hope you make progress on the whole JEPA thing soon.
2:33:54
(laughs) - Well, I mean, I hope things can work as planned.
2:34:00
I mean, again, we've been like kinda working on this idea of self supervised learning from video for 10 years.
2:34:07
And only made significant progress in the last two or three. - And actually you've mentioned
2:34:13
that there's a lot of interesting breakthroughs that can happen without having access to a lot of compute. So if you're interested in doing a PhD
2:34:20
in this kind of stuff, there's a lot of possibilities still to do innovative work.
2:34:25
So like what advice would you give to a undergrad that's looking to go to grad school and do a PhD?
2:34:32
- So basically, I've listed them already. This idea of how do you train a world model by observation?
2:34:39
And you don't have to train necessarily on gigantic data sets.
2:34:45
I mean, it could turn out to be necessary to actually train on large data sets to have emergent properties like we have with LLMs.
2:34:51
But I think there is a lot of good ideas that can be done without necessarily scaling up. Then there is how do you do planning
2:34:58
with a learn world model? If the world the system evolves in is not the physical world,
2:35:03
but is the world of let's say the internet or some sort of world
2:35:09
of where an action consists in doing a search in a search engine or interrogating a database,
2:35:14
or running a simulation or calling a calculator or solving a differential equation,
2:35:22
how do you get a system to actually plan a sequence of actions to give the solution to a problem?
2:35:28
And so the question of planning is not just a question of planning physical actions,
2:35:35
it could be planning actions to use tools for a dialogue system or for any kind of intelligence system.
2:35:43
And there's some work on this but not a huge amount. Some work at FAIR,
2:35:48
one called Toolformer, which was a couple years ago and some more recent work on planning,
2:35:55
but I don't think we have like a good solution for any of that.
2:36:00
Then there is the question of hierarchical planning. So the example I mentioned
2:36:05
of planning a trip from New York to Paris, that's hierarchical,
2:36:11
but almost every action that we take involves hierarchical planning in some sense.
2:36:17
And we really have absolutely no idea how to do this. Like there's zero demonstration
2:36:22
of hierarchical planning in AI,
2:36:28
where the various levels of representations that are necessary have been learned.
2:36:36
We can do like two level hierarchical planning when we design the two levels. So for example, you have like a dog legged robot, right?
2:36:44
You want it to go from the living room to the kitchen. You can plan a path that avoids the obstacle.
2:36:51
And then you can send this to a lower level planner that figures out how to move the legs
2:36:56
to kind of follow that trajectories, right? So that works, but that two level planning is designed by hand, right?
2:37:05
We specify what the proper levels of abstraction, the representation at each level of abstraction have to be.
2:37:13
How do you learn this? How do you learn that hierarchical representation of action plans, right?
2:37:20
With com nets and deep learning, we can train the system to learn hierarchical representations of percepts.
2:37:27
What is the equivalent when what you're trying to represent are action plans? - For action plans. Yeah.
2:37:32
So you want basically a robot dog or humanoid robot that turns on and travels from New York to Paris
2:37:38
all by itself. - [Yann] For example. - All right. It might have some trouble at the TSA but-
2:37:47
- No, but even doing something fairly simple like a household task. - [Lex] Sure. - Like cooking or something.
2:37:53
- Yeah. There's a lot involved. It's a super complex task. Once again, we take it for granted.
Hope for the future
2:38:00
What hope do you have for the future of humanity?
2:38:05
We're talking about so many exciting technologies, so many exciting possibilities. What gives you hope when you look out
2:38:12
over the next 10, 20, 50, 100 years? If you look at social media, there's wars going on, there's division, there's hatred,
2:38:21
all this kind of stuff that's also part of humanity. But amidst all that, what gives you hope?
2:38:29
- I love that question.
2:38:34
We can make humanity smarter with AI. Okay?
2:38:40
I mean AI basically will amplify human intelligence. It's as if every one of us
2:38:47
will have a staff of smart AI assistants. They might be smarter than us.
2:38:54
They'll do our bidding, perhaps execute a task
2:39:01
in ways that are much better than we could do ourselves because they'd be smarter than us.
2:39:07
And so it's like everyone would be the boss of a staff of super smart virtual people.
2:39:15
So we shouldn't feel threatened by this any more than we should feel threatened by being the manager of a group of people,
2:39:22
some of whom are more intelligent than us. I certainly have a lot of experience with this.
2:39:29
(laughs) Of having people working with me who are smarter than me.
2:39:34
That's actually a wonderful thing. So having machines that are smarter than us,
2:39:40
that assist us in all of our tasks, our daily lives, whether it's professional or personal,
2:39:45
I think would be an absolutely wonderful thing. Because intelligence is the commodity
2:39:52
that is most in demand. I mean, all the mistakes that humanity makes is because of lack of intelligence, really,
2:39:59
or lack of knowledge, which is related. So making people smarter which can only be better.
2:40:07
I mean, for the same reason that public education is a good thing
2:40:12
and books are a good thing, and the internet is also a good thing, intrinsically. And even social networks are a good thing
2:40:19
if you run them properly. (laughs) It's difficult, but you can.
2:40:25
Because it helps the communication
2:40:31
of information and knowledge and the transmission of knowledge. So AI is gonna make humanity smarter.
2:40:37
And the analogy I've been using is the fact that perhaps an equivalent event
2:40:44
in the history of humanity to what might be provided by generalization of AI assistant
2:40:53
is the invention of the printing press. It made everybody smarter. The fact that people could have access to books.
2:41:03
Books were a lot cheaper than they were before. And so a lot more people had an incentive to learn to read,
2:41:10
which wasn't the case before. And people became smarter.
2:41:17
It enabled the enlightenment, right? There wouldn't be an enlightenment without the printing press.
2:41:24
It enabled philosophy, rationalism,
2:41:30
escape from religious doctrine, democracy, science.
2:41:41
And certainly without this there wouldn't have been the American Revolution or the French Revolution.
2:41:47
And so we'll still be under feudal regimes perhaps.
2:41:53
And so it completely transformed the world because people became smarter
2:41:59
and kinda learned about things. Now, it also created 200 years
2:42:05
of essentially religious conflicts in Europe, right? Because the first thing that people read was the Bible
2:42:12
and realized that perhaps there was a different interpretation of the Bible than what the priests were telling them.
2:42:19
And so that created the Protestant movement and created a rift. And in fact, the Catholic church
2:42:25
didn't like the idea of the printing press but they had no choice. And so it had some bad effects and some good effects.
2:42:32
I don't think anyone today would say that the invention of the printing press had an overall negative effect
2:42:38
despite the fact that it created 200 years of religious conflicts in Europe.
2:42:44
Now compare this, and I was very proud of myself to come up with this analogy,
2:42:51
but realized someone else came with the same idea before me. Compare this with what happened in the Ottoman Empire.
2:42:58
The Ottoman Empire banned the printing press for 200 years.
2:43:07
And it didn't ban it for all languages, only for Arabic. You could actually print books
2:43:13
in Latin or Hebrew or whatever in the Ottoman Empire, just not in Arabic.
2:43:20
And I thought it was because the rulers just wanted to preserve
2:43:27
the control over the population and the dogma, religious dogma and everything.
2:43:33
But after talking with the UAE Minister of AI,
2:43:38
Omar Al Olama, he told me no, there was another reason.
2:43:45
And the other reason was that it was to preserve the corporation of calligraphers, right?
2:43:53
There's like an art form which is writing those beautiful Arabic poems
2:44:02
or whatever religious text in this thing. And it was very powerful corporation of scribes basically
2:44:07
that kinda run a big chunk of the empire. And we couldn't put them out of business.
2:44:14
So they banned the bridging press in part to protect that business.
2:44:21
Now, what's the analogy for AI today? Like who are we protecting by banning AI? Like who are the people who are asking that AI be regulated
2:44:28
to protect their jobs? And of course, it's a real question
2:44:35
of what is gonna be the effect of technological transformation like AI
2:44:41
on the job market and the labor market? And there are economists
2:44:46
who are much more expert at this than I am, but when I talk to them, they tell us we're not gonna run out of job.
2:44:54
This is not gonna cause mass unemployment. This is just gonna be gradual shift
2:45:01
of different professions. The professions that are gonna be hot 10 or 15 years from now,
2:45:06
we have no idea today what they're gonna be. The same way if we go back 20 years in the past,
2:45:12
like who could have thought 20 years ago that like the hottest job,
2:45:17
even like 5, 10 years ago was mobile app developer? Like smartphones weren't invented.
2:45:23
- Most of the jobs of the future might be in the Metaverse. (laughs) - Well, it could be. Yeah.
2:45:29
- But the point is you can't possibly predict. But you're right. I mean, you've made a lot of strong points.
2:45:36
And I believe that people are fundamentally good, and so if AI, especially open source AI
2:45:42
can make them smarter, it just empowers the goodness in humans.
2:45:48
- So I share that feeling. Okay? I think people are fundamentally good. (laughing)
2:45:54
And in fact a lot of doomers are doomers because they don't think that people are fundamentally good.
2:46:00
And they either don't trust people or they don't trust the institution to do the right thing
2:46:07
so that people behave properly. - Well, I think both you and I believe in humanity,
2:46:13
and I think I speak for a lot of people in saying thank you for pushing the open source movement,
2:46:20
pushing to making both research and AI open source, making it available to people,
2:46:25
and also the models themselves, making that open source also. So thank you for that. And thank you for speaking your mind
2:46:32
in such colorful and beautiful ways on the internet. I hope you never stop. You're one of the most fun people I know
2:46:37
and get to be a fan of. So Yann, thank you for speaking to me once again, and thank you for being you.
2:46:43
- Thank you Lex. - Thanks for listening to this conversation with Yann LeCun. To support this podcast,
2:46:49
please check out our sponsors in the description. And now let me leave you with some words from Arthur C. Clarke,
2:46:56
"the only way to discover the limits of the possible is to go beyond them into the impossible."
2:47:04
Thank you for listening and hope to see you next time.