charlesfrye commited on
Commit
a08f3cd
·
1 Parent(s): 1b5101a

adds documents

Browse files
documents/lecture-01.md ADDED
@@ -0,0 +1,563 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Introduction to planning, developing, and shipping ML-powered products.
3
+ ---
4
+
5
+ # Lecture 1: Course Vision and When to Use ML
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/-Iob-FW5jVM?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published August 8, 2022.
14
+ [Download slides](https://drive.google.com/file/d/18EVuJpnJ9z5Pz7oRYcgax_IzRVhbuAMC/view?usp=sharing).
15
+
16
+ ## 1 - Course Vision
17
+
18
+ ### History of FSDL
19
+
20
+ **Full Stack Deep Learning (FSDL) is the course and community for people
21
+ who are building products that are powered by machine learning (ML).**
22
+ It's an exciting time to talk about ML-powered products because ML is
23
+ rapidly becoming a mainstream technology - as you can see in startup
24
+ funding, job postings, and continued investments of large companies.
25
+
26
+ FSDL was originally started in 2018 when the most exciting ML-powered
27
+ products were built by the biggest companies. However, the broader
28
+ narrative in the field was that very few companies could get value out
29
+ of this technology.
30
+
31
+ Now in 2022, there's a proliferation of powerful products that are
32
+ powered by ML. The narrative has shifted as well: There's
33
+ standardization that has emerged around the tech stack - with
34
+ transformers and NLP starting to seep their way into more use cases, as
35
+ well as practices around how to apply ML technologies in the world. One
36
+ of the biggest changes in the field in the past four years has been the
37
+ emergence of the term **MLOps**.
38
+
39
+ In addition to the
40
+ field being more mature and research continuing to progress, a big
41
+ reason for this rapid change is that **the training of models is starting to become
42
+ commoditized**.
43
+
44
+ - With tools like [HuggingFace](https://huggingface.co), you can deploy a state-of-the-art NLP
45
+ or CV model in one or two lines of code.
46
+
47
+ - AutoML is starting to work for a lot of applications.
48
+
49
+ - Companies like [OpenAI](https://openai.com/api/) are starting to provide models as a service where you
50
+ don't even have to download open-source packages to use them. You
51
+ can make a network call to get predictions from a state-of-the-art
52
+ model.
53
+
54
+ - Many libraries are starting to standardize around frameworks like [Keras](https://keras.io/) and [PyTorch
55
+ Lightning](https://www.pytorchlightning.ai/).
56
+
57
+ ### AI Progress
58
+
59
+ The history of ML is characterized by stratospheric rises and meteoric falls of the public
60
+ perception of the technology. These were driven by a few different AI
61
+ winters that happened over the history of the field - where the
62
+ technology didn't live up to its hype. If you project forward a few
63
+ years, what will happen to ML?
64
+
65
+ ![](./media/image6.png)
66
+
67
+
68
+ *Source: [5 Things You Should Know About
69
+ AI](https://www.cambridgewireless.co.uk/media/uploads/resources/AI%20Group/AIMobility-11.05.17-Cambridge_Consultants-Monty_Barlow.pdf)
70
+ (Cambridge Consultants, May 2017)*
71
+
72
+ Here are the major categories of possible outcomes and our guess about their likelihoods:
73
+
74
+ 1. A true AI winter, where people
75
+ become skeptical about AI as a technology.
76
+ We think this is less likely.
77
+
78
+ 2. A slightly more likely outcome is that the overall luster of the
79
+ technology starts to wear off, but specific applications are
80
+ getting a ton of value out of it.
81
+
82
+ 3. The upside outcome for the field is that AI continues to accelerate
83
+ rapidly and becomes pervasive and incredibly effective.
84
+
85
+ Our conjecture is that: **The way we, as a field, avoid an AI winter is
86
+ by translating research progress into real-world products.** That's how
87
+ we avoid repeating what has happened in the past.
88
+
89
+ ### ML-Powered Products Require a Different Process
90
+
91
+ Building ML-powered products requires a fundamentally different process
92
+ in many ways than developing ML models in an academic setting.
93
+
94
+ ![](./media/image7.png)
95
+
96
+
97
+ In academia, you build **"flat-earth" ML** - selecting a problem,
98
+ collecting data, cleaning and labeling the data, iterating on model
99
+ development until you have a model that performs well on the dataset
100
+ collected, evaluating that model, and writing a report at the end.
101
+
102
+ ![](./media/image5.png)
103
+
104
+
105
+ But ML-powered products require **an outer loop** where after you deploy
106
+ the model into production, you measure how that model performs when it
107
+ interacts with real users. Then, you use real-world data to
108
+ improve your model, setting up a data flywheel that enables
109
+ continual improvement.
110
+
111
+ ### This Course
112
+
113
+ ![](./media/image2.png)
114
+
115
+
116
+ This class is about the unique aspects you need to know beyond training
117
+ models to build great ML-powered products. Here are some concrete goals
118
+ for us:
119
+
120
+ 1. Teaching you **generalist skills** and an understanding of the
121
+ **components of ML-powered products** (and ML projects more
122
+ generally).
123
+
124
+ 2. Teaching you **enough MLOps to get things done**.
125
+
126
+ 3. Sharing **best practices** and **explaining the motivation** behind them.
127
+
128
+ 4. Learning things that might **help you with job interviews** for ML engineering roles.
129
+
130
+ 5. **Forming a community** to learn together and from each other.
131
+
132
+ We do NOT try to:
133
+
134
+ 1. Teach you ML or software engineering from scratch.
135
+
136
+ 2. Cover the whole breadth of deep learning techniques.
137
+
138
+ 3. Make you an expert in any single aspect of ML.
139
+
140
+ 4. Do research in deep learning.
141
+
142
+ 5. Cover the full spectrum of MLOps.
143
+
144
+ If you feel rusty on your pre-requisites but want to get started with
145
+ FSDL, here are our recommendations to get up to speed with the
146
+ fundamentals:
147
+
148
+ - Andrew Ng's [Machine Learning Coursera
149
+ course](https://www.coursera.org/collections/machine-learning)
150
+
151
+ - Google's [crash course on Machine
152
+ Learning](https://developers.google.com/machine-learning/crash-course)
153
+
154
+ - MIT's [The Missing
155
+ Semester](https://missing.csail.mit.edu/) on software
156
+ engineering
157
+
158
+ ### ML-Powered Products vs MLOps
159
+
160
+ MLOps, as a discipline, has emerged in just the last few years. It is
161
+ about practices for deploying, maintaining, and operating ML systems
162
+ that generate ML models in production. A lot of MLOps is about:
163
+
164
+ - How do we put together an infrastructure that allows us to build
165
+ models in a repeatable and governable way?
166
+
167
+ - How can we run ML systems in a potentially high-scale production
168
+ setting?
169
+
170
+ - How can we collaborate on these systems as a team?
171
+
172
+ ![](./media/image1.png)
173
+
174
+
175
+ ML-powered product building is a distinct but overlapping discipline. A lot of
176
+ what it takes to build a great ML-powered product goes beyond the
177
+ infrastructure side of ML systems. It focuses on how to fit ML into the
178
+ context of the product or the application that you're building.
179
+
180
+ Other topics in the scope of the ML product discipline include:
181
+
182
+ - How do you understand how your users are interacting with your
183
+ model?
184
+
185
+ - How do you build a team or an organization that can work together
186
+ effectively on ML systems?
187
+
188
+ - How do you do product management in the context of ML?
189
+
190
+ - What are the best practices for designing products that use ML as
191
+ part of them?
192
+
193
+ This class focuses on teaching you end-to-end what it takes to get a
194
+ product out in the world that uses ML and will cover aspects of MLOps
195
+ that are most critical in order to do that.
196
+
197
+ ### Chapter Summary
198
+
199
+ 1. **ML-powered products are going mainstream** thanks to the
200
+ democratization of modeling.
201
+
202
+ 2. However, building **great ML-powered products requires a different
203
+ process** from building models.
204
+
205
+ 3. Full-Stack Deep Learning is **here to help**!
206
+
207
+ ## 2 - When To Use Machine Learning
208
+
209
+ ### When to Use ML At All
210
+
211
+ **ML projects have a higher failure rate than software projects in
212
+ general**. One reason that's worth acknowledging is that for many
213
+ applications, ML is fundamentally still research. Therefore, we
214
+ shouldn't aim for 100% success.
215
+
216
+ Additionally, many ML projects are
217
+ doomed to fail even before they are undertaken due to a variety of
218
+ reasons:
219
+
220
+ 1. They are technically infeasible or poorly scoped.
221
+
222
+ 2. They never make the leap to a production environment.
223
+
224
+ 3. The broader organization is not all on the same page about what
225
+ would be considered success criteria for them.
226
+
227
+ 4. They solve the problem that you set out to solve but do not solve a
228
+ big enough problem to be worth their complexity.
229
+
230
+ The bar for your ML projects should be that **their value must outweigh
231
+ not just the cost of developing them but also the additional complexity
232
+ that these ML systems introduce to your software** (as introduced in the
233
+ classic paper "[The High-Interest Credit Card of Technical
234
+ Debt](https://research.google/pubs/pub43146/)").
235
+
236
+ In brief,
237
+ ML systems erode the boundaries between other systems, rely on expensive
238
+ data dependencies, are commonly plagued by system design anti-patterns,
239
+ and are subject to the instability of the external world.
240
+
241
+ Before starting an ML project, ask yourself:
242
+
243
+ 1. **Are you ready to use ML?** More specifically, do you have a
244
+ product? Are you collecting data and storing it in a sane way? Do
245
+ you have the right people?
246
+
247
+ 2. **Do you really need ML to solve this problem?** More specifically,
248
+ do you need to solve the problem at all? Have you tried using
249
+ rules or simple statistics to solve the problem?
250
+
251
+ 3. **Is it ethical to use ML to solve this problem?** We have a
252
+ [whole lecture about ethics](../lecture-9-ethics/)!
253
+
254
+ ### How to Pick Problems to Solve with ML
255
+
256
+ Just like any other project prioritization, you want to look for use
257
+ cases that have **high impact** and **low cost**:
258
+
259
+ 1. **High-impact problems** are likely to be those that address friction in
260
+ your product, complex parts of your pipeline, places where cheap
261
+ prediction is valuable, and generally what other people in your
262
+ industry are doing.
263
+
264
+ 2. **Low-cost projects** are those with available data, where bad
265
+ predictions are not too harmful.
266
+
267
+ ![](./media/image11.png)
268
+
269
+
270
+ #### High-Impact Projects
271
+
272
+ Here are some heuristics that you can use to find high-impact ML
273
+ projects:
274
+
275
+ 1. **Find problems that ML takes from economically infeasible to feasible**.
276
+ A good resource here is the book "[Prediction Machines:
277
+ The Simple Economics of
278
+ AI](https://www.amazon.com/Prediction-Machines-Economics-Artificial-Intelligence/dp/1633695670)."
279
+ The book's central thesis is that AI reduces the cost of
280
+ prediction, which is central to decision-making. Therefore, look
281
+ for projects where making prediction cheaper will have a huge impact.
282
+
283
+ 2. **Think about what your product needs**.
284
+ [This article from the ML team at Spotify](https://spotify.design/article/three-principles-for-designing-ml-powered-products)
285
+ talks about the three principles for designing Discover Weekly,
286
+ one of Spotify's most powerful and popular ML-powered features.
287
+
288
+ 3. **Think about the types of problems that ML is particularly good at**.
289
+ One common class of problem that is overlooked is
290
+ ["Software 2.0"](https://karpathy.medium.com/software-2-0-a64152b37c35),
291
+ as coined by Andrej Kaparthy. Essentially, if you have a part of your
292
+ system that is complex and manually defined, then that's
293
+ potentially a good candidate to be automated with ML.
294
+
295
+ 4. **Look at what other people in the industry are doing**.
296
+ Generally, you can read papers and blog posts from both Big Tech and top
297
+ earlier-stage companies.
298
+
299
+ #### Low-Cost Projects
300
+
301
+ ![](./media/image12.png)
302
+
303
+
304
+ There are three main drivers for how much a project will cost:
305
+
306
+ 1. **Data availability**: How hard is it to acquire data? How expensive
307
+ is data labeling? How much data will be needed? How stable is the
308
+ data? What data security requirements do you have?
309
+
310
+ 2. **Accuracy requirement**: How costly are wrong predictions? How
311
+ frequently does the system need to be right to be useful? What are
312
+ the ethical implications of your model making wrong predictions?
313
+ It is noteworthy that **ML project costs tend to scale
314
+ super-linearly in the accuracy requirement**.
315
+
316
+ 3. **Problem difficulty**: Is the problem well-defined enough to be
317
+ solved with ML? Is there good published work on similar problems?
318
+ How much compute does it take to solve the problem? **Generally,
319
+ it's hard to reason about what's feasible in ML**.
320
+
321
+ #### What's Hard in ML?
322
+
323
+ ![](./media/image8.png)
324
+
325
+
326
+ Here are the three types of hard problems:
327
+
328
+ 1. **Output is complex**: The model predictions are ambiguous or in a
329
+ high-dimensional structure.
330
+
331
+ 2. **Reliability is required**: ML systems tend to fail in unexpected
332
+ ways, so anywhere you need high precision or high robustness is
333
+ going to be more difficult to solve with ML.
334
+
335
+ 3. **Generalization is required**: These problems tend to be more in
336
+ the research domain. They can deal with out-of-distribution data
337
+ or do tasks such as reasoning, planning, or understanding
338
+ causality.
339
+
340
+ #### ML Feasibility Assessment
341
+
342
+ This is a quick checklist you can use to assess the feasibility of your
343
+ ML projects:
344
+
345
+ 1. Make sure that you actually need ML.
346
+
347
+ 2. Put in the work upfront to define success criteria with all of the
348
+ stakeholders.
349
+
350
+ 3. Consider the ethics of using ML.
351
+
352
+ 4. Do a literature review.
353
+
354
+ 5. Try to rapidly build a labeled benchmark dataset.
355
+
356
+ 6. Build a "minimum" viable model using manual rules or simple
357
+ heuristics.
358
+
359
+ 7. Answer this question again: "Are you sure that you need ML at all?"
360
+
361
+ ### Not All ML Projects Should Be Planned The Same Way
362
+
363
+ Not all ML projects have the same characteristics; therefore, they
364
+ shouldn't be planned the same way. Understanding different archetypes of
365
+ ML projects can help select the right approach.
366
+
367
+ #### ML Product Archetypes
368
+
369
+ The three archetypes offered here are defined by how they interact with
370
+ real-world use cases:
371
+
372
+ 1. **Software 2.0 use cases**: Broadly speaking, this means taking
373
+ something that software or a product does in an automated fashion
374
+ today and augmenting its automation with machine learning. An
375
+ example of this would be improving code completion in the IDE
376
+ (like [Github
377
+ Copilot](https://github.com/features/copilot)).
378
+
379
+ 2. **Human-in-the-loop systems:** Machine learning can be applied for
380
+ tasks where automation is not currently deployed - but where
381
+ humans could have their judgment or efficiency augmented. Simply
382
+ put, helping humans do their jobs better by complementing them
383
+ with ML-based tools. An example of this would be turning sketches
384
+ into slides, a process will usually involve humans approving the
385
+ output of a machine learning model that made the slides.
386
+
387
+ 3. **Autonomous systems:** Systems that apply machine learning to
388
+ augment existing or implement new processes without human input.
389
+ An example of this would be full self-driving, where there is no
390
+ opportunity for a driver to intervene in the functioning of the
391
+ car.
392
+
393
+ For each archetype, some key considerations inform how you should go
394
+ about planning projects.
395
+
396
+ ![](./media/image10.png)
397
+
398
+
399
+ 1. In the case of Software 2.0 projects, you should focus more on
400
+ understanding **how impactful the performance of the new model
401
+ is**. Is the model truly much better? How can the performance
402
+ continue to increase across iterations?
403
+
404
+ 2. In the case of human-in-the-loop systems, consider more **the
405
+ context of the human user and what their needs might be**. How
406
+ good does the system actually have to be to improve the life of a
407
+ human reviewing its output? In some cases, a model that does even
408
+ 10% better with accuracy (nominally a small increase) might have
409
+ outsize impacts on human users in the loop.
410
+
411
+ 3. For autonomous systems, focus heavily on t**he failure rate and its
412
+ consequences**. When there is no opportunity for human
413
+ intervention, as is the case with autonomous systems, failures
414
+ need to be carefully monitored to ensure outsize harm doesn't
415
+ occur. Self-driving cars are an excellent example of an autonomous
416
+ system where failure rates are carefully monitored.
417
+
418
+ #### Data Flywheels
419
+
420
+ As you build a software 2.0 project, strongly consider the concept of
421
+ the **data flywheel**. For certain ML projects, as you improve your
422
+ model, your product will get better and more users will engage with the
423
+ product, thereby generating more data for the model to get even better.
424
+ It's a classic virtuous cycle and truly the gold standard for ML
425
+ projects.
426
+
427
+ ![](./media/image4.png)
428
+
429
+
430
+ As you consider implementing data flywheels, remember to know the answer
431
+ to these three questions:
432
+
433
+ 1. **Do you have a data loop?** To build a data flywheel, you crucially
434
+ need to be able to get labeled data from users in a scalable
435
+ fashion. This helps increase access to high-quality data and
436
+ define a data loop.
437
+
438
+ 2. **Can you turn more data into a better model?** This somewhat falls
439
+ onto you as the modeling expert, but it may also not be the case
440
+ that more data leads to significantly better performance. Make
441
+ sure you can actually translate data scale into better model
442
+ performance.
443
+
444
+ 3. **Does better model performance lead to better product use?** You
445
+ need to verify that improvements with models are actually tied to
446
+ users enjoying the product more and benefiting from it!
447
+
448
+ #### Impact and Feasibility of ML Product Archetypes
449
+
450
+ Let's visit our impact vs. feasibility matrix. Our three product
451
+ archetypes differ across the spectrum.
452
+
453
+ ![](./media/image9.png)
454
+
455
+
456
+ This is a pretty intuitive evaluation you can apply to all your ML
457
+ projects: **If it's harder to build (like autonomous systems), it's
458
+ likely to have a greater impact**! There are ways, however, to change
459
+ this matrix in the context of specific projects.
460
+
461
+ 1. For **Software 2.0**, data flywheels can magnify impact by allowing
462
+ models to get much better and increase customer delight over time.
463
+
464
+ 2. For **human-in-the-loop systems**, you can increase feasibility by
465
+ leveraging good product design. Thoughtful design can help reduce
466
+ expectations and accuracy requirements. Alternatively, a "good
467
+ enough" mindset that prioritizes incremental delivery over time
468
+ can make such systems more feasible.
469
+
470
+ 3. For **autonomous systems**, leveraging humans in the loop can make
471
+ development more feasible by adding guardrails and reducing the
472
+ potential impact of failures.
473
+
474
+ ### Just Get Started!
475
+
476
+ With all this discussion about archetypes and impact matrices, don't
477
+ forget the most important component of engineering: **actually
478
+ building**! Dive in and get started. Start solving problems and iterate
479
+ on solutions.
480
+
481
+ One common area practitioners trip up in is **tool fetishization.** As
482
+ MLOps and production ML have flourished, so too has the number of tools
483
+ and platforms that address various aspects of the ML process. You don't
484
+ need to be perfect with your tooling before driving value from machine
485
+ learning. Just because Google and Uber are doing things in a very
486
+ structured, at-scale way doesn't mean you need to as well!
487
+
488
+ In this course, we will primarily focus on how to set things up the
489
+ right way to do machine learning in production without overcomplicating
490
+ it. This is an ML products-focused class, not an MLOps class! Check out
491
+ this talk by Jacopo Tagliabue describing [MLOps at Reasonable
492
+ Scale](https://www.youtube.com/watch?v=Ndxpo4PeEms) for a
493
+ great exposition of this mindset.
494
+
495
+ ### Chapter Summary
496
+
497
+ 1. ML adds complexity. Consider whether you really need it.
498
+
499
+ 2. Make sure what you're working on is high impact, or else it might
500
+ get killed.
501
+
502
+ ## 3 - Lifecycle
503
+
504
+ ML adds complexity to projects and isn't always a value driver. Once you
505
+ know, however, that it's the right addition to your project, what does
506
+ the actual lifecycle look like? What steps do we embark upon as we
507
+ execute?
508
+
509
+ In this course, the common running example we use is of **a pose
510
+ estimation problem**. We'll use this as a case study to demonstrate the
511
+ lifecycle and illustrate various points about ML-powered products.
512
+
513
+ ![](./media/image13.png)
514
+
515
+
516
+ Here's a graphic that visualizes the lifecycle of ML projects:
517
+
518
+ ![](./media/image3.png)
519
+
520
+
521
+ It provides a very helpful structure. Watch from 48:00 to 54:00 to dive
522
+ deeper into how this lifecycle occurs in the context of a real machine
523
+ learning problem around pose estimation that Josh worked on at OpenAI.
524
+
525
+ Let's comment on some specific nuances:
526
+
527
+ - **Machine learning projects tend to be very iterative**. Each of
528
+ these phases can feed back into any of the phases that go before
529
+ it, as you learn more about the problem that you're working on.
530
+
531
+ - For example, you might realize that "Actually, it's way too
532
+ hard for us to get data in order to solve this problem!" or
533
+ "It's really difficult for us to label the pose of these
534
+ objects in 3D space".
535
+
536
+ - A solution might actually be to go back a step in the lifecycle
537
+ and set up the problem differently. For example, what if it
538
+ were cheaper to annotate per pixel?
539
+
540
+ - This could repeat itself multiple times as you progress through
541
+ a project. It's a normal and expected part of the machine
542
+ learning product development process.
543
+
544
+ - In addition to iteration during execution, there's also
545
+ cross-project "platform" work that matters! **Hiring and
546
+ infrastructure development are crucial to the long-term health of
547
+ your project**.
548
+
549
+ - Going through this lifecycle and winning each step is what we'll
550
+ cover in this class!
551
+
552
+ ## Lecture Summary
553
+
554
+ In summary, here's what we covered in this lecture:
555
+
556
+ 1. ML is NOT a cure-all. It's a complex technology that needs to be
557
+ used thoughtfully.
558
+
559
+ 2. You DON'T need a perfect setup to get going. Start building and
560
+ iterate!
561
+
562
+ 3. The lifecycle of machine learning is purposefully iterative and
563
+ circuitous. We'll learn how to master this process together!
documents/lecture-01.srt ADDED
@@ -0,0 +1,352 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,320 --> 00:00:33,120
3
+ hey everyone welcome to the 2022 edition of full stack deep learning i'm josh tobin one of the instructors and i'm really excited about this version of the class because we've made a bunch of improvements and i think it comes at a really interesting time to be talking about some of these topics let's dive in today we're going to cover a few things first we'll talk about why this course exists and what you might hope to take away from it then we'll talk about the first question you should ask when you're starting a new ml project which is should we be using ml for this at all and then we'll talk through the high level overview of what the life cycle of a typical ml project might look like which will also give you a conceptual
4
+
5
+ 2
6
+ 00:00:31,840 --> 00:01:08,880
7
+ outline for some of the things we'll talk about in this class really what is full stack deep learning about we aim to be the course and community for people who are building products that are powered by machine learning and i think it's a really exciting time to be talking about mlpowered products because machine learning is rapidly becoming a mainstream technology and you can see this in startup funding in job postings as well as in the continued investment of large companies in this technology i think it's particularly interesting to think about how this has changed since 2018 when we started teaching the class in 2018 a lot of the most exciting ml powered products were built by the biggest companies you had self-driving
8
+
9
+ 3
10
+ 00:01:06,880 --> 00:01:44,079
11
+ cars that were starting to show promise you had systems like translation from big companies like google that were really starting to hit the market in a way that was actually effective but the broader narrative in the field was that very few companies were able to get value out of this technology and even on the research side right now gpt3 is becoming a mainstream technology but in 2018 gpt-1 was one of the state-of-the-art examples of language models and you know if you look at what it actually took to build a system like this it was the code and the standardization around it was still not there like these technologies were still hard to apply now on the other hand there's a much wider range of really
12
+
13
+ 4
14
+ 00:01:42,240 --> 00:02:20,000
15
+ powerful products that are powered by machine learning dolly 2 is i think a great example image generation technology more on the consumer side tick tock is a really powerful example but it's not just massive companies now that are able to build machine learning powered products dscript is an application that we at full stack deep learning use all the time in fact we'll probably use it to edit this video that i'm recording right now and startups are also building things like email generation so there's a proliferation of machine learning powered products and the narrative has shifted i think a little bit as well which is that before these technologies were really hard to apply but now there's standardization
16
+
17
+ 5
18
+ 00:02:17,840 --> 00:02:55,519
19
+ that's emerging both around the technology stack transformers and nlp starting to seep their way into more and more use cases as well as the practices around how to actually apply these technologies in the world one of the biggest changes in the field in the past four years has been the emergence of this term called ml ops which we'll talk a lot about in this class and so if you ask yourself why like why is this changed so rapidly i think in addition to the field just maturing and research continuing to progress i think one of the biggest reasons is that the training of models is starting to become commoditized we showed a couple of slides ago how complicated code for gpt-1 was now using something like
20
+
21
+ 6
22
+ 00:02:53,840 --> 00:03:28,480
23
+ hugging face you can deploy a state-of-the-art nlp model or computer vision model in one or two lines of code on top of that automl is starting to actually work for a lot of applications i think four years ago we were pretty skeptical about it now i think it's a really good starting point for a lot of problems that you might want to solve and companies are starting to provide models really as a service where you don't even have to download open source package to use it you can just make a network call and you can have predictions from a state-of-the-art model and on the software side a lot of frameworks are starting to standardize around things like keras and pytorch lightning so a lot of the like spaghetti
24
+
25
+ 7
26
+ 00:03:27,200 --> 00:04:07,680
27
+ code that you had to write to build these systems just isn't necessary anymore and so i think if you project forward a few years what's going to happen in ml i think the history of the ml is characterized by rise and fall of the public perception of the technology these were driven by a few different ai winters that happened over the history of the field where the technology didn't live up to its height live up to its promise and people became skeptical about it what's going to happen in the future of the field i think what a lot of people think is that this time is different we have real applications of machine learning that are generating a lot of value in the world and so the prospect of a true ai winter where
28
+
29
+ 8
30
+ 00:04:05,680 --> 00:04:46,560
31
+ people become skeptical about ai as a technology maybe feels less likely but it's still possible a slightly more likely outcome is that the overall luster of the technology starts to wear off but certain applications are getting a ton of value out of this technology and then i think you know the upside outcome for the field is that ai continues to accelerate really rapidly and it becomes pervasive and incredibly effective and i think that's also what a lot of people believe to happen and so what i would conjecture is that the way that we as a field avoid an ai winter is by not just making progress in research but also making sure that that progress is translated to actual real world products that's how we avoid repeating
32
+
33
+ 9
34
+ 00:04:45,199 --> 00:05:20,639
35
+ what's happened in the past that's caused the field to lose some of its luster but the challenge that presents is that building ml powered products requires a fundamentally different process in many ways than building the types of ml systems you create in academia the sort of process that you might use to develop a model in an academic setting i would call flat earth machine learning flat earth ml this is a process that will probably be familiar to many people you start by selecting a problem you collect some data to use to solve the problem you clean and label that data you iterate on developing a model until you have a model that performs well on the data set that you collected and then you evaluate that
36
+
37
+ 10
38
+ 00:05:19,120 --> 00:05:56,639
39
+ model and if the model performs well according to your metrics then you write a report produce a jupiter notebook a paper or some slides and then you're done but in the real world the challenge is that if you deploy that model in production it's not going to perform well in the real world for long necessarily right and so ml powered products require this outer loop where you deploy the model into production you measure how that model is performing when it interacts with real users you use the real world data to build a data flywheel and then you continue this as part of an outer loop some people believe that the earth isn't round just because you can't see the outer loop in the ml system doesn't mean it's not
40
+
41
+ 11
42
+ 00:05:54,160 --> 00:06:32,080
43
+ there and so this course is really about how to do this process of building ml-powered products and so what we won't cover as much is the theory and the math and the sort of computer science behind deep learning or and machine learning more broadly there are many great courses that you can check out to learn those materials we also will talk a little bit about training models and some of the practical aspects of that but this isn't meant to be your first course in training machine learning models again there's many great classes for that as well but what this class is about is the unique aspects that you need to know beyond just training models to build great ml powered products so our goals in the class are to teach you
44
+
45
+ 12
46
+ 00:06:30,080 --> 00:07:09,120
47
+ a generalist skill set that you can use to build an ammo powered product and an understanding of how the different pieces of ml power products fit together we will also teach you a little bit about this concept of ml ops but this is not an ml ops class our goal is to teach you enough ml ops to get things done but not to cover the full depth of ml ops as a topic we'll also share certain best practices from what we've seen to work in the real world and try to explain some of the motivations behind them and if you're on the job market or if you're thinking about transitioning into a role in machine learning we also aim to teach you some things that might help you with ml engineering job interviews and then lastly in practice i think what we've
48
+
49
+ 13
50
+ 00:07:07,199 --> 00:07:42,560
51
+ found to be maybe the most powerful part of this is forming a community that you can use to learn from your peers about what works in the real world and what doesn't we as instructors have solved many problems with ml but there's a very good chance that we haven't solved one that's like the one that you're working on but in the broader full stack deep learning community i would bet that there probably is someone who's worked on something similar and so we hope that this can be a place where folks come together to learn from each other as well as just learning from us now there are some things that we are explicitly not trying to do with this class we're not trying to teach you machine learning or software engineering from scratch if
52
+
53
+ 14
54
+ 00:07:40,800 --> 00:08:15,919
55
+ you are coming at this class and you know you have an academic background in ml but you've never written production code before or you're a software engineer but i've never taken an ml class before you can follow along with this class but i would highly recommend taking these prerequisites before you dive into the material here because you'll i think get a lot more out of the class once you've learned the fundamentals of each of these fields we're also not aiming to cover the full breadth of deep learning techniques or machine learning techniques more broadly we'll talk about a lot of the techniques that are used in practice but the chances are that we won't talk about the specific model that you use for your use
56
+
57
+ 15
58
+ 00:08:14,240 --> 00:08:52,800
59
+ case it's not the goal here we're also not trying to make you an expert in any single aspect of machine learning we have a project and a set of labs that are associated with this course that will allow you to spend some time working on a particular application of machine learning but there there isn't a focus on becoming an expert in computer vision or nlp or any other single branch of machine learning and we're also not aiming to help you do research in deep learning or any other ml field and similarly ml ops is this broad topic that involves everything from infrastructure and tooling to organizational practices and we're not aiming to be comprehensive here the goal of this class is to show you end to end
60
+
61
+ 16
62
+ 00:08:51,279 --> 00:09:28,640
63
+ what it takes to build an ml-powered product and give you pointers to the different pieces of the field that you'll potentially need to go deeper on to solve the particular problem that you're working on so if you are feeling rusty on your prerequisites but want to get started with the class anyway here are some recommendations for classes on ml and software engineering that i'd recommend checking out if you want to remind yourself of some of the fundamentals i mentioned this distinction between ml power products and ml ops a little bit and so i wanted to dive into that a little bit more ml ops is this discipline that's emerged in the last couple of years really that is about practices for deploying and
64
+
65
+ 17
66
+ 00:09:26,880 --> 00:10:06,240
67
+ maintaining and operating machine learning models and the systems that generate these machine learning models in production and so a lot of ml ops is about how do we put together the infrastructure that will allow us to build models in a repeatable and governable way how we're able to do this at scale how we're able to collaborate on these systems as a team and how we're able to really run these machine learning systems in a potentially high scale production setting super important topic if your goal is to make ml work in the real world and there's a lot of overlap with what we're covering in this class but we see mlpowered products as kind of a distinct but overlapping discipline because a lot of what it
68
+
69
+ 18
70
+ 00:10:04,800 --> 00:10:45,120
71
+ takes to build a great ml powered product goes beyond the infrastructure side and the sort of repeatability and automation side of machine learning systems and it also focuses on how to fit machine learning into the context of product or the application that you're building so other topics that are in scope of this mlpowered product discipline are things like how do you understand how your users are interacting with your model and what type of model they need how do you build a team or an organization that can work together effectively on machine learning systems how do you do product management in the context of ml what are some of the best practices for designing products that use ml as part of them things like data labeling capturing
72
+
73
+ 19
74
+ 00:10:42,640 --> 00:11:24,240
75
+ feedback from users etc and so this class is really focused on teaching you end to end what it takes to get a product out in the world that uses ml and we'll cover the aspects of mlaps that are most critical to understand in order to do that a little bit about us as instructors i'm josh tobin i'm co-founder and ceo of machine learning infrastructure startup called gantry previously i was a research scientist at openai and did my machine learning phd at berkeley and charles and sergey are my wonderful co-instructors who you'll be hearing from in the coming weeks on the history of full stack deep learning so we started out as a boot camp in 2018 sergey and i as well as my grad school advisor and our close collaborator peter
76
+
77
+ 20
78
+ 00:11:22,240 --> 00:12:00,720
79
+ abeel had this collective realization that a lot of what we had been discovering about making ml work in the real world wasn't really well covered in other courses and we didn't really know if other people would be interested in this topic so we put it together as a one-time weekend long boot camp we got started to get good feedback on that and so it grew from there and we put the class online for the first time in 2021 and here we are so the way that this class was developed was a lot of this is from our personal experience our study and reading of materials in the field we also did a bunch of interviews with practitioners from this list of companies and at this point like a much longer list as well so we're constantly
80
+
81
+ 21
82
+ 00:11:58,880 --> 00:12:32,880
83
+ out there talking to folks who are doing this who are building ml powered products and trying to fold their perspectives into what we teach in this class some logistics before we dive into the rest of the material for today first is if you're part of the synchronous cohort all of the communication for that cohort is going to happen on discord so if you're not on discord already then please reach out to us instructors and we'll make sure to get you on that if you're not on discord or if you're not checking it regularly there's a high likelihood that you're going to miss some of the value of the synchronous course we will have a course project again for folks who are participating in the synchronous option which we'll share
84
+
85
+ 22
86
+ 00:12:31,040 --> 00:13:11,839
87
+ more details about on discord in the coming weeks and there's also i think one of the most valuable parts of this class is the labs which have undergone like a big revamp this time around i want to talk a little bit more about what we're covering there so the problem that we're going to be working on the labs is creating an application that allows you to take a picture of a handwritten page of text and then transcribe that into some actual text and so imagine that you have this web application where you can take a picture of your handwriting and then at the end you get the text that comes out of it and so the way this is going to work is we're going to build a web backend that allows you to send web requests decodes those images and sends
88
+
89
+ 23
90
+ 00:13:09,279 --> 00:13:47,920
91
+ them to a prediction model an ocr model that will develop that will transcribe those into the text itself and those models are going to be generated by a model training system that will also show you how to build in the class and the architecture that we'll use will look something like this we'll use state-of-the-art tools that we think balance being able to really build a system like this in a principled way without adding too much complexity to what you're doing all right so just to summarize this section machine learning powered products are going mainstream and in large part this is due to the fact that it's just much much easier to build machine learning models today than it was even four or five years ago and
92
+
93
+ 24
94
+ 00:13:46,079 --> 00:14:21,839
95
+ so i think the challenge ahead is given that we're able to create these models pretty easily how do we actually use the models to build great products and that's a lot of what we'll talk about in this class and i think the sort of fundamental challenge is that there's not only different tools that you need in order to build great products but also different processes and mindsets as well and that's what we're really aiming to do here in fsdl so looking forward to covering some of this material and hopefully helping create the next generation of ml powered products the next topic i want to dive into is when to use machine learning at all like what problems is this technology useful for solving and so the key points that we're
96
+
97
+ 25
98
+ 00:14:20,320 --> 00:14:55,920
99
+ going to cover here are the first is that machine learning introduces a lot of complexity and so you really shouldn't do it before you're ready to do it and you should think about exhausting your other options before you introduce this to your stack on the flip side that doesn't mean that you need to a perfect infrastructure to get started and then we'll talk a little bit about what types of projects tend to lend themselves to being good applications of machine learning and we'll talk about how to know whether projects are feasible and whether they'll have an impact on your organization but to start out with when should you use machine learning at all so i think the first thing that's really critical to know
100
+
101
+ 26
102
+ 00:14:54,079 --> 00:15:32,000
103
+ here is that machine learning projects have a higher failure rate than software products in general the statistic that you'll see most often floated around in blog posts or vendor pitches is that 87 percent this very precise number of machine learning projects fail i think it's also worth noting that 73 of all statistics are made up on the spot so and this one in particular i think is a little bit questionable whether this is actually a valid statistic or not anecdotally i would say that from what i've seen it's probably more like 25 it's still a very high number still a very high failure rate but maybe not the 90-ish percent that people are quoting the question you might ask is why is that the case right why is there such a
104
+
105
+ 27
106
+ 00:15:30,639 --> 00:16:12,639
107
+ high failure rate for machine learning projects you know one reason that's worth acknowledging is that for a lot of applications machine learning is fundamentally still research so 100 success rate probably shouldn't be the target that we're aiming for but i do think that many machine learning projects are doomed to fail maybe even before they are undertaken and i think there's a few reasons that this can happen so oftentimes machine learning projects are technically infeasible or they're just scoped poorly and there's just too much of a lift to even get the first version of the model developed and that leads to projects failing because they just take too long to see any value another common failure mode that's becoming less and less
108
+
109
+ 28
110
+ 00:16:10,079 --> 00:16:50,720
111
+ common is that a team that's really effective at developing a model may not be the right team to deploy that model into production and so there's this friction after the model is developed where you know the model maybe looks promising in a jupiter notebook but it never makes the leap to prod so hopefully you'll take things away from this class that will help you avoid being in this category another really common issue that i've seen is when you as a broader organization are not all on the same page about what we would consider to be successful here and so i've seen a lot of machine learning projects fail because you have a model that you think works pretty well and you actually know how to deploy into
112
+
113
+ 29
114
+ 00:16:49,120 --> 00:17:27,120
115
+ production but the rest of the organization can't get comfortable with the fact that this is actually going to be running and serving predictions to users so how do we know when we're ready to deploy and then maybe the most frustrating of all these failure modes is when you actually have your model work well and it solves the problem that you set out to solve but it doesn't solve a big enough problem and so the organization decides hey this isn't worth the additional complexity that it's going to take to make this part of our stack you know i think this is a point i want to double click on which is that really i think the bar for your machine learning project should be that the value of the project must outweigh not just the cost of
116
+
117
+ 30
118
+ 00:17:25,679 --> 00:18:07,440
119
+ developing it but the additional complexity that machine learning systems introduce into your software and machine learning introduces a lot of complexity to your software so this is kind of a quick summary of a classic paper that i would recommend reading which is the high interest credit card of technical debt paper the thesis of this paper is that machine learning as a technology tends to introduce technical debt at a much higher rate than most other software and the reasons that the authors point to are one an erosion of boundary between systems so machine learning systems often have the property for example that the predictions that they make influence other systems that they interact with if you recommend a user a particular type
120
+
121
+ 31
122
+ 00:18:05,120 --> 00:18:44,799
123
+ of content that changes their behavior and so that makes it hard to isolate machine learning as a component in your system it also relies on expensive data dependencies so if your machine learning system relies on a feature that's generated by another part of your system then those types of dependencies the authors found can be very expensive to maintain it's also very common for machine learning systems to be developed with design anti-patterns somewhat avoidable but in practice very common and the systems are subject to the instability of the external world if your user's behavior changes that can dramatically affect the performance of your machine learning models in a way that doesn't typically happen with
124
+
125
+ 32
126
+ 00:18:42,240 --> 00:19:21,200
127
+ traditional software so the upshot is before you start a new ml project you should ask yourself are we ready to use ml at all do we really need this technology to solve this problem and is it actually ethical to use ml to solve this problem to know if you're ready to use ml some of the questions you might ask are do we have a product at all do we have something that we can use to collect the data to know whether this is actually working are we already collecting that data and storing it in the same way if you're not currently doing data collection then it's going to be difficult to build your first ml system and do we have the team that will allow us to do this knowing whether you need ml to solve a problem i think the first question that
128
+
129
+ 33
130
+ 00:19:19,760 --> 00:19:58,880
131
+ you should ask yourself is do we need to solve this problem at all or are we just inventing a reason to use ml because we're excited about the technology have we tried using rules or simple statistics to solve the problem with some exceptions i think usually the first version of a system that you deploy that will eventually use ml should be a simple rule based or statistics-based system because a lot of times you can get 80 of the benefit of your complex ml system with some simple rules now there's some exceptions to this if the system is an nlp system or a computer vision system where rules just typically don't perform very well but as a general rule i think if you haven't at least thought about whether you can use
132
+
133
+ 34
134
+ 00:19:57,360 --> 00:20:33,760
135
+ a rule-based system to achieve the same outcome then maybe you're not ready to use ml yet and lastly is it ethical i won't dive into the details here because we'll have a whole lecture on this later in the course next thing i want to talk about is if we feel like we're ready to use ml in our organization how do we know if the problem that we're working on is a good fit to solving it with machine learning the sort of tl dr here is you want to look for like any other project prioritization you want to look for use cases that have high impact and low cost and so we'll talk about different heuristics that you can use to determine whether this application of machine learning is likely to be high impact and low cost and so we'll talk about
136
+
137
+ 35
138
+ 00:20:32,559 --> 00:21:10,240
139
+ heuristics like friction in your products complex parts of your pipeline places where it's valuable to reduce the cost of prediction and looking at what other people in your industry are doing which is a very underrated technique for picking problems to work on and then we'll also talk about some heuristics for for assessing whether a machine learning project is going to be feasible from a cost perspective overall prioritization framework that we're going to look at here is projects that you want to select are ones that are feasible so they're low cost and they're high impact let's start with the high impact side of things so what are some mental models you can use to find high impact ml projects and these are some of the ones that we'll
140
+
141
+ 36
142
+ 00:21:08,000 --> 00:21:49,520
143
+ cover so starting with a book called the economics of ai and so the question this book asks is what problems does machine learning make economically feasible to solve that were maybe not feasible to solve in the past and so the sort of core observation in this book is that really at a fundamental level what ai does is it reduces the cost of prediction before maybe you needed a person and that person would take five minutes to create a prediction it's very expensive it's very operationally complex ai can do that in a fraction of a second for the cost of essentially running your machine or running your gpu cheap prediction means that there's going to be predictions that are happening in more places even in problems whereas too
144
+
145
+ 37
146
+ 00:21:47,200 --> 00:22:27,360
147
+ expensive to do before and so the upshot of this mental model for project selection is think about projects where cheap prediction will have a huge business impact like where would you hire a bunch of people to make predictions that it isn't feasible to do now um the next mental model i want to talk about for selecting high impact projects is just thinking about what is your product need and so i really like this article called three principles for designing ml-powered products from spotify and in this article they talked about the principles that they used to develop the discover weekly feature which i think is like one of the most powerful features of spotify and you know really the way they thought about it is this reduces
148
+
149
+ 38
150
+ 00:22:25,840 --> 00:23:06,080
151
+ friction for our users reduces the friction of chasing everything down yourself and just brings you everything in a neat little package and so this is something that really makes their product a lot better and so that's another kind of easy way to come up with ideas for machine learning projects another angle to think about is what are types of problems that machine learning is particularly good at and one exploration of this mental model is an article called software 2.0 from andre carpathi which is also definitely worth a read and the kind of main thesis of this article is that machine learning is really useful when you can take a really complex part of your existing software system so a really messy stack of
152
+
153
+ 39
154
+ 00:23:04,400 --> 00:23:46,799
155
+ handwritten rules and replace that with machine learning replace that with gradient descent and so if you have a part of your system that is complex manually defined rules then that's potentially a really good candidate for automating with ml and then lastly i think it's worth just looking at what other people in your industry are doing with ml and there's a bunch of different resources that you can look at to try to figure out what other success stories with ml are i really like this article covering the spectrum of use cases of ml at netflix there are various industry reports this is a summary of one from algorithmia which kind of covers the spectrum of what people are using ml to do and more generally i think looking at
156
+
157
+ 40
158
+ 00:23:44,960 --> 00:24:21,919
159
+ papers from the biggest technology companies tends to be a good source of what those companies are trying to build with ml and how they're doing it as well as earlier stage tech companies that are still pretty ml forward and those companies i think are more likely to write these insights in blog posts than they are in papers and so here's a list that i didn't compile but i think is really valuable of case studies of using machine learning in the real world that are worth going through if you're looking for inspiration of what are types of problems you can solve and how might you solve them okay so coming back to our prioritization framework we talked about some mental models for what ml projects might be high impact
160
+
161
+ 41
162
+ 00:24:20,960 --> 00:24:57,760
163
+ and the next thing that we're going to talk about is how to assess the cost of a machine learning project that you're considering so the way i like to think about the cost of machine learning power projects is there's three main drivers for how much a project is gonna cost the first and most important is data availability so how easy is it for you to get the data that you're gonna need to solve this problem the second most important is the accuracy requirement that you have for the problem that you're solving and then also important is the sort of intrinsic difficulty of the problem that you're trying to solve so let's start by talking about data availability the kind of key questions that you might ask here to assess
164
+
165
+ 42
166
+ 00:24:56,080 --> 00:25:36,000
167
+ whether data availability is going to be a bottleneck for your project is do we have this data already and if not how hard is it and how expensive is it going to be to acquire how expensive is it not just to acquire but also to label if your labelers are really expensive then getting enough data to solve the problem really well might be difficult how much data will we need in total this can be difficult to assess a priori but if you have some way of guessing whether it's 5 000 or 10 000 or 100 000 data points this is an important input and then how stable is the data so if you're working on a problem where you don't really expect the underlying data to change that much over time then the project is going to be a lot more feasible than if
168
+
169
+ 43
170
+ 00:25:33,919 --> 00:26:15,200
171
+ the data that you need changes on a day-to-day basis so data availability is probably the most important cost driver for a lot of machine learning powered projects because data just tends to be expensive and this is slightly less true outside of the deep learning realm it's particularly true in deep learning where you often require manual labeling but it also is true in a lot of other ml applications where data collection is expensive and lastly on data bill availability is what data security requirements do you have if you're able to collect data from your users and use that to retrain your model then that bodes well for the overall cost of the project if on the other hand you're not even able to look at the data that your
172
+
173
+ 44
174
+ 00:26:12,960 --> 00:26:53,760
175
+ users are generating then that's just going to make the project more expensive because it's going to be harder to debug and harder to build a data flywheel moving on to the accuracy requirement the kinds of questions you might ask here are how expensive is it when you make a wrong prediction on one extreme you might have something like a self-driving car where a wrong prediction is extremely expensive because the prospect of that is really terrible on the other extreme is something like let's say potentially a recommender system where if a user sees a bad recommendation once it's probably not really going to be that bad maybe it affects their user experience over time and maybe and causes them to churn but certainly not
176
+
177
+ 45
178
+ 00:26:52,080 --> 00:27:30,320
179
+ as bad as a wrong prediction in a self-driving car you also need to ask yourself how frequently does the system actually need to be right to be useful i like to think of systems like dolly 2 which is an image generation system as like a positive example of this where you can if you're just using dolly 2 as a creative supplement you can generate thousands and thousands of images and select the one that you like best for your use case so the system doesn't need to be right more than like once every n times in order to actually get value from it as a user on the other hand if the system needs to be 100 reliable like never ever make a wrong prediction in order for it to be useful then it's just going to be more expensive to build
180
+
181
+ 46
182
+ 00:27:28,640 --> 00:28:06,480
183
+ these systems and then what are the ethical implications of your model making wrong predictions is like an important question to consider as well and then lastly on the problem difficulty questions to ask yourself are is this problem well defined enough to solve with ml are other people working on similar things doesn't necessarily need to be the exact same problem but if it's a sort of a brand new problem that no one's ever solved with mlv4 that's going to introduce a lot of technical risk another thing that's worth looking at if you're looking at other work on similar problems is how much compute did it actually take them to solve this problem and it's worth looking at that both on the training side as well as on
184
+
185
+ 47
186
+ 00:28:04,559 --> 00:28:42,240
187
+ the inference side because if it's feasible to train your model but it takes five seconds to make a prediction then for some applications that will be good enough and some for some it won't and then i think like maybe the weakest heuristic here but still potentially a useful one is can a human do this problem at all if a human can solve the problem then that's a decent indication that a machine learning system might be able to solve it as well but not a perfect indication as we'll come back to so i want to double click on this accuracy requirement why is this such an important driver of the cost of machine learning projects the fundamental reason is that in my observation the project cost tends to scale like super linearly
188
+
189
+ 48
190
+ 00:28:40,000 --> 00:29:25,120
191
+ in your accuracy requirement so as a very rough rule of thumb every time that you add an additional nine to your required accuracy so moving from 99.9 to 99.99 accuracy might lead to something like a 10x increase in your project costs because you might expect to need at least 10 times as much data if not more in order to actually solve the problem to that degree of accuracy required but also you might need a bunch of additional infrastructure monitoring support in order to ensure that the model is actually performing that accurately next thing i'm going to double click on is the problem difficulty so how do we know which problems are difficult for machine learning systems to solve the first point i want to make here is this is
192
+
193
+ 49
194
+ 00:29:23,039 --> 00:30:08,960
195
+ like i think like a classically hard problem to really answer confidently and so i really like this comic for two reasons the first is because it gets at this core property of machine learning systems which is that it's not always intuitive which problems will be easy for a computer to solve and which ones will be hard for a computer to solve in 2010 doing gis lookup was super easy and detecting whether a photo was a bird was like a research team in five years level of difficulty so not super intuitive as someone maybe outside of the field the second reason i like this comic is because it also points to the sort of second challenge in assessing feasibility in ml which is that this field just moves so fast that if you're not keeping up with
196
+
197
+ 50
198
+ 00:30:07,279 --> 00:30:50,320
199
+ what's going on in the state of the art then your understanding of what's feasible will be stale very quickly building an application to detect whether a photo is of a bird is no longer a research team in five years problem it's like a api call and 15 minutes type problem so take everything i say here with the grain of salt because the feasibility of ml projects is notoriously difficult to predict another example here is in the late 90s the new york times when they were talking about sort of ai systems beating humans at chess predicted that it might be a hundred years before a computer beats human echo or even longer and you know less than 20 years later machine learning systems from deep mind beat the best humans in
200
+
201
+ 51
202
+ 00:30:48,399 --> 00:31:22,000
203
+ the world that go these predictions are notoriously difficult to make but that being said i think it's still worth talking about and so one heuristic that you'll hear for what's feasible to do with machine learning is this heuristic from andrew ing which is that anything that a normal person can do in less than one second we can automate with ai i think this is actually not a great heuristic for what's feasible to do with ai but you'll hear it a lot so i wanted to talk about it anyway there's some examples of where this is true right so recognizing the content of images understanding speech potentially translating speech maybe grasping objects with a robot and things like that are things that you could point to
204
+
205
+ 52
206
+ 00:31:20,320 --> 00:31:57,600
207
+ as evidence for andrew's statement being correct but i think there's some really obvious counter examples as well machine learning systems are still no good at things that a lot of people are really good at like understanding human humor or sarcasm complex in-hand manipulation of objects generalizing to brand new scenarios that they've never seen before this is a heuristic that you'll see it's not one that i would recommend using seriously to assess whether your project is feasible or not there's a few things that we can say are definitely still hard in machine learning i kept a couple of things in these slides that we talked about being really difficult in machine learning when we started teaching the class in
208
+
209
+ 53
210
+ 00:31:55,120 --> 00:32:32,640
211
+ 2018 that i think i would no longer consider to be super difficult anymore unsupervised learning being one of them but reinforcement learning problems still tend to be not very feasible to solve for real world use cases although there are some use cases where with tons of data and compute reinforcement learning can be used to solve real world problems within the context of supervised learning there are also still problems that are hard so things like question answering a lot of progress over the last few years still these systems aren't perfect text summarization video prediction building 3d models another example of one that i think i would use to say is really difficult but with nerf and all the sort
212
+
213
+ 54
214
+ 00:32:30,640 --> 00:33:07,679
215
+ of derivatives of that i think is more feasible than ever real world speech recognition so outside of the context of a clean data set in a noisy room can we recognize what people are saying resisting adversarial examples doing math although there's been a lot of progress on this problem as well over the last few months solving world war problems or bond guard problems this is an example by the way of a bomb card problem it's a visual analogy type problem so this is kind of a laundry list of some things that are still difficult even in supervised learning and so can we reason about this what types of problems are still difficult to do so i think one type is where not the input to the model itself but the prediction that the model is
216
+
217
+ 55
218
+ 00:33:06,000 --> 00:33:47,760
219
+ making the output of the model where that is like a complex or high dimensional structure or where it's ambiguous right so for example 3d reconstruction the 3d model that you're outputting is very high dimensional and so that makes it difficult to do for ml video prediction not only high dimensional but also ambiguous just because you know what happened in the video for the last five seconds there's still maybe infinite possibilities for what the video might look like going forward so it's ambiguous and it's high dimensional which makes it very difficult to do with ml dialog systems again very ambiguous very open-ended very difficult to do with ml and uh open-ended recommender systems so a second category of problems that are
220
+
221
+ 56
222
+ 00:33:46,080 --> 00:34:26,720
223
+ still difficult to do with ml are problems where you really need the system to be reliable machine learning systems tend to fail in all kinds of unexpected and hard to reason about ways so anywhere where you need really high precision or robustness is gonna be more difficult to solve using machine learning so failing safely out of distribution for example is still a difficult problem in ml robustness to adversarial attacks is still a difficult problem in ml and even things that are easier to do with low precision like estimating the position and rotation of an object in 3d space can be very difficult to do if you have a high precision requirement the last category of problems i'll point to here is problems where you need the
224
+
225
+ 57
226
+ 00:34:24,560 --> 00:35:04,720
227
+ system to be able to generalize well to data that it's never seen before this can be data that's out of distribution it can be where your system needs to do something that looks like reasoning or planning or understanding of causality these problems tend to be more in the research domain today i would say one example is in the self-driving car world dealing with edge cases very difficult challenge in that field but also control problems in self-driving cars you know those stacks are incorporating more and more ml into them whereas the computer vision and perception part of self-driving cars adopted machine learning pretty early the control piece was using more traditional methods for much longer and then places where you have a small
228
+
229
+ 58
230
+ 00:35:02,800 --> 00:35:40,400
231
+ amount of data again like if you're considering machine learning broadly small data is often possible but especially in the context of deep learning small data still presents a lot of challenges summing this up like how should you try to assess whether your machine learning project is feasible or not first question you should ask is do we really need to solve this problem with ml at all i would recommend putting in the work up front to define what is the success criteria that we need and doing this with everyone that needs to sign up on the project in the end not just the ml team let's avoid being an ml team that works on problems in isolation and then has those projects killed because no one actually really needed to solve
232
+
233
+ 59
234
+ 00:35:38,000 --> 00:36:13,440
235
+ this problem or because the value of the solution is not worth the complexity that it adds to your product then you should consider the ethics of using ml to solve this problem and we'll talk more about this towards the end of the course in the ethics lecture then it's worth doing a literature review to make sure that there are examples of people working on similar problems trying to rapidly build a benchmark data set that's labeled so you can start to get some sense of whether your model's performing well or not then and only then building a minimum viable model so this is potentially even just manual rules or simple linear regression deploying this into production if it's feasible to do so or at least running
236
+
237
+ 60
238
+ 00:36:11,520 --> 00:36:50,640
239
+ this on your existing problem so you have a baseline and then lastly it's worth just restating making sure that you once you've built this minimum viable model that may not even use ml just really asking yourself the question of whether this is good enough for now or whether it's worth putting in the additional effort to turn this into a complex ml system the next point i want to make here is that not all ml projects really have the same characteristics and so should be and so you shouldn't think about planning all ml projects in the same way i want to talk about some archetypes of different types of ml projects and the implications that they have for the feasibility of the projects and how you might run the projects
240
+
241
+ 61
242
+ 00:36:49,040 --> 00:37:28,560
243
+ effectively and so the three archetypes i want to talk to are defined by how they interact with real world users and so the first archetype is software 2.0 use cases and so i would define this as taking something that software does today so an existing part of your product that you have let's say and doing it better more accurately or more efficiently with ml it's taking a part of your product that's already automated or already partially automated and adding more automation or more efficient automation using machine learning then the next archetype is human in the loop systems and so this is where you take something that is not currently automated in your system but it's something that humans are doing or
244
+
245
+ 62
246
+ 00:37:26,720 --> 00:38:07,839
247
+ humans could be doing and helping them do that job better more efficiently or more accurate accurately by supplementing their judgment with ml based tools preventing them from needing to do the job on every single data point by giving them suggestions of what they can do so they can shortcut their process in a lot of places human loop systems are about making the humans that are ultimately making the decisions more efficient or more effective and then lastly autonomous systems and so these are systems where you take something that humans do today or maybe is just not being done at all today and fully automated with ml to the point where you actually don't need humans to do the judgment piece of it at all and so some
248
+
249
+ 63
250
+ 00:38:05,440 --> 00:38:47,200
251
+ examples of software 2.0 are if you have an ide that has code completion can we do better code completion by using ml can we take a recommendation system that is initially using some simple rules and making it more customized can we take our video game ai that's using this rule-based system and make it much better by using machine learning some examples of human and loop systems would be building a product to turn hand-drawn sketches into slides you still have a human on the other end that's evaluating the quality of those sketches before they go in front of a customer or stakeholder so it's a human in the loop system but it's potentially saving a lot of time for that human email auto completion so if you use
252
+
253
+ 64
254
+ 00:38:45,359 --> 00:39:23,119
255
+ gmail you've seen these email suggestions where it'll suggest sort of short responses to the email that you got i get to decide whether that email actually goes out to the world so it's not an automation system it's a human in the loop system or helping a radiologist do their job faster and then examples of autonomous systems are things like full self-driving right maybe there's not even a steering wheel in the car i can't interrupt the autonomous system and take over control of the car even if i wanted to or maybe it's not designed for me to do that very often fully automated customer support so if i go on a company's website and i interact with their customer support without even having the option of talking to an agent
256
+
257
+ 65
258
+ 00:39:21,280 --> 00:39:56,720
259
+ or with them making it very difficult to talk to an agent that's an autonomous system or for example like fully automating website design so that to the point where people who are not design experts can just click a button and get a website designed for them and so i think some of the key questions that you need to ask before embarking on these projects are a little bit different depending on which archetype your project falls into so if you're working on a software 2.0 project then i think some of the questions you should be concerned about are how do you know that your models are actually performing improving performance over the baseline that you already have how confident are you that the type of performance improvement that
260
+
261
+ 66
262
+ 00:39:54,960 --> 00:40:32,560
263
+ you might be able to get from ml is actually going to generate value for your business if it's just one percent better is that really worth the cost then do these performance improvements lead to what's called a data flywheel which i'll talk a little bit more about with human in the loop systems you might ask a different set of questions before you embark on the project like how good does the system actually need to be useful if the system you know is able to automate 10 of the work of the human that is ultimately making the decisions or producing the end product is that useful to them or does that just slow it slow them down how can you collect enough data to make it that good is it possible to actually build a data set
264
+
265
+ 67
266
+ 00:40:30,720 --> 00:41:08,400
267
+ that is able to get you to that useful threshold for your system and for autonomous systems the types of questions you might ask are what is an acceptable failure rate for this system how many nines in your performance threshold do you need in order for this sort of not to cause harm in the world and how can you guarantee like how can you be really confident that one it won't exceed that failure rate and so this is something that in autonomous vehicles for example teams put a ton of effort into building the simulation and testing systems that they need to be confident that they won't exceed the failure rate that's except the very very low failure rate that's acceptable for those systems i want to double click on this data
268
+
269
+ 68
270
+ 00:41:06,160 --> 00:41:49,040
271
+ flywheel concept for software 2.0 we talked about can we build a data flywheel that lead to better and better performance of the system and the way to think about a data flywheel is it's this virtuous cycle where as your model gets better you are able to use a better that better model to make a better product which allows you to acquire more users and as you have more users those users generate more data which you can use to build a better model and this creates this virtuous cycle and so the connections between each of these steps are also important in order for more users to allow you to collect more data you need to have a data loop where you need to have a way of automatically collecting data and deciding what data points to
272
+
273
+ 69
274
+ 00:41:46,960 --> 00:42:23,839
275
+ label from your users or at least processes for doing these in order for more data to lead to a better model that's that's kind of on you as an ml practitioner right like you need to be able to translate more data more granular data more labels into a model that performs better for your users and then in order for the better model to lead to better users you need to be sure that better predictions are actually making your product better another point that i want to make on these project archetypes is i would sort of characterize them as having different trade-offs on this feasibility versus impact two by two that we talked about earlier software 2.0 projects since they're just taking something that you
276
+
277
+ 70
278
+ 00:42:22,480 --> 00:43:00,000
279
+ already know you can automate and automating it better tend to be more feasible but since you already have an answer to the question that they're also answering they also tend to be lower impact on the other extreme autonomous systems tend to be very difficult to build because the accuracy requirements in general are quite high but the impact can be quite high as well because you're replacing something that literally doesn't exist and human in the loop systems tend to be somewhere in between where you can really like you can use this paradigm of machine learning products to build things that couldn't exist before but the impact is not quite as high because you still need people in the loop that are helping use their judgment
280
+
281
+ 71
282
+ 00:42:57,599 --> 00:43:40,400
283
+ to complement the machine learning model there's ways that you can move these types of projects on the feasibility impact matrix to make them more likely to succeed so if you're working on a software 2.0 project you can make these projects have potentially higher impacts by implementing a data loop that allows you to build continual improvement data flywheel that we talked about before and potentially allows you to use the data that you're collecting from users interacting with this system to automate more tasks in the future so for example in the code completion ide example that we gave before you can you know if you're building something like github copilot then think about all the things that the data that you're collecting
284
+
285
+ 72
286
+ 00:43:38,560 --> 00:44:20,240
287
+ from that could be useful for building in the future you can make human in the loop systems more feasible through good product design and we'll talk a little bit more about this in a future lecture but there's design paradigms in the product itself that can reduce the accuracy requirement for these types of systems and another way to make these projects more feasible is by adopting sort of a different mindset which is let's just make the system good enough and ship it into the real world so we can start the process of you know seeing how how real users interact with it and using the feedback that we get from our humans in the loop to make the model better and then lastly autonomous systems can be made more feasible by adding guard rails
288
+
289
+ 73
290
+ 00:44:18,240 --> 00:44:59,119
291
+ or in some cases adding humans in the loop and so this is you can think of this as the approach to autonomous vehicles where you have safety drivers in the loop early on in the project or where you introduce tele operations so that a human can take control of the system if it looks like something is going wrong i think another point that is really important here is despite all this talk about what's feasible to do with ml the complexity that ml introduce is in your system i don't mean by any of this to say that you should do necessarily a huge amount of planning before you dive into using ml at all just make sure that the project that you're working on is the right project and then just dive in and get started and in particular i think a
292
+
293
+ 74
294
+ 00:44:57,359 --> 00:45:34,960
295
+ failure mode that i'm seeing crop up more and more over the past couple of years that you should avoid is falling into the trap of tool fetishization so one of the great things that's happened in ml over the past couple of years is the rise of this ml ops discipline and alongside of that has been proliferation of different tools that are available on the market to help with different parts of the ml process and one thing that i've noticed that this has caused for a lot of folks is this sort of general feeling that you really need to have perfect tools before you get started you don't need perfect tools to get started and you also don't need a perfect model and in particular just because google or uber is doing
296
+
297
+ 75
298
+ 00:45:33,599 --> 00:46:12,400
299
+ something like just because they have you know a feature store as part of their stack or they serve models in a particular way doesn't mean that you need to have that as well and so a lot of what we'll try to do in this class is talk about what's the middle ground be between doing things in the right way from a production perspective but not introducing too much complexity early on into your project so that's one of the reasons why fsdl is a class about building ml powered products in a practical way and not in mlaps class that's focused on what is the state of the art in the best possible infrastructure that you can use and um a talk and blog posts and associated set of things on this concept that i really
300
+
301
+ 76
302
+ 00:46:09,520 --> 00:46:54,960
303
+ like is this ml offset reasonable scale push by some of the folks from kovio and the sort of central thesis of ml offs at reasonable scale is you're not google you probably have a finite compute budget not entire cloud you probably have a limited number of folks on your team you probably have not an infinite budget to spend on this and you probably have a limited amount of data as well and so those differences between what you have and what uber has or what google has have implications for what the right stack is for the problems that you're solving and so it's worth thinking about these cases separately and so if you're interested in what one company did and recommends for an ml stack that isn't designed to
304
+
305
+ 77
306
+ 00:46:52,000 --> 00:47:31,200
307
+ scale to becoming uber scale then i recommend checking out this talk to summarize what we've covered so far machine learning is an incredibly powerful technology but it does add a lot of complexity and so before you embark on a machine learning project you should make sure that you're thinking carefully about whether you really need ml to solve the problem that you're solving and whether the problem is actually worth solving at all given the complexity that this adds and so let's avoid being ml teams that have their projects get killed because we're working on things that don't really matter to the business that we're a part of all right and the last topic i want to cover today is once you've sort of made this decision to embark on an ml
308
+
309
+ 78
310
+ 00:47:29,520 --> 00:48:07,599
311
+ project what are the different steps that you're going to go through in order to actually execute on that project and this will also give you an outline for some of the other things you can expect from the class so the running case study that we'll use here is a modified version of a problem that i worked on when i was at open ai which is pose estimation our goal is to build a system that runs on a robot that takes the camera feed from that robot and uses it to estimate the position in 3d space and the orientation the rotation of each of the objects in the scene so that we can use those for downstream tasks and in particular so we can use them to feed into a separate model which will be used to tell the robot how it
312
+
313
+ 79
314
+ 00:48:06,000 --> 00:48:40,000
315
+ actually can grasp the different objects in the scene machine learning projects start like any other project in a planning and project setup phase and so what the types of activities we'd be doing in this phase when we're working on this pose estimation project are things like deciding to work on post-estimation at all determining whether how much this is going to cost what resources we need to allocate to it considering the ethical implications and things like this right a lot of what we've been talking about so far in this lecture once we plan the project then we'll move into a data collection and labeling phase and so for pose estimation what this might look like is collecting the corpus of objects that
316
+
317
+ 80
318
+ 00:48:38,640 --> 00:49:18,559
319
+ we're going to train our model on setting up our sensors like our cameras to capture our information about those objects actually capturing those objects and somehow figuring out how to annotate these images that we're capturing with ground truth like the pose of the of the objects in those images one point i want to make about the life cycle of mbl projects is that this is not like a straightforward path machine learning projects tend to be very iterative and each of these phases can feed back into any of the phases before as you learn more about the problem that you're working on so for example you might realize that actually it's way too hard for us to get data in order to solve this problem or it's really difficult for us to label
320
+
321
+ 81
322
+ 00:49:16,079 --> 00:49:54,559
323
+ the pose of these objects in 3d space but what we can do is it's actually much cheaper for us to annotate like per pixel segmentation so can we reformulate the problem in a way that allows us to to use what we've learned about data collection and labeling to plan a better project once you have some data to work on then you enter the sort of training and debugging phase and so what we might do here is we might implement a baseline for our model not using like a complex neural network but just using some opencv functions and then once we have that working we might find a state-of-the-art model and reproduce it debug our implementation and iterate on our model run some hyper parameter sweeps until it performs well
324
+
325
+ 82
326
+ 00:49:52,720 --> 00:50:29,599
327
+ on our task this can feed back into the data collection and labeling phase because we might realize that you know we actually need more data in order to solve this problem or we might also realize that there's something flawed in the process that we've been using to label the data that we're using data labeling process might need to be revisited but we can also loop all the way back to the project planning phase because we might realize that actually this task is a lot harder than we thought or the requirements that we specified at the planning phase trade off with each other so we need to revisit which are most important so for example like maybe we thought that we had an accuracy requirement of estimating the pose of these objects to
328
+
329
+ 83
330
+ 00:50:26,960 --> 00:51:06,720
331
+ one tenth of one centimeter and we also had an a latency requirement for inference in our models of 1 100th of a second to run on robotic hardware and we might realize that hey you know we can get this really really tight accuracy requirement or we can have really fast inference but it's very difficult to do both so is it possible to relax one of those assumptions once you've trained a model that works pretty well offline for your task then your goal is going to be to deploy that model test it in the real world and then use that information to figure out where to go next for the purpose of this project that might look like piloting the grasping system in the lab so before we roll it out to actual users can we
332
+
333
+ 84
334
+ 00:51:04,880 --> 00:51:42,319
335
+ test it in a realistic scenario and we might also do things like writing tests to prevent regressions and evaluate for bias in the model and then eventually rolling this out into production and monitoring it and continually improving it from there and so we can feed back here into the training and debugging stage because oftentimes what we'll find is that the model that worked really well for our offline data set once it gets into the real world it doesn't actually work as well as we thought whether that's because the accuracy requirement that we had for the model was wrong like we actually needed it to be more accurate than we thought or maybe the metric that we're looking at the accuracy is not actually the metric
336
+
337
+ 85
338
+ 00:51:39,760 --> 00:52:17,920
339
+ that really matters for success at the downstream task that we're trying to solve because that could cause us to revisit the training phase we also could loop back to the data collection and labeling phase because common problem that we might find in the real world is that there's some mismatch between the training data that we collected and the data that we actually saw when we went out and tested this we could use what we learned from that to go collect more data or mine for hard cases like mine for the failure cases that we found in production and then finally as i alluded to before we could loop all the way back to the project planning phase because we realized that the metric that we picked doesn't really drive the downstream
340
+
341
+ 86
342
+ 00:52:15,920 --> 00:52:51,599
343
+ behavior that we desired just because the grasp model is accurate doesn't mean that the robot will actually be able to successfully grasp the object so we might need to use a different metric to really solve this task or we might realize that the performance in the real world isn't that great and so we maybe need to add additional requirements to our model as well maybe it just needs to be faster to in order to run on a real robot so these are kind of like what i think of as the activities that you do in any particular machine learning project that you undertake but there's also some sort of cross project things that you need in order to be successful which we'll talk about in the class as well you need to be able to work on
344
+
345
+ 87
346
+ 00:52:49,920 --> 00:53:25,040
347
+ these problems together as a team and you need to have the right infrastructure and tooling to make these processes more repeatable and these are topics that we'll cover as well so this is like a broad conceptual outline of the different topics that we'll talk about in this class and so to wrap up for today what we covered is machine learning is a complex technology and so you should use it because you need it or because you think it'll generate a lot of value but it's not a cure-all it doesn't solve every problem it won't automate every single thing that you wanted to automate so let's pick projects that are going to be valuable but in spite of this you don't need a perfect setup to get started and let's
348
+
349
+ 88
350
+ 00:53:23,440 --> 00:53:34,760
351
+ spend the rest of this course walking through the project lifecycle and learning about each of these stages and how we can how we can use them to build great ml powered products
352
+
documents/lecture-02.md ADDED
@@ -0,0 +1,563 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Software engineering, Deep learning frameworks, Distributed training, GPUs, and Experiment Management.
3
+ ---
4
+
5
+ # Lecture 2: Development Infrastructure & Tooling
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/BPYOsDCZbno?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Sergey Karayev](https://twitter.com/sergeykarayev).
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published August 15, 2022.
14
+ [Download slides](https://drive.google.com/open?id=16pEG5GesO4_UAWiD5jrIReMGzoyn165M).
15
+
16
+ ## 1 - Introduction
17
+
18
+ The **dream** of ML development is that given a project spec and some
19
+ sample data, you get a continually improving prediction system deployed
20
+ at scale.
21
+
22
+ The **reality** is starkly different:
23
+
24
+ - You have to collect, aggregate, process, clean, label, and version
25
+ the data.
26
+
27
+ - You have to find the model architecture and their pre-trained
28
+ weights and then write and debug the model code.
29
+
30
+ - You run training experiments and review the results, which will be
31
+ fed back into the process of trying out new architectures and
32
+ debugging more code.
33
+
34
+ - You can now deploy the model.
35
+
36
+ - After model deployment, you have to monitor model predictions and
37
+ close the data flywheel loop. Basically, your users generate fresh
38
+ data for you, which needs to be added to the training set.
39
+
40
+ ![](./media/image3.png)
41
+
42
+
43
+ This reality has roughly three components: data, development, and
44
+ deployment. The tooling infrastructure landscape for them is large, so
45
+ we'll have three lectures to cover it all. **This lecture focuses on the
46
+ development component**.
47
+
48
+ ## 2 - Software Engineering
49
+
50
+ ![](./media/image7.png)
51
+
52
+
53
+ ### Language
54
+
55
+ For your choice of **programming language**, Python is the clear winner
56
+ in scientific and data computing because of all the libraries that have
57
+ been developed. There have been some contenders like Julia and C/C++,
58
+ but Python has really won out.
59
+
60
+ ### Editors
61
+
62
+ To write Python code, you need an **editor**. You have many options,
63
+ such as Vim, Emacs, Jupyter Notebook/Lab, VS Code, PyCharm, etc.
64
+
65
+ - We recommend [VS Code](https://code.visualstudio.com/)
66
+ because of its nice features such as built-in git version control,
67
+ documentation peeking, remote projects opening, linters and type
68
+ hints to catch bugs, etc.
69
+
70
+ - Many practitioners develop in [Jupyter
71
+ Notebooks](https://jupyter.org/), which is great as
72
+ the "first draft" of a data science project. You have to put in
73
+ little thought before you start coding and seeing the immediate
74
+ output. However, notebooks have a variety of problems: primitive
75
+ editor, out-of-order execution artifacts, and challenges to
76
+ version and test them. A counterpoint to these problems is the
77
+ [nbdev package](https://nbdev.fast.ai/) that lets
78
+ you write and test code all in one notebook environment.
79
+
80
+ - We recommend you use **VS Code with built-in support for
81
+ notebooks** - where you can write code in modules imported into
82
+ notebooks. It also enables awesome debugging.
83
+
84
+ If you want to build something more interactive,
85
+ [Streamlit](https://streamlit.io/) is an excellent choice.
86
+ It lets you decorate Python code, get interactive applets, and publish
87
+ them on the web to share with the world.
88
+
89
+ ![](./media/image10.png)
90
+
91
+
92
+ For setting up the Python environment, we recommend you see [how we did
93
+ it in the
94
+ lab.](https://github.com/full-stack-deep-learning/conda-piptools)
95
+
96
+ ## 3 - Deep Learning Frameworks
97
+
98
+ ![](./media/image15.png)
99
+
100
+
101
+ Deep learning is not a lot of code with a matrix math library like
102
+ Numpy. But when you have to deploy your code onto CUDA for GPU-powered
103
+ deep learning, you want to consider deep learning frameworks as you
104
+ might be writing weird layer types, optimizers, data interfaces, etc.
105
+
106
+ ### Frameworks
107
+
108
+ There are various frameworks, such as PyTorch, TensorFlow, and Jax. They
109
+ are all similar in that you first define your model by running Python
110
+ code and then collect an optimized execution graph for different
111
+ deployment patterns (CPU, GPU, TPU, mobile).
112
+
113
+ 1. We prefer PyTorch because [it is absolutely
114
+ dominant](https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2022/)
115
+ by measures such as the number of models, the number of papers,
116
+ and the number of competition winners. For instance, about [77%
117
+ of 2021 ML competition winners used
118
+ PyTorch](https://blog.mlcontests.com/p/winning-at-competitive-ml-in-2022?s=w).
119
+
120
+ 2. With TensorFlow, you have TensorFlow.js (that lets you run deep
121
+ learning models in your browser) and Keras (an unmatched developer
122
+ experience for easy model development).
123
+
124
+ 3. Jax is a meta-framework for deep learning.
125
+
126
+ ![](./media/image12.png)
127
+
128
+
129
+ [PyTorch](https://pytorch.org/) has excellent developer
130
+ experience and is production-ready and even faster with TorchScript.
131
+ There is a great distributed training ecosystem. There are libraries for
132
+ vision, audio, etc. There are also mobile deployment targets.
133
+
134
+ [PyTorch Lightning](https://www.pytorchlightning.ai/)
135
+ provides a nice structure for organizing your training code, optimizer
136
+ code, evaluation code, data loaders, etc. With that structure, you can
137
+ run your code on any hardware. There are nice features such as
138
+ performance and bottleneck profiler, model checkpointing, 16-bit
139
+ precision, and distributed training libraries.
140
+
141
+ Another possibility is [FastAI
142
+ software](https://www.fast.ai/), which is developed
143
+ alongside the fast.ai course. It provides many advanced tricks such as
144
+ data augmentations, better initializations, learning rate schedulers,
145
+ etc. It has a modular structure with low-level API, mid-level API,
146
+ high-level API, and specific applications. The main problem with FastAI
147
+ is that its code style is quite different from mainstream Python.
148
+
149
+ At FSDL, we prefer PyTorch because of its strong ecosystem, but
150
+ [TensorFlow](https://www.tensorflow.org/) is still
151
+ perfectly good. If you have a specific reason to prefer it, you are
152
+ still going to have a good time.
153
+
154
+ [Jax](https://github.com/google/jax) is a more recent
155
+ project from Google that is not specific to deep learning. It provides
156
+ general vectorization, auto-differentiation, and compilation to GPU/TPU
157
+ code. For deep learning, there are separate frameworks like
158
+ [Flax](https://github.com/google/flax) and
159
+ [Haiku](https://github.com/deepmind/dm-haiku). You should
160
+ only use Jax for a specific need.
161
+
162
+ ### Meta-Frameworks and Model Zoos
163
+
164
+ Most of the time, you will start with at least a model architecture that
165
+ someone has developed or published. You will use a specific architecture
166
+ (trained on specific data with pre-trained weights) on a model hub.
167
+
168
+ - [ONNX](https://onnx.ai/) is an open standard for
169
+ saving deep learning models and lets you convert from one type of
170
+ format to another. It can work well but can also run into some
171
+ edge cases.
172
+
173
+ - [HuggingFace](https://huggingface.co/) has become an
174
+ absolutely stellar repository of models. It started with NLP tasks
175
+ but has then expanded into all kinds of tasks (audio
176
+ classification, image classification, object detection, etc.).
177
+ There are 60,000 pre-trained models for all these tasks. There is
178
+ a Transformers library that works with PyTorch, TensorFlow, and
179
+ Jax. There are 7,500 datasets uploaded by people. There's also a
180
+ community aspect to it with a Q&A forum.
181
+
182
+ - [TIMM](https://github.com/rwightman/pytorch-image-models)
183
+ is a collection of state-of-the-art computer vision models and
184
+ related code that looks cool.
185
+
186
+ ## 4 - Distributed Training
187
+
188
+ ![](./media/image9.png)
189
+
190
+
191
+ Let's say we have multiple machines represented by little squares above
192
+ (with multiple GPUs in each machine). You are sending batches of data to
193
+ be processed by a model with parameters. The data batch can fit on a
194
+ single GPU or not. The model parameters can fit on a single GPU or not.
195
+
196
+ The best case is that both your data batch and model parameters fit on a
197
+ single GPU. That's called **trivial parallelism**. You can either launch
198
+ more independent experiments on other GPUs/machines or increase the
199
+ batch size until it no longer fits on one GPU.
200
+
201
+ ### Data Parallelism
202
+
203
+ If your model still fits on a single GPU, but your data no longer does,
204
+ you have to try out **data parallelism** - which lets you distribute a
205
+ single batch of data across GPUs and average gradients that are computed
206
+ by the model across GPUs. A lot of model development work is cross-GPU,
207
+ so you want to ensure that GPUs have fast interconnects.
208
+
209
+ If you are using a server card, expect [a linear
210
+ speedup](https://lambdalabs.com/blog/best-gpu-2022-sofar/)
211
+ in training time. If you are using a consumer card, expect [a sublinear
212
+ speedup](https://lambdalabs.com/blog/titan-v-deep-learning-benchmarks/)
213
+ instead.
214
+
215
+ Data parallelism is implemented in PyTorch with the robust
216
+ [DistributedDataParallel
217
+ library](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
218
+ [Horovod](https://github.com/horovod/horovod) is another
219
+ 3rd-party library option. PyTorch Lightning makes it dead simple to use
220
+ either of these two libraries - where [speedup seems to be the
221
+ same](https://www.reddit.com/r/MachineLearning/comments/hmgr9g/d_pytorch_distributeddataparallel_and_horovod/).
222
+
223
+ A more advanced scenario is that you can't even fit your model on a
224
+ single GPU. You have to spread the model over multiple GPUs. There are
225
+ three solutions to this.
226
+
227
+ ### Sharded Data-Parallelism
228
+
229
+ Sharded data parallelism starts with the question: What exactly takes up
230
+ GPU memory?
231
+
232
+ - The **model parameters** include the floats that make up our model
233
+ layers.
234
+
235
+ - The **gradients** are needed to do back-propagation.
236
+
237
+ - The **optimizer states** include statistics about the gradients
238
+
239
+ - Finally, you have to send a **batch of data** for model development.
240
+
241
+ ![](./media/image5.png)
242
+
243
+ Sharding is a concept from databases where if you have one source of
244
+ data, you actually break it into shards of data that live across your
245
+ distributed system. Microsoft implemented an approach called
246
+ [ZeRO](https://arxiv.org/pdf/1910.02054.pdf) that shards
247
+ the optimizer states, the gradients, and the model parameters. **This
248
+ results in an insane order of magnitude reduction in memory use, which
249
+ means your batch size can be 10x bigger.** You should [watch the video
250
+ in this
251
+ article](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
252
+ to see how model parameters are passed around GPUs as computation
253
+ proceeds.
254
+
255
+ Sharded data-parallelism is implemented by Microsoft's
256
+ [DeepSpeed](https://github.com/microsoft/DeepSpeed)
257
+ library and Facebook's
258
+ [FairScale](https://github.com/facebookresearch/fairscale)
259
+ library, as well as natively by PyTorch. In PyTorch, it's called
260
+ [Fully-Sharded
261
+ DataParallel](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/).
262
+ With PyTorch Lightning, you can try it for a massive memory reduction
263
+ without changing the model code.
264
+
265
+ This same ZeRO principle can also be applied to a single GPU. You can
266
+ train a 13B-parameter model on a single V100 (32GB) GPU. Fairscale
267
+ implements this (called
268
+ [CPU-offloading](https://fairscale.readthedocs.io/en/stable/deep_dive/offload.html)).
269
+
270
+ ### Pipelined Model-Parallelism
271
+
272
+ **Model parallelism means that you can put each layer of your model on
273
+ each GPU**. It is trivial to implement natively but results in only one
274
+ GPU being active at a time. Libraries like DeepSpeed and FairScale make
275
+ it better by pipelining computation so that the GPUs are fully utilized.
276
+ You need to tune the amount of pipelining on the batch size to the exact
277
+ degree of how you will split up the model on the GPU.
278
+
279
+ ### Tensor-Parallelism
280
+
281
+ Tensor parallelism is another approach, which observes that there is
282
+ nothing special about matrix multiplication that requires the whole
283
+ matrix to be on one GPU. **You can distribute the matrix over multiple
284
+ GPUs**. NVIDIA published [the Megatron-LM
285
+ repo](https://github.com/NVIDIA/Megatron-LM), which does
286
+ this for the Transformer model.
287
+
288
+ You can actually use all of the three techniques mentioned above if you
289
+ really want to scale a huge model (like a GPT-3 sized language model).
290
+ Read [this article on the technology behind BLOOM
291
+ training](https://huggingface.co/blog/bloom-megatron-deepspeed)
292
+ for a taste.
293
+
294
+ ![](./media/image6.png)
295
+
296
+
297
+ In conclusion:
298
+
299
+ - If your model and data fit on one GPU, that's awesome.
300
+
301
+ - If they do not, and you want to speed up training, try
302
+ DistributedDataParallel.
303
+
304
+ - If the model still doesn't fit, try ZeRO-3 or Full-Sharded Data
305
+ Parallel.
306
+
307
+ For more resources to speed up model training, look at [this list
308
+ compiled by DeepSpeed](https://www.deepspeed.ai/training/),
309
+ [MosaicML](https://www.mosaicml.com), and
310
+ [FFCV](https://ffcv.io).
311
+
312
+ ## 5 - Compute
313
+
314
+ ![](./media/image14.png)
315
+
316
+
317
+ **Compute** is the next essential ingredient to developing machine
318
+ learning models and products.
319
+
320
+ The compute-intensiveness of models has grown tremendously over the last
321
+ ten years, as the below charts from
322
+ [OpenAI](https://openai.com/blog/ai-and-compute/) and
323
+ [HuggingFace](https://huggingface.co/blog/large-language-models)
324
+ show.
325
+
326
+ ![](./media/image1.png)
327
+
328
+
329
+ Recent developments, including models like
330
+ [GPT-3](https://openai.com/blog/gpt-3-apps/), have
331
+ accelerated this trend. These models are extremely large and require a
332
+ large number of petaflops to train.
333
+
334
+ ### GPUs
335
+
336
+ **To effectively train deep learning models**, **GPUs are required.**
337
+ NVIDIA has been the superior choice for GPU vendors, though Google has
338
+ introduced TPUs (Tensor Processing Units) that are effective but are
339
+ only available via Google Cloud. There are three primary considerations
340
+ when choosing GPUs:
341
+
342
+ 1. How much data fits on the GPU?
343
+
344
+ 2. How fast can the GPU crunch through data? To evaluate this, is your
345
+ data 16-bit or 32-bit? The latter is more resource intensive.
346
+
347
+ 3. How fast can you communicate between the CPU and the GPU and between
348
+ GPUs?
349
+
350
+ Looking at recent NVIDIA GPUs, it becomes clear that a new
351
+ high-performing architecture is introduced every few years. There's a
352
+ difference between these chips, which are licensed for personal use as
353
+ opposed to corporate use; businesses should only use **server**
354
+ **cards**.
355
+
356
+ ![](./media/image8.png)
357
+
358
+
359
+ Two key factors in evaluating GPUs are **RAM** and **Tensor TFlops**.
360
+ The more RAM, the better the GPU contains large models and datasets.
361
+ Tensor TFlops are special tensor cores that NVIDIA includes specifically
362
+ for deep learning operations and can handle more intensive
363
+ mixed-precision operations. **A tip**: leveraging 16-bit training can
364
+ effectively double your RAM capacity!
365
+
366
+ While these theoretical benchmarks are useful, how do GPUs perform
367
+ practically? Lambda Labs offers [the best benchmarks
368
+ here](https://lambdalabs.com/gpu-benchmarks). Their results
369
+ show that the most recent server-grade NVIDIA GPU (A100) is more than
370
+ 2.5 times faster than the classic V100 GPU. RTX chips also outperform
371
+ the V100. [AIME is also another source of GPU
372
+ benchmarks](https://www.aime.info/en/blog/deep-learning-gpu-benchmarks-2021/).
373
+
374
+ Cloud services such as Microsoft Azure, Google Cloud Platform, and
375
+ Amazon Web Services are the default place to buy access to GPUs. Startup
376
+ cloud providers like
377
+ [Paperspace](https://www.paperspace.com/),
378
+ [CoreWeave](https://www.coreweave.com/), and [Lambda
379
+ Labs](https://lambdalabs.com/) also offer such services.
380
+
381
+ ### TPUs
382
+
383
+ Let's briefly discuss TPUs. There are four generations of TPUs, and the
384
+ most recent v4 is the fastest possible accelerator for deep learning. V4
385
+ TPUs are not generally available yet, but **TPUs generally excel at
386
+ scaling to larger and model sizes**. The below charts compare TPUs to
387
+ the fastest A100 NVIDIA chip.
388
+
389
+ ![](./media/image11.png)
390
+
391
+
392
+ It can be overwhelming to compare the cost of cloud access to GPUs, so
393
+ [we made a tool that solves this
394
+ problem](https://fullstackdeeplearning.com/cloud-gpus/)!
395
+ Feel free to contribute to [our repository of Cloud GPU cost
396
+ metrics](https://github.com/full-stack-deep-learning/website/).
397
+ The tool has all kinds of nifty features like enabling filters for only
398
+ the most recent chip models, etc.
399
+
400
+ If we [combine the cost metrics with performance
401
+ metrics](https://github.com/full-stack-deep-learning/website/blob/main/docs/cloud-gpus/benchmark-analysis.ipynb),
402
+ we find that **the most expensive per hour chips are not the most
403
+ expensive per experiment!** Case in point: running the same Transformers
404
+ experiment on 4 V100s costs \$1750 over 72 hours, whereas the same
405
+ experiment on 4 A100s costs \$250 over only 8 hours. Think carefully
406
+ about cost and performance based on the model you're trying to train.
407
+
408
+ Some helpful heuristics here are:
409
+
410
+ 1. Use the most expensive per-hour GPU in the least expensive cloud.
411
+
412
+ 2. Startups (e.g., Paperspace) tend to be cheaper than major cloud
413
+ providers.
414
+
415
+ ### On-Prem vs. Cloud
416
+
417
+ For **on-prem** use cases, you can build your own pretty easily or opt
418
+ for a pre-built computer from a company like NVIDIA. You can build a
419
+ good, quiet PC with 128 GB RAM and 2 RTX 3909s for about \$7000 and set
420
+ it up in a day. Going beyond this can start to get far more expensive
421
+ and complicated. Lambda Labs offers a \$60,000 machine with 8 A100s
422
+ (super fast!). Tim Dettmers offers a great (slightly outdated)
423
+ perspective on building a machine
424
+ [here](https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/).
425
+
426
+ Some tips on on-prem vs. cloud use:
427
+
428
+ - It can be useful to have your own GPU machine to shift your mindset
429
+ from minimizing cost to maximizing utility.
430
+
431
+ - To truly scale-out experiments, you should probably just use the
432
+ most expensive machines in the least expensive cloud.
433
+
434
+ - TPUs are worth experimenting with for large-scale training, given
435
+ their performance.
436
+
437
+ - Lambda Labs is a sponsor, and we highly encourage looking at them
438
+ for on-prem and cloud GPU use!
439
+
440
+ ## 6 - Resource Management
441
+
442
+ ![](./media/image2.png)
443
+
444
+
445
+ Now that we've talked about raw compute, let's talk about options for
446
+ **how to manage our compute resources**. Let's say we want to manage a
447
+ set of experiments. Broadly speaking, we'll need hardware in the form of
448
+ GPUs, software requirements (e.g., PyTorch version), and data to train
449
+ on.
450
+
451
+ ### Solutions
452
+
453
+ Leveraging best practices for specifying dependencies (e.g., Poetry,
454
+ conda, pip-tools) makes the process of spinning up such experiments
455
+ quick and easy on a single machine.
456
+
457
+ If, however, you have a cluster of machines to run experiments on,
458
+ [SLURM](https://slurm.schedmd.com/documentation.html) is
459
+ the tried and true solution for workload management that is still widely
460
+ used.
461
+
462
+ For more portability, [Docker](https://www.docker.com/) is
463
+ a way to package up an entire dependency stack into a lighter-than-a-VM
464
+ package. [Kubernetes](https://kubernetes.io/) is the most
465
+ popular way to run many Docker containers on top of a cluster. The OSS
466
+ [Kubeflow](https://www.kubeflow.org/) project helps manage
467
+ ML projects that rely on Kubernetes.
468
+
469
+ These projects are useful, but they may not be the easiest or best
470
+ choice. They're great if you already have a cluster up and running, but
471
+ **how do you actually set up a cluster or compute platform?**
472
+
473
+ *Before proceeding, FSDL prefers open source and/or transparently priced
474
+ products. We discuss tools that fall into these categories, not SaaS
475
+ with opaque pricing.*
476
+
477
+ ### Tools
478
+
479
+ For practitioners all in on AWS, [AWS
480
+ Sagemaker](https://aws.amazon.com/sagemaker/) offers a
481
+ convenient end-to-end solution for building machine learning models,
482
+ from labeling data to deploying models. Sagemaker has a ton of
483
+ AWS-specific configuration, which can be a turnoff, but it brings a lot
484
+ of easy-to-use old school algorithms for training and allows you to BYO
485
+ algorithms as well. They're also increasing support for PyTorch, though
486
+ the markup for PyTorch is about 15-20% more expensive.
487
+
488
+ [Anyscale](https://www.anyscale.com/) is a company created
489
+ by the makers of the Berkeley OSS project
490
+ [Ray](https://github.com/ray-project/ray). Anyscale
491
+ recently launched [Ray
492
+ Train](https://docs.ray.io/en/latest/train/train.html),
493
+ which they claim is faster than Sagemaker with a similar value
494
+ proposition. Anyscale makes it really easy to provision a compute
495
+ cluster, but it's considerably more expensive than alternatives.
496
+
497
+ [Grid.ai](https://www.grid.ai/) is created by the PyTorch
498
+ Lightning creators. Grid allows you to specify what compute parameters
499
+ to use easily with "grid run" followed by the types of compute and
500
+ options you want. You can use their instances or AWS under the hood.
501
+ Grid has an uncertain future, as its future compatibility with Lightning
502
+ (given their rebrand) has not been clarified.
503
+
504
+ There are several non-ML options for spinning up compute too! Writing
505
+ your own scripts, using various libraries, or even Kubernetes are all
506
+ options. This route is harder.
507
+
508
+ [Determined.AI](https://determined.ai/) is an OSS solution
509
+ for managing on-prem and cloud clusters. They offer cluster management,
510
+ distributed training, and more. It's pretty easy to use and is in active
511
+ development.
512
+
513
+ With all this said, **there is still room to improve the ease of
514
+ experience for launching training on many cloud providers**.
515
+
516
+ ## 7 - Experiment and Model Management
517
+
518
+ ![](./media/image4.png)
519
+
520
+
521
+ In contrast to compute, **experiment management is quite close to being
522
+ solved**. Experiment management refers to tools and processes that help
523
+ us keep track of code, model parameters, and data sets that are iterated
524
+ on during the model development lifecycle. Such tools are essential to
525
+ effective model development. There are several solutions here:
526
+
527
+ - [TensorBoard](https://www.tensorflow.org/tensorboard):
528
+ A non-exclusive Google solution effective at one-off experiment
529
+ tracking. It is difficult to manage many experiments.
530
+
531
+ - [MLflow](https://mlflow.org/): A non-exclusive
532
+ Databricks project that includes model packaging and more, in
533
+ addition to experiment management. It must be self-hosted.
534
+
535
+ - [Weights and Biases](https://wandb.ai/site): An
536
+ easy-to-use solution that is free for personal and academic projects! Logging
537
+ starts simply with an "experiment config" command.
538
+
539
+ - Other options include [Neptune
540
+ AI](https://neptune.ai/), [Comet
541
+ ML](https://www.comet.ml/), and [Determined
542
+ AI](https://determined.ai/), all of which have solid
543
+ experiment tracking options.
544
+
545
+ Many of these platforms also offer **intelligent hyperparameter
546
+ optimization**, which allows us to control the cost of searching for the
547
+ right parameters for a model. For example, Weights and Biases has a
548
+ product called [Sweeps](https://wandb.ai/site/sweeps) that
549
+ helps with hyperparameter optimization. It's best to have it as part of
550
+ your regular ML training tool; there's no need for a dedicated tool.
551
+
552
+ ## 8 - "All-In-One"
553
+
554
+ ![](./media/image13.png)
555
+
556
+
557
+ There are machine learning infrastructure solutions that offer
558
+ everything\--training, experiment tracking, scaling out, deployment,
559
+ etc. These "all-in-one" platforms simplify things but don't come cheap!
560
+ Examples include [Gradient by
561
+ Paperspace](https://www.paperspace.com/gradient), [Domino
562
+ Data Lab](https://www.dominodatalab.com/), [AWS
563
+ Sagemaker](https://aws.amazon.com/sagemaker/), etc.
documents/lecture-02.srt ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,399 --> 00:00:49,360
3
+ hi everyone welcome to week two of full stack deep learning 2022. today we have a lecture on development infrastructure and tooling my name is sergey and i have my assistant mishka right here so just diving right in the dream of machine learning development is that you provide a project spec identify birds maybe some sample data here's what the birds look like here's what i want to see and then you get a continually improving prediction system and it's deployed at scale but the reality is that it's not just some sample data you really have to find the data aggregated process it clean it label it then you have to find the model architecture potentially the pre-trained weights then you still have to look at the model code
4
+
5
+ 2
6
+ 00:00:46,480 --> 00:01:32,640
7
+ probably edit it debug it run training experiments review the results that's going to feed back into maybe trying a new architecture debugging some more code and then when that's done you can actually deploy the model and then after you deploy it you have to monitor the predictions and then you close the data flywheel loop basically your user is generating fresh data for you that that you then have to add to your data set so this reality has roughly kind of three components and we divided into data and read this development in yellow and deployment in green and there are a lot of tools like the infrastructure landscape is pretty large so we have three lectures to cover all of it and today we're going to concentrate on
8
+
9
+ 3
10
+ 00:01:30,240 --> 00:02:16,239
11
+ the development part the middle part which is probably what you're familiar with from previous courses most of what you do is model development we actually want to start even a little bit before that and talk about software engineering you know it starts with maybe the programming language and for machine learning it's pretty clear it has to be python and the reason is because of all the libraries that have been developed for it it's just the winner in scientific and data computing there have been some contenders so julia is actually the the ju in jupiter jupiter notebooks to write python code you need an editor you can be old school and use vim or emacs a lot of people just write in jupyter notebooks or jupyter lab which
12
+
13
+ 4
14
+ 00:02:14,000 --> 00:03:06,400
15
+ also gives you a code editor window vs code is a very popular text editor python specific code editor pycharm is is really good as well at fsdl we recommend vs code it has a lot of nice stuff it hasn't built you know in addition to the nice editing features it has built-in git version control so you can see your commit you can actually stage line by line you can look at documentation as you write your code you can open projects remotely so like the window i'm showing here is actually on a remote machine that i've sshed into you can lend code as you write and if you haven't seen linters before it's basically this idea that if there are code style rules that you want to follow like a certain number of spaces
16
+
17
+ 5
18
+ 00:03:04,959 --> 00:03:49,280
19
+ for indentation whatever you decide you want to do gotta you should just codify it so that you don't ever have to think about it or manually put that in your tools just do it for you and you've run something that just looks at your code all the time you can do a little bit of static analysis so for example there's two commas in a row it's not going to run in this file or potentially you're using a variable that never got defined and in addition python now has type hints so you can actually say you know this variable is supposed to be an integer and then if you use it as an argument to a function that expects expect to float a static type checker can catch and tell you about it before you actually run it so we set
20
+
21
+ 6
22
+ 00:03:46,799 --> 00:04:35,040
23
+ that all up in the lab by the way and you will see how that works it's a very nice part of the lab a lot of people develop in jupiter notebooks and they're really fundamental to data science and i think for good reason i think it's a great kind of first draft of a project you just open up this notebook and you start coding there's very little thought that you have to put in before you start coding and start seeing immediate output so that kind of like fast feedback cycle that's really great and jeremy howard is a great practitioner so if you watch the fast ai course videos you'll see him use them to their full extent they do have problems though for example the editor that you use in the notebook is pretty primitive right
24
+
25
+ 7
26
+ 00:04:32,960 --> 00:05:23,039
27
+ there's no refactoring support there's no maybe peaking of the documentation there's no copilot which i have now got used to in vs code there's out of order execution artifact so if you've run the cells in a different order you might not get the same result as if you ran them all in line it's hard to version them you either strip out the output of each cell in which case you lose some of the benefit because sometimes you want to save the artifact that you produced in the notebook or the file is pretty large and keeps changing and it's hard to test because it's just not very amenable to like the unit testing frameworks and and and best practices that people have built up counterpoint to everything i just said
28
+
29
+ 8
30
+ 00:05:20,400 --> 00:06:10,560
31
+ is that you can kind of fix all of that and that's what jeremy howard is trying to do with nbdev which is this package that lets you write documentation your code and test for the code all in a notebook the full site deep learning recommendation is go ahead and use notebooks actually use the vs code built-in notebook support so i actually don't i'm not in the browser ever i'm just in in my vs code but i'm coding in a notebook style but also i usually write code in a module that then gets imported into a notebook and with this live reload extension it's quite nice because when you change code in the module and rerun the notebook that it gets the updated code and also you have nice things like you
32
+
33
+ 9
34
+ 00:06:08,960 --> 00:06:55,520
35
+ have a terminal you can look at files and so on and by the way it enables really awesome debugging so if you want to debug some code you can put a breakpoint here on the right you see the little red dot and then i'm about to launch the cell with the debug cell command and it'll drop me in into the debugger at that break point and so this is just really nice without leaving the editor i'm able to to do a lot notebooks are great sometimes you want something a little more interactive maybe something you can share with the world and streamlit has come along and let you just decorate python codes you write a python script you decorate it with widgets and data loaders and stuff and you can get interactive applets
36
+
37
+ 10
38
+ 00:06:53,599 --> 00:07:45,120
39
+ where people can let's say a variable can be controlled by a slider and everything just gets rerun very efficiently and then when you're happy with your applet you can publish it to the web and just share that streamlet address with your audience it's really quite great for setting up the python environment it can actually be pretty tricky so for deep learning usually you have a gpu and the gpu needs cuda libraries and python has a version and then each of the requirements that you use like pytorch or numpy have their own specific version also some requirements are for production like torch but some are only for development for example black is a code styling tool where my pi is a static analysis tool and it'd be nice to
40
+
41
+ 11
42
+ 00:07:41,440 --> 00:08:34,080
43
+ just separate the two so we can achieve all these desired things by specifying python and cuda versions in environment.yaml file and use conda to install the python and the cuda version that we specified but then all the other requirements we specify in with basically just very minimal constraints so we say like torch version greater than 1.7 or maybe no constraints like numpy any version and then we use this tool called pip tools that will analyze the constraints we gave and the constraints they might have for each other and find a mutually compatible version of all the requirements and then locks it so that when you come back to the project you have exactly the versions of everything you used
44
+
45
+ 12
46
+ 00:08:32,640 --> 00:09:25,040
47
+ and we can also just use a make file to simplify this now we do this in lab so you'll see this in lab and on that note please go through labs one through three they're already out and starts with an overview of what the labs are going to be about then pi torch lightning and pytorch and then we go through cnns transformers and we see a lot of the structure that i've been talking about so that is it for software engineering and the next thing i want to talk about are specifically deep learning frameworks and distributed training so why do we need frameworks well deep learning is actually not a lot of code if you have a matrix math library like numpy now fast.ai course does this pretty brilliantly they they basically have you
48
+
49
+ 13
50
+ 00:09:23,120 --> 00:10:12,560
51
+ build your own deep learning library and and you see how very little code it is but when you have to deploy stuff onto cuda for gpu power deep learning and when you have to consider that you might be writing weird layers that have to you have to figure out the differentiation of the layers that you write that can get to be just a lot to maintain and so and then also there's all the layer types that have been published in the literature like the convolutional layers there's all the different optimizers so there's just a lot of code and for that you really need a framework so which framework should you use right well i think josh answered this you know pretty concisely about a year ago and you said jax is for researchers pi
52
+
53
+ 14
54
+ 00:10:10,480 --> 00:11:00,880
55
+ torches for engineers and tensorflows for boomers so pytorch is the full stack deep learning choice but seriously though you know both pytorch and tensorflow and jaxx they all are similar you define a deep learning model by running python code writing and running python code and then what you get is an optimized execution graph that can target cpus gpus tpus mobile deployments now the reason you might prefer pytorch is because it just basically is absolutely dominant right so if you look at the number of models trained models that are shared on hugging face which is like the largest model zoo we'll talk about it in a few minutes you know there's models that are both pi torch and tensorflow there's some models
56
+
57
+ 15
58
+ 00:10:59,600 --> 00:11:50,240
59
+ on jacks there's some models for tensorflow only there's a lot of models that are just for pi torch if you track paper submissions to academic conferences it's about 75 plus percent pi torch implementations of these research papers and my face is blocking the stat but it's something like 75 percent of machine learning competition winners used pytorch in 2022 now tensorflow is kind of cool tensorflow.js in particular lets you run deep learning models in your browser and pytorch doesn't have that and then keras as a development experience is i think pretty unmatched for just stacking together layers easily training the model and then there's jax which you might have heard about so jack's you know the main thing is you
60
+
61
+ 16
62
+ 00:11:48,800 --> 00:12:37,200
63
+ need a meta framework for deep learning we'll talk about in a second but pytorch that's the pick excellent dev experience it's people used to say well maybe it's a little slow but it really is production ready even as is but you can make it even faster by compiling your model with a torch script there's a great distributed training ecosystem there's libraries for vision audio 3d data you know etc there's mobile deployment targets and with pytorch lightning which is what we use in labs have a nice structure for how to kind of where do you put your actual model code where you put your optimizer code where do you put your training code your evaluation code how should the data loaders look like and and then what you get is if you just
64
+
65
+ 17
66
+ 00:12:34,959 --> 00:13:26,800
67
+ kind of structure your code as pytorch lightning expects it you can run your code on cpu or gpu or any number of gpus or tpus with just you know a few characters change in your code there's a performance profiler there's model checkpointing there's 16-bit precision there's distributed training libraries it's just all very nice to use now another possibility is fast ai software which is developed alongside the fastai cores and it provides a lot of advanced tricks like data augmentations better weight initializations learning grade schedulers it has this kind of modular structure where there's data blocks and learners and then even vision text tabular applications the main problem with it that i see is
68
+
69
+ 18
70
+ 00:13:24,399 --> 00:14:20,560
71
+ the code style is quite different and in general it's it's a bit different than than mainstream pie torch it can be very powerful if you go in on it at fsdl we recommend pytorch lightning tensorflow is not just for boomers right fsdl prefers pi torch because we think it's a stronger ecosystem but tensorflow is still perfectly good and if you have a specific reason to prefer it such as that's what your employer uses you're gonna have a good time it still makes sense it's not bad jax is a recent a more recent project from google which is really not specific to deep learning it's about just general vectorization of all kinds of code and also auto differentiation of all kinds of code including your physics simulations
72
+
73
+ 19
74
+ 00:14:19,040 --> 00:15:03,440
75
+ stuff like that and then whatever you can express in jax gets compiled to gpu or tpu code and super fast for deep learning there are separate frameworks like flax or haiku and you know here at fsdl we say use it if you have a specific need maybe you're doing research on something kind of weird that's fine or you know potentially you're working at google you're not allowed to use pytorch that could make it a pretty good reason to use jacks there's also this notion of meta frameworks and model zoos that i want to cover so model zooz is the idea that sure you can just start with blank pi torch but most of the time you're going to start with at least a model architecture that someone's developed and published
76
+
77
+ 20
78
+ 00:15:02,320 --> 00:15:49,519
79
+ and a lot of the time you're going to start with actually a pre-trained model meaning someone trained the architecture on specific data they got weights that they then saved and uploaded to a hub and you can download and actually start not from scratch but from a pre-trained model onyx is this idea that deep learning models are all about the same right like we know what an mlp type of layer is we know what a cnn type of layer is and it doesn't matter if it's written in pytorch or tensorflow or cafe whatever it's written in we should be able to actually port it between the different code bases because the real thing that we're that we care about are the weights and the weights are just numbers right so onyx is this format that lets you
80
+
81
+ 21
82
+ 00:15:47,920 --> 00:16:39,279
83
+ convert from pytorch to tensorflow and vice versa and it can work super well it can also not work super well you can run into some edge cases so if it's something that you need to do then definitely worth a try but it's not necessarily going to work for all types of models hugging face has become an absolutely stellar repository of models starting with nlp but have since expanded to all kinds of tasks audio classification image classification object detection there's sixty thousand pre-trained models for all these tasks there is a specific library of transformers that works with pytorch tensorflow jacks also 7.5 000 data sets that people have uploaded there's also a lot more to it it's worth checking out you can host your model for
84
+
85
+ 22
86
+ 00:16:36,720 --> 00:17:31,679
87
+ inference and there's there's community aspects to it so it's a great resource another great resource specifically for vision is called tim state of the art computer vision models can be found on tim just search tim github next up let's talk about distributed training so the scenarios are we have multiple machines represented by little squares here with multiple gpus on each machine and you are sending batches of data to be processed by a model that has parameters right and the data batch can fit on a single gpu or potentially not fit on a single gpu and the model parameters can fit in a single gpu or potentially not fit in a single gpu so let's say the best case the easiest case is your batch of data fits on a single gpu
88
+
89
+ 23
90
+ 00:17:30,320 --> 00:18:21,280
91
+ your model parameters fit on a single gpu and that's really called trivial parallelism you can launch independent experiments on other gpus so maybe do a hyper parameter search or potentially you increase your batch size until it can no longer fit on one gpu and then you have to figure something else out and but then yeah what you have to then figure out is okay well my model still fits on a single gpu but my data no longer fits on a single gpu so now i have to go and do something different and what that different thing is usually is data parallelism it lets you distribute a single batch of data across gpus and then average gradients that are computed by the model across all the gpus so it's the same model on each gpu but
92
+
93
+ 24
94
+ 00:18:18,880 --> 00:19:10,960
95
+ different batches of data because a lot of this work is cross gpu we have to make sure that the gpus have fast interconnect right so gpu is connected usually through a pci interface to the computer but it and so if there's no other connection then all the data has to flow through the pci bus all the time it's possible that there is a faster interconnect like nv link between the gpus and then the data can leave the pci bus alone and just go straight across the the fast interconnect and the speed up you can expect is if you are using server cards like a100s a6000s you know v100s it's basically a linear speed up for data parallelism which is really cool if you're using consumer cards like 2080s or 3080s we'll talk about it a
96
+
97
+ 25
98
+ 00:19:08,720 --> 00:19:59,919
99
+ little further down then unfortunately it's going to be a sublinear speed up so maybe if you have four gpus it'll be like a 3x speed up if you have a gpus maybe a 5x speed up and that's due to the the fact that the consumer cards don't have as fast of an interconnect so data parallelism is implemented in pi torch in the distributed data parallel library there's also a third-party library called horovod and you can use either one super simply using pytorch lightning you basically say what's your strategy if you don't say anything then it's single gpu but if your strategy is ddp then it uses the python distributed data parallel if you use strategy horovod then it uses horivon it seems like the speedup's basically
100
+
101
+ 26
102
+ 00:19:58,160 --> 00:20:48,640
103
+ about the same there's no real reason to use horowat over distributed data parallel but it might make it easier for a specific case that you might have so it's good to know about but the first thing to try is just distributed data parallel now we come to a more advanced scenario which is now we can't even fit our model our model is so large it has billions of parameters it doesn't actually fit on a single gpu so we have to spread the model not just the data over multiple gpus and there's three solutions to this so sharded data parallelism starts with the question what exactly is in the gpu memory what is taking up the gpu memory so okay we have the model parameters the floats that make up our actual
104
+
105
+ 27
106
+ 00:20:47,360 --> 00:21:40,400
107
+ layers we have the gradients we need to know about the gradients because that's what we average to do our backdrop but we also have optimizer states and that's actually a lot of data for the atom optimizer that's probably the most often used optimizer today it has to be statistics about the gradients basically and in addition if you're doing kind of float 16 training then your model parameters gradients might be float 16 but the optimizer will keep a copy of them as float32 as well so it can be a lot more data and then plus of course you send your batch of data so all of this has to fit on a gpu but does it actually have to fit on every gpu is the question so the baseline that we have is yeah let's send all of this stuff to each gpu
108
+
109
+ 28
110
+ 00:21:37,840 --> 00:22:33,440
111
+ and that might take up like 129 gigabytes of data in this in this example this is from the paper called zero optimization storage training trillion parameter models okay so what if we shard the optimizer states sharding is a concept from databases where if you have one source of data you actually break it up into shards of data such that across your distributed system each part of your each node only sees a shard a single shard of the data so here the first thing we can try is we can shard the optimizer states each gpu doesn't have to have all the optimizer state it just has to have its little shard of it we can do the same for gradients and that's called zero two and then pretty crazily we can also do it for the
112
+
113
+ 29
114
+ 00:22:31,520 --> 00:23:19,840
115
+ model parameters themselves and that's called zero three and that can result in a pretty insane order of magnitude reduction in memory use which means that your batch size can be 10 times bigger i recommend watching this helpful video that i have linked but you literally pass around the model params between the gpus as computation is proceeding so here we see four gpus four chunks of data entering the gpus and what happened is gpu zero had the model parameters for that first part of the model and it communicated these parameters to the other three gpus and then they did their computation and once they were complete with that computation the other gpus can actually delete the parameters for those first
116
+
117
+ 30
118
+ 00:23:18,559 --> 00:24:05,440
119
+ four layers and then gpu one has the parameters for the next four layers and it broadcasts them to the other three gpus who are now able to do the next four layers of computation and that's just in the forward pass then you do the same with gradients and optimizer states in the backward pass this is a lot to implement thankfully we don't have to do it it's implemented by the deep speed library from microsoft and the fair scale library from facebook and recently actually also implemented natively by pytorch so in pytorch it's called fully sharded data parallel instead of zero three and with pytorch lightning you can actually try sharded ddp with just a tiny bit of a change try it see if you see a massive memory
120
+
121
+ 31
122
+ 00:24:04,400 --> 00:24:54,880
123
+ reduction that can correspond to a speed up in your training now the same idea the zero three principle right is that the gpu only needs the model frames it needs in the moment for the computation it's doing at this moment the same principle can be applied to just a single gpu you can get a 13 billion parameters onto the gpu and you can train a 13 billion parameter model on a single v100 which doesn't even fit it natively and fair scale also implements this and calls it cpu offloading there's a couple more solutions model parallelism take your model your model let's say has three layers and you have three gpus you can put each layer on a gpu right and in pytorch you can just implement it very trivially but the
124
+
125
+ 32
126
+ 00:24:52,960 --> 00:25:41,840
127
+ problem is that only one gpu will be active at a given time so the trick here is that and once again implemented by libraries like deep speed and fair scale they make it better so they pipeline this kind of computation so that gpus are mostly fully utilized although you need to tune the amount of pipelining on the batch size and exactly how you're going to split up the model into the gpus so this isn't as much of fire and forget solution like like sharded data parallel and another solution is tensor parallelism which basically is observing that there's nothing special about a matrix multiplication that requires the whole matrix to be on one gpu you can distribute the matrix over gpus so megatron lm is a repository from
128
+
129
+ 33
130
+ 00:25:39,279 --> 00:26:34,960
131
+ nvidia which did this for the transformer model and is widely used so you can actually use all of these if you really need to scale and the model that really needs to scale is a gpt3 three-sized language model such as bloom which recently finished training so they used zero data parallelism tensor parallelism pipeline parallelism in addition to some other stuff and they called it 3d parallelism but they also write that since they started their endeavor the the zero stage three performance has dramatically improved and if they were to start over again today maybe they would just do sharded data parallel and that would just be enough so in conclusion you know if your model and data fits on one gpu that's awesome
132
+
133
+ 34
134
+ 00:26:32,799 --> 00:27:21,919
135
+ if it doesn't or you want to speed up training then you can distribute over gpus with distributed data parallel if the model still doesn't fit you should try zero three or fully shared data parallel there's other ways to speed up there's 16 bit training there's maybe some special you know fast kernels for different types of layers like transformers you can maybe try sparse attention instead of normal dense attention so there's other things that these libraries like deep speed and fair skill implement that you can try and there's even more tricks that you could try for example for nlp there's this position encoding step you can use something called alibi which scales to basically all length of sequences
136
+
137
+ 35
138
+ 00:27:20,480 --> 00:28:09,600
139
+ so you can actually train on shorter sequences and use this trick called sequence length warm up where you train on shorter sequences and then you increase the size and because you're using alibi it should not mess up your position and then for vision you can also use a size warm up by progressively increasing the size of the image you can use special optimizers and these tricks are implemented by a library called mosaic ml composer and they report some pretty cool speed ups and it's pretty easy to implement and they also have a cool web tool i'm a fan of these things that basically lets you see the efficient frontier for training models time versus cost kind of fun to play around with this mosaic ml explorer
140
+
141
+ 36
142
+ 00:28:08,000 --> 00:29:01,039
143
+ there's also some research libraries like ffcv which actually try to optimize the data flow there are some simple tricks you can maybe do that speed it up a lot these things will probably find their way into mainstream pie torch eventually but it's worth giving this a try especially if again you're training on vision models the next thing we're going to talk about is compute that we need for deep learning i'm sure you've seen plots like this from open ai this is up through 2019 showing on a log scale just how many times the compute needs for the top performing models have grown and this goes even further into 2022 with the large language models like gpt3 they're just incredibly large and required an incredible amount of
144
+
145
+ 37
146
+ 00:28:58,720 --> 00:29:53,279
147
+ pedoflops to train so basically nvidia is the only choice for deep learning gpus and recently google tpus have been made available in the gcp cloud and they're also very nice and the three main factors that we need to think about when it comes to gpus are how much data can you transfer to the gpu then how fast can you crunch through that data and that actually depends on whether the data is 32-bit or 16-bit and then how fast can you communicate between the cpu and the gpu and between gpus we can look at some landmark nvidia gpus so the first thing we might notice is that there's a basically a new architecture every year every couple of years it went from kepler with the k80 and k40 cards in 2014
148
+
149
+ 38
150
+ 00:29:51,520 --> 00:30:44,480
151
+ up through ampere from 2020 on some cards are for server use some cards are for consumer use if you're doing stuff for business you're only supposed to use the server cards the ram that the gpu has allows you to fit a large model and a meaningful batch of data on the gpu so the more ram the better these are this is like kind of how much data can you crunch through in a unit time and there's also i have a column for tensor t flops are special tensor cores that nvidia specifically intends for deep learning operations which are mixed precision float32 and float16 these are much higher than just straight 32-bit teraflops if you use 16-bit training you effectively double or so your rain capacity we looked at the teraflops these are
152
+
153
+ 39
154
+ 00:30:42,720 --> 00:31:46,960
155
+ theoretical numbers but how do they actually benchmark lame the labs is probably the best source of benchmark data and here they show relative to the v100 single gpu how do the different gpus compare so one thing we might notice is the a100 which is the most recent gpu that's the server grade is over 2.5 faster than v100 you'll notice there's a couple of different a100s the pcie versus sxm4 refers to how fast you can get the data onto the gpu and the 40 gig versus 80 gig refers to how much data can fit on the gpu also recently there's rtx a 4000 5000 6000 and so on cards and the a40 and these are all better than the v100 another source of benchmarks is aime they show you time for resnet50 model to go through
156
+
157
+ 40
158
+ 00:31:44,240 --> 00:32:42,720
159
+ 1.4 images in imagenet the configuration of four a100s versus four v100s is three times faster in in flow 32 and only one and a half times faster in float 16. there's a lot more stuff you can notice but that's what i wanted to highlight and we could buy some of these gpus we could also use them in the cloud so amazon web services google cloud platform microsoft azure are all the heavyweight cloud providers google cloud platform out of the three is special because it also has tpus and the startup cloud providers are lame the labs paper space core weave data crunch jarvis and others so briefly about tpus so there's four versions of them four generations the tpu v4 are the most recent ones and they're
160
+
161
+ 41
162
+ 00:32:40,480 --> 00:33:36,960
163
+ just the fastest possible accelerator for deep learning this graphic shows speed ups over a100 which is the fastest nvidia accelerator but the v4s are not quite in general availability yet the v3s are still super fast and they excel at scaling so if you use if you have to train such a large model that you use multiple nodes multiple and all the cores in the tpu then this can be quite fast each tpu has 128 gigs of ram so there's a lot of different clouds and it's a little bit overwhelming to actually compare prices so we built a tool for cloud comparison cloud gpu comparison so we have aws gcp azure lambda labs paper space jarvis labs data crunch and we solicit pull requests so if you know another one like core weave
164
+
165
+ 42
166
+ 00:33:35,519 --> 00:34:34,639
167
+ make a pull request to this csv file and then what you can do is you can filter so for example i want to see only the latest generation gpus i want to see only four or eight gpu machines and then maybe particularly i actually don't even want to see the i want to see only the a100s so let's only select the a100s so that narrows it down right so if we want to use that that narrows it down and furthermore maybe i only want to use the 80 gig versions so that narrows it down further and then we can sort by per gpu price or the total price and we can see the properties of the machines right so we know the gpu ram but how many virtual cpus and how much machine ram do these different providers supply to us now let's combine this cost
168
+
169
+ 43
170
+ 00:34:33,679 --> 00:35:33,119
171
+ data with benchmark data and what we find is that something that's expensive per hour is not necessarily expensive per experiment using lambda labs benchmarking data if you use the forex v100 machine which is the cheapest per hour and you run an experiment using a transformers model that takes 72 hours it'll cost 1750 to run but if you use the 8x a100 machine it will only take eight hours to run and it'll actually only cost 250 and there's a similar story if you use confnet instead of transformer models less dramatic but still we find that the 8 by a100 machine is both the fastest and the cheapest so that's a little counter-intuitive so i was looking for more benchmarks so here is mosaic ml which i mentioned
172
+
173
+ 44
174
+ 00:35:30,960 --> 00:36:30,000
175
+ earlier they're benchmarking the resnet 50 and this is on aws what they find is the 8x a100 machine is one and a half times faster and 15 cheaper than 8x v100 so this is a confident experiment and here's a transformer experiment ept2 model so the 8x a100 machine is twice as fast and 25 cheaper than the adax v100 machine and it's actually three times faster and 30 cheaper than the 8x t4 machine which is a touring generation gpu a good heuristic is use the most expensive per hour gpu which is probably going to be a 4x or 8x a100 in the least expensive cloud and from playing with that cloud gpu table you can convince yourself that the startups are much cheaper than the big boys so here i'm filtering by a100
176
+
177
+ 45
178
+ 00:36:26,960 --> 00:37:22,560
179
+ and the per gpu cost on lambda labs is only one dollar and 10 cents per hour and on gcp azure and and aws it's at least you know 3.67 cents but what if you don't want to use the cloud there's two options you could build your own which is i would say easy or you can buy pre-built which is definitely even easier lambda labs builds them and nvidia builds them and then just pc builders like super micro and stuff like that build them you can build a pretty quiet pc with with a lot of ram and let's say you know two 390s or 2080 ti's or something that would maybe be five to eight thousand dollars it take you a day to build it and set it up maybe it's a rite of passage for deep learning practitioners
180
+
181
+ 46
182
+ 00:37:20,480 --> 00:38:15,680
183
+ now if you want to go beyond four or 2000 series like 20 80s or two 3000 series like 30 90s that can be painful just because there's a lot of power that they consume and they get hot so pre-built can be better here's a 12 000 machine with two a5000s which each have 24 gigs ram it's going to be incredibly fast or maybe you want 8 gpus now this one is going to be loud you're going to have to put it in some kind of special facility like a colo and actually lame the labs can can stored in their colo for you it'd be maybe sixty thousand dollars for eight a six thousands which is a really really fast server lame the labs also provides actionable advice for selecting specific gpus there is a well known article from tim detmers
184
+
185
+ 47
186
+ 00:38:13,119 --> 00:38:58,320
187
+ that is now slightly out of date because there's no ampere cards but it's still good he talks about more than just gpus but also about what cpu to get the ram the recommendations that that i want to give is i think it's it's useful to have your own gpu machine just to shift your mindset from minimizing cost of running in the cloud to maximizing utility of having something that you already paid for and just maximizing how much use you get out of it but to scale out experiments you probably need to enter the cloud and you should use the most expensive machines in the least expensive cloud tpus are worth experimenting with if you're doing large scale training lameda labs is a sponsor of the full-stack deep
188
+
189
+ 48
190
+ 00:38:56,800 --> 00:39:45,520
191
+ learning projects that our students are doing this year it's actually an excellent choice for both buying a machine for yourself and it's the least expensive cloud for a100s now that we've talked about compute we can talk about how to manage it so we want to do is we want to launch an experiment or a set of experiments each experiment is going to need a machinery machines with gpu or gpus in the machine it's going to need some kind of setup like a python version cuda version nvidia drivers python requirements like a specific version of pytorch and then it needs a source of data so we could do this manually we could use a workload manager like slurm we could use docker and kubernetes or we could use some software specialized
192
+
193
+ 49
194
+ 00:39:43,920 --> 00:40:35,520
195
+ for machine learning if you follow best practices for specifying dependencies like content pip tools that we covered earlier then all you have to do is log into the machine launch an experiment right activate your environment launch the experiment say how many gpus it needs if you however have a cluster of machines then you need to do some more advanced which is probably going to be slurm which is an old-school solution to workload management that's still that's still widely used this is actually a job from the big science effort to train the gpt3 size language model so they have 24 nodes with 64 cpus and 8 gpus on each node slurm is the way that they launched it on their cluster docker is a way to package up an entire
196
+
197
+ 50
198
+ 00:40:34,079 --> 00:41:22,560
199
+ dependency stack in in something that's lighter than a full-on virtual machine nvidia docker is also something you'll have to install which let's use gpus and we'll actually use this in lab so we'll talk more about it later kubernetes has kind of emerged as as the best way the most popular way to run many docker containers on top of a cluster cube flow specifically is a project for machine learning both of these are google originated open source projects but they're not controlled by google anymore so with kubeflow you can spawn and manage jupiter notebooks you can manage multi-step workflows it interfaces with pytorch and tensorflow and you can run it on top of google cloud platform or aws or azure or on your own
200
+
201
+ 51
202
+ 00:41:20,800 --> 00:42:10,400
203
+ cluster and it can be useful but it's a lot so it could be the right choice for you but we think it probably won't be slarm and kubeflow they make sense if you already have a cluster up and running but how do you even get a cluster up and running in the first place and before we proceed i try not to mention software as a service that doesn't show pricing i find that you know when you go to the website and it says call us or whatever contact us for a demo that's not the right fit for the fsdl community we like to use open source ideally but if it's not open source then at least something that's transparently priced aws sagemaker is a solution you've probably heard about if you've used amazon web services
204
+
205
+ 52
206
+ 00:42:08,160 --> 00:42:57,040
207
+ and it's really a set of solutions it's everything from labeling data to launching notebooks to training to deploying your models and even to monitoring them and notebooks are a central paradigm they call it sagemaker studio and sagemaker could totally make sense to adopt if you're already using aws for everything if you're not already using aws for everything it's not such a silver bullet that it's worth adopting necessarily but if you are it's definitely worth a look so for training specifically they have some basically pre-built algorithms and they're quite they're quite old-school but you can also connect any other algorithm yourself it's a little more it's a little more complicated and right away you have to configure a
208
+
209
+ 53
210
+ 00:42:55,119 --> 00:43:48,880
211
+ lot of i am you know roles and and security groups and stuff like that it might be overwhelming if all you're trying to do is train a machine learning model that said they do have increasing support for pytorch now notice if you're using sagemaker to launch your python training you're going to be paying about a 15 to 20 markup so there's special sagemaker instances that correspond to normal aws gpu instances but it's more expensive they do have support for using spot instances and so that could make it worth it any scale is a company from the makers of ray which is an open source project from berkeley and recently they released ray train which they claim is faster than sagemaker so the same idea basically lets you
212
+
213
+ 54
214
+ 00:43:46,800 --> 00:44:40,560
215
+ scale out your training to many nodes with many gpus but does it faster and it has better spot instance support where if a spot instance gets killed during training it recovers from it intelligently and any scale any scale a software is a service that makes it you know really simple to provision compute with one line of code you can launch a cluster of any size that ease of use comes at a significant markup to amazon web services grid ai is makers of py torch lightning and the the tagline is seamlessly trained hundreds of machine learning models on the cloud with zero code changes if you have some kind of main dot pi method that's going to run your training and that can run on your laptop or on on
216
+
217
+ 55
218
+ 00:44:38,400 --> 00:45:27,760
219
+ some local machine you can just scale it out to a grid of instances by prefacing it with grid run and then just saying what kind of instance type how many gpus should i use spot instances and so on and you can also you can use their instances or you can use aws under the hood and then it shows you all the experiments you're running and so on now i'm not totally sure about the long term plans for grid.ai because the makers of python's lightning are also rebranding as lightning.i which has its own pricing so i'm i'm just not totally sure but it's if it sticks around it looks like a really cool solution there's also non-machine learning specific solutions like you don't need sagemaker to provision compute on aws
220
+
221
+ 56
222
+ 00:45:25,440 --> 00:46:11,200
223
+ you could just do it in a number of ways that people have been doing you know provisioning aws instances and then uniting them into a cluster you can write your own scripts you can use kubernetes you can use some libraries for spot instances but there's nothing you know we can really recommend that's super easy to use determined ai is a machine learning specific open source solution that lets you manage a cluster either on prem or in the cloud it's cluster management distributed training experiment tracking hyper parameter search a lot of extra stuff it was a startup also from berkeley it got acquired by hp but it's still an active development it's really easy to use you just install determined get a
224
+
225
+ 57
226
+ 00:46:09,839 --> 00:46:59,920
227
+ cluster up and running you can also launch it on aws or gcp that said i feel like a truly simple solution to launching training on many cloud instances still doesn't exist so this is an area where i think there's room for a better solution and that cannot be said about experiment management and model management because i think there's great solutions there so what experiment management refers to is you know as we run machine learning experiments we we can lose track of which code parameters data set generated which model when we run multiple experiments that's even more difficult we need to like start making a spreadsheet of all the experiments we ran and the results and so on tensorboard is a solution from google
228
+
229
+ 58
230
+ 00:46:58,079 --> 00:47:48,640
231
+ that's not exclusive to tensorflow it gives you this nice set of pages that lets you track your loss and see where your model saved and it's a great solution for single experiments it does get unwieldy to manage many experiments as you get into dozens of experiments ml flow tracking is a solution that is open source it's from data bricks but it's not exclusive to data breaks it's not only for experiment management it's also for model packaging and stuff like that but they do have a robust solution for experiment management you do have to host it yourself weights and biases is a really popular super easy to use solution that is free for public projects and paid for private projects they show you
232
+
233
+ 59
234
+ 00:47:46,240 --> 00:48:33,119
235
+ all the experiments you've ever run slice them dice however you want for each experiment they record would you log like your loss but also stuff about your system like how utilized your gpu is which is pretty important to track and you basically just initialize it with your experiment config and then you log anything you want including images and we're actually going to see this in lab 4 which is this week they also have some other stuff like you can host reports and tables is a recent product that lets you slice and dice your data and predictions in really cool ways determine.ai also has an experiment tracking solution which is also perfectly good and there's other solutions too like neptune and comet
236
+
237
+ 60
238
+ 00:48:30,160 --> 00:49:22,240
239
+ and a number of others really often we actually want to programmatically launch experiments by doing something that's called hyper parameter optimization so maybe we want to search over learning rates so as we launch our training we don't want to commit to a specific learning rate we basically want to search over learning rates from you know point zero zero zero one to point one it'd be even more awesome if like this was done intelligently where if multiple runs are proceeding in in parallel the ones that aren't going as well as others get stopped early and we get to search over more of the potential hyperparameter space weights and biases has a solution to this that's very pragmatic and easy to
240
+
241
+ 61
242
+ 00:49:19,599 --> 00:50:10,720
243
+ use it's called sweeps the way this works is you basically add a yaml file to your project that specifies the parameters you want to search over and how you want to do the search so here on the right you'll see we're using this hyperband algorithm which is a state-of-the-art hyper-parameter optimization algorithm and then you launch agent on whatever machines you control the agent will pull the sweep server for a set of parameters run an experiment report results poll the server for more parameters and keep doing that and there's other solutions this is pretty table stakes kind of thing so sagemaker has hyperparameter search determined ai has hyperparameter search i think of it as just it's a part of your training harness so
244
+
245
+ 62
246
+ 00:50:09,680 --> 00:51:02,480
247
+ if you're already using weights and biases just use sweeps from weights and biases if you're already using determine just use hyperparameter search from determined it's not worth using some specialized software for this and lastly there are all-in-one solutions that cover everything from data to development to deployment a single system for everything for development usually a notebook interface scaling a training experiment to many machines provisioning the compute for you tracking experiments versioning models but also deploying models and monitoring performance managing data of really all-in-one each maker is the you know the prototypical solution here but there's some other ones like gradients from paper space so look at
248
+
249
+ 63
250
+ 00:51:00,960 --> 00:51:48,240
251
+ look at these features notebooks experiments data sets models and inference or domino data labs you can provision compute you can track the experiments you can deploy a model via a rest api you can monitor the predictions that the the api makes and you can publish little data applets kind of like streamlit you can also monitor spend and you see all the projects in one place domino's meant more for kind of non-deep learning machine learning but i just wanted to show it because it's a nice set of the all-in-one functionality so these all-in-one solutions could be good but before deciding we want to go in on one of them let's wait to learn more about data management and deployment in the weeks ahead
252
+
253
+ 64
254
+ 00:51:46,480 --> 00:51:52,960
255
+ and that is it for development infrastructure and tooling thank you
256
+
documents/lecture-03.md ADDED
@@ -0,0 +1,597 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Principles for testing software, tools for testing Python code, practices for debugging models and testing ML
3
+ ---
4
+
5
+ # Lecture 3: Troubleshooting & Testing
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/RLemHNAO5Lw?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Charles Frye](https://twitter.com/charles_irl).<br />
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published August 22, 2022.
14
+ [Download slides](https://fsdl.me/2022-lecture-03-slides).
15
+
16
+ ## 1 - Testing Software
17
+
18
+ 1. The general approach is that tests will help us ship faster with
19
+ fewer bugs, but they won't catch all of our bugs.
20
+
21
+ 2. That means we will use testing tools but won't try to achieve 100%
22
+ coverage.
23
+
24
+ 3. Similarly, we will use linting tools to improve the development
25
+ experience but leave escape valves rather than pedantically
26
+ following our style guides.
27
+
28
+ 4. Finally, we'll discuss tools for automating these workflows.
29
+
30
+ ### 1.1 - Tests Help Us Ship Faster. They Don't Catch All Bugs
31
+
32
+ ![](./media/image1.png)
33
+
34
+ **Tests are code we write that are designed to fail intelligibly when
35
+ our other code has bugs**. These tests can help catch some bugs before
36
+ they are merged into the main product, but they can't catch all bugs.
37
+ The main reason is that test suites are not certificates of correctness.
38
+ In some formal systems, tests can be proof of code correctness. But we
39
+ are writing in Python (a loosely goosey language), so all bets are off
40
+ in terms of code correctness.
41
+
42
+ [Nelson Elhage](https://twitter.com/nelhage?lang=en)
43
+ framed test suites more like classifiers. The classification problem is:
44
+ does this commit have a bug, or is it okay? The classifier output is
45
+ whether the tests pass or fail. We can then **treat test suites as a
46
+ "prediction" of whether there is a bug**, which suggests a different way
47
+ of designing our test suites.
48
+
49
+ When designing classifiers, we need to trade off detection and false
50
+ alarms. **If we try to catch all possible bugs, we can inadvertently
51
+ introduce false alarms**. The classic signature of a false alarm is a
52
+ failed test - followed by a commit that fixes the test rather than the
53
+ code.
54
+
55
+ To avoid introducing too many false alarms, it's useful to ask yourself
56
+ two questions before adding a test:
57
+
58
+ 1. Which real bugs will this test catch?
59
+
60
+ 2. Which false alarms will this test raise?
61
+
62
+ If you can think of more examples for the second question than the first
63
+ one, maybe you should reconsider whether you need this test.
64
+
65
+ One caveat is that: **in some settings, correctness is important**.
66
+ Examples include medical diagnostics/intervention, self-driving
67
+ vehicles, and banking/finance. A pattern immediately arises here: If you
68
+ are operating in a high-stakes situation where errors have consequences
69
+ for people's lives and livelihoods, even if it's not regulated yet, it
70
+ might be regulated soon. These are examples of **low-feasibility,
71
+ high-impact ML projects** discussed in the first lecture.
72
+
73
+ ![](./media/image19.png)
74
+
75
+
76
+ ### 1.2 - Use Testing Tools, But Don't Chase Coverage
77
+
78
+ - *[Pytest](https://docs.pytest.org/)* is the standard
79
+ tool for testing Python code. It has a Pythonic implementation and
80
+ powerful features such as creating separate suites, sharing
81
+ resources across tests, and running parametrized variations of
82
+ tests.
83
+
84
+ - Pure text docs can't be checked for correctness automatically, so
85
+ they are hard to maintain or trust. Python has a nice module,
86
+ [*[doctests]*](https://docs.python.org/3/library/doctest.html),
87
+ for checking code in the documentation and preventing rot.
88
+
89
+ - Notebooks help connect rich media (charts, images, and web pages)
90
+ with code execution. A cheap and dirty solution to test notebooks
91
+ is adding some *asserts* and using *nbformat* to run the
92
+ notebooks.
93
+
94
+ ![](./media/image17.png)
95
+
96
+
97
+ Once you start adding different types of tests and your codebase grows,
98
+ you will want coverage tools for recording which code is checked or
99
+ "covered" by tests. Typically, this is done in lines of code, but some
100
+ tools can be more fine-grained. We recommend
101
+ [Codecov](https://about.codecov.io/), which generates nice
102
+ visualizations you can use to drill down and get a high-level overview
103
+ of the current state of your testing. Codecov helps you understand your
104
+ tests and can be incorporated into your testing. You can say you want to
105
+ reject commits not only where tests fail, but also where test coverage
106
+ goes down below a certain threshold.
107
+
108
+ However, we recommend against that. Personal experience, interviews, and
109
+ published research suggest that only a small fraction of the tests you
110
+ write will generate most of your value. **The right tactic,
111
+ engineering-wise, is to expand the limited engineering effort we have on
112
+ the highest-impact tests and ensure that those are super high quality**.
113
+ If you set a coverage target, you will instead write tests in order to
114
+ meet that coverage target (regardless of their quality). You end up
115
+ spending more effort to write tests and deal with their low quality.
116
+
117
+ ![](./media/image16.png)
118
+
119
+
120
+ ### 1.3 - Use Linting Tools, But Leave Escape Valves
121
+
122
+ **Clean code is of uniform and standard style**.
123
+
124
+ 1. Uniform style helps avoid spending engineering time on arguments
125
+ over style in pull requests and code review. It also helps improve
126
+ the utility of our version control by cutting down on noisy
127
+ components of diffs and reducing their size. Both benefits make it
128
+ easier for humans to visually parse the diffs in our version
129
+ control system and make it easier to build automation around them.
130
+
131
+ 2. Standard style makes it easier to accept contributions for an
132
+ open-source repository and onboard new team members for a
133
+ closed-source system.
134
+
135
+ ![](./media/image18.png)
136
+
137
+
138
+ One aspect of consistent style is consistent code formatting (with
139
+ things like whitespace). The standard tool for that in Python is
140
+ [the] *[black]* [Python
141
+ formatter](https://github.com/psf/black). It's a very
142
+ opinionated tool with a fairly narrow scope in terms of style. It
143
+ focuses on things that can be fully automated and can be nicely
144
+ integrated into your editor and automated workflows.
145
+
146
+ For non-automatable aspects of style (like missing docstrings), we
147
+ recommend [*[flake8]*](https://flake8.pycqa.org/). It comes
148
+ with many extensions and plugins such as docstring completeness, type
149
+ hinting, security, and common bugs.
150
+
151
+ ML codebases often have both Python code and shell scripts in them.
152
+ Shell scripts are powerful, but they also have a lot of sharp edges.
153
+ *[shellcheck](https://www.shellcheck.net/)* knows all the
154
+ weird behaviors of bash that often cause errors and issues that aren't
155
+ immediately obvious. It also provides explanations for why it's raising
156
+ a warning or an error. It's very fast to run and can be easily
157
+ incorporated into your editor.
158
+
159
+ ![](./media/image6.png)
160
+
161
+
162
+ One caveat to this is: **pedantic enforcement of style is obnoxious.**
163
+ To avoid frustration with code style and linting, we recommend:
164
+
165
+ 1. Filtering rules down to the minimal style that achieves the goals we
166
+ set out (sticking with standards, avoiding arguments, keeping
167
+ version control history clean, etc.)
168
+
169
+ 2. Having an "opt-in" application of rules and gradually growing
170
+ coverage over time - which is especially important for existing
171
+ codebases (which may have thousands of lines of code that we need
172
+ to be fixed).
173
+
174
+ ### 1.4 - Always Be Automating
175
+
176
+ **To make the best use of testing and linting practices, you want to
177
+ automate these tasks and connect to your cloud version control system
178
+ (VCS)**. Connecting to the VCS state reduces friction when trying to
179
+ reproduce or understand errors. Furthermore, running things outside of
180
+ developer environments means that you can run tests automatically in
181
+ parallel to other development work.
182
+
183
+ Popular, open-source repositories are the best place to learn about
184
+ automation best practices. For instance, the PyTorch Github library has
185
+ tons of automated workflows built into the repo - such as workflows that
186
+ automatically run on every push and pull.
187
+
188
+ ![](./media/image15.png)
189
+
190
+
191
+ The tool that PyTorch uses (and that we recommend) is [GitHub
192
+ Actions](https://docs.github.com/en/actions), which ties
193
+ automation directly to VCS. It is powerful, flexible, performant, and
194
+ easy to use. It gets great documentation, can be used with a YAML file,
195
+ and is embraced by the open-source community. There are other options
196
+ such as [pre-commit.ci](https://pre-commit.ci/),
197
+ [CircleCI](https://circleci.com/), and
198
+ [Jenkins](https://www.jenkins.io/); but GitHub Actions
199
+ seems to have won the hearts and minds in the open-source community in
200
+ the last few years.
201
+
202
+ To keep your version control history as clean as possible, you want to
203
+ be able to run tests and linters locally before committing. We recommend
204
+ *[pre-commit](https://github.com/pre-commit/pre-commit)*
205
+ to enforce hygiene checks. You can use it to run formatting, linting,
206
+ etc. on every commit and keep the total runtime to a few seconds.
207
+ *pre-commit* is easy to run locally and easy to automate with GitHub
208
+ Actions.
209
+
210
+ **Automation to ensure the quality and integrity of our software is a
211
+ productivity enhancer.** That's broader than just CI/CD. Automation
212
+ helps you avoid context switching, surfaces issues early, is a force
213
+ multiplier for small teams, and is better documented by default.
214
+
215
+ One caveat is that: **automation requires really knowing your tools.**
216
+ Knowing Docker well enough to use it is not the same as knowing Docker
217
+ well enough to automate it. Bad automation, like bad tests, takes more
218
+ time than it saves. Organizationally, that makes automation a good task
219
+ for senior engineers who have knowledge of these tools, have ownership
220
+ over code, and can make these decisions around automation.
221
+
222
+ ### Summary
223
+
224
+ 1. Automate tasks with GitHub Actions to reduce friction.
225
+
226
+ 2. Use the standard Python toolkit for testing and cleaning your
227
+ projects.
228
+
229
+ 3. Choose testing and linting practices with the 80/20 principle,
230
+ shipping velocity, and usability/developer experience in mind.
231
+
232
+ ## 2 - Testing ML Systems
233
+
234
+ 1. Testing ML is hard, but not impossible.
235
+
236
+ 2. We should stick with the low-hanging fruit to start.
237
+
238
+ 3. Test your code in production, but don't release bad code.
239
+
240
+ ### 2.1 - Testing ML Is Hard, But Not Impossible
241
+
242
+ Software engineering is where many testing practices have been
243
+ developed. In software engineering, we compile source code into
244
+ programs. In machine learning, training compiles data into a model.
245
+ These components are harder to test:
246
+
247
+ 1. Data is heavier and more inscrutable than source code.
248
+
249
+ 2. Training is more complex and less well-defined.
250
+
251
+ 3. Models have worse tools for debugging and inspection than compiled
252
+ programs.
253
+
254
+ In this section, we will focus primarily on "smoke" tests. These tests
255
+ are easy to implement and still effective. They are among the 20% of
256
+ tests that get us 80% of the value.
257
+
258
+ ### 2.2 - Use Expectation Testing on Data
259
+
260
+ **We test our data by checking basic properties**. We express our
261
+ expectations about the data, which might be things like there are no
262
+ nulls in this column or the completion date is after the start date.
263
+ With expectation testing, you will start small with only a few
264
+ properties and grow them slowly. You only want to test things that are
265
+ worth raising alarms and sending notifications to others.
266
+
267
+ ![](./media/image14.png)
268
+
269
+
270
+ We recommend
271
+ [*[great_expectations]*](https://greatexpectations.io/) for
272
+ data testing. It automatically generates documentation and quality
273
+ reports for your data, in addition to built-in logging and alerting
274
+ designed for expectation testing. To get started, check out [this
275
+ MadeWithML tutorial on
276
+ great_expectations](https://github.com/GokuMohandas/testing-ml).
277
+
278
+ ![](./media/image13.png)
279
+
280
+ To move forward, you want to stay as close to the data as possible:
281
+
282
+ 1. A common pattern is that there's a benchmark dataset with
283
+ annotations (in academia) or an external annotation team (in the
284
+ industry). A lot of the detailed information about that data can
285
+ be extracted by simply looking at it.
286
+
287
+ 2. One way for data to get internalized into the organization is that
288
+ at the start of the project, model developers annotate data ad-hoc
289
+ (especially if you don't have the budget for an external
290
+ annotation team).
291
+
292
+ 3. However, if the model developers at the start of the project move on
293
+ and more developers get onboarded, that knowledge is diluted. A
294
+ better solution is an internal annotation team that has a regular
295
+ information flow with the model developers is a better solution.
296
+
297
+ 4. The best practice ([recommended by Shreya
298
+ Shankar](https://twitter.com/sh_reya/status/1521903046392877056))
299
+ is t**o have a regular on-call rotation where model developers
300
+ annotate data themselves**. Ideally, these are fresh data so that
301
+ all members of the team who are developing models know about the
302
+ data and build intuition/expertise in the data.
303
+
304
+ ### 2.3 - Use Memorization Testing on Training
305
+
306
+ **Memorization is the simplest form of learning**. Deep neural networks
307
+ are very good at memorizing data, so checking whether your model can
308
+ memorize a very small fraction of the full data set is a great smoke
309
+ test for training. If a model can\'t memorize, then something is clearly
310
+ very wrong!
311
+
312
+ Only really gross issues with training will show up with this test. For
313
+ example, your gradients may not be calculated correctly, you have a
314
+ numerical issue, or your labels have been shuffled; serious issues like
315
+ these. Subtle bugs in your model or your data are not going to show up.
316
+ A way to catch smaller bugs is to include the length of run time in your
317
+ test coverage. It's a good way to detect if smaller issues are making it
318
+ harder for your model to learn. If the number of epochs it takes to
319
+ reach an expected performance suddenly goes up, it may be due to a
320
+ training bug. PyTorch Lightning has an "*overfit_batches*" feature that
321
+ can help with this.
322
+
323
+ **Make sure to tune memorization tests to run quickly, so you can
324
+ regularly run them**. If they are under 10 minutes or some short
325
+ threshold, they can be run every PR or code change to better catch
326
+ breaking changes. A couple of ideas for speeding up these tests are
327
+ below:
328
+
329
+ ![](./media/image3.png)
330
+
331
+ Overall, these ideas lead to memorization tests that implement model
332
+ training on different time scale and allow you to mock out scenarios.
333
+
334
+ A solid, if expensive idea for testing training is to **rerun old
335
+ training jobs with new code**. It's not something that can be run
336
+ frequently, but doing so can yield lessons about what unexpected changes
337
+ might have happened in your training pipeline. The main drawback is the
338
+ potential expense of running these tests. CI platforms like
339
+ [CircleCI](https://circleci.com/) charge a great deal for
340
+ GPUs, while others like Github Actions don't offer access to the
341
+ relevant machines easily.
342
+
343
+ The best option for testing training is to **regularly run training with
344
+ new data that's coming in from production**. This is still expensive,
345
+ but it is directly related to improvements in model development, not
346
+ just testing for breakages. Setting this up requires **a data flywheel**
347
+ similar to what we talked about in Lecture 1. Further tooling needed to
348
+ achieve will be discussed down the line.
349
+
350
+ ### 2.4 - Adapt Regression Testing for Models
351
+
352
+ **Models are effectively functions**. They have inputs and produce
353
+ outputs like any other function in code. So, why not test them like
354
+ functions with regression testing? For specific inputs, we can check to
355
+ see whether the model consistently returns the same outputs. This is
356
+ best done with simpler models like classification models. It's harder to
357
+ maintain such tests with more complex models. However, even in a more
358
+ complex model scenario, regression testing can be useful for comparing
359
+ changes from training to production.
360
+
361
+ ![](./media/image11.png)
362
+
363
+
364
+ A more sophisticated approach to testing for ML models is to **use loss
365
+ values and model metrics to build documented test suites out of your
366
+ data**. Consider this similar to [the test-driven
367
+ development](https://en.wikipedia.org/wiki/Test-driven_development)
368
+ (TDD) code writing paradigm. The test that is written before your code
369
+ in TDD is akin to your model's loss performance; both represent the gap
370
+ between where your code needs to be and where it is. Over time, as we
371
+ improve the loss metric, our model is getting closer to passing "the
372
+ test" we've imposed on it. The gradient descent we use to improve the
373
+ model can be considered a TDD approach to machine learning models!
374
+
375
+ ![](./media/image9.png)
376
+
377
+
378
+ While gradient descent is somewhat like TDD, it's not *exactly* the same
379
+ because simply reviewing metrics doesn't tell us how to resolve model
380
+ failures (the way traditional software tests do).
381
+
382
+ To fill in this gap, **start by [looking at the data points that have
383
+ the highest loss](https://arxiv.org/abs/1912.05283)**. Flag
384
+ them for a test suite composed of "hard" examples. Doing this provides
385
+ two advantages: it helps find where the model can be improved, and it
386
+ can also help find errors in the data itself (i.e. poor labels).
387
+
388
+ As you examine these failures, you can aggregate types of failures into
389
+ named suites. For example in a self-driving car use case, you could have
390
+ a "night time" suite and a "reflection" suite. **Building these test
391
+ suites can be considered the machine learning version of regression
392
+ testing**, where you take bugs that you\'ve observed in production and
393
+ add them to your test suite to make sure that they don\'t come up again.
394
+
395
+ ![](./media/image8.png)
396
+
397
+ The method can be quite manual, but there are some options for speeding
398
+ it up. Partnering with the annotation team at your company can help make
399
+ developing these tests a lot faster. Another approach is to use a method
400
+ called [Domino](https://arxiv.org/abs/2203.14960) that
401
+ uses foundation models to find errors. Additionally, for testing NLP
402
+ models, use the
403
+ [CheckList](https://arxiv.org/abs/2005.04118) approach.
404
+
405
+ ### 2.5 - Test in Production, But Don't YOLO
406
+
407
+ It's crucial to test in true production settings. This is especially
408
+ true for machine learning models, because data is an important component
409
+ of both the production and the development environments. It's difficult
410
+ to ensure that both are very close to one another.
411
+
412
+ **The best way to solve the training and production difference is to
413
+ test in production**.
414
+
415
+ Testing in production isn't sufficient on its own. Rather, testing in
416
+ production allows us to develop tooling and infrastructure that allows
417
+ us to resolve production errors quickly (which are often quite
418
+ expensive). It reduces pressure on other kinds of testing, but does not
419
+ replace them.
420
+
421
+ ![](./media/image7.png)
422
+
423
+
424
+ We will cover in detail the tooling needed for production monitoring and
425
+ continual learning of ML systems in a future lecture.
426
+
427
+ ### 2.6 - ML Test Score
428
+
429
+ So far, we have discussed writing "smoke" tests for ML: expectation
430
+ tests for data, memorization tests for training, and regression tests
431
+ for models.
432
+
433
+ **As your code base and team mature, adopt a more full-fledged approach
434
+ to testing ML systems like the approach identified in the [ML Test
435
+ Score](https://research.google/pubs/pub46555/) paper**. The
436
+ ML Test Score is a rubric that evolved out of machine learning efforts
437
+ at Google. It's a strict rubric for ML test quality that covers data,
438
+ models, training, infrastructure, and production monitoring. It overlaps
439
+ with, but goes beyond some of the recommendations we've offered.
440
+
441
+ ![](./media/image2.png)
442
+
443
+ It's rather expensive, but worth it for high stakes use cases that need
444
+ to be really well-engineered! To be really clear, this rubric is
445
+ *really* strict. Even our Text Recognizer system we've designed so far
446
+ misses a few categories. Use the ML Test Score as inspiration to develop
447
+ the right testing approach that works for your team's resources and
448
+ needs.
449
+
450
+ ![](./media/image5.png)
451
+
452
+ ## 3 - Troubleshooting Models
453
+
454
+ **Tests help us figure out something is wrong, but troubleshooting is
455
+ required to actually fix broken ML systems**. Models often require the
456
+ most troubleshooting, and in this section we'll cover a three step
457
+ approach to troubleshooting them.
458
+
459
+ 1. "Make it run" by avoiding common errors.
460
+
461
+ 2. "Make it fast" by profiling and removing bottlenecks.
462
+
463
+ 3. "Make it right" by scaling model/data and sticking with proven
464
+ architectures.
465
+
466
+ ### 3.1 - Make It Run
467
+
468
+ This is the easiest step for models; only a small portion of bugs cause
469
+ the kind of loud failures that prevent a model from running at all.
470
+ Watch out for these bugs in advance and save yourself the trouble of
471
+ models that don't run.
472
+
473
+ The first type of bugs that prevent models from running at all are
474
+ **shape errors.** When the shape of the tensors don't match for the
475
+ operations run on them, models can't be trained or run. Prevent these
476
+ errors by keeping notes on the expected size of tensors, annotate the
477
+ sizes in the code, and even step through your model code with a debugger
478
+ to check tensor size as you go.
479
+
480
+ ![](./media/image10.png)
481
+
482
+
483
+ The second type of bugs is out of **memory errors**. This occurs when
484
+ you try to push a tensor to a GPU that is too large to fit. PyTorch
485
+ Lightning has good tools to prevent this. Make sure you're using the
486
+ lowest precision your training can tolerate; a good default is 16 bit
487
+ precision. Another common reason for this is trying to run a model on
488
+ too much data or too large a batch size. Use the autoscale batch size
489
+ feature in PyTorch Lightning to pick the right size batch. You can use
490
+ gradient accumulation if these batch sizes get too small. If neither of
491
+ these options work, you can look into manual techniques like tensor
492
+ parallelism and gradient checkpoints.
493
+
494
+ **Numerical errors** also cause machine learning failures. This is when
495
+ NaNs or infinite values show up in tensors. These issues most commonly
496
+ appear first in the gradient and then cascade through the model. PyTorch
497
+ Lightning has a good tool for tracking and logging gradient norms. A
498
+ good tip to check whether these issues are caused by precision issues is
499
+ to switch to Python 64 bit floats and see if that causes these issues to
500
+ go away. Normalization layers tend to cause these issues, generally
501
+ speaking. So watch out for how you do normalization!
502
+
503
+ ### 3.2 - Make It Fast
504
+
505
+ ![](./media/image4.png)
506
+
507
+ Once you can run a model, you'll want it to run fast. This can be tricky
508
+ because the performance of DNN training code is very counterintuitive.
509
+ For example, transformers can actually spend more time in the MLP layer
510
+ than the attention layer. Similarly, trivial components like loading
511
+ data can soak up performance.
512
+
513
+ To solve these issues, the primary solution is to **roll up your sleeves
514
+ and profile your code**. You can often find pretty easy Python changes
515
+ that yield big results. Read these two tutorials by
516
+ [Charles](https://wandb.ai/wandb/trace/reports/A-Public-Dissection-of-a-PyTorch-Training-Step--Vmlldzo5MDE3NjU?galleryTag=&utm_source=fully_connected&utm_medium=blog&utm_campaign=using+the+pytorch+profiler+with+w%26b)
517
+ and [Horace](https://horace.io/brrr_intro.html) for more
518
+ details.
519
+
520
+ ### 3.3 - Make It Right
521
+
522
+ After you make it run fast, make the model right. Unlike traditional
523
+ software, machine learning models never are truly perfect. Production
524
+ performance is never perfect. As such, it might be more appropriate to
525
+ say "make it as right as needed".
526
+
527
+ Knowing this, making the model run and run fast allows us to make the
528
+ model right through applying **scale.** To achieve performance benefits,
529
+ scaling a model or its data are generally fruitful and achievable
530
+ routes. It's a lot easier to scale a fast model. [Research from OpenAI
531
+ and other institutions](https://arxiv.org/abs/2001.08361)
532
+ is showing that benefits from scale can be rigorously measured and
533
+ predicted across compute budget, dataset size, and parameter count.
534
+
535
+ ![](./media/image12.png)
536
+
537
+ If you can't afford to scale yourself, consider finetuning a model
538
+ trained at scale for your task.
539
+
540
+ So far, all of the advice given has been model and task-agnostic.
541
+ Anything more detailed has to be specific to the model and the relevant
542
+ task. Stick close to working architectures and hyperparameters from
543
+ places like HuggingFace, and try not to reinvent the wheel!
544
+
545
+ ## 4 - Resources
546
+
547
+ Here are some helpful resources that discuss this topic.
548
+
549
+ ### Tweeters
550
+
551
+ 1. [Julia Evans](https://twitter.com/b0rk)
552
+
553
+ 2. [Charity Majors](https://twitter.com/mipsytipsy)
554
+
555
+ 3. [Nelson Elhage](https://twitter.com/nelhage)
556
+
557
+ 4. [kipply](https://twitter.com/kipperrii)
558
+
559
+ 5. [Horace He](https://twitter.com/cHHillee)
560
+
561
+ 6. [Andrej Karpathy](https://twitter.com/karpathy)
562
+
563
+ 7. [Chip Huyen](https://twitter.com/chipro)
564
+
565
+ 8. [Jeremy Howard](https://twitter.com/jeremyphoward)
566
+
567
+ 9. [Ross Wightman](https://twitter.com/wightmanr)
568
+
569
+ ### Templates
570
+
571
+ 1. [Lightning Hydra
572
+ Template](https://github.com/ashleve/lightning-hydra-template)
573
+
574
+ 2. [NN Template](https://github.com/grok-ai/nn-template)
575
+
576
+ 3. [Generic Deep Learning Project
577
+ Template](https://github.com/sudomaze/deep-learning-project-template)
578
+
579
+ ### Texts
580
+
581
+ 1. [Reliable ML Systems
582
+ talk](https://www.usenix.org/conference/opml20/presentation/papasian)
583
+
584
+ 2. ["ML Test Score"
585
+ paper](https://research.google/pubs/pub46555/)
586
+
587
+ 3. ["Attack of the Cosmic
588
+ Rays!"](https://blogs.oracle.com/linux/post/attack-of-the-cosmic-rays)
589
+
590
+ 4. ["Computers can be
591
+ understood"](https://blog.nelhage.com/post/computers-can-be-understood/)
592
+
593
+ 5. ["Systems that defy detailed
594
+ understanding"](https://blog.nelhage.com/post/systems-that-defy-understanding/)
595
+
596
+ 6. [Testing section from MadeWithML course on
597
+ MLOps](https://madewithml.com/courses/mlops/testing/)
documents/lecture-03.srt ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,080 --> 00:00:42,480
3
+ hey folks welcome to the third lecture of full stack deep learning 2022 i'm charles frye today i'll be talking about troubleshooting and testing a high level outline of what we're going to cover today we'll talk about testing software in general and the sort of standard tools and practices that you can use to de-risk shipping software quickly then we'll move on to special considerations for testing machine learning systems the specific techniques and approaches that work best there and then lastly we'll go through what you do when your models are failing their tests how you troubleshoot your model first let's cover concepts for testing software so the general approach that we're gonna take is that tests are gonna help us ship faster with
4
+
5
+ 2
6
+ 00:00:40,800 --> 00:01:26,159
7
+ fewer bugs but they aren't gonna catch all of our bugs and that means though we're going to use testing tools we aren't going to try and achieve 100 coverage similarly we're going to use linting tools to try and improve the development experience but leave escape valves rather than pedantically just following our style guides lastly we'll talk about tools for automating these workflows so first why are we testing it all tests can help us ship faster even if they aren't catching all bugs before they go into production so as a reminder what even our tests tests are code we write that's designed to fail in an intelligible way when our other code has bugs so for example this little test for the text recognizer from the full stack
8
+
9
+ 3
10
+ 00:01:24,320 --> 00:02:06,240
11
+ deep learning code base checks whether the output of the text recognizer on a particular input is the same as what was expected and raises an error if it's not and these kinds of tests can help catch some bugs before they're merged into main or shipped into production but they can't catch all bugs and one reason why is that test suites in the tools we're using are not certificates of correctness in some formal systems tests like those can actually be used to prove that code is correct but we aren't working in one of those systems like agda the fear improving language or idris ii we're writing in python and so really it's a loosey-goosey language and all bets are off in terms of code correctness so if test suites aren't
12
+
13
+ 4
14
+ 00:02:05,439 --> 00:02:49,599
15
+ like certificates of correctness then what are they like i like this framing from nelson el haga who's at anthropic ai who says that we should think of test suites as being more like classifiers and so to bring our intuition from working with classification algorithms and machine learning so the classification problem is does this commit have a bug or is it okay and what are classifier outputs is that the test pass or the test failed so our tests are our classifier of code and so you should think of that as a prediction of whether there's a bug this kind of frame shift suggests a different way of designing our test suites when we design classifiers we know that we need to trade off detection and false alarms
16
+
17
+ 5
18
+ 00:02:47,519 --> 00:03:34,159
19
+ lots of people are thinking about detection when they're designing their test suites they're trying to make sure that they will catch all of the possible bugs but in doing so we can inadvertently introduce false alarms so the classic signature of a false alarm is a failed test that's followed by a commit that fixes the test rather than the code so that's an example from the full stack deep learning code base so in order to avoid introducing too many false alarms it's useful to ask yourself two questions before adding a test so the first question is which real bugs will this test catch what are some actual ways that the world might change around this code or that somebody might introduce a change to to some part of
20
+
21
+ 6
22
+ 00:03:32,159 --> 00:04:16,160
23
+ the code base that this test will catch once you've listed a couple of those then ask yourself what are some false alarms that this test might raise what are some ways that the world around the test or the code could change that's still valid in good code but now this test will fail and if you can think of more examples for the latter case than the former then maybe you should reconsider whether you really need this test one caveat to this in some settings it actually is really important that you have a super high degree of confidence in the correctness of your code so this screenshot is from a deep learning diagnostic tool for cardiac ultrasounds by caption health that i worked on in an internship in that project we had a ton
24
+
25
+ 7
26
+ 00:04:14,159 --> 00:05:00,240
27
+ of concern about the correctness of the model the confidence people had in the model and regulators expected to see that kind of information so there's other cases where this level of correctness is needed self-driving cars is one example you also see this in banking and finance there are a couple of patterns that immediately arise here one is the presence of regulators uh and more generally high stakes if you're operating in a high-stakes situation where errors have consequences for people's lives and livelihoods even if it's not regulated yet it might be regulated soon and in particular these are also all examples of those autonomous systems that class of low feasibility high impact machine learning project that we talked about in the
28
+
29
+ 8
30
+ 00:04:58,320 --> 00:05:44,240
31
+ first lecture this is one of the reasons for their low feasibility is because correctness becomes really important for these kinds of autonomous systems so what does this mindset mean for how we approach testing and quality assurance for our code it means that we're going to use testing tools but we don't want to aim for complete coverage of our code so in terms of tools pi test is the standard tool for testing python code it is a very pythonic implementation and interface and it has also a ton of powerful features like marks for creating separate suites of tests sharing resources across tests and running tests in a variety of parameterized variations in addition to writing the kinds of separate test suites that are standard in lots of
32
+
33
+ 9
34
+ 00:05:42,080 --> 00:06:29,440
35
+ languages in python there's a nice built-in tool called doctest for testing the code inside of our docstrings and this helps make sure that our docs strings don't get out of sync with our code which builds trust in the content of those docs strings and makes them easier to maintain doc tests are really nice but there are some limits they're framed around code snippets that could be run in a terminal and so they can only display what can be easily displayed in a terminal notebooks on the other hand can display things like rich media charts images and web pages also interleaved with code execution and text so for example with our data processing code we have some notebooks that that have charts and images in them that explain choices in
36
+
37
+ 10
38
+ 00:06:27,520 --> 00:07:08,960
39
+ that data processing trouble is notebooks are hard to test we use a cheap and dirty solution we make sure that our notebooks run and to end then we add some assert statements and we use nb format to run the notebooks and flag when they sail so once you start adding lots of different types of tests and as your code base grows you're going to want to have tooling for recording what kind of code is actually being checked or covered by the tests typically this is done in terms of lines of code but some tools can be a little bit more finer grained the tool that we recommend for this is called codecov it generates a lot of really nice visualizations that you can use to drill down or get a high level overview of the current state of
40
+
41
+ 11
42
+ 00:07:07,360 --> 00:07:53,039
43
+ your testing this is a great tool for helping you understand your testing and its state it can be incorporated into your testing effectively saying i'm going to reject commits not only where tests fail but also where test coverage goes down below some value or by a certain amount but we actually recommend against that personal experience interviews and even some published research suggests that only a small fraction of the tests that you write are going to generate the majority of your value and so the right tactic engineering wise is to expend the limited engineering effort that we have on the highest impact tests and making sure those are super high quality but if you set a coverage target then you're
44
+
45
+ 12
46
+ 00:07:51,280 --> 00:08:36,560
47
+ instead going to write tests in order to meet that coverage target regardless of their quality so you end up spending more effort both to write the tests and then to maintain and deal with their low quality in addition to checking that our code is correct we're going to also want to check that our code is clean with linting tools but with the caveat that we always want to make sure that there are escape valves from these tools when we say the code is clean what we mean is that it's of a uniform style and of a standard style so uniform style helps avoid spending engineering time on arguments over style in pull requests and code review it also helps improve the utility of our version control system by cutting down on
48
+
49
+ 13
50
+ 00:08:33,360 --> 00:09:20,959
51
+ unnecessary noisy components of dips and reducing their size both of these things will make it easier for humans to visually parse the dips in our version control system and make it easier to build automation around them and then we also generally want to adopt a standard style in whatever community it is that we are writing our code if you're an open source repository this is going to make it easier to accept contributions and even if you're working on a closed source team if your new team members are familiar with this style that's standard in the community they'll be faster to onboard one aspect of consistent style is consistent formatting of code with things like white space the standard tool for that in python is the black
52
+
53
+ 14
54
+ 00:09:18,320 --> 00:10:04,399
55
+ python formatter it's a very opinionated tool but it has a fairly narrow scope in terms of style it focuses on things that can be fully automated so you can see it not only detects deviations from style but also implements the fix so that's really nice integrated into your editor integrate it into automated workflows and avoid engineers having to implement these things themselves for non-automatable aspects of style the tool we recommend is flake 8. non-automatable aspects of style are things like missing doc strings we don't have good enough automation tools to reliably generate doc strings for code automatically so these are going to require engineers to intervene in order to fix them one of the best things about flakegate is that
56
+
57
+ 15
58
+ 00:10:02,000 --> 00:10:47,680
59
+ it comes with tons of extensions and plugins so you can check things like doctrine style and completeness like type hinting and even for security issues and common bugs all via flake8 extensions so those cover your python code ml code bases often have both python code and shell scripts in them shell scripts are really powerful but they also have a lot of sharp edges so shell check knows about all these kinds of weird behaviors of bash that often cause errors and issues that aren't immediately obvious and it also provides explanations for why it's raising a warning or an error it's a very fast to run tool so you can incorporate it into your editor and because it includes explanations you can often resolve the
60
+
61
+ 16
62
+ 00:10:45,279 --> 00:11:33,600
63
+ issue without having to go to google or stack overflow and switch contexts out of your editing environment so these tools are great and a uniform style is important but really pedantically enforcing style can be self-defeating so i searched for the word slaykate on github and found over a hundred thousand commits mentioning placating these kinds of automated style enforcement tools and all these commits sort of drip frustration from engineers who are spending time on this that they wish that they were not so to avoid frustration with code style and linting we recommend filtering your rules down to the minimal style that achieves the goals that we set of sticking with standards and of avoiding arguments and
64
+
65
+ 17
66
+ 00:11:30,640 --> 00:12:12,880
67
+ of keeping version control history clean another suggestion is to have an opt-in rather than an opt-out application of rules so by default many of these rules may not be applied to all files in the code base but you can opt in and add a particular rule to a particular file and then you can sort of grow this coverage over time and avoid these kinds of frustrations this is especially important for applying these kinds of style recommendations to existing code bases which may have thousands of lines of code that need to be fixed in order to make best use of these testing and linting practices you're going to want to embrace automation as much as possible in your development workflows for the things we talked about already
68
+
69
+ 18
70
+ 00:12:11,760 --> 00:12:58,240
71
+ with testing and linting you're going to want to automate these and connect them to your cloud version control system and run these tasks in the cloud or otherwise outside of development environments so connecting diversion control state reduces friction when trying to reproduce or understand errors and running things outside of developer environments means that you can run these tests in parallel to other development work so you can kick off tests that might take 10 or 20 minutes and spend that time responding to slack messages or moving on to other work one of the best places to learn about best practices for automation are popular open source repos so i checked out pytorch's github repository and found
72
+
73
+ 19
74
+ 00:12:55,040 --> 00:13:42,800
75
+ that they had tons and tons of automated workflows built into the repository they also followed what i think are some really nice practices like they had some workflows that are automatically running on every push and pull and these are mostly code related tasks that run for less than 10 minutes so that's things like linting and maybe some of the quicker tests other tasks that aren't directly code related but maybe do things like check dependencies and any code-related tasks that take more than 10 minutes to run are run on a schedule so we can see that for example closing stale pull requests is done on a schedule because it's not code related pytorch also runs a periodic suite of tests that takes hours to run you don't
76
+
77
+ 20
78
+ 00:13:40,639 --> 00:14:23,920
79
+ want to run that every time that you push or pull so the tool that they use and that we recommend is github actions this ties your automation entirely directly to your version control system and that has tons of benefits also github actions is really powerful it's really flexible there's a generous free tier it's performant and on top of all this it's really easy to use it's got really great documentation configuring github actions is done just using a yaml file and because of all these features it's been embraced by the open source community which has contributed lots and lots of github actions that maybe already automate the workflow that you're interested in that's why we recommend github actions there are other
80
+
81
+ 21
82
+ 00:14:21,360 --> 00:15:07,120
83
+ options precommit.ci circleci and jenkins all great choices all automation tools that i've seen work but github actions seems to have won hearts and minds in the open source community in the last couple years so that makes sure that these tests and lints are being run in code before it's shipped or before it's merged into main but part of our goal was to keep our version control history as clean as possible so we want to be able to run these locally as well and before committing and so for that we recommend a tool called pre-commit which can run all kinds of different tools and automations automatically before commits so it's extremely flexible and can run lots of stuff you will want to keep the total run time to just a few seconds or
84
+
85
+ 22
86
+ 00:15:05,440 --> 00:15:46,160
87
+ you'll discourage engineers from committing which can lead to work getting lost pre-commit super easy to run locally in part because it separates out the environment for these linting tools from the rest of the development environment which avoids a bunch of really annoying tooling and system administration headaches they're also super easy to automate with github actions automation to ensure the quality and integrity of our software is a huge productivity enhancer that's broader than just ci cd which is how you might which is how you might hear tools like github actions referred to automation helps you avoid context switching if a task is being run fully automatically then you don't have to switch context and remember the
88
+
89
+ 23
90
+ 00:15:44,160 --> 00:16:26,880
91
+ command line arguments that you need in order to run your tool it services issues more quickly than if these things were being run manually it's a huge force multiplier for small teams that can't just throw engineer hours at problems and it's better documented than manual processes the script or artifact that you're using to automate a process serves as documentation for how a process is done if somebody wants to do it manually the one caveat is that fully embracing automation requires really knowing your tools well knowing docker well enough to use it is not the same as knowing docker well enough to automate it and bad automation like bad tests can take more time away than it saves so organizationally that actually makes
92
+
93
+ 24
94
+ 00:16:24,720 --> 00:17:08,559
95
+ automation a really good task for senior engineers who have knowledge of these tools have ownership over code and can make these kinds of decisions around automation perhaps with junior engineer mentees to actually write the implementations so in summary automate tasks with github actions to reduce friction in development and move more quickly use the standard python tool cat for testing and cleaning your projects and choose in that toolkit the testing and linting practices with the 80 20 principle for tests with shipping velocity and with usability and developer experience in mind now that we've covered general ideas for testing software systems let's talk about the specifics that we need for testing machine learning systems the key point
96
+
97
+ 25
98
+ 00:17:06,880 --> 00:17:53,039
99
+ in this section is that testing email is difficult but if we adapt ml specific coding practices and focus on low hanging fruit to start then we can test our ml code and then additionally testing machine learning means testing in production but testing in production doesn't mean that you can just release bad code and let god sort it out so why is testing machine learning hard so software engineering is where a lot of testing practices have been developed and in software engineering we compile source code into programs so we write source code and a compiler turns that into a program that can take inputs and return outputs in machine learning training compiles in a sense data into a model and all of these components are
100
+
101
+ 26
102
+ 00:17:51,520 --> 00:18:46,240
103
+ harder to test in the machine learning case than in the software engineering case data is heavier and more inscrutable than source code training is more complex less well-defined and less mature than compilation and models have worse tools for debugging and inspection than compiled programs so this means that ml is the dark souls of software testing it's a notoriously difficult video game but just because something is difficult doesn't mean that it's impossible in the latest souls game elden ring a player named let me solo her defeated one of the hardest bosses in the game wearing nothing but a jar on their head if testing machine learning code is the dark souls of software testing then with practice and with the
104
+
105
+ 27
106
+ 00:18:44,080 --> 00:19:28,400
107
+ right techniques you can become the let me solo her of software testing and so in our recommendations in this section we're going to focus mostly on what are sometimes called smoke tests which let you know when something is on fire and help you resolve that issue so these tests are easy to implement but they are still very effective so they're among the 20 percent of tests that get us 80 of the value for data the kind of smoke testing we recommend is expectation testing so we test our data by checking basic properties we express our expectations about the data which might be things like there are no nulls in this column the completion date is after the start date and so with these you're going to want to start small checking
108
+
109
+ 28
110
+ 00:19:26,160 --> 00:20:13,679
111
+ only a few properties and grow them slowly and only test things that are worth raising alarms over worth sending people notifications worth bringing people in to try and resolve them so you might be tempted to say oh these are human heights they should be between four and eight feet but actually there are people between the heights of two and ten feet so loosening these expectations to avoid false positives is an important way to make them more useful so you can even say that i should be not negative and less than 30 feet and that will catch somebody maybe accidentally entering a height in inches but it doesn't express strong expectations about the statistical distribution of heights you could try and build something for expectation
112
+
113
+ 29
114
+ 00:20:11,200 --> 00:20:52,480
115
+ testing with a tool like pie test but there's enough specifics and there's good enough tools that it's worth reaching for something else so the tool we recommend is great expectations in part because great expectation automatically generates documentation for your data and quality reports and has built-in logging and learning designed for expectation testing so we are going to go through this in the lab we'll go through a lot of the other tools that we've talked about in the lab this week so if you want to check out great expectations we recommend the made with ml tutorial on great expectations by gogumontis loose expectation testing is a really uh is a great start for testing your data pipeline what do you
116
+
117
+ 30
118
+ 00:20:50,720 --> 00:21:33,600
119
+ do as you move forward from that the number one recommendation that i have is to stay as close to your data as possible so from top to bottom we have data annotation setups going from furthest away from the model development team to closest one common pattern is that there's some benchmark data set with annotations that you're using uh which is super common in academia or there's an external annotation team which is very common in industry and in that case a lot of the detailed information about the data that you can learn by looking at it and using it yourself are going to be internalized into the organization so one way that that sometimes does get internalized is that at the start of the project some
120
+
121
+ 31
122
+ 00:21:31,280 --> 00:22:16,080
123
+ data will get annotated ad hoc by model developers especially if you're not using some external benchmark data set or you don't yet have budget for an external annotation team and that's an improvement but if the model developers who around at the start of the project move on and as more developers get onboarded that knowledge is diluted better than that is an internal annotation team that has regular information flow whether that's stand-ups and syncs or exchange of documentation that information flows to the model developers but probably the best practice and one that i saw recommended by shreya shankar on twitter is to have a regular on-call rotation where model developers annotate data themselves ideally fresh data so that
124
+
125
+ 32
126
+ 00:22:13,919 --> 00:22:59,840
127
+ all members of the team who are developing models know about the data and develop intuition and expertise in the data for testing our training code we're going to use memorization testing so memorization is the simplest form of learning steep neural networks are very good at memorizing data and so checking whether your model can memorize a very small fraction of the full data set is a great smoke test for trading and if a model can't memorize then something is clearly very wrong only really gross issues with training are going to show up with this test so your gradients aren't being calculated correctly you have a numerical issue your labels have been shuffled and subtle bugs in your model or your data
128
+
129
+ 33
130
+ 00:22:57,760 --> 00:23:38,240
131
+ are not going to show up in this but you can improve the coverage of this test by including the run time in the test because regressions there can reveal bugs that just checking whether you can eventually memorize a small data set wouldn't reveal so if you're including the wall time that can catch performance regressions but also if you're including the number of steps or epochs required to hit some criterion value of the loss then you can catch some of these small issues that make learning harder but not impossible there's a nice feature of pytorch lighting overfit batches that can quickly implement this memorization test and if you design them correctly you can incorporate these tests into end-to-end model deployment testing to
132
+
133
+ 34
134
+ 00:23:36,880 --> 00:24:14,880
135
+ check to make sure that the data that the model memorized in training is also something it can correctly respond to in production with these memorization tests you're going to want to tune them to run quickly so that you can run them as often as possible if you can get them to under 10 minutes you might run them on every pull request or on every push so this is something that we worked on in updating the course for 2022 so the simplest way to speed these jobs up is to simply buy faster machines but if you're already on the fastest machines possible or you don't have budget then you start by reducing the size of the data set that the model is memorizing down to the batch size that you want to use in training once you reduce the
136
+
137
+ 35
138
+ 00:24:13,120 --> 00:24:52,559
139
+ batch size below what's in training you're starting to step further and further away from the training process that you're trying to test and so going down this list we're getting further and further from the thing that we're actually testing but allowing our tests to run more quickly the next step that can really speed up a memorization test is to turn off regular regularization which is meant to reduce overfitting and memorization is a form of overfitting so that means turning off dropout turning off augmentation you can also reduce the model size without reducing the architecture so reduce the number of layers reduce the width of layers while keeping all of those components in place and if that's not enough you can remove
140
+
141
+ 36
142
+ 00:24:50,799 --> 00:25:31,679
143
+ some of the most expensive components and in the end you should end up with a tier of memorization tests which are more or less close to how you actually train models in production that you can run on different time scales one recommendation you'll see moving forward and trying to move past just smoke testing by checking for memorization is to rerun old training jobs with new code so this is never something that you're going to be able to run on every push probably not nightly either if you're looking at training jobs that run for multiple days and the fact that this takes a long time to run is one of the reasons why it's going to be really expensive no matter how you do it for example if you use if you're gonna be
144
+
145
+ 37
146
+ 00:25:29,360 --> 00:26:10,720
147
+ doing this with circle ci uh you'll need gpu runners to execute your training jobs but those are only available in the enterprise level plan which is twenty four thousand dollars a year at a bare minimum i've seen some very large bills for running gpus in circleci github actions on the other hand does not have gpu runners available so you'll need to host them yourself though it is on the roadmap to add gpu runners to github actions and that means that you're probably going to maybe double your training spend maybe you're adding an extra machine to rerun your training jobs or maybe you're adding to your cloud budget to pay for more cloud machines to run these jobs and all this expenditure here is only on testing code
148
+
149
+ 38
150
+ 00:26:09,120 --> 00:26:52,159
151
+ it doesn't have any connection to the actual models that we're trying to ship the best thing to do is to test training by regularly running training with new data that's coming in from production this is still going to be expensive because you're going to be running more training than you were previously but now that training spend is going to model development not code testing so it's easier to justify having this set up requires the data flywheel that we talked about in lecture one and that requires production monitoring tooling and all kinds of other things that we'll talk about in the monitoring and continual learning lecture lastly for testing our models we're going to adapt regression testing at a very base level
152
+
153
+ 39
154
+ 00:26:49,440 --> 00:27:31,279
155
+ models are effectively functions so we can test them like functions they take in inputs they produce outputs we can write down what those outputs should be for specific inputs and then test them this is easiest for classification and other tasks with simple output if you have really complex output like for example our text recognizer that returns tests then these tests can often become really flaky and hard to maintain but even for those kinds of outputs you can use these tests to check for differences between how the model is behaving in training and how it's behaving in production the better approach is still relatively straight forward is to use the values of the loss and your metrics to help build documented regression test
156
+
157
+ 40
158
+ 00:27:29,120 --> 00:28:10,880
159
+ suites out of your data out of the data you're using in training and the data you see in production the framing that i like to bring to this comes from test driven development so test driven development is a paradigm for testing that says first you write the test and that test fails because you haven't written any of the code that it's testing and then you write code until you pass the test this is straightforward or incorporate into testing our models because in some sense we're already doing it think of the loss as like a fuzzy test signal rather than simply failing or not failing the loss tells us how badly a test was failed so how badly did we miss the expected output on this particular input and so
160
+
161
+ 41
162
+ 00:28:09,360 --> 00:28:51,919
163
+ just like in test driven development that's a test that's written before that code writing process and during training our model is changing and it changes in order to do better on the tests that we're providing and the model stops changing once it passes the test so in some sense gradient descent is already test driven development and maybe that is an explanation for the carpathi quote that gradient descent writes better code than me but just because gradient scent is test-driven development doesn't mean that we're done testing our models because what's missing here is that the loss and other metrics are telling us that we're failing but they aren't giving us actionable insights or a way to resolve that failure the simplest and
164
+
165
+ 42
166
+ 00:28:49,679 --> 00:29:37,600
167
+ most generic example is to find data points with the highest loss in your validation and test set or coming from production and put them in a suite labeled hard but note that the problem isn't always going to be with the model searching for high loss examples does reveal issues about what your model is learning but it also reveals issues in your data like bad labels so this doesn't just test models it also tests data and then we want to aggregate individual failures that we observe into named suites of specific types of failure so this is an example from a self-driving car task of detecting pedestrians cases where pedestrians were not detected it's much easier to incorporate this into your workflows if you already have a
168
+
169
+ 43
170
+ 00:29:36,240 --> 00:30:13,840
171
+ connection between your model development team and your annotation team reviewing these examples here we can see that what seems to be the same type of failure is occurring more than once in two examples there's a pedestrian who's not visible because they're covered by shadows in two examples there are reflections off of the windshield that are making it harder to see the pedestrian and then some of the examples come from night scenes so we can collect these up create a data set with that label and treat these as test suites to drive model development decisions and so this is kind of like the machine learning version of a type of testing called regression testing where you take bugs that you've observed
172
+
173
+ 44
174
+ 00:30:12,159 --> 00:30:56,320
175
+ in production and add them to your test suite to make sure that they don't come up again so the process that i described is very manual but there's some hope that this process might be automated in the near future a recent paper described a method called domino that uses much much larger cross-modal embedding models so foundation models to understand what kinds of errors a smaller model like a specific model designed just to detect birds or pedestrians what kinds of mistakes is it making on images as your models get more mature and you understand their behavior and the data that they're operating on better you can start to test more features of your models so for more ways to test models with an emphasis on nlp see the
176
+
177
+ 45
178
+ 00:30:54,480 --> 00:31:38,399
179
+ checklist paper that talks about different ways to do behavioral testing of models in addition to testing data training and models in our development environment we're also going to want to test in production and the reason why is that production environments differ from development environments this is something that is true for complex software systems outside of machine learning so charity majors of honeycomb has been a big proponent of testing and production on these grounds and this is especially true for machine learning models because data is an important component of both the production and the development environments and it's very difficult to ensure that those two things are close to each other and so
180
+
181
+ 46
182
+ 00:31:36,080 --> 00:32:18,720
183
+ the solution in this case is to run our tests in production but testing in production doesn't mean only testing in production testing in production means monitoring production for errors and fixing them quickly as chip win the author of designing machine learning systems pointed out this means building infrastructure and tooling so that errors in production are quickly fixed doing this safely and effectively and ergonomically requires tooling to monitor production and a lot of that tooling is fairly new especially tooling that can handle the particular type of production monitoring that we need in machine learning we'll cover it along with monitoring and continual learning in that lecture so in summary we
184
+
185
+ 47
186
+ 00:32:16,720 --> 00:33:03,440
187
+ recommend focusing on some of the low-hanging fruit when testing ml and sticking to tests that can alert you alert you to when the system is on fire so that means expectation tests of simple properties of data memorization tests for training and data-based regression tests for models but what about as your code base and your team matures one really nice rubric for organizing the testing of a really mature ml code base is the ml test score so the ml test score came out of google research and it's this really strict rubric for ml test quality so it includes tests for data models training infrastructure and production monitoring and it overlaps with but goes beyond some of the recommendations that we've
188
+
189
+ 48
190
+ 00:33:00,799 --> 00:33:48,559
191
+ given already maintaining and automating all of these tests is really expensive but it can be worth it for a really high stakes or large scale machine learning system so we didn't use the ml test score to design the text recognizer code base but we can check what we implemented against it some of the recommendations in the machine learning test score didn't end up being relevant for our model for example some of the data tests are organized around tabular data for traditional machine learning rather than for deep learning but there's still lots of really great suggestions in the ml test score so you might be surprised to see we're only hitting a few of these criteria in each category but that's a function of how
192
+
193
+ 49
194
+ 00:33:46,240 --> 00:34:33,040
195
+ strict this testing rubric is so they also provide some data on how teams doing ml at google did on this rubric and if we compare ourselves to that standard the text recognizer is about in the middle which is not so bad for a team not working with google scale resources tests alert us to the presence of bugs but in order to resolve them we'll need to do some troubleshooting and one of the components of the machine learning pipeline that's going to need the most troubleshooting and which is going to require very specialized approaches is troubleshooting models so the key idea in this section is to take a three-step approach to troubleshooting your model first make it run by avoiding the common kinds of errors that can
196
+
197
+ 50
198
+ 00:34:30,159 --> 00:35:18,240
199
+ cause crashes shape issues out of memory errors and numerical problems then make your model fast by profiling it and removing any bottlenecks then lastly make the model write improve its performance on test metrics by scaling out the model and the data and sticking with proven architectures first how do we make a model run luckily this step is actually relatively easy in that only a small portion of bugs in machine learning cause the kind of loud failure that we're tackling here so there's shape errors out of memory errors and numerical errors shape errors occur when the shapes of tensors don't match the shapes expected by the operations applied to them so while you're writing your pytorch code it's a good idea to
200
+
201
+ 51
202
+ 00:35:15,200 --> 00:35:58,640
203
+ keep notes on what you expect the shapes of your tensors to be to annotate those in the code as we do in the full stack deep learning code base and to even step through this code in a debugger checking the shapes as you go another one of the most common errors in deep learning is out of memory when you when you try and push a tensor to the gpu that's too large to fit on it something of a right of passage for deep learning engineering luckily pytorch lightning has a bunch of really nice tools built into this first make sure you're using the lowest precision that your training can tolerate a good default is half precision floats or 16-bit floats a common culprit is that you're trying to run your model on too much data at once
204
+
205
+ 52
206
+ 00:35:57,040 --> 00:36:40,400
207
+ on too large of a batch so you can use the auto scale batch size feature in pi torch lightning to pick a batch size that uses as much gpu memory as you have but no more and if that batch size is too small to get you stable gradients that can be used for training you can use gradient accumulation across batches also easily within lightning to get the same gradients that you would have gotten if you calculated on a much larger batch if none of those work and you're already operating on gpus with the maximum amount of ram then you'll have to look into manual techniques like tensor parallelism and gradient checkpointing another cause of crashes for machine learning models is numerical errors when tensors end up with nands or
208
+
209
+ 53
210
+ 00:36:38,240 --> 00:37:23,119
211
+ infinite values in them most commonly these numerical issues appear first in the gradient the gradients explode or shrink to zero and then the values of parameters or activations become infinite or nan so you can observe some of these gradient spikes occurring in some of the experiments for the dolly mini project that have been publicly posted pi torch lightning comes with a nice tool for tracking gradient norms and logging them so that you can see them and correlate them with the appearance of nance and infinities and crashes in your training a nice debugging step to check what the cause might be whether the cause is due to precision issues or due to a more fundamental numerical issue is to switch to double precision floats the default
212
+
213
+ 54
214
+ 00:37:20,640 --> 00:38:02,960
215
+ floating point size in python 64-bit floats and see if that causes these issues to go away if it doesn't then that means that there's some kind of issue with your numerical code and you'll want to find a numerically stable implementation to base your work off of or apply error analysis techniques and one of the most common causes of these kinds of numerical errors are the normalization layers like batch norm and layer norm that's what's involved in these gradient spikes in dolly mini so you'll want to make sure to check carefully that you're using normalization in the way that's been found to work for the types of data and architectures that you're using once your battle can actually run end to end and calculate gradients correctly the
216
+
217
+ 55
218
+ 00:38:00,000 --> 00:38:47,040
219
+ next step is to make it go fast this could be tricky because the performance of deep neural network training code is very counter-intuitive for example with typical hyper-parameter choices transformer layers spend more time on the plain old mlp component than they do on the intention component and as we saw in lecture two for popular optimizers just keeping track of the optimizer state actually uses more gpu memory than any of the other things you might expect would take up that memory like model parameters or data and then furthermore without careful parallelization what seem like fairly trivial components like loading data can end up dwarfing what would seem like the actual performance bottlenecks like the
220
+
221
+ 56
222
+ 00:38:45,200 --> 00:39:31,440
223
+ forwards and backwards passes and parameter updates the only solution here is to kind of roll up your sleeves and get your hands dirty and actually profile your code so we'll see this in the lab but the good news is that you can often find relatively low hanging fruit to speed up training like making changes just in the regular python code and not in not any component of the model and lastly once you've got a model that can run and that runs quickly it's time to make the model correct by reducing its loss on tester production data the normal recommendation for software engineering is make it run make it right make it fast so why is make it right last in this case and the reason why is that machine learning models are
224
+
225
+ 57
226
+ 00:39:29,359 --> 00:40:12,480
227
+ always wrong production performance is never perfect and if we think of non-zero loss as a partial test failure for our models then our tests are always at least partially failing so it's never really possible to truly make it right and then the other reason that we want to put performance first is they can kind of solve all your problems with model correctness with scale so if your model is over fitting to the training data and your production loss is way higher then you can scale up your data if your model is underfitting and you're you can't get the training loss to go down as much as you'd like then scale up your model if you're have distribution shift which means that your training and validation loss are both low but your
228
+
229
+ 58
230
+ 00:40:09,920 --> 00:40:57,680
231
+ production or test loss is really high then just scale up both folks at openai and elsewhere have done work demonstrating that the performance benefits from scale can be very rigorously measured and predicted across compute budget data set size and parameter count generating these kinds of scaling law charts is an important component of openai's workflows for deciding how to build models and how to run training but scaling costs money so what do you do if you can't afford the level of scale required to reach the performance that you want in that case you're going to want to fine-tune or make use of a model trained at scale for your tasks this is something we'll talk about in the building on foundation models lecture all the other advice
232
+
233
+ 59
234
+ 00:40:55,200 --> 00:41:40,800
235
+ around addressing overfitting addressing underfitting resolving distribution shift is going to be model and task specific and it's going to be hard to know what is going to work without trying it so this is just a selection of some of the advice i've seen given or been given about improving model performance and they're mutually exclusive in many cases because they're so tied to the particular task and model and data that they're being applied to so the easiest way to resolve this is to stick as close as possible to working architectures and hyper parameter choices that you can get from places like the hugging face hub or papers with code and in fact this is really how these hyperparameter choices and architectures arise it's via a slow
236
+
237
+ 60
238
+ 00:41:38,560 --> 00:42:22,480
239
+ evolutionary process of people building on techniques and hyperparameter choices that work rather than people designing things entirely from scratch so that brings us to the end of the troubleshooting and testing lecture we covered the general approach to testing software both tools and practices that you can use to ship more safely more quickly then we covered the specific things that you need in order to test ml systems data sets training procedures and models both the most basic tests that you should implement at the beginning and then how to grow those into more sophisticated more robust tests and then lastly we considered the workflows and techniques that you need to troubleshoot model performance so
240
+
241
+ 61
242
+ 00:42:20,160 --> 00:42:42,359
243
+ we'll see more on all these topics in the lab for this week if you'd like to learn more about any of these topics check out the slides online for a list of recommended twitter follows project templates and medium to long form text resources to learn more about troubleshooting and testing that's all for this lecture thanks for listening and happy testing
244
+
documents/lecture-04.md ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Sourcing, storing, exploring, processing, labeling, and versioning data for deep learning.
3
+ ---
4
+
5
+ # Lecture 4: Data Management
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/Jlm4oqW41vY?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Sergey Karayev](https://sergeykarayev.com).<br />
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published August 29, 2022.
14
+ [Download slides](https://fsdl.me/2022-lecture-04-slides).
15
+
16
+ ## 1 - Introduction
17
+
18
+ One thing people don't quite get as they enter the field of ML is how
19
+ much of it deals with data - putting together datasets, exploring the
20
+ data, wrangling the data, etc. The key points of this lecture are:
21
+
22
+ 1. Spend 10x as much time exploring the data as you would like to.
23
+
24
+ 2. Fixing, adding, and augmenting the data is usually the best way to
25
+ improve performance.
26
+
27
+ 3. Keep it all simple!
28
+
29
+ ## 2 - Data Sources
30
+
31
+ ![](./media/image9.png)
32
+
33
+ There are many possibilities for the sources of data. You might have
34
+ images, text files, logs, or database records. In deep learning, you
35
+ need to get that data into a local filesystem disk next to a GPU. **How
36
+ you send data from the sources to training is different for each
37
+ project**.
38
+
39
+ - With images, you can simply download them from S3.
40
+
41
+ - With text files, you need to process them in some distributed way,
42
+ analyze the data, select a subset, and put that on a local
43
+ machine.
44
+
45
+ - With logs and database records, you can use a data lake to aggregate
46
+ and process the data.
47
+
48
+ ![](./media/image2.png)
49
+
50
+
51
+ The basics will be the same - a filesystem, object storage, and
52
+ databases.
53
+
54
+ ### Filesystem
55
+
56
+ The **filesystem** is a fundamental abstraction. Its fundamental unit is
57
+ a file - which can be text or binary, is not versioned, and is easily
58
+ overwritten. The filesystem is usually on a disk connected to your
59
+ machine - physically connected on-prem, attached in the cloud, or even
60
+ distributed.
61
+
62
+ The first thing to know about discs is that their speed and bandwidth
63
+ range - from hard discs to solid-state discs. There are two orders of
64
+ magnitude differences between the slowest (SATA SSD) and the fastest
65
+ (NVMe SSD) discs. Below are some latency numbers you should know, with
66
+ the human-scale numbers in parentheses:
67
+
68
+ ![](./media/image12.png)
69
+
70
+
71
+ What formats should the data be stored on the local disc?
72
+
73
+ - If you work with binary data like images and audio, just use the
74
+ standard formats like JPEG or MP3 that it comes in.
75
+
76
+ - If you work with metadata (like labels), tabular data, or text data,
77
+ then compressed JSON or text files are just fine. Alternatively,
78
+ Parquet is a table format that is fast, compact, and widely used.
79
+
80
+ ### Object Storage
81
+
82
+ The **object storage** is an API over the filesystem. Its fundamental
83
+ unit is an object, usually in a binary format (an image, a sound file, a
84
+ text file, etc.). We can build versioning or redundancy into the object
85
+ storage service. It is not as fast as the local filesystem, but it can b
86
+ fast enough within the cloud.
87
+
88
+ ### Databases
89
+
90
+ **Databases** are persistent, fast, and scalable storage and retrieval
91
+ of structured data systems. A helpful mental model for this is: all the
92
+ data that the databases hold is actually in the computer\'s RAM, but the
93
+ database software ensures that if the computer gets turned off,
94
+ everything is safely persisted to disk. If too much data is in the RAM,
95
+ it scales out to disk in a performant way.
96
+
97
+ You should not store binary data in the database but the object-store
98
+ URLs instead. [Postgres](https://www.postgresql.org/) is
99
+ the right choice most of the time. It is an open-source database that
100
+ supports unstructured JSON and queries over that JSON.
101
+ [SQLite](https://www.sqlite.org/) is also perfectly good
102
+ for small projects.
103
+
104
+ Most coding projects that deal with collections of objects that
105
+ reference each other will eventually implement a crappy database. Using
106
+ a database from the beginning with likely save you time. In fact, most
107
+ MLOps tools are databases at their core (e.g.,
108
+ [W&B](https://wandb.ai/site) is a database of experiments,
109
+ [HuggingFace Hub](https://huggingface.co/models) is a
110
+ database of models, and [Label
111
+ Studio](https://labelstud.io/) is a database of labels).
112
+
113
+ ![](./media/image11.png)
114
+
115
+
116
+ **Data warehouses** are stores for online analytical processing (OLAP),
117
+ as opposed to databases being the data stores for online transaction
118
+ processing (OLTP). You get data into the data warehouse through a
119
+ process called **ETL (Extract-Transform-Load)**: Given a number of data
120
+ sources, you extract the data, transform it into a uniform schema, and
121
+ load it into the data warehouse. From the warehouse, you can run
122
+ business intelligence queries. The difference between OLAP and OLTP is
123
+ that: OLAPs are column-oriented, while OLTPs are row-oriented.
124
+
125
+ ![](./media/image13.png)
126
+
127
+
128
+ **Data lakes** are unstructured aggregations of data from multiple
129
+ sources. The main difference between them and data warehouses is that
130
+ data lakes use ELT (Extract-Load-Transform) process: dumping all the
131
+ data in and transforming them for specific needs later.
132
+
133
+ **The big trend is unifying both data lake and data warehouse, so that
134
+ structured data and unstructured data can live together**. The two big
135
+ platforms for this are
136
+ [Snowflake](https://www.snowflake.com/) and
137
+ [Databricks](https://www.databricks.com/). If you are
138
+ really into this stuff, "[Designing Data-Intensive
139
+ Applications](https://dataintensive.net/)" is a great book
140
+ that walks through it from first principles.
141
+
142
+ ## 3 - Data Exploration
143
+
144
+ ![](./media/image4.png)
145
+
146
+ To explore the data, you must speak its language, mostly SQL and,
147
+ increasingly, DataFrame. **SQL** is the standard interface for
148
+ structured data, which has existed for decades. **Pandas** is the main
149
+ DataFrame in the Python ecosystem that lets you do SQL-like things. Our
150
+ advice is to become fluent in both to interact with both transactional
151
+ databases and analytical warehouses and lakes.
152
+
153
+ [Pandas](https://pandas.pydata.org/) is the workhorse of
154
+ Python data science. You can try [DASK
155
+ DataFrame](https://examples.dask.org/dataframe.html) to
156
+ parallelize Pandas operations over cores and
157
+ [RAPIDS](https://rapids.ai/) to do Pandas operations on
158
+ GPUs.
159
+
160
+ ## 4 - Data Processing
161
+
162
+ ![](./media/image8.png)
163
+
164
+ Talking about data processing, it's useful to have a motivational
165
+ example. Let's say we have to train a photo popularity predictor every
166
+ night. For each photo, the training data must include:
167
+
168
+ 1. Metadata (such as posting time, title, and location) that sits in
169
+ the database.
170
+
171
+ 2. Some features of the user (such as how many times they logged in
172
+ today) that are needed to be computed from logs.
173
+
174
+ 3. Outputs of photo classifiers (such as content and style) that are
175
+ needed to run the classifiers.
176
+
177
+ Our ultimate task is to train the photo predictor model, but we need to
178
+ output data from the database, compute the logs, and run classifiers to
179
+ output their predictions. As a result, we have **task dependencies**.
180
+ Some tasks can't start until others are finished, so finishing a task
181
+ should kick off its dependencies.
182
+
183
+ Ideally, dependencies are not always files but also programs and
184
+ databases. We should be able to spread this work over many machines and
185
+ execute many dependency graphs all at once.
186
+
187
+ ![](./media/image7.png)
188
+
189
+
190
+ - [Airflow](https://airflow.apache.org/) is a standard
191
+ scheduler for Python, where it's possible to specify the DAG
192
+ (directed acyclic graph) of tasks using Python code. The operator
193
+ in that graph can be SQL operations or Python functions.
194
+
195
+ - To distribute these jobs, the workflow manager has a queue for the
196
+ tasks and manages the workers that pull from them. It will restart
197
+ jobs if they fail and ping you when the jobs are finished.
198
+
199
+ - [Prefect](https://www.prefect.io/) and
200
+ [Dagster](https://dagster.io/) are contenders to
201
+ improve and replace Airflow in the long run.
202
+
203
+ The primary advice here is not to **over-engineer things**. You can get
204
+ machines with many CPU cores and a lot of RAM nowadays. For example,
205
+ UNIX has powerful parallelism, streaming, and highly optimized tools.
206
+
207
+ ## 5 - Feature Store
208
+
209
+ ![](./media/image3.png)
210
+
211
+ Let's say your data processing generates artifacts you need for
212
+ training. How do you make sure that, in production, the trained model
213
+ sees the same processing taking place (which happened during training)?
214
+ How do you avoid recomputation during retraining?
215
+
216
+ **Feature stores** are a solution to this (that you may not need!).
217
+
218
+ - The first mention of feature stores came from [this Uber blog post
219
+ describing their ML platform,
220
+ Michelangelo](https://eng.uber.com/michelangelo-machine-learning-platform/).
221
+ They had an offline training process and an online prediction
222
+ process, so they built an internal feature store for both
223
+ processes to be in sync.
224
+
225
+ - [Tecton](https://www.tecton.ai/) is the leading SaaS
226
+ solution to feature store.
227
+
228
+ - [Feast](https://feast.dev/) is a common open-source
229
+ option.
230
+
231
+ - [Featureform](https://www.featureform.com/) is a
232
+ relatively new option.
233
+
234
+ ## 6 - Datasets
235
+
236
+ ![](./media/image1.png)
237
+
238
+ What about datasets specifically made for machine learning?
239
+
240
+ [HuggingFace
241
+ Datasets](https://huggingface.co/docs/datasets) is a great
242
+ source of machine learning-ready data. There are 8000+ datasets covering
243
+ a wide variety of tasks, like computer vision, NLP, etc. The Github-Code
244
+ dataset on HuggingFace is a good example of how these datasets are
245
+ well-suited for ML applications. Github-Code can be streamed, is in the
246
+ modern Apache Parquet format, and doesn't require you to download 1TB+
247
+ of data in order to properly work with it. Another sample dataset is
248
+ RedCaps, which consists of 12M image-text pairs from Reddit.
249
+
250
+ ![](./media/image15.png)
251
+
252
+
253
+ Another interesting dataset solution for machine learning is
254
+ [Activeloop](https://www.activeloop.ai/). This tool is
255
+ particularly well equipped to work with data and explore samples without
256
+ needing to download it.
257
+
258
+ ## 7 - Data Labeling
259
+
260
+ ![](./media/image10.png)
261
+
262
+ ### No Labeling Required
263
+
264
+ The first thing to talk about when it comes to labeling data
265
+ is...**maybe we don\'t have to label data?** There are a couple of
266
+ options here we will cover.
267
+
268
+ **Self-supervised learning** is a very important idea that allows you to
269
+ avoid painstakingly labeling all of your data. You can use parts of your
270
+ data to label other parts of your data. This is very common in NLP right
271
+ now. This is further covered in the foundation model lecture. The long
272
+ and short of it is that models can have elements of their data masked
273
+ (e.g., the end of a sentence can be omitted), and models can use earlier
274
+ parts of the data to predict the masked parts (e.g., I can learn from
275
+ the beginning of the sentence and predict the end). This can even be
276
+ used across modalities (e.g., computer vision *and* text), as [OpenAI
277
+ CLIP](https://github.com/openai/CLIP) demonstrates.
278
+
279
+ ![](./media/image14.png)
280
+
281
+
282
+ **Image data augmentation** is an almost compulsory technique to adopt,
283
+ especially for vision tasks. Frameworks like
284
+ [torchvision](https://github.com/pytorch/vision) help with
285
+ this. In data augmentation, samples are modified (e.g., brightened)
286
+ without actually changing their core "meaning." Interestingly,
287
+ augmentation can actually replace labels.
288
+ [SimCLR](https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html)
289
+ is a model that demonstrates this - where its learning objective is to
290
+ maximize agreement between augmented views of the same image and
291
+ minimize agreement between different images.
292
+
293
+ For other forms of data, there are a couple of augmentation tricks that
294
+ can be applied. You can delete some cells in tabular data to simulate
295
+ missing data. In text, there aren't established techniques, but ideas
296
+ include changing the order of words or deleting words. In speech, you
297
+ could change the speed, insert pauses, etc.
298
+
299
+ **Synthetic data** is an underrated idea. You can synthesize data based
300
+ on your knowledge of the label. For example, you can [create
301
+ receipts](https://github.com/amoffat/metabrite-receipt-tests)
302
+ if your need is to learn how to recognize receipts from images. This can
303
+ get very sophisticated and deep, so tread carefully.
304
+
305
+ You can also get creative and ask your users to label data for you.
306
+ Google Photos, as any user of the app knows, regularly gets users to
307
+ label images about where people in photos are the same or different.
308
+
309
+ ![](./media/image16.png)
310
+
311
+
312
+ This is an example of the data flywheel. Improving the data allows the
313
+ user to improve the model, which in turn makes their product experience
314
+ better.
315
+
316
+ ### Labeling Solutions
317
+
318
+ These are all great options for avoiding labeling data. However,
319
+ **you'll usually have to label some data to get started.**
320
+
321
+ Labeling has standard annotation features, like bounding boxes, that
322
+ help capture information properly. Training annotators properly is more
323
+ important than the particular kind of annotation. Standardizing how
324
+ annotators approach a complex, opinable task is crucial. Labeling
325
+ guidelines can help capture the exact right label from an annotator.
326
+ Quality assurance is key to ensuring annotation and labeling are
327
+ happening properly.
328
+
329
+ There are a few options for sourcing labor for annotations:
330
+
331
+ 1. Full-service data labeling vendors offer end-to-end labeling
332
+ solutions.
333
+
334
+ 2. You can hire and train annotators yourself.
335
+
336
+ 3. You can crowdsource annotation on a platform like Mechanical Turk.
337
+
338
+ **Full-service companies offer a great solution that abstracts the need
339
+ to build software, manage labor, and perform quality checks**. It makes
340
+ sense to use one. Before settling on one, make sure to dedicate time to
341
+ vet several. Additionally, label some gold standard data yourself to
342
+ understand the data yourself and to evaluate contenders. Take calls with
343
+ several contenders, ask for work samples on your data, and compare them
344
+ to your own labeling performance.
345
+
346
+ - [Scale AI](https://scale.com/) is the dominant data
347
+ labeling solution. It offers an API that allows you to spin up
348
+ tasks.
349
+
350
+ - Additional contenders include
351
+ [Labelbox](https://labelbox.com/) and
352
+ [Supervisely](https://supervise.ly/).
353
+
354
+ - [LabelStudio](https://labelstud.io/) is an open-source
355
+ solution for performing annotation yourself, with a companion
356
+ enterprise version. It has a great set of features that allow you
357
+ to design your interface and even plug-in models for active
358
+ learning!
359
+
360
+ - [Diffgram](https://diffgram.com/) is a competitor to
361
+ Label Studio.
362
+
363
+ - Recent offerings, like
364
+ [Aquarium](https://www.aquariumlearning.com/) and
365
+ [Scale Nucleus](https://scale.com/nucleus), have
366
+ started to help concentrate labeling efforts on parts of the
367
+ dataset that are most troublesome for models.
368
+
369
+ - [Snorkel](https://snorkel.ai/) is a dataset management
370
+ and labeling platform that uses weak supervision, which is a
371
+ similar concept. You can leverage composable rules (e.g., all
372
+ sentences that have the term "amazing" are positive sentiments)
373
+ that allow you to quickly label data faster than if you were to
374
+ treat every data point the same.
375
+
376
+ In conclusion, try to avoid labeling using techniques like
377
+ self-supervised learning. If you can't, use labeling software and
378
+ eventually outsource the work to the right vendor. If you can't afford
379
+ vendors, consider hiring part-time work rather than crowdsourcing the
380
+ work to ensure quality.
381
+
382
+ ## 8 - Data Versioning
383
+
384
+ ![](./media/image6.png)
385
+
386
+ Data versioning comes with a spectrum of approaches:
387
+
388
+ 1. Level 0 is bad. In this case, data just lives on some file system.
389
+ In these cases, the issue arises because the models are
390
+ unversioned since their data is unversioned. Models are part code,
391
+ part data. This will lead to the consequence of being unable to
392
+ get back to a previous level of performance if need be.
393
+
394
+ 2. You can prevent this event with Level 1, where you snapshot your
395
+ data each time you train. This somewhat works but is far from
396
+ ideal.
397
+
398
+ 3. In Level 2, data is versioned like code, as a commingled asset with
399
+ versioned code. You can use a system like
400
+ [git-lfs](https://git-lfs.github.com/) that allows
401
+ you to store large data assets alongside code. This works really
402
+ well!
403
+
404
+ 4. Level 3 involves specialized solutions for working with large data
405
+ files, but this may not be needed unless you have a very specific
406
+ need (i.e., uniquely large or compliance-heavy files).
407
+
408
+ ![](./media/image5.png)
409
+
410
+ [DVC](https://dvc.org/) is a great tool for this. DVC
411
+ helps upload your data asset to a remote storage location every time you
412
+ commit changes to the data file or trigger a commit; it functions like a
413
+ fancier git-lfs. It adds features like lineage for data and model
414
+ artifacts, allowing you to recreate pipelines.
415
+
416
+ Several techniques are associated with privacy-controlled data, like
417
+ [federated
418
+ learning](https://blog.ml.cmu.edu/2019/11/12/federated-learning-challenges-methods-and-future-directions/),
419
+ differential privacy, and learning on encrypted data. These techniques
420
+ are still in research, so they aren't quite ready for an FSDL
421
+ recommendation.
documents/lecture-04.srt ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,080 --> 00:00:45,920
3
+ hey everyone welcome to week four of full stack deep learning my name is sergey i have my assistant mishka right here there she is and today we're going to be talking about data management one of the things that people don't quite get as they enter the field of machine learning is just how much of it is actually just dealing with data putting together data sets looking at data munching data it's like half of the problem and it's more than half of the job for a lot of people but at the same time it's not something that people want to do the key points of this presentation are going to be that you should do a lot of it you should spend about 10 times as much time exploring the data as you would
4
+
5
+ 2
6
+ 00:00:44,879 --> 00:01:36,240
7
+ like to and let it really just flow through you and usually the best way to improve performance of your model is going to be fixing your data set adding to the data set or maybe augmenting your data as you train and the last key point is keep it all simple you might be overwhelmed especially if you haven't been exposed to a lot of this stuff before there's a lot of words and terminology in different companies you don't have to do any of it and in fact you might benefit if you keep it as simple as possible that said we're going to be talking about this area of the ammo ops landscape and we'll start with the sources of data there's many possibilities for the sources of data you might have images you might have text files you might have
8
+
9
+ 3
10
+ 00:01:34,880 --> 00:02:22,319
11
+ maybe logs database records but in deep learning you're going to have to get that data onto some kind of local file system disk right next to a gpu so you can send data and train and how exactly you're going to do that is different for every project different for every company so maybe you're training on images and you simply download the images that's all it's going to be from s3 or maybe you have a bunch of text that you need to process in some distributed way then analyze the data select the subset of it put that on the local machine or maybe you have a nice process with a data lake that ingests logs and database records and then from that you can aggregate and process it so that's always going to be different
12
+
13
+ 4
14
+ 00:02:20,480 --> 00:03:13,440
15
+ but the basics are always going to be the same and they concern the file system object storage and databases so the file system is the fundamental abstraction and the fundamental unit of it is a file which can be a text file or a binary file it's not versioned and it can be easily overwritten or deleted and usually this is the file system is on a disk that's connected to your machine may be physically connected or maybe attached in the cloud or maybe it's even the distributed file system although that's less common now and we'll be talking about directly connected disks the first thing to know about disks is that the speed of them and the bandwidth of them is a quite quite a range from hard disks which are
16
+
17
+ 5
18
+ 00:03:11,200 --> 00:04:05,120
19
+ usually spinning magnetic disks to solid-state disks which can be connected through the sata protocol or the nvme protocol and there's two orders of magnitude difference between the slowest which is like sata spinning disks and the fastest which are nvme solid state disks and making these slides i realized okay i'm showing you that but there's also some other latency numbers you should know so there's a famous document that you might have seen on the internet originally credited to jeff dean who i think credited peter norvig from google but i added human scale numbers in parens so here's how it's going to go so if you access the l1 l2 maybe even l3 cache of the cpu it's a very limited store of data but
20
+
21
+ 6
22
+ 00:04:03,599 --> 00:04:59,280
23
+ it's incredibly fast it only takes a name a second to access and in human scale you might think of it as taking a second and then accessing ram is the next fastest thing and it's about 100 times slower but it's still incredibly fast and then that's just kind of finding something in ram but reading a whole megabyte sequentially from ram is now 250 microseconds which if the cache access took a second now it's taken two and a half days to read a megabyte from ram and if you're reading a megabyte from a sata connected ssd drive now you're talking about weeks so it's one and a half weeks and if you're reading a one one megabit of data from a spinning disk now we're talking about months and finally if you're sending a packet
24
+
25
+ 7
26
+ 00:04:57,120 --> 00:05:51,120
27
+ of data from california across the ocean to europe and then back we're talking about years on a human scale in a 150 millisecond on the absolute scale and if gpu timing info i'd love to include it here so please just send it over to full stack so what format should data be stored on the local disk if it's binary data like images or audio just use the standard formats like jpegs or mp3 that it comes in they're already compressed you can't really do better than that for the metadata like labels or tabular data or text data compress json or text files just fine or parquet is a table format that's fast it's compressed by default as it's written and read that's compact and it's very widely used now let's talk about
28
+
29
+ 8
30
+ 00:05:48,960 --> 00:06:46,000
31
+ object storage i think of it as an api over the file system where the fundamental unit is now an object and it's usually binary so it's maybe an image or a sound file but it could also be a text we can build in versioning or redundancy into the object storage service so instead of a file that can easily be overridden and isn't versioned we can say that an object whenever i update it it's actually just updating the version of it s3 is the fundame is the most common example and it's not as fast as local file system but it's fast enough especially if you're staying within the cloud databases are persistent fast and scalable storage and retrieval of structured data systems the metal model that i like to use is
32
+
33
+ 9
34
+ 00:06:45,039 --> 00:07:38,960
35
+ that all the data that the database holds is actually in the ram of the computer but the database software ensures that if the computer gets turned off everything is safely persisted to disk and if it actually is too much data for ram it scales out to disk but still in a very performant way do not store binary data in the database you should store the object store urls to the binary data in the database instead postgres is the right choice it's an open source database and most of the time it's what you should use for example it supports unstructured json and queries over that unstructured json but sqlite is perfectly good for small projects it's a self-contained binary every language has an interface to it
36
+
37
+ 10
38
+ 00:07:35,919 --> 00:08:24,879
39
+ even your browser has it and i want to stress that you should probably be using a database most coding projects like anything that deals with collections of objects that reference each other like maybe you're dealing with snippets of text that come from documents and documents of authors and maybe authors have companies or something like that this is very common and that code base will probably implement some kind of database and you can save yourself time and gain performance if you just use the database from the beginning and many mo ops tools specifically are at their core databases like weights and biases is a database of experiments hugging phase model hub is a database of models label studio which we'll talk about is a
40
+
41
+ 11
42
+ 00:08:22,960 --> 00:09:22,320
43
+ database of labels plus obviously user interfaces for generating the labels and uploading the models and stuff like that but coming from an academic background i think it's important to fully appreciate databases data warehouses are stores for online analytical processing as opposed to databases which are data stores for online transaction processing and the difference i'll cover in a second but the way you get data into data warehouses is another acronym called etl extract transform load so maybe you have a number of data sources here it's like files database otp database and some sources in the cloud you'll extract data transform it into a uniform schema and then load it into the data warehouse and then from the
44
+
45
+ 12
46
+ 00:09:20,160 --> 00:10:15,839
47
+ warehouse we can run business intelligence queries we know that it's archived and so what's the difference between olaps and otps like why are they different software platforms instead of just using postgres for everything so the difference is all laps for analytical processing are usually column oriented which lets you do queries what's the mean length of the text of comments over the last 30 days and it lets them be more compact because if you're storing the column you can compress that whole column in storage and oltps are usually row oriented and those are for queries select all the comments for this given user data lakes are unstructured aggregation of data from multiple sources so the main difference to data
48
+
49
+ 13
50
+ 00:10:12,399 --> 00:11:10,720
51
+ warehouses is that instead of extract transform load its extract load into the lake and then transform later and the trend is unifying both so both unstructured and structured data should be able to live together the big two platforms for this our snowflake and databricks and if you're interested in this stuff this is a really great book that walks through the stuff from first principles that i think you will enjoy now that we have our data stored if we would like to explore it we have to speak the language of data and the language of data is mostly sql and increasingly it's also data frames sql is the standard interface for structured data it's existed for decades it's not going away it's worth
52
+
53
+ 14
54
+ 00:11:08,480 --> 00:12:01,200
55
+ being able to at least read and it's well worth being able to write and for python pandas is the main data frame solution which basically lets you do sql-like things but in code without actually writing sql our advice is to become fluent in both this is how you interact with both transactional databases and analytical warehouses and lakes pandas is really the workhorse of python data science i'm sure you've seen it i just wanted to give you some tips if pandas are slow on something it's worth trying das data frames have the same interface but they paralyze operations over many cores and even over multiple machines if you set that up and something else that's worth trying if you have gpus available is rapids and
56
+
57
+ 15
58
+ 00:11:59,600 --> 00:12:55,040
59
+ video rapids lets you do a subset of what pandas can do but on gpus so significantly faster for a lot of types of data so talking about data processing it's useful to have a motivational example so let's say we have to train a photo popularity predictor every night and for each photo training data must include maybe metadata about the photos such as the posting time the title that the user gave the location was taken maybe some features of the user and then maybe outputs of classifiers of the photo for content maybe style so the metadata is going to be in the database the features we might have to compute from logs and the photo classifications we're going to need to run those classifiers so we have dependencies our ultimate
60
+
61
+ 16
62
+ 00:12:52,959 --> 00:13:50,800
63
+ task is to train the photopredictor model but to do we need to output data from database compute stuff from logs and run classifiers to output their predictions what we'd like is to define what we have to do and as things finish they should kick off their dependencies and everything should ideally not only have not only be files but programs and databases we should be able to spread this work over many machines and we're not the only ones running this job or this isn't the only job that's running on these machines how do we actually schedule multiple such jobs airflow is a pretty standard solution for python where it's possible to specify the acyclical graph of tasks using python code and the operators in that graph can be
64
+
65
+ 17
66
+ 00:13:48,320 --> 00:14:52,320
67
+ sql operations or actually python functions and other plugins for airflow and to distribute these jobs the workflow manager has a queue has workers that report to it will restart jobs if they fail and will ping you when the jobs are done prefect is another is another solution that's been to improve over air flow it's more modern and dagster is another contender for the airflow replacement the main piece of advice here is don't over engineer this you can get machines with many cpu cores and a ton of ram nowadays and unix itself has powerful parallelism streaming tools that are highly optimized and this is a little bit of a contrived example from a decade ago but hadoop was all the rage in 2014 it was a distributed data processing
68
+
69
+ 18
70
+ 00:14:51,120 --> 00:15:43,279
71
+ framework and so to run some kind of job that just aggregated a bunch of text files and computed some statistics over them the author spanned set up a hadoop job and it took 26 minutes to run but just writing a simple unix command that reads all the files grabs for the string sorts it and gives you the unique things was only 70 seconds and part of the reason is that this is all actually happening in parallel so it's making use of your cores pretty efficiently and you can make even more efficient use of them with the parallel command or here it's an argument to x-args and that's not to say that you should do everything just in unix but it is to say that just because the solution exists doesn't mean that it's right for you it
72
+
73
+ 19
74
+ 00:15:41,680 --> 00:16:39,120
75
+ might be the case that you can just run your stuff in a single python script on your 32 core pc feature stores you might have heard about the situation that they deal with is all the data processing we we're doing is generating artifacts that we'll need for training time so how do we ensure that in production the model that was trained sees data where the same processing took place as it as as happened during training time and also when we retrain how do we avoid recomputing things that we don't need to recompute so feature store is our solution to this that you may not need the first mention i saw feature stores were was in this blog post from uber describing their machine learning platform michelangelo
76
+
77
+ 20
78
+ 00:16:36,560 --> 00:17:43,520
79
+ and so they had offline training process and an online prediction process and they had feature stores for both that had to be in sync tecton is probably the leading sas solution to a feature storage for open source solutions feast is a common one and i recently came across feature form that looks pretty good as well so this is something you need check it out if it's not something you need don't feel like you have to use it in summary binary data like images sound files maybe compressed text store is object metadata about the data like labels or user activity with object should be stored in the database don't be afraid of sql but also know if you're using data frames there are accelerated solutions to them
80
+
81
+ 21
82
+ 00:17:41,200 --> 00:18:35,200
83
+ if dealing with stuff like logs and other sources of data that are disparate it's worth setting up a data lake to aggregate all of it in one place you should have a repeatable process to aggregate the data you need for training which might involve stuff like airflow and depending on the expense and complexity of processing a feature store could be useful at training time the data that you need should be copied over to a file system on a really fast local drive and then you should optimize gpu transfer so what about specifically data sets for machine learning training hugging phase data sets is a great hub of data there's over 8 000 data sets revision nlp speech etc so i wanted to take a look at a few
84
+
85
+ 22
86
+ 00:18:33,360 --> 00:19:28,320
87
+ example data sets here's one called github code it's over a terabyte of text 115 million code files the hugging face library the datasets library allows you to stream it so you don't have to download the terabyte of data in order to see some examples of it and the underlying format of the data is parquet tables so there's thousands of parquet tables each about half a gig that you can download piece by piece another example data set is called red caps pretty recently released 12 million image text pairs from reddit the images don't come with the data you need to download the images yourself make sure as you download it's multithreaded they give you example code and the underlying format then of the
88
+
89
+ 23
90
+ 00:19:26,080 --> 00:20:21,600
91
+ database are the images you download plus json files that have the labels or the text that came with the images so the real foundational format of the data is just the json files and there's just urls in those files to the objects that you can then download here's another example data set common voice from wikipedia 14 000 hours of speech in 87 languages the format is mp3 files plus text files with the transcription of what the person's saying there's another interesting data set solution called active loop where you can also explore data stream data to your local machine and even transform data without saving it locally it look it has a pretty cool viewer of the data so here's looking at microsoft
92
+
93
+ 24
94
+ 00:20:18,159 --> 00:21:14,159
95
+ coco computer vision data set and in order to get it onto your local machine it's a simple hub.load the next thing we should talk about is labeling and the first thing to talk about when it comes to labeling is maybe we don't have to label data self-supervised learning is a very important idea that you can use parts of your data to label other parts of your data so in natural language this is super common right now and we'll talk more about this in the foundational models lecture but given a sentence i can mask the last part of the sentence and to use the first part of the sentence to predict how it's going to end but i can also mask the middle of the sentence and use the whole sentence to predict the middle or i can even mask
96
+
97
+ 25
98
+ 00:21:12,640 --> 00:22:04,640
99
+ the beginning of the sentence and use the completion of the sentence to predict the beginning in vision you can extract patches and then predict the relationship of the patches to each other and you can even do it across modalities so openai clip which we'll talk about in a couple of weeks is trained in this contrastive way where a number of images and the number of text captions are given to the model and the learning objective is to minimize the distance between the image and the text that it came with and to maximize the distance between the image and the other texts the and when i say between the image and the text the embedding of the image and the embedding of the texts and this led to great results this is
100
+
101
+ 26
102
+ 00:22:02,640 --> 00:22:56,960
103
+ one of the best vision models for all kinds of tasks right now data augmentation is something that must be done for training vision models there's frameworks that provide including torch vision that provide you functions to do this it's changing the brightness of the data the contrast cropping it skewing it flipping it all kinds of transformations that basically don't change the meaning of the image but change the pixels of the image this is usually done in parallel to gpu training on the cpu and interestingly the augmentation can actually replace labels so there's a paper called simclear where the learning objective is to extract different views of an image and maximize the agreement or the similarity of the
104
+
105
+ 27
106
+ 00:22:55,679 --> 00:23:50,000
107
+ embeddings of the views of the same image and minimize the agreement between the views of the different images so without labels and just with data augmentation and a clever learning objective they were able to learn a model that performs very well for even supervised tasks for non-vision data augmentation if you're dealing with tabular data you could delete some of the table cells to simulate what it would be like to have missing data for text i'm not aware of like really well established techniques but you could maybe delete words replace words with synonyms change the order of things and for speech you could change the speed of the file you could insert pauses you could remove some stuff you
108
+
109
+ 28
110
+ 00:23:47,039 --> 00:24:36,000
111
+ can add audio effects like echo you can strip out certain frequency bands synthetic data is also something where the labels would basically be given to you for free because you use the label to generate the data so you know the label and it's still somewhat of an underrated idea that's often worth starting with we certainly do this in the lab but it can get really deep right so you can even use 3d rendering engines to generate very realistic vision data where you know exactly the label of everything in the image and this was done for receipts in this project that i link here you can also ask your users if you have users to label data for you i love how google photos does this they always ask me is this the same or different person
112
+
113
+ 29
114
+ 00:24:34,480 --> 00:25:29,919
115
+ and this is sometimes called the data flywheel right where i'm incentivized to answer because it helps me experience the product but it helps google train their models as well because i'm constantly generating data but usually you might have to label some data as well and data labeling always has some standard set of features there's bounding boxes key points or part of speech tagging for text there's classes there's captions what's important is training the annotators so whoever will be doing the annotation make sure that they have a complete rulebook of how they should be doing it because there's reasonable ways to interpret the task so here's some examples like if i'm only seeing the head of the fox should i label only
116
+
117
+ 30
118
+ 00:25:27,360 --> 00:26:21,360
119
+ the head or should i label the inferred location of the entire fox behind the rock it's unclear and quality assurance is something that's going to be key to annotation efforts because different people are just differently able to uh adhere to the rules where do you get people to annotate you can work with full-service data labeling companies you can hire your own annotators probably part-time and maybe promote the most the most able ones to quality control or you could potentially crowdsource this was popular in the past with mechanical turk the full service companies provide you the software stack the labor to do it and quality assurance and it probably makes sense to use them so how do you pick one you should at
120
+
121
+ 31
122
+ 00:26:18,480 --> 00:27:12,880
123
+ first label some data yourself to make sure that you understand the task and you have a gold standard that you can evaluate companies on then you should probably take calls with several of the companies or just try them out if they let you try it out online get a work sample and then look at how the work sample agrees with your own gold standard and then see how the price of the annotation compares scale dot ai is probably the dominant data labeling solution today and they take an api approach to this where it's you create tasks for them and then receive results and there are many other annotations like label box supervisedly and there's just a million more label studio is an open source solution that you can run yourself
124
+
125
+ 32
126
+ 00:27:11,440 --> 00:28:05,600
127
+ there's an enterprise edition for managed hosting but there's an open source edition that you can just run in the docker container on your own machine we're going to use it in the lab and it has a lot of different interfaces for text images you can create your own interfaces you can even plug in models and do active learning for annotation diff gram is something i've come across but i haven't used it personally they claim to be better than label studio and it looks pretty good an interesting feature that that i've seen some software offerings have is evaluate your current model on your data and then explore how it performed such that you can easily select subsets of data for further labeling or potentially
128
+
129
+ 33
130
+ 00:28:03,360 --> 00:28:56,320
131
+ find mistakes in your labeling and just understand how your model is performing on the data there's aquarium learning and scale nucleus are both solutions to this that you can check out snorkel you might have heard about and it's using this idea of weak supervision where if you have a lot of data to label some of it is probably really easy to label if you're labeling sentiment of text and if they're using the word wonderful then it's probably positive so if you can create a rule that says if the text contains the word wonderful just apply the positive label to it and you create a number of these labeling functions and then the software intelligently composes them and it could be a really fast way to to
132
+
133
+ 34
134
+ 00:28:54,159 --> 00:29:44,640
135
+ go through a bunch of data there's the open source project of snorkel and there's the commercial platform and i recently came across rubrics which is a very similar idea that's fully open source so in conclusion for labeling first think about how you can do self-supervised learning and avoid labeling if you need to label which you probably will need to do use labeling software and really get to know your data by labeling it yourself for a while after you've done that you can write out detailed rules and then outsource to a full service company otherwise if you don't want to outsource you can't afford it you should probably hire some part-time contractors and not try to crowdsource because crowdsourcing is a lot of
136
+
137
+ 35
138
+ 00:29:42,480 --> 00:30:35,120
139
+ quality assurance overhead it's a lot better to just find a good person who can trust to do the job and just have them label lastly in today's lecture we can talk about versioning i like to think of data versioning as a spectrum where the level zero is unversioned and level three is specialized data versioning solution so label level one level zero is bad okay where you have data that just lives on the file system or is on s3 or in a database and it's not version so you train a model you deploy the model and the problem is when you deploy the model what you're deploying is partly the code but partly the data that generated the weights right and if the data is not versioned then your model is in effect not
140
+
141
+ 36
142
+ 00:30:33,039 --> 00:31:21,679
143
+ versioned and so what will probably happen is that your performance will degrade at some point and you won't be able to get back to a previous level of high performance so you can solve this with level one each time you train you just take a snapshot of your data and you store it somewhere so this kind of works because you'll be able to get back to that performance by retraining but it'd be nicer if i could just version the data as easily as code not through some separate process and that's where we arrive at level two where we just we version data exactly in the same way as reversion code so let's say we're having a data set of audio files and text transcriptions so we're going to upload the audio files
144
+
145
+ 37
146
+ 00:31:19,679 --> 00:32:12,559
147
+ to s3 that's probably where they were to begin with and the labels for the files we can just store in a parquet file or a json file where it's going to be the s3 url and the transcription of it now even this metadata file can get pretty big it's a lot of text but you can use git lfs which stands for large file storage and we can just add them and the git add will version the data file exactly the same as your version your code file and this can totally work you do not need to definitely go to level three would be using a specialized solution for versioning data and this usually helps you store large files directly and it could totally make sense but just don't assume that you need it right away if you can get away with just
148
+
149
+ 38
150
+ 00:32:11,039 --> 00:33:08,240
151
+ get lfs that would be the fstl recommendation if it's starting to break then the leading solution for level three versioning is dvc and there's a table comparing the different versioning solutions like pachyderm but this table is biased towards dvc because it's by a solution that's github for dbc called dags hub and the way dvc works is you set it up you add your data file and then the most basic thing it does is it can upload to s3 or google cloud storage or whatever some other network storage whatever you set up every time you commit it'll upload your data somewhere and it'll make sure it's versioned so it's like a replacement for git lfs but you can go further and you can also record the lineage of
152
+
153
+ 39
154
+ 00:33:06,000 --> 00:34:05,519
155
+ the data so how exactly was this data generated how does this model artifact get generated so you can use dvc run to mark that and then use dvc to recreate the pipelines the last thing i want to say is we get a lot of questions at fstl about privacy sensitive data and this is still a research area there's no kind of off-the-shelf solution we can really recommend federated learning is a research area that refers to training a global model from data on local devices without the the model training process having access to the local data so it's there's a federated server that has the model and it sends what to do to local models and then it syncs back the models and differential privacy is another term
156
+
157
+ 40
158
+ 00:34:02,640 --> 00:34:52,200
159
+ this is for aggregating data such that even though you have the data it's aggregated in such a way that you can't identify the individual points so it should be safe to train on sensitive data because you won't actually be able to understand the individual points of it and another topic that is in the same vein is learning on encrypted data so can i have data that's encrypted that i can't decrypt but can i still do machine learning on it in a way that generates useful models and these three things are all research areas and i'm not aware of like really good off-the-shelf solutions for them unfortunately that concludes our lecture on data management thank you
160
+
documents/lecture-05.md ADDED
@@ -0,0 +1,788 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: How to turn an ML model into an ML-powered product
3
+ ---
4
+
5
+ # Lecture 5: Deployment
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/W3hKjXg7fXM?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).<br />
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published September 5, 2022.
14
+ [Download slides](https://fsdl.me/2022-lecture-05-slides).
15
+
16
+ ## Introduction
17
+
18
+ ![](./media/image21.png)
19
+
20
+ Deploying models is a critical part of making your models good, to begin
21
+ with. When you only evaluate the model offline, it's easy to miss the
22
+ more subtle flaws that the model has, where it doesn't actually solve
23
+ the problem that your users need it to solve. Oftentimes, when we deploy
24
+ a model for the first time, only then do we really see whether that
25
+ model is actually doing a good job or not. Unfortunately, for many data
26
+ scientists and ML engineers, model deployment is an afterthought
27
+ relative to other techniques we have covered.
28
+
29
+ Much like other parts of the ML lifecycle, we'll focus on deploying a
30
+ minimum viable model as early as possible, which entails **keeping it
31
+ simple and adding complexity later**. Here is the process that this
32
+ lecture covers:
33
+
34
+ - Build a prototype
35
+
36
+ - Separate your model and UI
37
+
38
+ - Learn the tricks to scale
39
+
40
+ - Consider moving your model to the edge when you really need to go
41
+ fast
42
+
43
+ ## 1 - Build a Prototype To Interact With
44
+
45
+ There are many great tools for building model prototypes.
46
+ [HuggingFace](https://huggingface.co/) has some tools
47
+ built into its playground. They have also recently acquired a startup
48
+ called [Gradio](https://gradio.app/), which makes it easy
49
+ to wrap a small UI around the model.
50
+ [Streamlit](https://streamlit.io/) is another good option
51
+ with a bit more flexibility.
52
+
53
+ ![](./media/image19.png)
54
+
55
+
56
+ Here are some best practices for prototype deployment:
57
+
58
+ 1. **Have a basic UI**: The goal at this stage is to play around with
59
+ the model and collect feedback from other folks. Gradio and
60
+ Streamlit are your friends here - often as easy as adding a couple
61
+ of lines of code to create a simple interface for the model.
62
+
63
+ 2. **Put it behind a web URL**: An URL is easier to share. Furthermore,
64
+ you will start thinking about the tradeoffs you'll be making when
65
+ dealing with more complex deployment schemes. There are cloud
66
+ versions of [Streamlit](https://streamlit.io/cloud)
67
+ and [HuggingFace](https://huggingface.co/) for this.
68
+
69
+ 3. **Do not stress it too much**: You should not take more than a day
70
+ to build a prototype.
71
+
72
+ A model prototype won't be your end solution to deploy. Firstly, a
73
+ prototype has limited frontend flexibility, so eventually, you want to
74
+ be able to build a fully custom UI for the model. Secondly, a prototype
75
+ does not scale to many concurrent requests. Once you start having users,
76
+ you'll hit the scaling limits quickly.
77
+
78
+ ![](./media/image18.png)
79
+
80
+
81
+ Above is an abstract diagram of how your application might look. The
82
+ **client** is your user's device that interacts with your application.
83
+ This device can be a browser, a vehicle, or a mobile phone. This device
84
+ calls over a network to a **server**. The server talks to a **database**
85
+ (where data is stored), used to power the application.
86
+
87
+ ![](./media/image6.png)
88
+
89
+
90
+ There are different ways of structuring your application to fit an ML
91
+ model inside. The prototype approach mentioned in the beginning fits
92
+ into the **model-in-service** approach - where your hosted web server
93
+ has a packaged version of the model sitting inside it. This pattern has
94
+ pros and cons.
95
+
96
+ The biggest pro is that if you are doing something complex, you get to
97
+ reuse your existing infrastructure. It does not require you as a model
98
+ developer to set up new things from scratch.
99
+
100
+ However, there is a number of pronounced cons:
101
+
102
+ 1. **Your web server may be written in a different language**, so
103
+ getting your model into that language can be difficult.
104
+
105
+ 2. **Models may change more frequently than server code** (especially
106
+ early in the lifecycle of building your model). If you have a
107
+ well-established application and a nascent model, you do not want
108
+ to redeploy the entire application every time that you make an
109
+ update to the model (sometimes multiple updates per day).
110
+
111
+ 3. If you have a large model to run inference on, you'll have to load
112
+ that model on your web server. **Large models can eat into the
113
+ resources for your web server**. That might affect the user
114
+ experience for people using that web server, even if they are not
115
+ interacting with the model.
116
+
117
+ 4. **Server hardware is generally not optimized for ML workloads**. In
118
+ particular, you rarely will have a GPU on these devices.
119
+
120
+ 5. **Your model and application may have different scaling
121
+ properties**, so you might want to be able to scale them
122
+ differently.
123
+
124
+ ## 2 - Separate Your Model From Your UI
125
+
126
+ ### 2.1 - Batch Prediction
127
+
128
+ ![](./media/image8.png)
129
+
130
+
131
+ The first pattern to pull your model from your UI is called **batch
132
+ prediction**. You get new data in and run your model on each data point.
133
+ Then, you save the results of each model inference into a database. This
134
+ can work well under some circumstances. For example, if there are not a
135
+ lot of potential inputs to the model, you can re-run your model on some
136
+ frequency (every hour, every day, or every week). You can have
137
+ reasonably fresh predictions to return to those users that are stored in
138
+ your database. Examples of these problems include the early stages of
139
+ building recommender systems and internal-facing tools like marketing
140
+ automation.
141
+
142
+ To run models on a schedule, you can leverage the data processing and
143
+ workflow tools mentioned in our previous lecture on data management. You
144
+ need to re-run data processing, load the model, run predictions, and
145
+ store those predictions in your database. This is exactly a **Directed
146
+ Acyclic Graph workflow of data operations** that tools like
147
+ [Dagster](https://dagster.io/),
148
+ [Airflow](https://airflow.apache.org/), or
149
+ [Prefect](https://www.prefect.io/) are designed to solve.
150
+ It's worth noting that there are also tools like
151
+ [Metaflow](https://metaflow.org/) that are designed more
152
+ for ML or data science use cases that might be potentially even an
153
+ easier way to get started.
154
+
155
+ Let's visit the pros and cons of this batch prediction pattern. Starting
156
+ with the pros:
157
+
158
+ 1. Batch prediction is **simple to implement** since it reuses existing
159
+ batch processing tools that you may already be using for training
160
+ your model.
161
+
162
+ 2. It **scales very easily** because databases have been engineered for
163
+ decades for such a purpose.
164
+
165
+ 3. Even though it looks like a simple pattern, it has been **used in
166
+ production by large-scale production systems for years**. This is
167
+ a tried-and-true pattern you can run and be confident that it'll
168
+ work well.
169
+
170
+ 4. It is **fast to retrieve the prediction** since the database is
171
+ designed for the end application to interact with.
172
+
173
+ Switching to the cons:
174
+
175
+ 1. Batch prediction **doesn't scale to complex input types**. For
176
+ instance, if the universe of inputs is too large to enumerate
177
+ every single time you need to update your predictions, this won't
178
+ work.
179
+
180
+ 2. **Users won't be getting the most up-to-date predictions from your
181
+ model**. If the feature that goes into your model changes every
182
+ hour, minute, or subsecond, but you only run your batch prediction
183
+ job every day, the predictions your users see might be slightly
184
+ stale.
185
+
186
+ 3. **Models frequently become "stale."** If your batch jobs fail for
187
+ some reason, it can be hard to detect these problems.
188
+
189
+ ### 2.2 - Model-as-Service
190
+
191
+ The second pattern is called **model-as-service**: we run the model
192
+ online as its own service. The service is going to interact with the
193
+ backend or the client itself by making requests to the model service and
194
+ receiving responses back.
195
+
196
+ ![](./media/image16.png)
197
+
198
+
199
+ The pros of this pattern are:
200
+
201
+ 1. **Dependability** - model bugs are less likely to crash the web
202
+ application.
203
+
204
+ 2. **Scalability** - you can choose optimal hardware for the model and
205
+ scale it appropriately.
206
+
207
+ 3. **Flexibility** - you can easily reuse a model across multiple
208
+ applications.
209
+
210
+ The cons of this pattern are:
211
+
212
+ 1. Since this is a separate service, you add a network call when your
213
+ server or client interacts with the model. That can **add
214
+ latency** to your application.
215
+
216
+ 2. It also **adds infrastructural complexity** because you are on the
217
+ hook for hosting and managing a separate service.
218
+
219
+ Even with these cons, **the model-as-service pattern is still a sweet
220
+ spot for most ML-powered products** since you really need to be able to
221
+ scale independently of the application in most complex use cases. We'll
222
+ walk through the basic components of building your model service -
223
+ including REST APIs, dependency management, performance optimization,
224
+ horizontal scaling, rollout, and managed options.
225
+
226
+ #### REST APIs
227
+
228
+ **Rest APIs** serve predictions in response to canonically-formatted
229
+ HTTP requests. There are other alternative protocols to interact with a
230
+ service that you host on your infrastructures, such as
231
+ [GRPC](https://grpc.io/) (used in TensorFlow Serving) and
232
+ [GraphQL](https://graphql.org/) (common in web development
233
+ but not terribly relevant to model services).
234
+
235
+ ![](./media/image3.png)
236
+
237
+
238
+ Unfortunately, there is currently no standard for formatting requests
239
+ and responses for REST API calls.
240
+
241
+ 1. [Google Cloud](https://cloud.google.com/) expects a
242
+ batch of inputs structured as a list called "instances" (with keys
243
+ and values).
244
+
245
+ 2. [Azure](https://azure.microsoft.com/en-us/) expects a
246
+ list of things called "data", where the data structure itself
247
+ depends on what your model architecture is.
248
+
249
+ 3. [AWS Sagemaker](https://aws.amazon.com/sagemaker/)
250
+ expects instances that are formatted differently than they are in
251
+ Google Cloud.
252
+
253
+ Our aspiration for the future is to move toward **a standard interface
254
+ for making REST API calls for ML services**. Since the types of data
255
+ that you might send to these services are constrained, we should be able
256
+ to develop a standard as an industry.
257
+
258
+ #### Dependency Management
259
+
260
+ Model predictions depend on **code**, **model weights**, and
261
+ **dependencies**. In order for your model to make a correct prediction,
262
+ all of these dependencies need to be present on your web server.
263
+ Unfortunately, dependencies are a notorious cause of trouble as it is
264
+ hard to ensure consistency between your development environment and your
265
+ server. It is also hard to update since even changing a TensorFlow
266
+ version can change your model.
267
+
268
+ At a high level, there are two strategies for managing dependencies:
269
+
270
+ 1. **Constrain the dependencies for your model** by saving your model
271
+ in an agnostic format that can be run anywhere.
272
+
273
+ 2. **Use containers** to constrain the entire inference program.
274
+
275
+ ![](./media/image11.png)
276
+
277
+
278
+ ##### Constraining Model Dependencies
279
+
280
+ The primary way to constrain the dependencies of just your model is
281
+ through a library called [ONNX](https://onnx.ai/) - the
282
+ Open Neural Network Exchange. The goal of ONNX is to be **an
283
+ interoperability standard for ML models**. The promise is that you can
284
+ define a neural network in any language and run it consistently
285
+ anywhere. The reality is that since the underlying libraries used to
286
+ build these models change quickly, there are often bugs in the
287
+ translation layer, which creates even more problems to solve for you.
288
+ Additionally, ONNX doesn't deal with non-library code such as feature
289
+ transformations.
290
+
291
+ ##### Containers
292
+
293
+ To understand how to manage dependencies with containers, we need to
294
+ understand [the differences between Docker and Virtual
295
+ Machines](https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b),
296
+ how Docker images are built via Docker files and constructed via layers,
297
+ the ecosystem around Docker, and specific wrappers around Docker that
298
+ you can use for ML.
299
+
300
+ ![](./media/image10.png)
301
+
302
+
303
+ In a **virtual machine**, you package up the entire operating system
304
+ (OS) as well as the libraries and applications that are built on top of
305
+ that OS. A virtual machine tends to be very heavyweight because the OS
306
+ itself has a lot of code and is expensive to run. A **container** such
307
+ as Docker removes that need by packaging the libraries and applications
308
+ together. A Docker engine that runs on top of your OS knows how to
309
+ virtualize the OS and run the libraries/applications.
310
+
311
+ By virtue of being **lightweight**, Docker is used differently than how
312
+ Virtual Machines were used. A common pattern is to spin up [a new
313
+ Docker container](https://www.docker.com/what-container)
314
+ for every discrete task. For example, a web application might have four
315
+ containers: a web server, a database, a job queue, and a worker. These
316
+ containers are run together as part of an orchestration system.
317
+
318
+ ![](./media/image15.png)
319
+
320
+
321
+ Docker containers are created from [Docker
322
+ files](https://docs.docker.com/engine/reference/builder/).
323
+ Each Docker file runs a sequence of steps to define the environment
324
+ where you will run your code. Docker also allows you to build, store,
325
+ and pull Docker containers from a Docker Hub that is hosted on some
326
+ other servers or your cloud. You can experiment with a code environment
327
+ that is on your local machine but will be identical to the environment
328
+ you deploy on your server.
329
+
330
+ Docker is separated into [three different
331
+ components](https://docs.docker.com/engine/docker-overview):
332
+
333
+ 1. The **client** is where you'll be running on your laptop to build an
334
+ image from a Dockerfile that you define locally using some
335
+ commands.
336
+
337
+ 2. These commands are executed by a **Docker Host**, which can run on
338
+ either your laptop or your server (with more storage or more
339
+ performance).
340
+
341
+ 3. That Docker Host talks to a **registry** - which is where all the
342
+ containers you might want to access are stored.
343
+
344
+ ![](./media/image1.png)
345
+
346
+
347
+ With this separation of concerns, you are not limited by the amount of
348
+ compute and storage you have on your laptop to build, pull, and run
349
+ Docker images. You are also not limited by what you have access to on
350
+ your Docker Host to decide which images to run.
351
+
352
+ In fact, there is a powerful ecosystem of Docker images that are
353
+ available on different public Docker Hubs. You can easily find these
354
+ images, modify them, and contribute them back to the Hubs. It's easy to
355
+ store private images in the same place as well. Because of this
356
+ community and the lightweight nature of Docker, it has become
357
+ [incredibly popular in recent
358
+ years](https://www.docker.com/what-container#/package_software)
359
+ and is ubiquitous at this point.
360
+
361
+ There is a bit of a learning curve to Docker. For ML, there are a few
362
+ open-source packages designed to simplify this:
363
+ [Cog](https://github.com/replicate/cog),
364
+ [BentoML](https://github.com/bentoml/BentoML), and
365
+ [Truss](https://github.com/trussworks). They are built by
366
+ different model hosting providers that are designed to work well with
367
+ their model hosting service but also just package your model and all of
368
+ its dependencies in a standard Docker container format.
369
+
370
+ ![](./media/image12.png)
371
+
372
+ These packages have **two primary components**: The first one is a
373
+ standard way of defining your prediction service. The second one is a
374
+ YAML file that defines the other dependencies and package versions that
375
+ will go into the Docker container running on your laptop or remotely.
376
+
377
+ If you want to have the advantages of using Docker for making your ML
378
+ models reproducible but do not want to go through the learning curve of
379
+ learning Docker, it's worth checking out these three libraries.
380
+
381
+ #### Performance Optimization
382
+
383
+ !!! info "What about performance _monitoring_?"
384
+ In this section, we focus on ways to improve the performance of your
385
+ models, but we spend less time on how exactly that performance is monitored,
386
+ which is a challenge in its own right.
387
+
388
+ Luckily, one of the
389
+ [student projects](../project-showcase/) for the 2022 cohort,
390
+ [Full Stack Stable Diffusion](../project-showcase/#full-stack-stable-diffusion),
391
+ took up that challenge and combined
392
+ [NVIDIA's Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server),
393
+ the [Prometheus monitoring tool](https://en.wikipedia.org/wiki/Prometheus_(software)),
394
+ and
395
+ the [Grafana analytics dashboarding tool](https://en.wikipedia.org/wiki/Grafana)
396
+ to monitor a robust, scalable, and observable deployment of Stable Diffusion models.
397
+
398
+ Check out the repo on GitHub
399
+ [here](https://github.com/okanlv/fsdl-full-stack-stable-diffusion-2022)
400
+ if you want to see a worked example of a fully-monitored DL-powered application.
401
+
402
+ To make model inference on your machine more efficient, we need to
403
+ discuss GPU, concurrency, model distillation, quantization, caching,
404
+ batching, sharing the GPU, and libraries that automate these tasks for
405
+ you.
406
+
407
+ ##### GPU or no GPU?
408
+
409
+ There are some advantages to hosting your model on a GPU:
410
+
411
+ 1. It's probably the same hardware you train your model on, to begin
412
+ with. That can eliminate any lost-in-translation issues.
413
+
414
+ 2. As your model gets big and your techniques get advanced, your
415
+ traffic gets large. GPUs provide high throughput to deal with
416
+ that.
417
+
418
+ However, GPUs introduce a lot of complexity:
419
+
420
+ 1. They are more complex to set up.
421
+
422
+ 2. They are more expensive.
423
+
424
+ As a result, **just because your model is trained on a GPU does not mean
425
+ that you need to actually host it on a GPU in order for it to work**. In
426
+ the early version of your model, hosting it on a CPU should suffice. In
427
+ fact, it's possible to get high throughput from CPU inference at a low
428
+ cost by using some other techniques.
429
+
430
+ ##### Concurrency
431
+
432
+ With **concurrency**, multiple copies of the model run in parallel on
433
+ different CPUs or cores on a single host machine. To do this, you need
434
+ to be careful about thread tuning. There's [a great Roblox
435
+ presentation](https://www.youtube.com/watch?v=Nw77sEAn_Js)
436
+ on how they scaled BERT to serve a billion daily requests, just using
437
+ CPUs.
438
+
439
+ ##### Model Distillation
440
+
441
+ With **model distillation**, once you have a large model that you've
442
+ trained, you can train a smaller model that imitates the behavior of
443
+ your larger one. This entails taking the knowledge that your larger
444
+ model learned and compressing that knowledge into a much smaller model
445
+ that you may not have trained to the same degree of performance from
446
+ scratch. There are several model distillation techniques pointed out in
447
+ [this blog
448
+ post](https://heartbeat.comet.ml/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb).
449
+ They can be finicky to do by yourself and are infrequently used in
450
+ practice. An exception is distilled versions of popular models (such as
451
+ [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)).
452
+
453
+ ##### Quantization
454
+
455
+ With **quantization**, you execute some or potentially all of the
456
+ operations in your model in a lower fidelity representation of the
457
+ numbers that you are doing the math. These representations can be 16-bit
458
+ floating point numbers or 8-bit integers. This introduces some tradeoffs
459
+ with accuracy, but it's worth making these tradeoffs because the
460
+ accuracy you lose is limited relative to the performance you gain.
461
+
462
+ The recommended path is to use built-in quantization methods in
463
+ [PyTorch](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)
464
+ and TensorFlow. More specifically, [HuggingFace
465
+ Optimum](https://huggingface.co/docs/optimum) is a good
466
+ choice if you have already been using HuggingFace's pre-trained models.
467
+ You can also run **quantization-aware training**, which often results in
468
+ higher accuracy.
469
+
470
+ ![](./media/image5.png)
471
+
472
+
473
+ ##### Caching
474
+
475
+ With **caching**, you realize that for some ML models, some inputs are
476
+ more common than others. Instead of always calling the model every time
477
+ a user makes a request, let's store the common requests in a cache.
478
+ Then, let's check that cache before running an expensive operation.
479
+ Caching techniques can get fancy, but the basic way of doing this is to
480
+ use [functools library in
481
+ Python](https://docs.python.org/3/library/functools.html).
482
+
483
+ ![](./media/image2.png)
484
+
485
+
486
+ ##### Batching
487
+
488
+ With **batching**, you take advantage of the fact that ML models often
489
+ achieve a higher throughput when doing prediction in parallel,
490
+ especially in a GPU. To accomplish this, you need to gather predictions
491
+ until you have a batch, run those predictions, and return them to your
492
+ user. You want to tune the batch size that deals optimally with the
493
+ latency-throughput tradeoff. You also need to have a way to shortcut the
494
+ process if latency becomes too long. Batching is complicated to
495
+ implement, so you probably do not want to implement this yourself.
496
+
497
+ ##### Sharing the GPU
498
+
499
+ Your model may not take up all of the GPU memory with your inference
500
+ batch size. **Why don't you run multiple models on the same GPU?** This
501
+ is a place where you want to use a model serving solution that supports
502
+ GPU sharing out of the box.
503
+
504
+ ##### Libraries
505
+
506
+ There are offerings from TensorFlow, PyTorch, and third-party tools from
507
+ NVIDIA and Anyscale. NVIDIA's choice is probably the most powerful but
508
+ can be difficult to get started with. Starting with Anyscale's [Ray
509
+ Serve](https://docs.ray.io/en/latest/serve/index.html) may
510
+ be an easier way to get started.
511
+
512
+ ![](./media/image20.png)
513
+
514
+
515
+ #### Horizontal Scaling
516
+
517
+ If you're going to scale up to a large number of users interacting with
518
+ your model, it's not going to be enough to get the most efficiency out
519
+ of one server. At some point, you'll need to scale horizontally to have
520
+ traffic going to multiple copies of your model running on different
521
+ servers. This is called **horizontal scaling**. This technique involves
522
+ taking traffic that would usually go to a single machine and splits
523
+ across multiple machines.
524
+
525
+ Each machine has a copy of the service, and a tool called a load
526
+ balancer distributes traffic to each machine. In practice, there are two
527
+ ways to do this: with either **container orchestration** (e.g.
528
+ Kubernetes) or **serverless** (e.g. AWS Lambda).
529
+
530
+ ##### Container Orchestration
531
+
532
+ In container orchestration, we use
533
+ [Kubernetes](https://kubernetes.io/) to help manage
534
+ containerized applications (in Docker containers, for example) and run
535
+ them across machines.
536
+
537
+ ![](./media/image14.png)
538
+
539
+
540
+ Kubernetes is quite interesting, but it's probably overkilled to learn
541
+ too much about it if your only goal is to deploy machine learning
542
+ models. There are a number of frameworks that make it easiest to deploy
543
+ ML models with Kubernetes, including
544
+ [Kubeflow](https://www.kubeflow.org/),
545
+ [Seldon](https://www.seldon.io/), etc.
546
+
547
+ ##### Serverless
548
+
549
+ If Kubernetes isn't the path for you (e.g. you don't want to have to
550
+ worry about infrastructure at all), serverless is another option for
551
+ deploying models. In this paradigm, app code and dependencies are
552
+ packaged into .zip files or Docker containers with a single entry point
553
+ function, which is a single function (e.g. *model.predict()*) that will
554
+ be run repeatedly. This package is then deployed to a service like [AWS
555
+ Lambda](https://aws.amazon.com/lambda/), which almost
556
+ totally manages the infrastructure required to run the code based on the
557
+ input. Scaling to thousands of requests and across multiple machines is
558
+ taken care of by these services. In return, you pay for the compute time
559
+ that you consume.
560
+
561
+ Since model services tend to run discretely and not continuously (like a
562
+ web server), serverless is a great fit for machine learning deployment.
563
+
564
+ ![](./media/image7.png)
565
+
566
+
567
+ **Start with serverless!** It's well worth the time saved in managing
568
+ infrastructure and dealing with associated challenges. There are still
569
+ some problems you should be aware of though.
570
+
571
+ 1. First, the size of the actual deployment package that can be sent to
572
+ a serverless service tends to be limited, which makes large models
573
+ impossible to run.
574
+
575
+ 2. Second, there is also a cold start problem. If there is no traffic
576
+ being sent to the service in question, the service will "wind
577
+ down" to zero compute use, at which point it takes time to start
578
+ again. This lag in starting up upon the first request to the
579
+ serverless service is known as the "cold start" time. This can
580
+ take seconds or even minutes.
581
+
582
+ 3. Third, it can be hard to actually build solid software engineering
583
+ concepts, like pipelines, with serverless. Pipelines enable rapid
584
+ iteration, while serverless offerings often do not have the tools
585
+ to support rapid, automated changes to code of the kind pipelines
586
+ are designed to do.
587
+
588
+ 4. Fourth, state management and deployment tooling are related
589
+ challenges here.
590
+
591
+ 5. Finally, most serverless functions are CPU only and have limited
592
+ execution time. If you need GPUs for inference, serverless might
593
+ not be for you quite yet. There are, however, new offerings like
594
+ [Banana](https://www.banana.dev/) and
595
+ [Pipeline](https://www.pipeline.ai/) that are
596
+ seeking to solve this problem of serverless GPU inference!
597
+
598
+ #### Model Rollouts
599
+
600
+ If serving is how you turn a model into something that can respond to
601
+ requests, rollouts are how you manage and update these services. To be
602
+ able to make updates effectively, you should be able to do the
603
+ following:
604
+
605
+ 1. **Roll out gradually**: You may want to incrementally send traffic
606
+ to a new model rather than the entirety.
607
+
608
+ 2. **Roll back instantly**: You may want to immediately pull back a
609
+ model that is performing poorly.
610
+
611
+ 3. **Split traffic between versions**: You may want to test differences
612
+ between models and therefore send some traffic to each.
613
+
614
+ 4. **Deploy pipelines of models**: Finally, you may want to have entire
615
+ pipeline flows that ensure the delivery of a model
616
+
617
+ Building these capabilities in a reasonably challenging infrastructure
618
+ problem that is beyond the scope of this course. In short, managed
619
+ services are a good option for this that we'll now discuss!
620
+
621
+ #### Managed Options
622
+
623
+ All of the major cloud providers offer their managed service options for
624
+ model deployment. There are a number of startups offering solutions as
625
+ well, like BentoML or Banana.
626
+
627
+ ![](./media/image9.png)
628
+
629
+ The most popular managed service is [AWS
630
+ Sagemaker](https://aws.amazon.com/sagemaker/). Working with
631
+ Sagemaker is easier if your model is already in a common format like a
632
+ Huggingface class or a SciKit-Learn model. Sagemaker has convenient
633
+ wrappers for such scenarios. Sagemaker once had a reputation for being a
634
+ difficult service to work with, but this is much less the case for the
635
+ clear-cut use case of model inference. Sagemaker, however, does have
636
+ real drawbacks around ease of use for custom models and around cost. In
637
+ fact, Sagemaker instances tend to be 50-100% more expensive than EC2.
638
+
639
+ ### 2.3 - Takeaways
640
+
641
+ To summarize this section, remember the following:
642
+
643
+ 1. You *probably* don't need GPU inference, which is hard to access and
644
+ maintain. Scaling CPUs horizontally or using serverless can
645
+ compensate.
646
+
647
+ 2. Serverless is probably the way to go!
648
+
649
+ 3. Sagemaker is a great way to get started for the AWS user, but it can
650
+ get quite expensive.
651
+
652
+ 4. Don't try to do your own GPU inference; use existing tools like
653
+ TFServing or Triton to save time.
654
+
655
+ 5. Watch out for new startups focused on GPU inference.
656
+
657
+ ## 3 - Move to the Edge?
658
+
659
+ Let's now consider the case of moving models out of web service and all
660
+ the way to the "edge", or wholly on-device. Some reasons you may need to
661
+ consider this include a lack of reliable internet access for users or
662
+ strict data security requirements.
663
+
664
+ If such hard and fast requirements aren't in place, you'll need to take
665
+ into account the tradeoff between accuracy and latency and how this can
666
+ affect the end-user experience. Put simply, **if you have exhausted all
667
+ options to reduce model prediction time (a component of latency),
668
+ consider edge deployment**.
669
+
670
+ ![](./media/image4.png)
671
+
672
+
673
+ Edge deployment adds considerable complexity, so it should be considered
674
+ carefully before being selected as an option. In edge prediction, model
675
+ weights are directly loaded on our client device after being sent via a
676
+ server (shown above), and the model is loaded and interacted with
677
+ directly on the device.
678
+
679
+ This approach has compelling pros and cons:
680
+
681
+ 1. Some pros to particularly call out are the latency advantages that
682
+ come without the need for a network and the ability to scale for
683
+ "free," or the simple fact that you don't need to worry about the
684
+ challenges of running a web service if all inference is done
685
+ locally.
686
+
687
+ 2. Some specific cons to call out are the often limited hardware and
688
+ software resources available to run machine learning models on
689
+ edge, as well as the challenge of updating models since users
690
+ control this process more than you do as the model author.
691
+
692
+ ### 3.1 - Frameworks
693
+
694
+ Picking the right framework to do edge deployment depends both on how
695
+ you train your model and what the target device you want to deploy it on
696
+ is.
697
+
698
+ - [TensorRT](https://developer.nvidia.com/tensorrt): If
699
+ you're deploying to NVIDIA, this is the choice to go with.
700
+
701
+ - [MLKit](https://developers.google.com/ml-kit) and
702
+ [CoreML](https://developer.apple.com/documentation/coreml)**:**
703
+ For phone-based deployment on either Android **or** iPhone, go
704
+ with MLKit for the former and CoreML for the latter.
705
+
706
+ - [PyTorch Mobile](https://pytorch.org/mobile)**:** For
707
+ compatibility with both iOS and Android, use PyTorch Mobile.
708
+
709
+ - [TFLite](https://www.tensorflow.org/lite): A great
710
+ choice for using TensorFlow in a variety of settings, not just on
711
+ a phone or a common device.
712
+
713
+ - [TensorFlow JS](https://www.tensorflow.org/js)**:**
714
+ The preferred framework for deploying machine learning in the
715
+ browser.
716
+
717
+ - [Apache TVM](https://tvm.apache.org/): A library
718
+ agnostic, target device agnostic option. This is the choice for
719
+ anyone trying to deploy to as diverse a number of settings as
720
+ possible.
721
+
722
+ Keep paying attention to this space! There are a lot of startups like
723
+ [MLIR](https://mlir.llvm.org/),
724
+ [OctoML](https://octoml.ai/),
725
+ [TinyML](https://www.tinyml.org/), and
726
+ [Modular](https://www.modular.com/) that are aiming to
727
+ solve some of these problems.
728
+
729
+ ### 3.2 - Efficiency
730
+
731
+ No software can help run edge-deployed models that are simply too large;
732
+ **model efficiency** is important for edge deployment! We previously
733
+ discussed quantization and distillation as options for model efficiency.
734
+ However, there are also network architectures specifically designed to
735
+ work better in edge settings like
736
+ [MobileNets](https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d).
737
+ MobileNets replace the more expensive computations typical of server-run
738
+ models with simpler computations and achieve acceptable performance
739
+ oftentimes.
740
+
741
+ ![](./media/image17.png)
742
+
743
+
744
+ MobileNets are a great tool for model deployments and are a great case
745
+ study in model efficiency. Another similarly great case study is
746
+ [DistillBERT](https://medium.com/huggingface/distilbert-8cf3380435b5).
747
+
748
+ ![](./media/image13.png)
749
+
750
+ ### 3.3 - Mindsets
751
+
752
+ As we wrap up this lecture, keep in mind the following mindsets as you
753
+ consider edge deployment:
754
+
755
+ 1. **Start with the edge requirement, not the architecture choice**.
756
+ It's easy to pick a high-performing model architecture, only to
757
+ then find it impossible to run on the edge device. Avoid this
758
+ scenario at all costs! Tricks like quantization can account for up
759
+ to 10x improvement, but not much more.
760
+
761
+ 2. **Once you have a model that works on the edge, you can iterate
762
+ locally without too much additional re-deployment.** In this case,
763
+ make sure to add metrics around the model size and edge
764
+ performance to your experiment tracking.
765
+
766
+ 3. **Treat tuning the model as an additional risk and test
767
+ accordingly.** With the immaturity of edge deployment frameworks,
768
+ it's crucial to be especially careful when testing your model on
769
+ the exact hardware you'll be deploying on.
770
+
771
+ 4. **Make sure to have fallbacks!** Models are finicky and prone to
772
+ unpredictable behavior. In edge cases, it's especially important
773
+ to have easily available fallback options for models that aren't
774
+ working.
775
+
776
+ ### 3.4 - Conclusion
777
+
778
+ To summarize this section:
779
+
780
+ 1. Web deployment is easier, so use edge deployment only if you need
781
+ to.
782
+
783
+ 2. Choose your framework to match the available hardware and
784
+ corresponding mobile frameworks, or try Apache TVM to be more
785
+ flexible.
786
+
787
+ 3. Start considering hardware constraints at the beginning of the
788
+ project and choose architectures accordingly.
documents/lecture-05.srt ADDED
@@ -0,0 +1,396 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,240 --> 00:00:32,000
3
+ hey everybody welcome back this week we're going to talk about deploying models into production so we're talking about this part of the life cycle and why do we spend a whole week on this maybe the answer is obvious right which is if you want to build a machine learning powered product you need some way of getting your model into production but i think there's a more subtle reason as well which is that i think of deploying models as a really critical part of making your models good to begin with the reason for that is when you only evaluate your model offline it's really easy to miss some of the more subtle flaws that model has where it doesn't actually solve the problem that your users needed to solve
4
+
5
+ 2
6
+ 00:00:30,320 --> 00:01:07,040
7
+ oftentimes when we deploy a model for the first time only then do we really see whether that model is actually doing a good job or not but unfortunately for a lot of data scientists and ml engineers model deployment is kind of an afterthought relative to some of the other techniques that you've learned and so the goal of this lecture is to cover different ways of deploying models into production and we're not going to be able to go in depth in all of them because it's it's a broad and deep topic worthy probably of a course itself and i'm not personally an expert in it but what we will do is we'll cover like a couple of happy paths that will take you to getting your first model in production for most use cases and then
8
+
9
+ 3
10
+ 00:01:05,680 --> 00:01:41,119
11
+ we'll give you a tour of some of the other techniques that you might need to learn about if you want to do something that is outside of that normal 80 so to summarize it's really important to get your model into production because only there do you see if it actually works if it actually solves the task that you set out to solve the technique that we're going to emphasize that you use for this is much like what we use in other parts of the life cycle and it's focused on like getting an mvp out early deploy early deploy a minimum viable model as early as possible and deploy often we're also going to emphasize keeping it simple and adding to bluxy later and so we'll start we'll walk through this the following process starting with building
12
+
13
+ 4
14
+ 00:01:39,280 --> 00:02:10,959
15
+ a prototype then we'll talk about how to separate your model in your ui which is sort of one of the first things that you'll need to do to make a more complex ui or to scale then we'll talk about some of the tricks that you need to do in order to scale your model up to serve many users and then finally we'll talk about more advanced techniques that you might use when you need your model to be really fast which often means moving it from a web server to the edge so the first thing that we'll talk about is how to build the first prototype of your production model and the goal here is just something that you can play around with yourself and share with your friends luckily unlike when we first taught this class there's many great
16
+
17
+ 5
18
+ 00:02:08,879 --> 00:02:46,160
19
+ tools for building prototypes of models hugging face has some tools built into their playground they've also recently acquired a company called gradio which we'll be using in the lab for the course which makes it very easy to wrap a small user interface around the model and then streamlit is also a great tool for doing this streamlight gives you a little bit more flexibility than something like radio or hugging face spaces at the cost of just needing to put a little bit more thought into how to pull all the pieces together in your ui but it's still very easy to use a few best practices to think about when you're deploying the prototype model first i would encourage you to have a basic ui for the model not
20
+
21
+ 6
22
+ 00:02:44,720 --> 00:03:20,480
23
+ just to have an api and the reason for that is you know the goal at this stage is to play around with the model get feedback on the model both yourself and also from your friends or your co-workers or whoever else you're talking with this project about gradio and streamlight are really your friends here gradio really it's often as easy as adding a couple of lines of code to create a simple interface for a model streamlit is a little bit more ambitious in that it's a tool that allows you to build pretty complex uis just using python so it'll be familiar interfaces for you if you're a python developer but will require a little bit more thought about how you want to structure things but still very easy next best practice
24
+
25
+ 7
26
+ 00:03:18,800 --> 00:03:51,040
27
+ is don't just run this on your laptop it's actually worth at this stage putting it behind a web url why is that important one it's easier to share right so part of the goal here is to collect feedback from other folks but it also starts to get you thinking about some of the trade-offs that you'll be making when you do a more complex deployment how much latency does this model actually have luckily there are cloud versions of both streamlit and hub and face which are which make this very easy so there's at this point in time not a lot of excuse not to just put this behind a simple url so you can share with people and then the last tip here is just don't stress too much at this stage again this is a prototype this is
28
+
29
+ 8
30
+ 00:03:49,360 --> 00:04:20,880
31
+ something that should take you not more than like maybe a day if you're doing it for the first time but if you're building many of these models maybe it even just takes you a couple hours we've talked about this first step which is buildings prototype and next i want to talk about why is this not going to work like why is this not going to be the end solution that you use to deploy your model so where will this fail the first big thing is with any of these tools that we discussed you're going to have limited flexibility in terms of how you build the user interface for your model and extremely gives you more flexibility there than gradio but still relatively limited flexibility and so eventually you're gonna want to be able to build a
32
+
33
+ 9
34
+ 00:04:19,040 --> 00:04:53,520
35
+ fully custom ui for the model and then secondly these systems tend not to scale very well to many concurrent requests so if it's just you or you and a couple friends playing around the model that's probably fine but once you start to have users you'll hit the scaling limits of these pretty quickly and this is a good segue to talk about at a high level different ways you can structure your machine learning power application in particular where the model fits into that application so we'll start with an abstract diagram of how your application might look there's a few different components to this on the left we have a client and the client is essentially your user and that's the device that they're using to interact with the
36
+
37
+ 10
38
+ 00:04:52,160 --> 00:05:28,320
39
+ application that you built so it could be a browser it could be a vehicle whatever that device is that they're interacting with then that device will make calls over a network to a server that server is typically if you're building a web app where most of your code is running that server will talk to a database where there's data stored that's used for powering the application and there's different ways of structuring this application to fit a machine learning model inside the prototype approach that we just described mostly fits into this model in service approach where the web server that you're hosting actually just has a packaged version of the model sitting inside of it when you write a streamled script for a gradioscript part of that
40
+
41
+ 11
42
+ 00:05:26,560 --> 00:06:02,000
43
+ script will be to load the model and so that script will be building your ui as well as running the model at the same time so this pattern like all patterns has pros and cons the biggest pro i think is one it's really easy if you're using one of these prototype development tools but two even if you are doing something a little bit more complicated like you're reusing your web infrastructure for the app that your company is building you get to reuse a lot of existing infrastructure so it doesn't require you as a model developer to set up a lot of new things just to try your model out and that's really great but there are a number of pretty pronounced cons to this as well the first is that your web server in many
44
+
45
+ 12
46
+ 00:05:59,840 --> 00:06:34,080
47
+ cases like once you get beyond this streamlight and gradio type example might be written in a different language than your model like it might be written in in ruby or in javascript or something like that getting your model into that language can be difficult the second reason is that oftentimes especially in early in the life cycle of building your model your model might be changing more frequently than your server code so if you have a relatively well established application but a model that you're still building you might not want to have to redeploy the entire application every single time that you make an update to the model which might be every day or even multiple times a day the third con of this approach is that it
48
+
49
+ 13
50
+ 00:06:31,360 --> 00:07:06,240
51
+ doesn't scale very well with model size so if you have a really large model that you're trying to run inference on you'll have to load that on your web server and so that's going to start to eat into the resources of that web server and might affect the user experience for people using that web server even if they're not interacting with the model or that's not the primary thing that they're doing in that web application because all of the resources from that web server are being directed to making this model run the fourth reason is that server hardware the hardware that you're probably running your web application or your mobile application on is generally not optimized very well for machine learning workloads and so in particular
52
+
53
+ 14
54
+ 00:07:04,240 --> 00:07:38,720
55
+ you're very rarely going to have a gpu on these devices that may or may not be a deal breaker which we'll come back to later in the lecture and the last con is that your model itself and the application that's part of might have very different scaling properties and you might want to be able to scale them differently so for example if you're running a very lightweight ui then it might not take a lot of resources or a lot of thought to scale it to many users but if your model itself is really complicated or very large you might need to get into some of the advanced techniques in this lecture and host these models on gpus to get them to scale you don't want to necessarily have to bring all of that complexity to your
56
+
57
+ 15
58
+ 00:07:37,120 --> 00:08:13,840
59
+ web server it's important when there's different scaling properties to be able to separate these concerns as part of the application that you're building so that brings us to the second step which is pulling your model out of the ui and there's a couple of different ways that we can do this and we'll talk about two different patterns here the first is to pull your model out of the ui and have it interact directly with the database this is called batch prediction so how does this work periodically you will get new data in and you'll run your model on each of those data points then you'll save the results of that model inference into a database this can work really well in some circumstances so for example if there's just not a lot of
60
+
61
+ 16
62
+ 00:08:11,599 --> 00:08:50,240
63
+ potential inputs to the model if you have one prediction per user or one prediction per customer or something along those lines then you can rerun your model on some frequency like every hour or every day or every week and you can have reasonably fresh predictions to return to those users just stored in your database so examples of types of problems where this can work well are you know in the early stages of building out a recommender system in some cases for doing more internal facing use cases like marketing automation if for example you want to give each of your marketing leads a score that tells your marketing your sales team how much effort to put into closing those leads then you'll have this finite universe of leads that
64
+
65
+ 17
66
+ 00:08:48,800 --> 00:09:23,680
67
+ needs a prediction for the model so you can just run a model prediction on every single possible lead store that in a database and then let your users interact with it from there how can you actually do this how do you actually run the model on the schedule the data processing and workflow tools that we talked about in the previous lecture also work really well here what you'll need to do is you'll need to re-run your data pre-processing you'll then need to load the model run the predictions and store the predictions in the database that you're using for your application and so this is exactly a directed acyclic graph a workflow of data operations that tools like dagster airflow or prefix are designed to solve
68
+
69
+ 18
70
+ 00:09:22,240 --> 00:09:58,720
71
+ it's worth noting here that there's also tools like metaflow that are designed more for a machine learning or data science use case that might be potentially even an easier way to get started so what are the pros and cons of this pattern of running your model offline and putting the predictions in a database the biggest pro is that this is just really easy to implement right it's reusing these existing batch processing tools that you may already be using for trading your model and it doesn't require you to host any type of new web server to get those predictions to users you can just put the predictions in the database that your product is already using it also scales very easily because databases themselves are designed and
72
+
73
+ 19
74
+ 00:09:57,040 --> 00:10:35,120
75
+ have been engineered for decades to scale really easily it's also you know it seems like a simple pattern but it's used in production by very large scale production systems by large companies and it has been for years often times this is for things like recommender systems this is a tried and true pattern that you can run and become pretty confident that it'll work well and then it's also relatively low latency because the database itself is designed for the end application to interact with so latency was a concern that the database designers were able to solve for us there's also some very pronounced cons to this approach and the most important one is that it just doesn't work for every type of model if you have complex
76
+
77
+ 20
78
+ 00:10:33,360 --> 00:11:08,880
79
+ inputs to your model if the universe of inputs is too large to enumerate every single time you need to update your predictions then this just isn't going to work second con is that your users are not going to be getting the most up-to-date predictions from your model if the features going into your model let's say change every hour or every minute or some second but you only run this batch prediction job every day then the predictions your users see might be slightly stale think about this in the context of a recommender system if you're only running the predictions of the recommender system every day then those recommendations that you serve to your users won't take into account all of the contacts that those users have
80
+
81
+ 21
82
+ 00:11:07,040 --> 00:11:42,720
83
+ provided you in between those predictions so the movies that they watch today the tv shows that they watch today those won't be taken into account in at least the machine learning part of their recommendations but there's you know there's other algorithmic ways to make sure that you don't do things like show users the same movie twice and the final con here is that models frequently can become stale so if your batch job fails for some reason there's a timeout in one of your data pre-processing steps and the new predictions don't get dumped into the database these types of things can make this problem of not getting up-to-date predictions worse and worse and they can be very hard to detect although there's tools for data quality
84
+
85
+ 22
86
+ 00:11:41,200 --> 00:12:17,920
87
+ that can really help detect them the next pattern that we're going to talk about is rather than running the model offline and putting the predictions in a database instead let's run the model online as its own service the service is going to interact with the backend or the client itself by making requests to this model service sending hey what is the prediction for this particular input and receiving responses back the model says that the prediction for this input is this particular value the pros of this approach are it's dependable if you have a bug in your model if your model is running directly in the web server then that can crash your entire application but hosting this as an independent service in your application
88
+
89
+ 23
90
+ 00:12:16,160 --> 00:12:51,680
91
+ means that's less likely second it's more scalable so you can choose what is the best hardware what is the best infrastructure setup for the model itself and scale that as you need to without needing to worry about how that affects the rest of your application third it's really flexible if you stand up a model service for a particular model you can reuse that service in other applications or other parts of your application very easily concierge since this is a separate service you add a network call when your server or your client interacts with the model it has to make a request and receive a response over the network so that can add some latency to your application it adds infrastructure complexity relative to
92
+
93
+ 24
94
+ 00:12:50,560 --> 00:13:29,680
95
+ the other techniques that we've talked about before because now you're on the hook for hosting and managing a separate service just to host your model this is i think really the challenge for a lot of ml teams is that hey i'm good at training models i'm not sure how to run a web service however i do think this is the sweet spot for most ml powered products because the cons of the other approaches are just too great you really need to be able to scale models independently of the application itself in most complex use cases and for a lot of interesting uses of ml we don't have a finite universe of inputs to the model that we can just enumerate every day we really need to be able to have our users send us whatever requests that they want
96
+
97
+ 25
98
+ 00:13:27,440 --> 00:13:59,680
99
+ to get and receive a customized response back in this next section we'll talk through the basics of how to build your model service there's a few components to this we will talk about rest apis which are the language that your service will use to interact with the rest of your application we'll talk about dependency management so how to deal with these pesky versions of pi torch or tensorflow that you might need to be upgrading and we'll talk about performance optimization so how to make this run fast and scale well and then we'll talk about rollout so how to get the next version of your model into production once you're ready to deploy it and then finally we'll once we've covered sort of the technical
100
+
101
+ 26
102
+ 00:13:58,560 --> 00:14:40,399
103
+ considerations that you'll need to think about we'll talk about managed options that solve a lot of these technical problems for you first let's talk about rest apis what are rest apis rest apis serve predictions in response to canonically formatted http requests there's other alternative protocols to rest for interacting with a service that you host on your infrastructure probably the most common one that you'll see in ml is grpc which is used in a lot of google products like tensorflow serving graphql is another really commonly used protocol in web development that is not terribly relevant for building model services so what does a rest api look like you may have seen examples of this before but when you are sending data to
104
+
105
+ 27
106
+ 00:14:37,199 --> 00:15:20,000
107
+ a web url that's formatted as json blog oftentimes this is a rest request this is an example of what it might look like to interact with the rest api in this example we are sending some data to this url which is where the rest api is hosted api.fullstackdeeplearning.com and we're using the post method which is one of the parts of the rest standard that tells the server how it's going to interact with the data that we're sending and then we're sending this json blob of data that represents the inputs to the model that we want to receive a prediction from so one question you might ask is there any standard for how to format the inputs that we send to the model and unfortunately there isn't really any standard yet here are a few
108
+
109
+ 28
110
+ 00:15:16,959 --> 00:16:01,279
111
+ examples from rest apis for model services hosted in the major clouds and we'll see some differences here between how they expect the inputs to the model to be formatted for example in google cloud they expect a batch of inputs that is structured as a list of what they call instances each of which has values and a key in azure they expect a list of things called data where the data structure itself depends on what your model architecture is and in sagemaker they also expect instances but these instances are formatted differently than they are in google cloud so one thing i would love to see in the future is moving toward a standard interface for making rest api calls for machine learning services since the types of
112
+
113
+ 29
114
+ 00:16:00,079 --> 00:16:35,839
115
+ data that you might send to these services is pretty constrained we should be able to develop a standard as an industry the next topic we'll cover is dependency management model predictions depend not only on the weights of the model that you're running the prediction on but also on the code that's used to turn those weights into the prediction including things like pre-processing and the dependencies the specific library versions that you need in order to run the function that you called and in order for your model to make a correct prediction all of these dependencies need to be present on your web server unfortunately dependencies are a notorious cause of trouble in web applications in general and in
116
+
117
+ 30
118
+ 00:16:34,000 --> 00:17:10,799
119
+ particular in machine learning web services the reason for that is a few things one they're very hard to make consistent between your development environment and your server how do you make sure that the server is running the exact same version of tensorflow pytorch scikit-learn numpy whatever other libraries you depend on as your jupyter notebook was when you train those models the second is that they're hard to update if you update dependencies in one environment you need to update them in all environments and in machine learning in particular since a lot of these libraries are moving so quickly small changes in something like a tensorflow version can change the behavior of your model so it's important to be like
120
+
121
+ 31
122
+ 00:17:09,439 --> 00:17:47,360
123
+ particularly careful about these versions in ml at a high level there's two strategies that will cover for managing dependencies the first is to constrain the dependencies for just your model to save your model in a format that is agnostic that can be run anywhere and then the second is to wrap your entire inference program your entire predict function for your model into what's called a container so let's talk about how to constrain the dependencies of just your model the primary way that people do this today is through this library called onyx the open neural network exchange and the goal of onyx is to be an interoperability standard for machine learning models what they want you to be able to do is to define a neural network
124
+
125
+ 32
126
+ 00:17:45,280 --> 00:18:24,960
127
+ in any language and run it consistently anywhere no matter what inference framework you're using hardware you're using etc that's the promise the reality is that since the underlying libraries used to build these models are currently changing so quickly there's often bugs in this translation layer and in many cases this can create more problems than it actually solves for you and the other sort of open problem here is this doesn't really deal with non-library code in many cases in ml things like feature transformations image transformations you might do as part of your tensorflow or your pi torch graph but you might also just do as a python function that wraps those things and these open neural network standards like
128
+
129
+ 33
130
+ 00:18:22,960 --> 00:18:57,440
131
+ onyx don't really have a great story for how to handle pre-processing that brings us to a second strategy for managing dependencies which is containers how can you manage dependencies with containers like docker so we'll cover a few things here we'll talk about the differences between docker and general virtual machines which you might have covered in a computer science class we'll talk about how docker images are built via docker files and constructed via layers we'll talk a little bit about the ecosystem around docker and then we'll talk about specific wrappers around docker that you can use for machine learning the first thing to know about docker is how it differs from virtual machines which is an older technique for
132
+
133
+ 34
134
+ 00:18:55,600 --> 00:19:33,440
135
+ packaging up dependencies in a virtual machine you essentially package up the entire operating system as well as all the libraries and applications that are built on top of that operating system so it tends to be very heavy weight because the operating system is itself just a lot of code and expensive to run the improvement that docker made is by removing the need to package up the operating system alongside the application instead you have the libraries and applications packaged up together in something called a container and then you have a docker engine that runs on top of your the operating system on your laptop or on your server that knows how to to virtualize the os and run your bins and libraries and
136
+
137
+ 35
138
+ 00:19:32,080 --> 00:20:08,559
139
+ applications on top of it so we just learned that docker is much more lightweight than the typical virtual machine and by virtue of being lightweight it is used very differently than vms were used in particular a common pattern is to spin up a new docker container for every single discrete task that's part of your application so for example if you're building a web application you wouldn't just have a single docker container like you might if you were using a virtual machine instead you might have four you might have one for the web server itself one for the database one for job queue and one for your worker since each one of these parts of your application serves a different function it has different library dependencies and maybe
140
+
141
+ 36
142
+ 00:20:07,200 --> 00:20:43,039
143
+ in the future you might need to scale it differently each one of them goes into its own container and those containers are are run together as part of an orchestration system which we'll talk about in a second how do you actually create a docker container docker containers are created from docker files this is what a docker file looks like it runs a sequence of steps to define the environment that you're going to run your code in so in this case it is importing another container that has some pre-packaged dependencies for running python 2.7 hopefully you're not running python 2.7 but if you were you could build a docker container that uses it using this from command at the top and then doing other things like adding
144
+
145
+ 37
146
+ 00:20:41,440 --> 00:21:21,120
147
+ data from your local machine hip installing packages exposing ports and running your actual application you can build these docker containers on your laptop and store them there if you want to when you're doing development but one of the really powerful things about docker is it also allows you to build store and pull docker containers from a docker hub that's hosted on some other server on docker servers or on your cloud provider for example the way that you would run a docker container typically is by using this docker run command so what that will do is in this case it will find this container on the right called gordon slash getting started part two and it'll try to run that container but if you're connected
148
+
149
+ 38
150
+ 00:21:19,360 --> 00:21:59,280
151
+ to a docker hub and you don't have that docker image locally then what it'll do is it'll automatically pull it from the docker hub that you're connected to the server that your docker engine is connected to it'll download that docker container and it will run it on your local machine so you can experiment with that code environment that's going to be identical to the one that you deploy on your server and in a little bit more detail docker is separated into three different components the first is the client this is what you'll be running on your laptop to build an image from a docker file that you define locally to pull an image that you want to run some code in on your laptop to run a command inside of an image those commands are
152
+
153
+ 39
154
+ 00:21:56,799 --> 00:22:35,440
155
+ actually executed by a docker host which is often run on your laptop but it doesn't have to be it can also be run on a server if you want more storage or more performance and then that docker host talks to a registry which is where all of the containers that you might want to access are stored this separation of concerns is one of the things that makes docker really powerful because you're not limited by the amount of compute and storage you have on your laptop to build pull and run docker images and you're not limited by what you have access to on your docker host to decide which images to run in fact there's a really powerful ecosystem of docker images that are available on different public docker hubs you can
156
+
157
+ 40
158
+ 00:22:33,360 --> 00:23:07,600
159
+ easily find these images modify them and contribute them back and have the full power of all the people on the internet that are building docker files and docker images there might just be one that already solves your use case out of the box it's easy to store private images in the same place as well so because of this community and lightweight nature of docker it's become incredibly popular in recent years and is pretty much ubiquitous at this point so if you're thinking about packaging dependencies for deployment this is probably the tool that you're going to want to use docker is not as hard to get started with as it sounds you'll need to read some documentation and play around with docker files a little bit to get a
160
+
161
+ 41
162
+ 00:23:05,679 --> 00:23:40,240
163
+ feel for how they work and how they fit together you oftentimes won't need to build your own docker image at all because of docker hubs and you can just pull one that already works for your use case when you're getting started that being said there is a bit of a learning curve to docker isn't there some way that we can simplify this if we're working on machine learning and there's a number of different open source packages that are designed to do exactly that one is called cog another is called bento ml and a third is called truss and these are all built by different model hosting providers that are designed to work well with their model hosting service but also just package your model and all of its dependencies in a
164
+
165
+ 42
166
+ 00:23:38,720 --> 00:24:14,960
167
+ standard docker container format so you could run it anywhere that you want to and the way that these systems tend to work is there's two components the first is there's a standard way of defining your prediction service so your like model.predict function how do you wrap that in a way that this service understands so in cog it's this base predictor class that you see on the bottom left in truss it's dependent on the model library that you're using like you see on the right hand side that's the first thing is how do you actually package up this model.predict function and then the second thing is a yaml file which sort of defines the other dependencies and package versions that are going to go into this docker
168
+
169
+ 43
170
+ 00:24:13,360 --> 00:24:47,520
171
+ container that will be run on your laptop or remotely and so this this sort of a simplified version of the steps that you would put into your docker build command but at the end of the day it packages up in the standard format so you can deploy it anywhere so if you want to have some of the advantages of using docker for making your machine learning models reproducible and deploying them but you don't want to actually go through the learning curve of learning docker or you just want something that's a little bit more automated for machine learning use cases then it's worth checking out these three libraries the next topic we'll discuss is performance optimization so how do we make models go bur how do we make them
172
+
173
+ 44
174
+ 00:24:45,919 --> 00:25:22,559
175
+ go fast and there's a few questions that we'll need to answer here first is should we use a gpu to do inference or not we'll talk about concurrency model distillation quantization caching batching sharing the gpu and then finally libraries that automate a lot of these things for you so the spirit of this is going to be sort of a whirlwind tour through some of the major techniques of making your models go faster and we'll try to give you pointers where you can go to learn more about each of these topics the first question you might ask is should you host your model on a gpu or on a cpu there's some advantages to hosting your model on a gpu the first is that it's probably the same hardware that you train your model on to begin with so
176
+
177
+ 45
178
+ 00:25:20,640 --> 00:25:59,919
179
+ that can eliminate some loss and translation type moments the second big con is that as your model gets really big and as your techniques get relatively advanced your traffic gets very large this is usually how you can get the sort of maximum throughput like the most number of users that are simultaneously hitting your model is by hosting the model on a gpu but gpus introduce a lot of complexity as well they're more complex to set up because they're not as well trodden the path for hosting web services as cpus are and they're often almost always actually more expensive so i think one point that's worth emphasizing here since it's a common misconception i see all the time is just because your model was trained on a gpu does not mean that you
180
+
181
+ 46
182
+ 00:25:57,760 --> 00:26:36,080
183
+ need to actually host it on a gpu in order for it to work so consider very carefully whether you really need a gpu at all or whether you're better off especially for an early version of your model just hosting it on a cpu in fact it's possible to get very high throughput just from cpu inference at relatively low cost by using some other techniques and so one of the main ones here is concurrency concurrency means on a single host machine not just having a single copy of the model running but having multiple copies of the model running in parallel on different cpus or different cpu cores how can you actually do this the main technique that you need to be careful about here is thread tuning so making sure that in torch it
184
+
185
+ 47
186
+ 00:26:34,400 --> 00:27:10,320
187
+ knows which threads you need to use in order to actually run the model otherwise the different torch models are going to be competing for threads on your machine there's a great blog post from roblox about how they scaled up bert to serve a billion daily requests just using cpus and they found this to be much easier and much more cost effective than using gpus cpus can be very effective for scaling up to high throughput as well you don't necessarily need gpus to do that the next technique that we'll cover is model distillation what is model distillation model distillation means once you have your model that you've trained maybe a very large or very expensive model that does very well at the task that you want to
188
+
189
+ 48
190
+ 00:27:08,000 --> 00:27:44,399
191
+ solve you can train a smaller model that tries to imitate the behavior of your larger one and so this generally is a way of taking the knowledge that your larger model learned and compressing that knowledge into a much smaller model that maybe you couldn't have trained to the same degree of performance from scratch but once you have that larger model it's able to imitate it so how does this work i'll just point you to this blog post that covers several techniques for how you can do this it's worth noting that this can be tricky to do on your own and is i would say relatively infrequently done in practice in production a big exception to that is oftentimes there are distilled versions of popular models the stilbert is a
192
+
193
+ 49
194
+ 00:27:42,399 --> 00:28:21,120
195
+ great example of this that are pre-trained for you that you can use for very limited performance trade-off the next technique that we're going to cover is quantization what is it this means that rather than taking all of the matrix multiplication math that you do when you make a prediction with your model and doing that all in the sort of full precision 64 or 32-bit floating point numbers that your model weights might be stored in instead you execute some of those operations or potentially all of them in a lower fidelity representation of the numbers that you're doing the math with and so these can be 16-bit floating point numbers or even in some cases 8-bit integers this introduces some trade-offs with accuracy
196
+
197
+ 50
198
+ 00:28:19,279 --> 00:28:55,600
199
+ but oftentimes this is a trade-off that's worth making because the accuracy you lose is pretty limited relative to the performance that you gain how can you do this the recommended path is to use the built-in methods in pytorch and hugging face and tensorflow lite rather than trying to roll this on your own and it's also worth starting to think about this even when you're training your model because techniques called quantization aware training can result in higher accuracy with quantized models than just naively training your model and then running quantization after the fact i want to call out one tool in particular for doing this which is relatively new optimum library from uh hugging face which just makes this very
200
+
201
+ 51
202
+ 00:28:53,840 --> 00:29:31,840
203
+ easy and so if you're already using hugging face models there's a little downside to trying this out next we'll talk about caching what is caching for some machine learning models if you look at the patterns of the inputs that users are requesting that model to make predictions on there's some inputs that are much more common than others so rather than asking the model to make those predictions from scratch every single time users make those requests first let's store the common requests in a cache and then let's check that cache before we actually run this expensive operation of running a forward pass on our neural network how can you do this there's a huge depth of techniques that you can use for intelligent caching but
204
+
205
+ 52
206
+ 00:29:29,760 --> 00:30:07,679
207
+ there's also a very basic way to do this using func tools library in python and so this looks like it's just adding a wrapper to your model.predict code that will essentially check the cache to see if this input is stored there and return the sort of cached prediction if it's there otherwise run the function itself and this is also one of the techniques used in the roblox blog post that i highlighted before for scaling this up to a billion requests per day the pretty important part of their approach so for some use cases you can get a lot of lift just by simple caching the next technique that we'll talk about is batching so what is the idea behind batching well typically when you run inference on a machine learning model
208
+
209
+ 53
210
+ 00:30:05,919 --> 00:30:46,240
211
+ unlike in training you are running it with bat shy as equals one so you have one request come in from a user and then you respond with the prediction for that request and the fact that we are running a prediction on a single request is part of why generally speaking gpus are not necessarily that much more efficient than cpus for running inference what batching does is it takes advantage of the fact that gpus can achieve much higher throughput much higher number of concurrent predictions when they do that prediction in parallel on a batch of inputs rather than on a single input at a time how does this work you have individual predictions coming in from users i want a prediction for this input i want a prediction for this input so
212
+
213
+ 54
214
+ 00:30:44,799 --> 00:31:23,360
215
+ you'll need to gather these inputs together until you have a batch of a sufficient size and then you'll run a prediction on that batch and then split the batch into the predictions that correspond to the individual requests and return those to the individual users so there's a couple of pretty tricky things here one is you'll need to tune this batch size in order to trade off between getting the most throughput from your model which generally requires a larger batch size and reducing the inference latency for your users because if you need to wait too long in order to gather enough predictions to fit into that batch then your users are gonna pay the cost of that they're gonna be the ones waiting for that response to come
216
+
217
+ 55
218
+ 00:31:21,840 --> 00:31:57,440
219
+ back so you need to tune the batch size to trade off between those two considerations you'll also need some way to shortcut this process if latency becomes too long so let's say that you have a lull in traffic and normally it takes you a tenth of a second to gather your 128 inputs that you're going to put into a bash but now all of a sudden it's taking a full second to get all those inputs that can be a really bad user experience if they just have to wait for other users to make predictions in order to see their response back so you'll want some way of shortcutting this process of gathering all these data points together if the latency is becoming too long for your user experience so hopefully it's clear from
220
+
221
+ 56
222
+ 00:31:55,600 --> 00:32:32,559
223
+ this that this is pretty complicated to implement and it's probably not something that you want to implement on your own but luckily it's built into a lot of the libraries for doing model hosting on gpus which we'll talk about in a little bit the next technique that we'll talk about is sharing the gpu between models what does this mean your model may not necessarily fully utilize your gpu for inference and this might be because your batch size is too small or because there's too much other delay in the system when you're waiting for requests so why not just have multiple models if you have multiple model services running on the same view how can you do this this is generally pretty hard and so this is also a place where
224
+
225
+ 57
226
+ 00:32:30,799 --> 00:33:08,080
227
+ you'll want to run an out-of-the-box model serving solution that solves this problem for you so we talked about how in gpu inference if you want to make that work well there's a number of things like sharing the gpu between models and intelligently batching the inputs to the models to trade off between latency and throughput that you probably don't want to implement yourself luckily there's a number of libraries that will solve some of these gpu hosting problems for you there's offerings from tensorflow which is pretty well baked into a lot of google cloud's products and pytorch as well as third-party tools from nvidia and any scale and ray nvidia's is probably the most powerful and is the one that i often see from companies that are trying
228
+
229
+ 58
230
+ 00:33:06,399 --> 00:33:43,200
231
+ to do very high throughput model serving but can also often be difficult to get started with starting with ray serve or the one that's specific to your neural net library is maybe an easier way to get started if you want to experiment with this all right we've talked about how to make your model go faster and how to optimize the performance of the model on a single server but if you're going to scale up to a large number of users interacting with your model it's not going to be enough to get the most efficiency out of one server at some point you'll need to scale horizontally to have traffic going to multiple copies of your model running on different servers so what is horizontal scaling if you have too much traffic for a single
232
+
233
+ 59
234
+ 00:33:41,360 --> 00:34:18,079
235
+ machine you're going to take that stream of traffic that's coming in and you're going to split it among multiple machines how can you actually achieve this each machine that you're running your model on will have its own separate copy of your service and then you'll route traffic between these different copies using a tool called a load balancer in practice there's two common methods of doing this one is container orchestration which is a sort of set of techniques and technologies kubernetes being the most popular for managing a large number of different containers that are running as part of one application on your infrastructure and then a second common method especially in machine learning is serverless so
236
+
237
+ 60
238
+ 00:34:16,800 --> 00:34:52,320
239
+ we'll talk about each of these let's start with container orchestration when we talked about docker we talked about how docker is different than typical deployment and typical virtual machines because rather than running a separate copy of the operating system for every virtual machine or program that you want to run instead you run docker on your server and then docker is able to manage these lightweight virtual machines that run each of the parts of your application that you want to run so when you deploy docker typically what you'll do is you'll run a docker host on a server and then you'll have a bunch of containers that the docker host is responsible for managing and running on that server but when you want to scale
240
+
241
+ 61
242
+ 00:34:50,960 --> 00:35:28,240
243
+ out horizontally so when you want to have multiple copies of your application running on different servers then you'll need a different tool in order to coordinate between all of these different machines and docker images the most common one is called kubernetes kubernetes works together with very closely with docker to build and run containerized distributed applications kubernetes helps you remove the sort of constraint that all of the containers are running on the same machine kubernetes itself is a super interesting topic that is worth reading about if you're interested in distributed computing and infrastructure and scaling things up but for machine learning deployment if your only goal is to deploy ml models it's probably overkill
244
+
245
+ 62
246
+ 00:35:26,720 --> 00:36:01,520
247
+ to learn a ton about kubernetes there's a number of frameworks that are built on top of kubernetes that make it easier to use for deploying models the most commonly used ones in practice tend to be kubeflow serving and selden but even if you use one of these libraries on top of kubernetes for container orchestration you're still going to be responsible for doing a lot of the infrastructure management yourself and serverless functions are an alternative that remove a lot of the need for infrastructure management and are very well suited for machine learning models the way these work is you package up your app code and your dependencies into a docker container and that docker container needs to have a single entry
248
+
249
+ 63
250
+ 00:36:00,079 --> 00:36:36,079
251
+ point function like one function that you're going to run over and over again in that container so for example in machine learning this is most often going to be your model.predict function then you deploy that container to a service service like aws lambda or the equivalence in google or azure clouds and that service is responsible for running that predict function inside of that container for you over and over and over again and takes care of everything else scaling load balancing all these other considerations that if you're horizontally scaling a server would be your problem to solve on top of that there's a different pricing model so if you're running a web server then you control that whole web server and so you
252
+
253
+ 64
254
+ 00:36:34,400 --> 00:37:08,560
255
+ pay for all the time that it's running 24 hours a day but with serverless you only pay for the time that these servers are actually being used to run your model you know if your model is only serving predictions or serving most of its predictions eight hours a day let's say then you're not paying for the other 16 hours where it's not serving any predictions because of all these things serverless tends to be very well suited to building model services especially if you are not an infrastructure expert and you want a quick way to get started so we recommend this as a starting point for once you get past your prototype application so the genius idea here is your servers can't actually go down if you don't have any we're doing
256
+
257
+ 65
258
+ 00:37:06,960 --> 00:37:41,280
259
+ serverless serverless is not without its cons one of the bigger challenges that has gotten easier recently but is still often a challenge in practice is that the packages that you can deploy with these serverless applications tend to be limited in size so if you have an absolutely massive model you might run into those limits there's also a cold start problem what this means is serverless is designed to scale all the way down to zero so if you're not receiving any traffic if you're not receiving any requests for your model then you're not going to pay which is one of the big advantages of serverless but the problem is when you get that first request after the serverless function has been cold for a while it
260
+
261
+ 66
262
+ 00:37:39,520 --> 00:38:15,440
263
+ takes a while to start up it can be seconds or even minutes to get that first prediction back once you've gotten that first prediction back it's faster to get subsequent predictions back but it's still worth being aware of this limitation another challenge practical challenge is that many of these server these serverless services are not well designed for building pipelines and models so if you have a complicated chaining of logic to produce your prediction then it might be difficult to implement that in a server-less context there's little little or no state management available in serverless functions so for example if caching is really important for your application it can be difficult to build that caching
264
+
265
+ 67
266
+ 00:38:13,920 --> 00:38:50,320
267
+ in if you're deploying your model in serverless and there's often limited deployment tooling as well so rolling out new versions of the serverless function there's often not all the tooling that you'd want to make that really easy and then finally these serverless functions today are cpu only and they have limited execution time of you know a few seconds or a few minutes so if you truly need gpus for imprints then serverless is not going to be your answer but i don't think that limitation is going to be true forever in fact i think we might be pretty close to serverless gpus there's already a couple of startups that are claiming to offer serverless gpu for inference and so if you want to do inference on gpus but you
268
+
269
+ 68
270
+ 00:38:48,560 --> 00:39:24,880
271
+ don't want to manage gpu machines yourself i would recommend checking out these two options from these two young startups the next topic that we'll cover in building a model service is rollouts so what do you need to think about in terms of rolling out new models if serving is how you turn your machine learning model into something that can respond to requests that lives on a web server that anyone or anyone that you want to can send a request to and get a prediction back then rollouts are how you manage and update these services so if you have a new version of a model or if you want to split traffic between two different versions to run an a b test how do you actually do that from an infrastructure perspective you probably
272
+
273
+ 69
274
+ 00:39:23,440 --> 00:39:56,960
275
+ want to have the ability to do a few different things so one is to roll out new versions gradually what that means is when you have version n plus one of your model and you want to replace version n with it it's sometimes helpful to be able to rather than just instantly switching over all the traffic to n plus one instead start by sending one percent of your traffic to n plus one and then ten percent and then 50 and then once you're confident that it's working well then switch all of your traffic over to it so you'll want to be able to roll out new versions gradually on the flip side you'll want to be able to roll back to an old version instantly so if you detect a problem with the new version of the model that you deployed hey on this
276
+
277
+ 70
278
+ 00:39:55,200 --> 00:40:28,720
279
+ 10 of traffic that i'm sending to the new model users are not responding well to it or it's sending a bunch of errors you'll want to be able to instantly revert to sending all of your traffic to the older version of the model you want to be able to split traffic between versions a sort of a prerequisite for doing these things as well as running an av test you also want to be able to deploy pipelines of models or deploy models in a way such that they can shadow the prediction traffic they can look at the same inputs as your main model and produce predictions that don't get sent back to users so that you can test whether the predictions look reasonable before you start to show them to users this is just kind of like a
280
+
281
+ 71
282
+ 00:40:26,800 --> 00:41:06,720
283
+ quick flavor of some of the things that you might want to solve for in a way of doing model rollouts this is a challenging infrastructure problem so it's beyond the scope of this lecture in this class really if you're using a managed option which we'll come to in a bit or you have infrastructure that's provided for you by your team it may take care of this for you already but if not then looking into a managed option might be a good idea so manage options take care of a lot of the scaling and roll out challenges that you'd otherwise face if you host models yourself even on something like aws lambda there's a few different categories of options here the cloud providers all provide their own sort of managed options as well as in
284
+
285
+ 72
286
+ 00:41:04,560 --> 00:41:42,079
287
+ most of the end-to-end ml platforms so if you're already using one of these cloud providers or end-to-end ml platforms pretty heavily it's worth checking out their offering to see if that works for you and there's also a number of startups that have offerings here so there's a couple that are i would say more focused on developer experience like bento ml and cortex so if you find sagemaker really difficult to use or you just hate the developer experience for it it might be worth checking one of those out cortex recently was acquired by databricks so it might also start to be incorporated more into their offerings then there's startups that are have offerings that are more have good ease of use but are also really focused on performance
288
+
289
+ 73
290
+ 00:41:39,760 --> 00:42:17,440
291
+ banana is a sort of popular upcoming example of that to give you a feel of what these manage options look like i want to double click on sagemaker which is probably the most popular managed offering the happy path in sagemaker is if your model is already in a digestible format a hugging face model or a scikit-learn model or something like that and in those cases deploying the sagemaker is pretty easy so you will instead of using like kind of a base hugging face class you'll instead use this sagemaker wrapper for the hogging face class and then call fit like you normally would that can also be run on the cloud and then to deploy it you just will call the dot deploy method of this hugging face wrapper and you'll specify
292
+
293
+ 74
294
+ 00:42:15,920 --> 00:42:52,319
295
+ how many instances you want this to run on as well as how beefy you need the hardware to be to run it then you can just call predictor.predicts using some input data and it'll run that prediction on the cloud for you in order to return your response back you know i would say in the past sagemaker had a reputation for being difficult to use if you're just doing inference i don't think that reputation is that warranted i think it's actually like pretty easy to use and in many cases is a very good choice for deploying models because it has a lot of easy wrappers to prevent you from needing to build your own docker containers or things like that and it offers options for both deploying model to a dedicated web server like you see
296
+
297
+ 75
298
+ 00:42:50,960 --> 00:43:29,119
299
+ in this example as well as to a serverless instance the main trade-offs with using sagemaker are one is you want to do something more complicated than standard huggy face or psychic learn model you'll again still need to deploy a container and the interface for deploying a container is maybe not as user friendly or straightforward as you might like it to be interestingly as of yesterday it was quite a bit more expensive for employing models to dedicated instances than raw ec2 but maybe not so much more expensive than serverless if you're going to go serverless anyway and you're willing to pay 20 overhead to have something that is a better experience for deploying most machine learning models then sagemaker is worth checking out if
300
+
301
+ 76
302
+ 00:43:26,960 --> 00:44:07,839
303
+ you're already on amazon take aways from building a model service first you probably don't need to do gpu inference and if you're doing cpu inference then oftentimes scaling horizontally to more servers or even just using serverless is the simplest option is often times enough serverless is probably the recommended option to go with if you can get away with cpus and it's especially helpful if your traffic is spiky so if you have more users in the morning or if you only send your model predictions at night or if your traffic is low volume where you wouldn't max out a full beefy web server anyway sagemaker is increasingly a perfectly good way to get started if you're on aws can get expensive once you've gotten to the
304
+
305
+ 77
306
+ 00:44:06,400 --> 00:44:42,720
307
+ point where that cost really starts to matter then you can consider other options if you do decide to go down the route of doing gpu inference then don't try to roll your own gpu inference instead it's worth investing in using a tool like tensorflow serving or triton because these will end up saving you time and leading to better performance in the end and lastly i think it's worth keeping an eye on the startups in this space for on-demand gpu inference because i think that could change the equation of whether gpu inference is really worth it for machine learning models the next topic that we'll cover is moving your model out of a web server entirely and pushing it to the edge so pushing it to where your users are when
308
+
309
+ 78
310
+ 00:44:41,520 --> 00:45:15,520
311
+ should you actually start thinking about this sometimes it's just obvious let's say that you uh your users have no reliable internet connection they're driving a self-driving car in the desert or if you have very strict data security or privacy requirements if you're building on an apple device and you can't send the data that you need you need to make the predictions back to a web server otherwise if you don't have those strict requirements the trade-off that you'll need to consider is both the accuracy of your model and the latency of your user receiving a response from that model affect the thing that we ultimately care about which is building a good end user experience latency has a couple of different components to it one
312
+
313
+ 79
314
+ 00:45:13,599 --> 00:45:50,240
315
+ component to it is the amount of time it takes the model to make the prediction itself but the other component is the network round trip so how long it takes for the user's request to get to your model service and how long it takes for the prediction to get back to the client device that your user is running on and so if you have exhausted your options for reducing the amount of time that it takes for them all to make a prediction or if your requirements are just so strict that there's no way for you to get within your latency sla by just reducing the amount of time it takes for the model to make prediction then it's worth considering moving to the edge even if you have you know reliable internet connection and don't have very
316
+
317
+ 80
318
+ 00:45:48,640 --> 00:46:23,119
319
+ strict data security and privacy requirements but it's worth noting that moving to the edge adds a lot of complexity that isn't present in web development so think carefully about whether you really need this this is the model that we're considering in edge prediction where the model itself is running on the client device as opposed to running on the server or in its own service the way this works is you'll send the waste to the client device and then the client will load the model and interact with it directly there's a number of pros and cons to this approach the biggest pro is that this is the lowest latency way that you can build machine learning powered products and latency is often a pretty important
320
+
321
+ 81
322
+ 00:46:21,440 --> 00:46:56,240
323
+ driver of user experience it doesn't require an internet connection so if you're building robots or other types of devices that you want to run ml on this can be a very good option it's great with data security because the data that needs to make the prediction never needs to leave the user's device and in some sense you get scale for free right because rather than needing to think about hey how do i scale up my web service to serve the needs of all my users each of those users will bring their own hardware that will be used to run the model's predictions so you don't need to think as much about how to scale up and down the resources you need for running model inference there's some pretty pronounced cons to this approach
324
+
325
+ 82
326
+ 00:46:54,640 --> 00:47:32,079
327
+ as well first of all on these edge devices you generally have very limited hardware resources available so if you're used to running every single one of your model predictions on beefy modern agpu machine you're going to be in for a bit of a shock when it comes to trying to get your model to work on the devices that you needed to work on the tools that you use to do this to make models run on limited hardware are less full featured and in many cases harder to use and more error in bug prone than the neural network libraries that you might be used to working with in tensorflow and pi torch since you need to send updated model weights to the device it can be very difficult to update models in web deployment you have
328
+
329
+ 83
330
+ 00:47:30,480 --> 00:48:05,520
331
+ full control over what version of the model is deployed and so there's a bug you can roll out a fix very quickly but on the edge you need to think a lot more carefully about your strategy for updating the version of the model that your users are running on their devices because they may not always be able to get the latest model and then lastly when things do go wrong so if your if your model has is making errors or mistakes it can be very difficult to detect those errors and fix them and debug them because you don't have the raw data that's going through your models available to you as a model developer since it's all on the device of your user next we're gonna give a lightning tour of the different frameworks that you can use for doing
332
+
333
+ 84
334
+ 00:48:04,000 --> 00:48:40,960
335
+ edge deployment and the right framework to pick depends both on how you train your model and what the target device you want to deploy it on is so we're not going to aim to go particularly deep on any of these options but really just to give you sort of a broad picture of what are the options you can consider as you're making this decision so we'll split this up mostly by what device you're deploying to so simplest answer is if you're deploying to an nvidia device then the right answer is probably tensor rt so whether that's like a gpu like the one you train your model on or one of the nvidia's devices that's more specially designed to deploy on the edge tensorrt tends to be a go-to option there if instead
336
+
337
+ 85
338
+ 00:48:38,720 --> 00:49:23,359
339
+ you're deploying not to an nvidia device but to a phone then both android and apple have libraries for deploying neural networks on their particular os's which are good options if you know that you're only going to be deploying to an apple device or to an android device but if you're using pytorch and you want to be able to deploy both on ios and on android then you can look into pytorch mobile which compiles pi torch down into something that can be run on either of those operating systems similarly tensorflow lite aims to make tensorflow work on different mobile os's as well as well as other edge devices that are neither mobile devices nor nvidia devices if you're deploying not to a nvidia device not to a phone and not to
340
+
341
+ 86
342
+ 00:49:21,839 --> 00:49:58,800
343
+ some other edge device that you might consider but deploying to the browser for reasons of performance or scalability or data privacy then tensorflow.js is probably the main example to look at here i'm not aware of a good option for deploying pytorch to the browser and then lastly you know you might be thinking why is there such a large universe of options like i need to follow this complicated decision tree to pick something that depends on the way i train my model the target device i'm deploying it to there aren't even good ways of filling in some of the cells in that graph like how do you run a pi torch model on an edge device that is not a phone for example it's maybe not super clear in that case it might be
344
+
345
+ 87
346
+ 00:49:56,720 --> 00:50:33,680
347
+ worth looking into this library called apache tvm apache tvm aims to be a library agnostic and target agnostic tool for compiling your model down into something that can run anywhere the idea is build your model anywhere run it anywhere patrick tvm has some adoption but is i would say at this point still pretty far from being a standard in the industry but it's an option that's worth looking into if you need to make your models work on many different types of devices and then lastly i would say pay attention to this space i think this is another sort of pretty active area for development for machine learning startups in particular there's a startup around patchy tvm called octoml which is worth looking into and there's a new
348
+
349
+ 88
350
+ 00:50:32,000 --> 00:51:11,839
351
+ startup that's built by the developers of lower level library called mlir called modular which is also aiming to solve potentially some of the problems around edge deployment as well as tinyml which is a project out of google we talked about the frameworks that you can use to actually run your model on the edge but those are only going to go so far if your model is way too huge to actually put it on the edge at all and so we need ways of creating more efficient models in a previous section we talked about quantization and distillation both of those techniques are pretty helpful for designing these types of models but there's also model architectures that are specifically designed to work well on mobile or edge
352
+
353
+ 89
354
+ 00:51:09,760 --> 00:51:48,559
355
+ devices and the operative example here is mobile nets the idea of mobile nets is to take some of the expensive operations in a typical comp net like convolutional layers with larger filter sizes and replace them with cheaper operations like one by one convolutions and so it's worth checking out this mobilenet paper if you want to learn a little bit more about how mobile networks and maybe draw inspiration for how to design a mobile-friendly architecture for your problem mobile desks in particular are a very good tool for a mobile deployment they tend to not have a huge trade-off in terms of accuracy relative to larger models but they are much much smaller and easier to fit on edge devices another case study
356
+
357
+ 90
358
+ 00:51:46,720 --> 00:52:23,200
359
+ that i recommend checking out is looking into distilbert distilbert is an example of model distillation that works really well to get a smaller version of bert that removes some of the more expensive operations and uses model distillation to have a model that's not much less performant than bert but takes up much less space and runs faster so to wrap up our discussion on edge deployment i want to talk a little bit about some of the sort of key mindsets for edge deployment that i've learned from talking to a bunch of practitioners who have a lot more experience than i do in deploying machine learning models on the edge the first is there's a temptation i think to finding the perfect model architecture first and then figuring out how to make
360
+
361
+ 91
362
+ 00:52:21,119 --> 00:52:59,599
363
+ it work on your device and oftentimes if you're pulling on a web server you can make this work because you always have the option to scale up horizontally and so if you have a huge model it might be expensive to run but you can still make it work but on the edge practitioners believe that the best thing to do is to choose your architecture with your target hardware in mind so you should not be considering architectures that have no way of working on your device and kind of a rule of thumb is you might be able to make up for a factor of let's say an order of magnitude 2 to 10x in terms of inference time or model size through some combination of distillation quantization and other tricks but usually you're not going to get much
364
+
365
+ 92
366
+ 00:52:58,000 --> 00:53:33,920
367
+ more than a 10x improvement so if your model is 100 times too large or too slow to run in your target context then you probably shouldn't even consider that architecture the next mindset is once you have one version of the model that works on your edge device you can iterate locally without needing to necessarily test all the changes that you make on that device which is really helpful because deploying and testing on the edge itself is tricky and potentially expensive but you can iterate locally once the version that you're iterating on does work as long as you only gradually add to the size of the model or the latency of the model and one thing that practitioners recommended doing that is i think a step
368
+
369
+ 93
370
+ 00:53:32,960 --> 00:54:09,599
371
+ that's worth taking if you're going to do this is to add metrics or add tests for model size and latency so that if you're iterating locally and you get a little bit carried away and you double the size of your model or triple the size of your model you'll at least have a test that reminds you like hey you probably need to double check to make sure that this model is actually going to run on the device that we needed to run on another mindset that i learned from practitioners of edge supplement is to treat tuning the model for your device as an additional risk in the model deployment life cycle and test it accordingly so for example always test your models on production hardware before actually deploying them to
372
+
373
+ 94
374
+ 00:54:07,839 --> 00:54:45,359
375
+ production hardware now this may seem obvious but it's not the easiest thing to do in practice and so some folks that are newer to edge deployment will skip this step the reason why this is important is because since these edge deployment libraries are immature there can often be minor differences in the way that the neural network works on your edge device versus how it works on your training device or on your laptop so it's important to run the prediction function of your model on that edge device on some benchmark data set to test both the latency as well as the accuracy of the model on that particular hardware before you deploy it otherwise the differences in how your model works on that hardware versus how it works in
376
+
377
+ 95
378
+ 00:54:43,280 --> 00:55:24,240
379
+ your development environment can lead to unforeseen errors or unforeseen degradations and accuracy of your deployed model then lastly since machinery models in general can be really finicky it's a good idea to build fallback mechanisms into the application in case the model fails or you accidentally roll out a bad version of the model or the model is running too slow to solve the task for your user and these fallback mechanisms can look like earlier versions of your model much simpler or smaller models that you know are going to be reliable and run in the amount of time you need them to run in or even just like rule-based functions where if your model is taking too long to make a prediction or is erroring out
380
+
381
+ 96
382
+ 00:55:22,799 --> 00:55:58,000
383
+ or something you still have something that is going to return a response to your end user so to wrap up our discussion of edge deployment first thing to remind you of is web deployment is truly much easier than edge fluid so only use edge deployment if you really need to second you'll need to choose a framework to do edge deployment and the way that you'll do this is by matching the library that you use to build your neural network and the available hardware picking the corresponding edge deployment framework that matches those two constraints if you want to be more flexible like if you want your model to be able to work on multiple devices it's worth considering something like apache tvm third start considering the
384
+
385
+ 97
386
+ 00:55:56,480 --> 00:56:29,440
387
+ additional constraints that you'll get from edge deployment at the beginning of your project don't wait until you've invested three months into building the perfect model to think about whether that model is actually going to be able to run on the edge instead make sure that those constraints for your edge deployment are taken into consideration from day one and choose your architectures and your training methodologies accordingly to wrap up our discussion of deploying machine learning models fully models is a necessary step of building a machine learning power product but it's also a really useful one for making your models better because only in real life do you get to see how your model actually works on the
388
+
389
+ 98
390
+ 00:56:27,839 --> 00:57:03,040
391
+ task that we really care about so the mindsets that we encourage you to have here are deploy early and deploy often so you can start collecting that feedback from the real world as quickly as possible keep it simple and add complexity only as you need to because this deployment is a can be a rabbit hole and there's a lot of complexity to deal with here so make sure that you really need that complexity so start by building a prototype then once you need to start to scale it up then separate your model from your ui by either doing bath predictions or building a model service then once the like sort of naive way that you've deployed your model stops scaling then you can either learn the tricks to scale or use a managed
392
+
393
+ 99
394
+ 00:57:00,559 --> 00:57:29,839
395
+ service or a cloud provider option to handle a lot of that scaling for you and then lastly if you really need to be able to operate your model on a device that doesn't have consistent access to the internet if you have very hard data security requirements or if you really really really want to go fast then consider moving your model to the edge but be aware that's going to add a lot of complexity and force you to deal with some less mature tools when you want to do that that wraps up our lecture on deployment and we'll see you next week
396
+
documents/lecture-06.md ADDED
@@ -0,0 +1,809 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: How to continuously improve models in production
3
+ ---
4
+
5
+ # Lecture 6: Continual Learning
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/nra0Tt3a-Oc?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published September 12, 2022.
14
+ [Download slides](https://fsdl.me/2022-lecture-06-slides).
15
+
16
+ ## 1 - Overview
17
+
18
+ The core justification for continual learning is that, unlike in
19
+ academia, we never deal with static data distributions in the real
20
+ world. The implication is that: **if you want to use ML in production
21
+ and build ML-powered products, you need to think about your goal of
22
+ building a continual learning system, not just a static model**.
23
+
24
+ Recalling the data flywheel that we've described in this class before:
25
+ as you get more users, those users bring more data. You can use the data
26
+ to make a better model. A better model helps you attract even more users
27
+ and build a better model over time. Andrej Karpathy described the most
28
+ optimistic version of it as "[Operation
29
+ Vacation](https://www.youtube.com/watch?v=hx7BXih7zx8)" -
30
+ if we make our continual learning system good enough, it'll get better
31
+ on its own over time, and ML engineers can just go on vacation.
32
+
33
+ ![](./media/image6.png)
34
+
35
+ The reality is quite different. Initially, we gather, clean, and label
36
+ some data. We train a model on that data. Then we evaluate the model and
37
+ loop back to training the model to improve it based on our evaluations.
38
+ Finally, we get a minimum viable model and deploy it into production.
39
+
40
+ ![](./media/image1.png)
41
+
42
+ The problem begins after we deploy the model: we generally don't have a
43
+ great way of measuring how our models are actually performing in
44
+ production. Often, we just spot-check some predictions to see if they
45
+ are doing what they are supposed to do. If it seems to work, then it's
46
+ great. We move on to work on other things.
47
+
48
+ ![](./media/image8.png)
49
+
50
+ Unfortunately, the ML engineer is probably not the one who discovers the
51
+ problems, to begin with. Some business user or product manager gets
52
+ complaints from users about a dipping metric, which leads to an
53
+ investigation. This already costs the company money because the product
54
+ and business teams must investigate the problem.
55
+
56
+ ![](./media/image12.png)
57
+
58
+ Eventually, they point back to the ML engineer and the model he is
59
+ responsible for. At this point, we are stuck on doing ad-hoc analyses
60
+ because we don't know what caused the model failure. Eventually, we can
61
+ run a bunch of SQL queries and paste together some Jupyter notebooks to
62
+ figure out what the problem is. If we are lucky, we can run an A/B test.
63
+ If the test looks good, we'll deploy it into production. Then, we are
64
+ back to where we started - **not getting ongoing feedback about how the
65
+ model is doing in production**.
66
+
67
+ The upshot is that **continual learning is the least well-understood
68
+ part of the production ML lifecycle**. Very few companies are doing this
69
+ in production today. This lecture focuses on how to improve different
70
+ steps of the continual learning process, pointers to learn about each
71
+ step, and recommendations for doing it pragmatically and adopting it
72
+ gradually.
73
+
74
+ ## 2 - How to Think About Continual Learning
75
+
76
+ Our opinionated view about continual learning is **training a sequence
77
+ of models that can adapt to a continuous stream of data that comes into
78
+ production.** You can think about continual learning as an outer loop in
79
+ your training process. On one end of the loop is your application, which
80
+ consists of a model and some other code that users interact with that
81
+ application by submitting requests, getting predictions back, and
82
+ submitting feedback about how well the model did at providing that
83
+ prediction.
84
+
85
+ The continual learning loop starts with **logging**, which is how we get
86
+ all the data into the loop. Then we have **data curation**, **triggers**
87
+ for the retraining process, **dataset formation** to pick the data to
88
+ retrain on, the **training** process itself, and **offline testing** to
89
+ validate whether the retrained model is good enough to go into
90
+ production. After the model is deployed, we have **online testing**, and
91
+ that brings the next version of the model into production, where we can
92
+ start the loop all over.
93
+
94
+ Each of these stages passes the output to the next step. Output is
95
+ defined by a set of rules. These rules combine to form our **retraining
96
+ strategy**. Let's discuss what the retraining strategy looks like for
97
+ each stage:
98
+
99
+ ![](./media/image7.png)
100
+
101
+
102
+ At the **logging** stage, the key question answered by the retraining
103
+ strategy is **what data should we store**? At the end of this stage, we
104
+ have an "infinite stream" of potentially unlabeled data coming from
105
+ production and can be used for downstream analysis.
106
+
107
+ ![](./media/image3.png)
108
+
109
+
110
+ At the **curation** stage, the key rules we need to define are **what
111
+ data from that infinite stream will we prioritize for labeling and
112
+ potential retraining?** At the end of this stage, we have a reservoir of
113
+ candidate training points that have labels and are fully ready to be fed
114
+ back into a training process.
115
+
116
+ ![](./media/image5.png)
117
+
118
+
119
+ At the **retraining trigger** stage, the key question is **when should
120
+ we retrain?** The output of this stage is a signal to kick off a
121
+ retraining job.
122
+
123
+ ![](./media/image2.png)
124
+
125
+
126
+ At the **dataset formation** stage, the key rules we need to define are
127
+ **from this entire reservoir of data, what specific subset of that data
128
+ are we using to train on for this particular training job?** The output
129
+ of this stage is a view into that reservoir or training data that
130
+ specifies the exact data points to be used for the training job.
131
+
132
+ ![](./media/image22.png)
133
+
134
+
135
+ At the **offline testing** stage, the key rule we need to define is
136
+ **what "good enough" looks like for all stakeholders.** The output of
137
+ this stage is equivalent to a "pull request" report card for your model
138
+ with a clear sign-off process. Once you are signed off, the new model
139
+ will roll out into production.
140
+
141
+ ![](./media/image21.png)
142
+
143
+
144
+ Finally, at the **deployment and online testing** stage, the key rule to
145
+ define is **how do we know if this deployment was successful?** The
146
+ output of this stage is a signal to roll this model out fully to all of
147
+ your users.
148
+
149
+ In an idealized world, from an ML engineer's perspective, once the model
150
+ is deployed, the first version of the model is to not retrain the model
151
+ directly. Instead, we want the model to sit on top of the retraining
152
+ strategy and try to improve that strategy over time. Rather than
153
+ training models daily, we look at metrics about how well the strategy is
154
+ working and how well it's solving the task of improving our model over
155
+ time in response to changes in the world. The input that we provide is
156
+ by tuning the strategy to do a better job of solving that task.
157
+
158
+ For most ML engineers, our jobs don't feel like that at a high level.
159
+ **Our retraining strategy is just retraining models whenever we feel
160
+ like it**. We can get good results from ad-hoc retraining, but when you
161
+ start getting consistent results and no one is actively working on the
162
+ model day to day anymore, then it's worth starting to add some
163
+ automation. Alternatively, if you find yourself needing to retrain the
164
+ model more than once a week (or even more frequently than that) to deal
165
+ with changing results in the real world, then it's worth investing in
166
+ automation just to save yourself.
167
+
168
+ ## 3 - Periodic Retraining
169
+
170
+ The first baseline retraining strategy that you should consider after
171
+ you move on from ad-hoc is just **periodic retraining**:
172
+
173
+ 1. At the logging stage, we simply log everything.
174
+
175
+ 2. At the curation stage, we sample uniformly at random from the data
176
+ that we've logged up until we get the maximum number of data
177
+ points that we are able to handle. Then we label them using some
178
+ automated tools.
179
+
180
+ 3. Our retraining trigger will just be periodic.
181
+
182
+ 4. We train once a week, but we do it on the last month's data, for
183
+ example.
184
+
185
+ 5. Then we compute the test set accuracy after each training, set a
186
+ threshold on that, or more likely manual review the results each
187
+ time, and spot-check some of the predictions.
188
+
189
+ 6. When we deploy the model, we do spot evaluations of that deployed
190
+ model on a few individual predictions to make sure things look
191
+ healthy.
192
+
193
+ ![](./media/image17.png)
194
+
195
+
196
+ Periodic retraining won't work in every circumstance. There are several
197
+ failure modes:
198
+
199
+ 1. The first category is that you have more data than you can log or
200
+ label. If you have a **high volume** of data, you might need to be
201
+ more careful about what data to sample and enrich, particularly if
202
+ that data comes from **a long-tail distribution** - where you have
203
+ edge cases that your model needs to perform well on, but those
204
+ edge cases might not be caught by just doing standard uniform
205
+ sampling. Or if that data is expensive to label like in a
206
+ **human-in-the-loop** scenario - where you need custom labeling
207
+ rules or labeling is a part of the product. In either of those
208
+ cases, you need to be more careful about what subset of your data
209
+ you log and enrich to be used down the road.
210
+
211
+ 2. The second category has to do with **managing the cost of
212
+ retraining**. If your model is expensive to retrain, retraining it
213
+ periodically is not going to be the most cost-efficient way to go,
214
+ especially if you do it on a rolling window of data every single
215
+ time. You will leave a lot of performance on the table by not
216
+ retraining more frequently. You can partially solve this by
217
+ increasing the retraining frequency, but this will increase the
218
+ costs even further.
219
+
220
+ 3. The final failure mode is situations where you have **a high cost of
221
+ bad predictions**. Every time you retrain your model, it
222
+ introduces risk, which comes from the fact that the data you're
223
+ training the model on might be bad in some way. It might be
224
+ corrupted, might have been attacked by an adversary, or might not
225
+ be representative anymore of all the cases that your model needs
226
+ to perform well on. The more frequently you retrain and the more
227
+ sensitive you are to model failures, the more thoughtful you need
228
+ to be about careful model evaluation such that you are not unduly
229
+ taking on too much risk from frequent retraining.
230
+
231
+ ## 4 - Iterating On Your Retraining Strategy
232
+
233
+ The main takeaway from this section is that **we will use monitoring and
234
+ observability to determine what changes we want to make to our
235
+ retraining strategy**.
236
+
237
+ 1. We'll do that by monitoring just the metrics that actually that
238
+ matter and using all other metrics for debugging.
239
+
240
+ 2. When we debug an issue with our model, that will lead to potentially
241
+ retraining our model. But more broadly than that, we can think of
242
+ it as a change to the retraining strategy - changing our
243
+ retraining triggers, our offline tests, our sampling strategies,
244
+ the metrics for observability, etc.
245
+
246
+ 3. As we get more confident in our monitoring, we can introduce more
247
+ automation to our system.
248
+
249
+ There are no real standards or best practices on model monitoring yet.
250
+ The main principles we'll follow are: (1) We'll focus on monitoring what
251
+ matters and what breaks empirically; and (2) We'll compute other signals
252
+ too but use them for observability and debugging.
253
+
254
+ ![](./media/image13.png)
255
+
256
+
257
+ What does it mean to monitor a model in production? We think about it
258
+ as: You have some metric to assess the model quality (i.e, accuracy) and
259
+ a time series of how that metric changes over time. The question you try
260
+ to answer is: **Is this bad or okay?** Do you need to pay attention to
261
+ this degradation or not?
262
+
263
+ The questions we'll need to answer are:
264
+
265
+ 1. What metrics should we be looking at when we are monitoring?
266
+
267
+ 2. How can we tell if those metrics are bad and warrant an
268
+ intervention?
269
+
270
+ 3. What are the tools that help us with this process?
271
+
272
+ ### What Metrics to Monitor
273
+
274
+ Choosing the right metric to monitor is probably the most important part
275
+ of this process. Below you can find different types of metrics ranked in
276
+ order of how valuable they are.
277
+
278
+ ![](./media/image11.png)
279
+
280
+
281
+ #### Outcomes and Feedback From Users
282
+
283
+ The most valuable one to look at is **outcome data or feedback from your
284
+ users**. Unfortunately, there are no one-size-fits-all ways to do this
285
+ because it depends a lot on the specifics of the product you are
286
+ building. This is more of a product management question of how to design
287
+ your product in a way that you can capture feedback from your users as
288
+ part of the product experience.
289
+
290
+ #### Model Performance Metrics
291
+
292
+ The next most valuable signal to look at is **model performance
293
+ metrics**. These are offline metrics such as accuracy. This is less
294
+ useful than user feedback because of loss mismatch. A common experience
295
+ many ML practitioners have is that improving model performance leads to
296
+ the same or worse outcome. There's very little excuse for not doing
297
+ this. To some degree, you can label some production data each day by
298
+ setting up an on-call rotation or throwing a labeling party. These
299
+ practices will give you some sense of how your model performance trends
300
+ over time.
301
+
302
+ ![](./media/image10.png)
303
+
304
+
305
+ #### Proxy Metrics
306
+
307
+ The next best thing to look at is **proxy metrics**, which are
308
+ correlated with bad model performance. These are mostly domain-specific.
309
+ For example, if you are building text generation with a language model,
310
+ two examples would be repetitive and toxic outputs. If you are building
311
+ a recommendation system, an example would be the share of personalized
312
+ responses. **Edge cases** can be good proxy metrics. If there are
313
+ certain problems you know that you have with your model, if those
314
+ increase in prevalence, that might mean your model is not doing very
315
+ well.
316
+
317
+ There's an academic direction that aims at being able to take any metric
318
+ you care about and approximate it on previously unseen data. How well do
319
+ we think our model is doing on this new data? Which would make these
320
+ proxy metrics a lot more practically useful? There are a number of
321
+ different approaches here: from training an auxiliary model to predict
322
+ how well your main model might do on this offline data, to using
323
+ heuristics and human-in-the-loop methods.
324
+
325
+ ![](./media/image20.png)
326
+
327
+
328
+ An unfortunate result from this literature is that it's not possible to
329
+ have a single method you use in all circumstances to approximate how
330
+ your model is doing on out-of-distribution data. Let's say you are
331
+ looking at the input data to predict how the model will perform on those
332
+ input points. Then the label distribution changes. As a result, you
333
+ won't be able to take into account that change in your approximate
334
+ metric.
335
+
336
+ #### Data Quality
337
+
338
+ The next signal to look at is **data quality.** [Data quality
339
+ testing](https://lakefs.io/data-quality-testing/) is a set
340
+ of rules you apply to measure the quality of your data. This deals with
341
+ questions such as: How well does a piece of information reflect reality?
342
+ Does it fulfill your expectations of what's comprehensive? Is your
343
+ information available when you need it? Some common examples include
344
+ checking whether the data has the right schema, the data is in the
345
+ expected range, and the number of records is not anomalous.
346
+
347
+ ![](./media/image19.png)
348
+
349
+ This is useful because data problems tend to be the most common issue
350
+ with ML models in practice. In [a Google
351
+ report](https://www.usenix.org/conference/opml20/presentation/papasian)
352
+ which covered 15 years of different pipeline outages with a particular
353
+ ML model, most of the outages that happened with that model were
354
+ distributed systems problems, commonly data problems.
355
+
356
+ #### Distribution Drift
357
+
358
+ ##### Why Measure Distribution Drift?
359
+
360
+ Your model's performance is only guaranteed on **data sampled from the
361
+ same distribution** as it was trained on. This can have a huge impact in
362
+ practice. A recent example includes changes in model behavior during the
363
+ pandemic. A bug in the retraining pipeline caused the recommendations
364
+ not to be updated for new users, leading to millions of dollars in
365
+ revenue lost.
366
+
367
+ ##### Types of Distribution Drift
368
+
369
+ Distribution drift manifests itself in different ways in the wild:
370
+
371
+ 1. **Instantaneous drift** happens when a model is deployed in a new
372
+ domain, a bug is introduced in the pre-processing pipeline, or a
373
+ big external shift like COVID occurs.
374
+
375
+ 2. **Gradual drift** happens when users\' preferences change or new
376
+ concepts get introduced to the corpus over time.
377
+
378
+ 3. **Periodic drift** happens when users' preferences are seasonal or
379
+ people in different time zones use your model differently.
380
+
381
+ 4. **Temporary drift** happens when a malicious user attacks your
382
+ model, a new user tries your product and churns, or someone uses
383
+ your product in an unintended way.
384
+
385
+ ##### How to Measure It?
386
+
387
+ How to tell if your distribution is drifted?
388
+
389
+ 1. Your first **select a window of "good" data to serve as a
390
+ reference**. To select that reference, you can use a fixed window
391
+ of production data you believe to be healthy. [Some
392
+ papers](https://arxiv.org/abs/1908.04240) advocate
393
+ for using a sliding window of production data. In practice, most
394
+ of the time you probably should use your validation data as the
395
+ reference.
396
+
397
+ 2. Once you have that reference data, you **select a new window of
398
+ production data to measure your distribution distance on**. This
399
+ is not a super principled approach and tends to be
400
+ problem-dependent. A pragmatic solution is to pick one or several
401
+ window sizes with a reasonable amount of data and slide them.
402
+
403
+ 3. Finally, once you have your reference window and production window,
404
+ you **compare the windows using a distribution distance metric**.
405
+
406
+ ##### What Metrics To Use?
407
+
408
+ Let's start by considering the one-dimensional case, where you have a
409
+ particular feature that is one-dimensional and can compute a density of
410
+ that feature on your reference/production windows. You want some metric
411
+ that approximates the distance between these two distributions.
412
+
413
+ ![](./media/image9.png)
414
+
415
+
416
+ There are a few options here:
417
+
418
+ 1. The commonly recommended ones are the KL divergence and the KS test.
419
+ But they are actually bad choices.
420
+
421
+ 2. Sometimes-better options would be (1) infinity norm or 1-norm of the
422
+ diff between probabilities for each category, and (2)
423
+ Earth-mover's distance (a more statistically principled approach).
424
+
425
+ Check out [this Gantry blog
426
+ post](https://gantry.io/blog/youre-probably-monitoring-your-models-wrong/)
427
+ to learn more about why the commonly recommended metrics are not so good
428
+ and the other ones are better.
429
+
430
+ ##### Dealing with High-Dimensional Data
431
+
432
+ In the real world for most models, we have potentially many input
433
+ features or even unstructured data that is very high-dimensional. How do
434
+ we deal with detecting distribution drift in those cases?
435
+
436
+ 1. You can measure **drift on all of the features independently**: If
437
+ you have a lot of features, you will hit [the multiple hypothesis
438
+ testing
439
+ problem](https://multithreaded.stitchfix.com/blog/2015/10/15/multiple-hypothesis-testing/).
440
+ Furthermore, this doesn't capture cross-correlation.
441
+
442
+ 2. You can measure **drift on only the important features**: Generally
443
+ speaking, it's a lot more useful to measure drift on the outputs
444
+ of the model than the inputs. You can also [rank the importance
445
+ of your input
446
+ features](https://christophm.github.io/interpretable-ml-book/feature-importance.html)
447
+ and measure drift on the most important ones.
448
+
449
+ 3. You can look at **metrics that natively compute or approximate the
450
+ distribution distance between high-dimensional distributions**:
451
+ The two that are worth checking out are [maximum mean
452
+ discrepancy](https://jmlr.csail.mit.edu/papers/v13/gretton12a.html)
453
+ and [approximate Earth-mover's
454
+ distance](https://arxiv.org/abs/1904.05877). The
455
+ caveat here is that they are pretty hard to interpret.
456
+
457
+ ![](./media/image14.png)
458
+
459
+ A more principled way to measure distribution drift for high-dimensional
460
+ inputs to the model is to use **projections**. The idea of a projection
461
+ is that:
462
+
463
+ 1. You first take some high-dimensional input to the model and run that
464
+ through a function.
465
+
466
+ 2. Each data point your model makes a prediction on gets tagged by this
467
+ projection function. The goal of this projection function is to
468
+ reduce the dimensionality of that input.
469
+
470
+ 3. Once you've reduced the dimensionality, you can do drift detection
471
+ on that lower-dimensional representation of the high-dimensional
472
+ data.
473
+
474
+ This approach works for any kind of data, no matter what the
475
+ dimensionality is or what the data type is. It's also highly flexible.
476
+ There are different types of projections that can be useful:
477
+ **analytical projections** (e.g., mean pixel value, length of sentence,
478
+ or any other function), **random projections** (e.g., linear), and
479
+ **statistical projections** (e.g., autoencoder or other density models,
480
+ T-SNE).
481
+
482
+ ##### Cons of Looking at Distribution Drift
483
+
484
+ ![](./media/image18.png)
485
+
486
+ **Models are designed to be robust to some degree of distribution
487
+ drift**. The figure on the left above shows a toy example to demonstrate
488
+ this point. We have a classifier that's trained to predict two classes.
489
+ We've induced a synthetic distribution shift to shift the red points on
490
+ the top left to bottom. These two distributions are extremely different,
491
+ but the model performs equally well on the training data and the
492
+ production data. In other words, knowing the distribution shift doesn't
493
+ tell you how the model has reacted to that shift.
494
+
495
+ The figure on the right is a research project that used data generated
496
+ from a physics simulator to solve problems on real-world robots. The
497
+ training data was highly out of distribution (low-fidelity, random
498
+ images). However, by training on this set of training data, the model
499
+ was able to generalize to real-world scenarios on the test data.
500
+
501
+ Beyond the theoretical limitations of measuring distribution drift, this
502
+ is just hard to do in practice. You have to window size correctly. You
503
+ have to keep all this data around. You need to choose metrics. You need
504
+ to define projections to make your data lower-dimensional.
505
+
506
+ #### System Metrics
507
+
508
+ The last thing to consider looking at is your standard **system
509
+ metrics** such as CPU utilization, GPU memory usage, etc. These don't
510
+ tell you anything about how your model is actually performing, but they
511
+ can tell you when something is going wrong.
512
+
513
+ #### Practical Recommendations
514
+
515
+ We also want to look at how hard it is to compute the aforementioned
516
+ stages in practice. As seen below, the Y-axis shows the **value** of
517
+ each signal and the X-axis shows the **feasibility** of measuring each
518
+ signal.
519
+
520
+ 1. Measuring outcomes or feedback has pretty wide variability in terms
521
+ of how feasible it is to do, as it depends on how your product is
522
+ set up.
523
+
524
+ 2. Measuring model performance tends to be the least feasible thing to
525
+ do because it involves collecting some labels.
526
+
527
+ 3. Proxy metrics are easier to compute because they don't involve
528
+ labels.
529
+
530
+ 4. System metrics and data quality metrics are highly feasible because
531
+ you have off-the-shelf tools for them.
532
+
533
+ ![](./media/image15.png)
534
+
535
+
536
+ Here are our practical recommendations:
537
+
538
+ 1. Basic data quality checks are zero-regret, especially if you are
539
+ retraining your model.
540
+
541
+ 2. Get some way to measure feedback, model performance, or proxy
542
+ metrics, even if it's hacky or not scalable.
543
+
544
+ 3. If your model produces low-dimensional outputs, monitoring those for
545
+ distribution shifts is also a good idea.
546
+
547
+ 4. As you evolve your system, practice the **observability** mindset.
548
+
549
+ While you can think of monitoring as measuring the known unknowns (e.g.,
550
+ setting alerts on a few key metrics), [observability is measuring
551
+ unknown
552
+ unknowns](https://www.honeycomb.io/blog/observability-a-manifesto/)
553
+ (e.g., having the power to ask arbitrary questions about your system
554
+ when it breaks). An observability mindset means two implications:
555
+
556
+ 1. You should keep around the context or raw data that makes up the
557
+ metrics that you are computing since you want to be able to drill
558
+ all the way down to potentially the data points themselves that
559
+ make up the degraded metric.
560
+
561
+ 2. You can go crazy with measurement by defining a lot of different
562
+ metrics. You shouldn't necessarily set alerts on each of those
563
+ since you don't want too many alerts. Drift is a great example
564
+ since it is useful for debugging but less so for monitoring.
565
+
566
+ Finally, it's important to **go beyond aggregate metrics**. If your
567
+ model is 99% accurate in aggregate but only 50% accurate for your most
568
+ important user, is it still "good"? The way to deal with this is by
569
+ flagging important subgroups or cohorts of data and alerting on
570
+ important metrics across them. Some examples are categories you don't
571
+ want to be biased against, "important" categories of users, and
572
+ categories you might expect to perform differently on (languages,
573
+ regions, etc.).
574
+
575
+ ### How To Tell If Those Metrics are "Bad"
576
+
577
+ We don't recommend statistical tests (e.g., KS-Test) because they try to
578
+ return a p-value for the likelihood that the data distributions are not
579
+ the same. When you have a lot of data, you will get very small p-values
580
+ for small shifts. This is not what we actually care about since models
581
+ are robust to a small number of distribution shifts.
582
+
583
+ Better options than statistical tests include fixed rules, specific
584
+ ranges, predicted ranges, and unsupervised detection of new patterns.
585
+ [This article on dynamic data
586
+ testing](https://blog.anomalo.com/dynamic-data-testing-f831435dba90?gi=fb4db0e2ecb4)
587
+ has the details.
588
+
589
+ ![](./media/image16.png)
590
+
591
+ ### Tools for Monitoring
592
+
593
+ The first category is **system monitoring** tools, a premature category
594
+ with different companies in it
595
+ ([Datadog](https://www.datadoghq.com/),
596
+ [Honeycomb](https://www.honeycomb.io/), [New
597
+ Relic](https://newrelic.com/), [Amazon
598
+ CloudWatch](https://aws.amazon.com/cloudwatch/), etc.).
599
+ They help you detect problems with any software system, not just ML
600
+ models. They provide functionality for setting alarms when things go
601
+ wrong. Most cloud providers have decent monitoring solutions, but if you
602
+ want something better, you can look at monitoring-specific tools to
603
+ monitor anything.
604
+
605
+ This raises the question of whether we should just use these system
606
+ monitoring tools to monitor ML metrics as well. [This blog
607
+ post](https://www.shreya-shankar.com/rethinking-ml-monitoring-3/)
608
+ explains that it's feasible but highly painful due to many technical
609
+ reasons. Thus, it's better to use ML-specific tools.
610
+
611
+ Two popular open-source monitoring tools are
612
+ [EvidentlyAI](https://github.com/evidentlyai) and
613
+ [whylogs](https://github.com/whylabs/whylogs).
614
+
615
+ - Both are similar in that you provide them with samples of data and
616
+ they produce a nice report that tells you where their distribution
617
+ shifts are.
618
+
619
+ - The big limitation of both is that they don't solve the data
620
+ infrastructure and the scale problem. You still need to be able to
621
+ get all that data into a place where you can analyze it with these
622
+ tools.
623
+
624
+ - The main difference between them is that whylogs is more focused on
625
+ gathering data from the edge by aggregating the data into
626
+ statistical profiles at inference time. You don't need to
627
+ transport all the data from your inference devices back to your
628
+ cloud.
629
+
630
+ ![](./media/image4.png)
631
+
632
+ Lastly, there are a bunch of different SaaS vendors for ML monitoring
633
+ and observability: [Gantry](https://gantry.io/),
634
+ [Aporia](https://www.aporia.com/),
635
+ [Superwise](https://superwise.ai/),
636
+ [Arize](https://arize.com/),
637
+ [Fiddler](https://www.fiddler.ai/),
638
+ [Arthur](https://arthur.ai/), etc.
639
+
640
+
641
+ ## 5 - Retraining Strategy
642
+
643
+ We’ve talked about monitoring and observability, which allow you to identify issues with your continual learning system. Now, we’ll talk about how we will fix the various stages of the continual learning process based on what we learn from monitoring and observability.
644
+
645
+
646
+ ### Logging
647
+
648
+ The first stage of the continual learning loop is **logging**. As a reminder, the goal of logging is to get data from your model to a place where you can analyze it. The key question to answer here is: “**what data should I actually log?**”
649
+
650
+ For most of us, the best answer is just to log all of the data. Storage is cheap. It's better to have data than not to have it. There are, however, some situations where you can't do that. For example, if you have too much traffic going through your model to the point where it's too expensive to log all of it, or if you have data privacy concerns, or if you're running your model at the edge, you simply may not be able to able to log all your data.
651
+
652
+ In these situations, there are two approaches that you can take. The first approach is **profiling**. With profiling, rather than sending all the data back to your cloud and then using that to monitor, you instead compute **statistical profiles** of your data on the edge that describe the data distribution that you're seeing. This is great from a data security perspective because it doesn't require you to send all the data back home. It minimizes your storage cost. Finally, you don't miss things that happen in the tails, which is an issue for the next approach. That'll describe the place to use. This approach is best used for security-critical applications. Computing statistical profiles is a pretty interesting topic in computer science and data summarization that is worth checking out if you’re interested in this approach.
653
+
654
+ ![alt_text](./media/image22.png "image_tooltip")
655
+
656
+
657
+ The other approach is **sampling**. With sampling, you'll just take certain data points and send those back to your monitoring and logging system. The advantage of sampling is that it has minimal impact on your inference resources. You don't have to actually spend the computational budget to compute profiles. You also get to have access to the raw data for debugging and retraining, albeit a smaller amount. This is the approach we recommend for any other kind of application.
658
+
659
+
660
+ ### Curation
661
+
662
+ The next step in the continual learning loop is **curation**. The goal of curation is to take the infinite stream of production data, which is potentially unlabeled, and turn it into a finite reservoir of enriched data suitable for training. Here, we must answer, “**what data should be enriched?**”
663
+
664
+ You could **sample and enrich data randomly**, but that may not prove helpful to your model. Importantly, you miss rare classes or events. A better approach can be to perform **stratified subsampling**, wherein you sample specific proportions of individuals from various subpopulations (e.g. race). The most advanced strategy for picking data to enrich is to **curate data points** that are somehow interesting for the purpose of improving your model.
665
+
666
+ There are a few different ways of doing this: **user-driven curation loops** via feedback loops, **manual curation** via error analysis, and **automatic curation** via active learning.
667
+
668
+ User-driven curation is a great approach that is easy to implement, assuming you have a clear way of gathering user feedback. If your user churns, clicks thumbs down, or performs some other similar activity on the model’s output, you have an easy way of understanding data that could be enriched for future training jobs.
669
+
670
+ ![alt_text](./media/image23.png "image_tooltip")
671
+
672
+ If you don't have user feedback, or if you need even more ways of gathering interesting data from your system, the second most effective way is by doing **manual error analysis**. In this approach, we look at the errors that our model is making, reason about the different types of failure modes that we're seeing, and try to write functions or rules that help capture these error modes. We'll use those functions to gather more data that might represent those error cases. Some examples of these function-based approaches are **similarity-based curation**, which uses nearest neighbors, and **projection-based curation**, wherein we train a new function or model to recognize key data points.
673
+
674
+ The last way to curate data is to do so automatically using a class of algorithms called **[active learning](https://lilianweng.github.io/posts/2022-02-20-active-learning/)**. The way active learning works is that, given a large amount of unlabeled data, we will try to determine which data points would improve model performance the most (if you were to label those data points next and train on them). These algorithms define **a sampling strategy**, rank all of your unlabeled examples using **a scoring function** that defines the sampling strategy, and mark the data points with the highest scores for future labeling.
675
+
676
+ There are a number of different scoring function approaches that are shown below.
677
+
678
+
679
+
680
+ 1. **Most uncertain**: sample low-confidence and high-entropy predictions or predictions that an ensemble disagrees on.
681
+ 2. **Highest predicted loss**: train a separate model that predicts loss on unlabeled points, then sample the highest predicted loss.
682
+ 3. **Most different from labels**: train a model to distinguish labeled and unlabeled data, then sample the easiest to distinguish.
683
+ 4. **Most representative**: choose points such that no data is too far away from anything we sampled.
684
+ 5. **Big impact on training**: choose points such that the expected gradient is large or points where the model changes its mind the most about its prediction during training.
685
+
686
+ Uncertainty scoring tends to be the most commonly used method since it is simple and easy to implement.
687
+
688
+ You might have noticed that there's a lot of similarity between some of the ways that we do data curation and the way that we do monitoring. That's no coincidence--**monitoring and data curation are two sides of the same coin!** They're both interested in solving the problem of finding data points where the model may not be performing well or where we're uncertain about how the model is performing on those data points.
689
+
690
+ ![alt_text](./media/image24.png "image_tooltip")
691
+
692
+ Some examples of people practically applying data curation are OpenAI’s DALL-E 2, which uses [active learning and manual curation](https://openai.com/blog/dall-e-2-pre-training-mitigations/), Tesla, which uses [feedback loops and manual curation](https://www.youtube.com/watch?v=hx7BXih7zx8), and Cruise, which uses feedback loops.
693
+
694
+ Some tools that help with data curation are [Scale Nucleus](https://scale.com/nucleus), [Aquarium](https://www.aquariumlearning.com/), and [Gantry](https://gantry.io/).
695
+
696
+ To summarize then, here are our final set of recommendations for applying data curation.
697
+
698
+
699
+
700
+ 1. Random sampling is a fine starting point. If you want to avoid bias or have rare classes, do stratified sampling instead.
701
+ 2. If you have a feedback loop, then user-driven curation is a no-brainer. If not, confidence-based active learning is easy to implement.
702
+ 3. As your model performance increases, you’ll have to look harder for challenging training points. Manual techniques are unavoidable and should be embraced. Know your data!
703
+
704
+
705
+ ### Retraining Triggers
706
+
707
+ After we've curated our infinite stream of unlabeled data down to a reservoir of labeled data that's ready to potentially train on, the next thing that we'll need to decide is “**what trigger are we gonna use to retrain?**”
708
+
709
+ The main takeaway here is that moving to automated retraining is **not** always necessary. In many cases, just manually retraining is good enough. It can save you time and lead to better model performance. It's worth understanding when it makes sense to actually make the harder move to automated retraining.
710
+
711
+ The main prerequisite for moving to automated retraining is being able to reproduce model performance when retraining in a fairly automated fashion. If you're able to do that and you are not really working on the model actively, it's probably worth implementing some automated retraining. As a rule of thumb, if you’re retraining the model more than once a month, automated retraining may make sense.
712
+
713
+ When it's time to move to automated training, the main recommendation is to just keep it simple and **retrain periodically**, e.g. once a week. The main question though is, how do you pick the right training schedule? The recommendation here is to:
714
+
715
+
716
+
717
+ 1. Apply measurement to figure out a reasonable retraining schedule.
718
+ 2. Plot your model performance and degradation over time.
719
+ 3. Compare how retraining the model at various intervals would have resulted in improvements to its performance.
720
+
721
+ As seen below, the area between the curves represents the opportunity cost, so always remember to balance the upside of retraining with the operational costs of retraining.
722
+
723
+ ![alt_text](./media/image25.png "image_tooltip")
724
+
725
+ This is a great area for future academic research! More specifically, we can look at ways to automate determining the optimal retraining strategy based on performance decay, sensitivity to performance, operational costs, and retraining costs.
726
+
727
+ An additional option for retraining, rather than time-based intervals, is **performance triggers** (e.g. retrain when the model accuracy dips below 90%). This helps react more quickly to unexpected changes and is more cost-optimal, but requires very good instrumentation to process these signals along with operational complexity.
728
+
729
+ An idea that probably won't be relevant but is worth thinking about is **online learning**. In this paradigm, you train on every single data point as it comes in. It's not very commonly used in practice.
730
+
731
+ A version of this idea that is used fairly frequently in practice is **online adaptation**. This method operates not at the level of retraining the whole model itself but rather on the level of adapting the policy that sits on top of the model. What is a policy you ask? A policy is the set of rules that takes the raw prediction that the model made, like the score or the raw output of the model, and turns it into the output the user sees. In online adaptation, we use algorithms like multi-armed bandits to tune these policies. If your data changes very frequently, it is worth looking into this method.
732
+
733
+
734
+ ### Dataset Formation
735
+
736
+ Imagine we've fired off a trigger to start a new training job. The next question we need to answer is, among all of the labeled data in our reservoir of data, **what specific data points should we train on for this particular new training job?**
737
+
738
+ We have four options here. Most of the time in deep learning, we'll just use the first option and **train on all the data that we have available** to us. Remember to keep your data version controlled and your curation rules consistent.
739
+
740
+ ![alt_text](./media/image26.png "image_tooltip")
741
+
742
+ If you have too much data to do that, you can use recency as a heuristic for a second option and **train on only a sliding window of the most recent data** (if recency is important) or **sample a smaller portion** (if recency isn’t). In the latter case, compare the aggregate statistics between the old and new windows to ensure there aren’t any bugs. It’s also important in both cases to compare the old and new datasets as they may not be related in straightforward ways.
743
+
744
+ ![alt_text](./media/image27.png "image_tooltip")
745
+
746
+ A useful third option is **online batch selection**, which can be used when recency doesn’t quite matter. In this method, we leverage label-aware selection functions to choose which items in mini-batches to train on.
747
+
748
+ ![alt_text](./media/image28.png "image_tooltip")
749
+
750
+ A more difficult fourth option that isn’t quite recommended is **continual fine-tuning**. Rather than retraining from scratch every single time, you train your existing model on just new data. The reason why you might wanna do this primarily is because it's much more cost-effective. The paper below shares some findings from GrubHub, where they found a 45x cost improvement by doing this technique relative to sliding windows.
751
+
752
+ ![alt_text](./media/image29.png "image_tooltip")
753
+
754
+ The big challenge here is that unless you're very careful, it's easy for the model to forget what it learned in the past. The upshot is that you need to have mature evaluation practices to be very careful that your model is performing well on all the types of data that it needs to perform well on.
755
+
756
+
757
+ ### Offline Testing
758
+
759
+ After the previous steps, we now have a new candidate model that we think is ready to go into production. The next step is to test that model. The goal of this stage is to produce a report that our team can sign off on that answers the question of whether this new model is good enough or whether it's better than the old model. The key question here is, “**what should go into that report?**”
760
+
761
+ This is a place where there's not a whole lot of standardization, but the recommendation we have here is to compare your current model with the previous version of the model on all of the metrics that you care about, all of the subsets of data that you've flagged are important, and all the edge cases you’ve defined. Remember to adjust the comparison to account for any sampling bias.
762
+
763
+ Below is a sample comparison report. Note how the validation set is broken out into concrete subgroups. Note also how there are specific validation sets assigned to common error cases.
764
+
765
+ ![alt_text](./media/image30.png "image_tooltip")
766
+
767
+ In continual learning, evaluation sets are dynamically refined just as much as training sets are. Here are some guidelines for how to manage evaluation sets in a continual learning system:
768
+
769
+
770
+
771
+ 1. As you curate new data, add some of it to your evaluation sets. For example, if you change how you do sampling, add that newly sampled data to your evaluation set. Or if you encounter a new edge case, create a test case for it.
772
+ 2. Corollary 1: you should version control your evaluation sets as well.
773
+ 3. Corollary 2: if your data changes quickly, always hold out the most recent data for evaluation.
774
+
775
+ Once you have the testing basics in place, a more advanced option that you can look into here is **expectation testing**. Expectation tests work by taking pairs of examples where you know the relationship between the two. These tests help a lot with understanding the generalizability of models.
776
+
777
+ ![alt_text](./media/image31.png "image_tooltip")
778
+
779
+ Just like how data curation is highly analogous to monitoring, so is offline testing. We want to observe our metrics, not just in aggregate but also across all of our important subsets of data and across all of our edge cases. One difference between these two is that **you will have different metrics available in offline testing and online testing**. For example, you’re much more likely to have labels offline. Online, you’re much more likely to have feedback. We look forward to more research that can predict online metrics from offline ones.
780
+
781
+
782
+ ### Online Testing
783
+
784
+ Much of this we covered in the last lecture, so we’ll keep it brief! Use shadow mode and A/B tests, roll out models gradually, and roll back models if you see issues during rollout.
785
+
786
+
787
+ ## 6 - The Continual Improvement Workflow
788
+
789
+ To tie it all together, we’ll conclude with an example. Monitoring and continual learning are two sides of the same coin. We should be using the signals that we monitor to very directly change our retraining strategy. This section describes the future state that comes as a result of investing in the steps laid out previously.
790
+
791
+ Start with a place to store and version your strategy. The components of your continual learning strategy should include the following:
792
+
793
+
794
+
795
+ * Inputs, predictions, user feedback, and labels.
796
+ * Metric definitions for monitoring, observability, and offline testing.
797
+ * Projection definitions for monitoring and manual data curation.
798
+ * Subgroups and cohorts of interest for monitoring and offline testing.
799
+ * Data curation logic.
800
+ * Datasets for training and evaluation.
801
+ * Model comparison reports.
802
+
803
+ Walk through this example to understand how changes to the retraining strategy occur as issues surface in our machine learning system.
804
+
805
+ ![alt_text](./media/image32.png "image_tooltip")
806
+
807
+ ## 7 - Takeaways
808
+
809
+ To summarize, continual learning is a nascent, poorly understood topic that is worth continuing to pay attention to. Watch this space! In this lecture, we focused on all the steps and techniques that allow you to use retraining effectively. As MLEs, leverage monitoring to strategically improve your model. Always start simple, and get better!
documents/lecture-06.srt ADDED
@@ -0,0 +1,440 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,080 --> 00:00:38,960
3
+ hi everybody welcome back to full stack deep learning this week we're going to talk about continual learning which is in my opinion one of the most exciting topics that we cover in this class continual learning describes the process of iterating on your models once they're in production so using your production data to retrain your models for two purposes first to adapt your model to any changes in the real world that happen after you train your model and second to use data from the real world to just improve your model in general so let's dive in the sort of core justification for continual learning is that unlike in academia in the real world we never deal with static data distributions and so the implication of
4
+
5
+ 2
6
+ 00:00:36,719 --> 00:01:14,640
7
+ that is if you want to use ml in production if you want to build a good machine learning powered product you need to think about your goal as building a continual learning system not just building a static model so i think how we all hope this would work is the data flywheel that we've described in this class before so as you get more users those users bring more data you can use the data to make better model that better model helps you attract even more users and build a better model over time and the most automated version of this the most optimistic version of it was described by andre karpathy as operation vacation if we make our continual learning system good enough then it'll just get better on its own
8
+
9
+ 3
10
+ 00:01:12,880 --> 00:01:47,920
11
+ over time and we as machine learning engineers can just go on vacation and when we come back the model will be better but the reality of this is actually quite different i think it starts out okay so we gather some data we clean and label that data we train a model on the data then we evaluate the model we loop back to training the model to make it better based on the evaluations that we made and finally we get to the point where we're done we have a minimum viable model and we're ready to ship it into production and so we deploy it the problem begins after we deploy it which is that we generally don't really have a great way of measuring how our models are actually performing in production so often what
12
+
13
+ 4
14
+ 00:01:46,159 --> 00:02:21,440
15
+ we'll do is we'll just spot check some predictions to see if it looks like it's doing what it's supposed to be doing and if it seems to be working then that's great we probably move on and work on some other project that is until the first problem pops up and now unfortunately i as a machine learning engineer and probably not the one that discovers that problem to begin with it's probably you know some business user or some pm that realizes that hey we're getting complaints from a user or we're having a metric that's dipped and this leads to an investigation this is already costing the company money because the product and the business team are having to investigate this problem eventually they are able to
16
+
17
+ 5
18
+ 00:02:19,440 --> 00:02:53,280
19
+ point this back to me and to the model that i am responsible for and at this point you know i'm kind of stuck doing some ad hoc analyses because i don't really know what the cause of the model of the failure of the model is maybe i haven't even looked at this model for a few weeks or a few months you know maybe eventually i'm able to like run a bunch of sql queries you know paste together some jupiter notebooks and figure out what i think the problem is so i'll retrain the model i'll redeploy it and if we're lucky we can run an a b test and if that a b test looks good then we'll deploy it into production and we're sort of back where we started not getting ongoing feedback about how the model is really doing in production the
20
+
21
+ 6
22
+ 00:02:51,519 --> 00:03:29,040
23
+ upshot of all this is that continual learning is really the least well understood part of the production machine learning lifecycle and very few companies are actually doing this well in production today and so this lecture in some ways is going to feel a little bit different than some of the other lectures a big part of the focus of this lecture is going to be being opinionated about how we think you should think about the structure of continual learning problems this is you know some of what we say here will be sort of well understood industry best practices and some of it will be sort of our view on what we think this should look like i'm going to throw a lot of information at you about each of the different steps of the
24
+
25
+ 7
26
+ 00:03:27,120 --> 00:04:06,000
27
+ continual learning process how to think about improving how you do these parts once you have your first model in production and like always we'll provide some recommendations for how to do this pragmatically and how to adopt it gradually so first i want to give sort of an opinionated take on how i think you should think about continual learning so i'll define continual learning as training a sequence of models that is able to adapt to a continuous stream of data that's coming in in production you can think about continual learning as an outer loop on your training process on one end of the loop is your application which consists of a model as well as some other code users interact with that application by submitting
28
+
29
+ 8
30
+ 00:04:03,840 --> 00:04:41,440
31
+ requests getting predictions back and then submitting feedback about how well the model did at providing that prediction the continual learning loop starts with logging which is how we get all the data into the loop then we have data curation triggers for doing the retraining process data set formation to pick the data to actually retrain on and we have the training process itself then we have offline testing which is how we validate whether the retrained model is good enough to go into production after it's deployed we have online testing and then that brings the next version of the model into production where we can start the loop all over again each of these stages passes an output to the next step and the way that output is defined is by
32
+
33
+ 9
34
+ 00:04:39,040 --> 00:05:17,360
35
+ using a set of rules and all these rules together roll up into something called a retraining strategy next we'll talk about what the retraining strategy defines for each stage and what the output looks like so the logging stage the key question that's answered by the retraining strategy is what data should we actually store and at the end of this we have an infinite stream of potentially unlabeled data that's coming from production and is able to be used for downstream analysis at the curation stage the key rules that we need to define are what data from that infinite stream are we going to prioritize for labeling and potential retraining and at the end of the stage we'll have a reservoir of a finite number of
36
+
37
+ 10
38
+ 00:05:15,919 --> 00:05:52,880
39
+ candidate training points that have labels and are fully ready to be fed back into a training process at the retraining trigger stage the key question to answer is when should we actually retrain how do we know when it's time to hit the retrain button and the output of the stage is a signal to kick off a retraining job at the data set formation stage the key rules we need to define are from among this entire reservoir of data what specific subset of that data are we actually going to train on for this particular training job you can think of the output of this as a view into that reservoir of training data that specifies the exact data points that are going to go into this training job at the offline testing
40
+
41
+ 11
42
+ 00:05:50,960 --> 00:06:28,800
43
+ stage the key rules that we need to define are what is good enough look like for all of our stakeholders how are we going to agree that this model is ready to be deployed and the output of the stage looks like something like the equivalent of a pull request a report card for your model that has a clear sign-off process that once you're signed off the new model will roll out into prod and then finally at the deployment online testing stage the key rules that we need to find are how do we actually know if this deployment was successful and the output of the stage will be the signal to actually roll this model out fully to all of your users in an idealized world the way i think we should think of our role as machine
44
+
45
+ 12
46
+ 00:06:26,720 --> 00:07:07,360
47
+ learning engineers once we've deployed the first version of the model is not to retrain the model directly but it's to sit on top of the retraining strategy and babysit that strategy and try to improve the strategy itself over time so rather than training models day-to-day we're looking at metrics about how well the strategy is working how well it's solving the task of improving our model over time in response to changes to the world and the input that we provide is by tuning the strategy by changing the rules that make up the strategy to help the strategy do a better job of solving that task that's a description of the goal state of our role as an ml engineer in the real world today for most of us our job doesn't really feel like this at
48
+
49
+ 13
50
+ 00:07:05,520 --> 00:07:39,759
51
+ a high level because for most of us our retraining strategy is just retraining models whenever we feel like it and that's not actually as bad as it seems you can get really good results from ad hoc retraining but when you start to be able to get really consistent results when you retrain models and you're not really working on the model day-to-day anymore then it's worth starting to add some automation alternatively if you find yourself needing to retrain the model more than you know once a week or even more frequently than that to deal with changing results in the real world then it's also worth investing in automation just to save yourself time the first baseline retraining strategy that you should consider after you move
52
+
53
+ 14
54
+ 00:07:38,240 --> 00:08:18,479
55
+ on from ad hoc is just periodic retraining and this is what you'll end up doing in most cases in the near term so let's describe this periodic retraining strategy so at the logging stage we'll simply log everything a curation will sample uniformly at random from the data that we've logged up until we get the max number of data points that we're able to handle we're able to label or we're able to train on and then we'll label them using some automated tool our retraining trigger will just be periodic so we'll train once a week but we'll do it on the last month's data for example and then we will compute the test set accuracy after each training set a threshold on that or more likely manually review the results each time and spot check some of
56
+
57
+ 15
58
+ 00:08:16,400 --> 00:08:51,600
59
+ the predictions and then when we deploy the model we'll do spot evaluations of that deployed model on a few individual predictions just to make sure things look healthy and we'll move on this baseline looks something like what most companies do for automated retraining in the real world retraining periodically is a pretty good baseline and in fact it's what i would suggest doing when you're ready to start doing automated retraining but it's not going to work in every circumstance so let's talk about some of the failure modes the first category of failure modes has to do with when you have more data than you're able to log or able to label if you have a high volume of data you might need to be more careful about what data you sample
60
+
61
+ 16
62
+ 00:08:48,640 --> 00:09:30,160
63
+ and enrich particularly if either that data comes from a long tail distribution where you have edge cases that your model needs to perform well on but those edge cases might not be caught by just doing standard uniform random sampling or if that data is expensive to label like in a human in the loop scenario where you need custom labeling rules or labeling is part of the product in either of those cases long tail distribution or human in the loop setup you probably need to be more careful about what subset of your data that you log and enrich to be used down the road second category of where this might fail has to do with managing the cost of retraining if your model is really expensive to retrain then
64
+
65
+ 17
66
+ 00:09:28,720 --> 00:10:01,600
67
+ retraining it periodically is probably not going to be the most cost efficient way to go especially if you do it on a rolling window of data every single time let's say that you retrain your model every week but your data actually changes a lot every single day you're going to be leaving a lot of performance on the table by not retraining more frequently you could increase the frequency and retrain say every few hours but this is going to increase costs even further the final failure mode is situations where you have a high cost of bad predictions one thing that you should think about is every single time you retrain your model it introduces risk that risk comes from the fact that the data that you're training
68
+
69
+ 18
70
+ 00:10:00,320 --> 00:10:38,480
71
+ the model on might be bad in some way it might be corrupted it might have been attacked by an adversary or it might just not be representative anymore of all the cases that your model needs to perform well on so the more frequently you retrain and the more sensitive you are to failures of the model the more thoughtful you need to be about how do we make sure that we're carefully evaluating this model such that we're not unduly taking on risk too much risk from retraining frequently when you're ready to move on from periodic retraining it's time to start iterating on your strategy and this is the part of the lecture where we're going to cover a grab box of tools that you can use to help figure out how to iterate on your strategy and what
72
+
73
+ 19
74
+ 00:10:36,959 --> 00:11:10,320
75
+ changes the strategy to make you don't need to be familiar in depth with every single one of these tools but i'm hoping to give you a bunch of pointers here that you can use when it's time to start thinking about how to make your model better so the main takeaway from this section is going to be we're going to use monitoring and observability as a way of determining what changes we want to make to our retraining strategy and we're going to do that by monitoring just the metrics that actually matter the most important ones for us to care about and then using all of their metrics and information for debugging when we debug an issue with our model that's going to lead to potentially retraining our model but more broadly
76
+
77
+ 20
78
+ 00:11:09,040 --> 00:11:45,760
79
+ than that we can think of it as a change to the retraining strategy like changing our retraining triggers changing our offline tests our sampling strategies the metrics we use for observability etc and then lastly another principle for iterating on your strategy is as you get more confident in your monitoring as you get more confident that you'll be able to catch issues with your model if they occur then you can start to introduce more automation into your system so do things manually at first and then as you get more confident in your monitoring start to automate them let's talk about how to monitor and debug models in production so that we can figure out how to improve our retraining strategy the tldr here is like many parts of this
80
+
81
+ 21
82
+ 00:11:43,839 --> 00:12:18,160
83
+ lecture there's no real standards or best practices here yet and there's also a lot of bad advice out there the main principles that we're gonna follow here are we're gonna focus on monitoring things that really matter and also things that tend to break empirically and we're going to also compute all the other signals that you might have heard of data drift all these other sorts of things but we're primarily going to use those for debugging and observability what does it mean to monitor a model in production the way i think about it is you have some metric that you're using to assess the quality of your model like your accuracy let's say then you have a time series of how that metric changes over time and the question that you're
84
+
85
+ 22
86
+ 00:12:16,240 --> 00:12:51,839
87
+ trying to answer is is this bad or is this okay do i need to pay attention to this degradation or do i not need to pay attention so the questions that we'll need to answer are what metrics should we be looking at when we're doing monitoring how can we tell if those metrics are bad and it warrants an intervention and then lastly we'll talk about some of the tools that are out there to help you with this process choosing the right metric to monitor is probably the most important part of this process and here are the different types of metrics or signals that you can look at ranked in order of how valuable they are if you're able to get them the most valuable thing that you can look at is outcome data or feedback from your users
88
+
89
+ 23
90
+ 00:12:50,160 --> 00:13:26,079
91
+ if you're able to get access to this signal then this is by far the most important thing to look at unfortunately there's no one-size-fits-all way to do this because it just depends a lot on the specifics of the product that you're building for example if you're building a recommender system then you might measure feedback based on did the user click on the recommendation or not but if you're building a self-driving car that's not really a useful or even feasible signal to gather so you might instead gather data on whether the user intervened and grabbed the wheel to take over autopilot from the car and this is really more of a product design or product management question of how can you actually design your product in such
92
+
93
+ 24
94
+ 00:13:24,480 --> 00:13:59,279
95
+ a way that it that you're able to capture feedback from your users as part of that product experience and so we'll come back and talk a little bit more about this in the ml product management lecture the next most valuable signal to look at if you can get it is model performance metrics these are your offline model metrics things like accuracy the reason why this is less useful than user feedback is because of loss mismatch so i think a common experience that many ml practitioners have is you spend let's say a month trying to make your accuracy one or two percentage points better and then you deploy the new version of the model and it turns out that your users don't care they react just the same way they did
96
+
97
+ 25
98
+ 00:13:57,360 --> 00:14:35,120
99
+ before or even worse to that new theoretically better version of the model there's often very little excuse for not doing this at least to some degree you can just label some production data each day it doesn't have to be a ton you can do this by setting up an on-call rotation or just throwing a labeling party each day where you spend 30 minutes with your teammates you know labeling 10 or 20 data points each even just like that small amount will start to give you some sense of how your model's performance is trending over time if you're not able to measure your actual model performance metrics then the next best thing to look at are proxy metrics proxy metrics are metrics that are just correlated with bad model performance
100
+
101
+ 26
102
+ 00:14:33,199 --> 00:15:07,279
103
+ these are mostly domain specific so for example if you're building text generation with a language model then two examples here would be repetitive outputs and toxic outputs if you're building a recommender system then an example would be the share of personalized responses if you're seeing fewer personalized responses then that's probably an indication that your model is doing something bad if you're looking for ideas for proxy metrics edge cases can be good proxy metrics if there's certain problems that you know that you have with your model if those increase in prevalence then that might mean that your model's not doing very well that's the practical side of proxy metrics today they're very domain specific
104
+
105
+ 27
106
+ 00:15:06,000 --> 00:15:44,720
107
+ either you're going to have good proxy metrics or you're not but i don't think it has to be that way there's an academic direction i'm really excited about that is aimed at being able to take any metric that you care about like your accuracy and approximate it on previously unseen data so how well do we think our model is doing on this new data which would make these proxy metrics a lot more practically useful there's a number of different approaches here ranging from training an auxiliary model to predict how well your main model might do on these on this offline data to heuristics to human loop methods and so it's worth checking these out if you're interested in seeing how people might do this in two or three years one unfortunate
108
+
109
+ 28
110
+ 00:15:43,519 --> 00:16:21,440
111
+ result from this literature though that's worth pointing out is that it's probably not going to be possible to have a single method that you use in all circumstances to approximate how your model is doing on out of distribution data so the way to think about that is let's say that you have you're looking at the input data to predict how the model is going to perform on those input points and then the label distribution changes if you're only looking at the input points then how would you be able to take into account that label distribution change in your approximate metric but there's more theoretical rounding for this result as well all right back to our more pragmatic scheduled programming the next signal that you can look at is data
112
+
113
+ 29
114
+ 00:16:19,519 --> 00:16:59,680
115
+ quality and data quality testing is just a set of rules that you can apply to measure the quality of your data this is dealing with questions like how well does the data reflect reality um how comprehensive is it and how consistent is it over time some common examples of data quality testing include checking whether the data has the right schema whether the values in each of the columns are in the range that you'd expect that you have enough columns that you don't have too much missing data simple rules like that the reason why this is useful is because data problems tend to be the most common issue with machine learning models in practice so this is a report from google where they covered 15 years of different pipeline outages
116
+
117
+ 30
118
+ 00:16:57,600 --> 00:17:37,840
119
+ with a particular machine learning model and their main finding was that most of the outages that happened with that model did not really a lot to do with ml at all they were often distributed systems problems or also really commonly there were data problems one example that they give is a common type of failure where a data pipeline lost the permissions to read the data source that it depended on and so was starting to fail so these types of data issues are often what will cause models to fail spectacularly in production the next most helpful signal to look at is distribution drift even though distribution drift is a less useful signal than say user feedback it's still really important to be able to measure whether your data
120
+
121
+ 31
122
+ 00:17:35,840 --> 00:18:17,679
123
+ distributions change so why is that well your model's performance is only guaranteed if the data that it's evaluated on is sampled from the same distribution as it was trained on and this can have a huge impact in practice recent examples include total change in model behavior during the pandemic as words like corona took on new meeting or bugs and retraining pipelines that cause millions of dollars of losses for companies because they led to changing data distributions distribution drift manifests itself in different ways in the wild there's a few different types that you might see so you might have an instantaneous drift like when a model is deployed in a new domain or a bug is introduced in a re in a pre-processing
124
+
125
+ 32
126
+ 00:18:15,039 --> 00:18:53,120
127
+ pipeline or some big external shift like covid you could have a gradual drift like if user preferences change over time or new concepts keep getting added to your corpus you could have periodic drifts like if your user preferences are seasonal or you could have a temporary drift like if a malicious user attacks your model and each of these different types of drifts might need to be detected in slightly different ways so how do you tell if your distribution is drifted the approach we're going to take here is we're going to first select a window of good data that's going to serve as a reference going forward how do you select that reference well you can use a fixed window of production data that you believe to be healthy so
128
+
129
+ 33
130
+ 00:18:51,200 --> 00:19:30,080
131
+ if you think that your model was really healthy at the beginning of the month you can use that as your reference window some papers advocate for sliding this window of production data to use as your reference but in practice most of the time what most people do is they'll use something like their validation data as their reference once you have that reference data then you'll select your new window of production data to measure your distribution distance on there isn't really a super principal approach for how to select the window of data to measure drift on and it tends to be pretty problem specific so a pragmatic solution that what a lot of people do is they'll just pick one window size or even they'll just pick a few window
132
+
133
+ 34
134
+ 00:19:27,679 --> 00:20:08,160
135
+ sizes with some reasonable amount of data so that's not too noisy and then they'll just slide those windows and lastly once you have your reference window and your production window then you'll compare these two windows using a distribution distance metric so what metrics should you use let's start by considering the one-dimensional case where you have a particular feature that is one-dimensional and you are able to compute a density of that feature on your reference window and your production window then the way to think about this problem is you're going to have some metric that approximates the distance between these two distributions there's a few options here the ones that are commonly recommended are the kl divergence and
136
+
137
+ 35
138
+ 00:20:06,160 --> 00:20:39,679
139
+ the ks test unfortunately those are commonly recommended but they're also bad choices sometimes better options would be things like using the infinity norm or the one norm which are what google advocates for using or the earth mover's distance which is a bit more of a statistically principled approach and i'm not going to go into details of these metrics here but check out the blog post at the bottom if you want to learn more about why the commonly recommended ones are not so good and the other ones are better so that's the one-dimensional case if you just have a single input feature that you're trying to measure distribution distance on but in the real world for most models we have potentially many input features or
140
+
141
+ 36
142
+ 00:20:37,840 --> 00:21:13,919
143
+ even unstructured data that is very high dimensional so how do we deal with detecting distribution drift in those cases one thing you could consider doing is just measuring drifts on all of the features independently problem that you'll run into there is if you have a lot of features you're going to hit the multiple hypothesis testing problem and secondly this doesn't capture cross correlation so if so if you have two features and the distributions of each of those features stay the same but the correlation between the features changed then that wouldn't be captured using this type of system another common thing to do would be to measure drift only on the most important features one heuristic here is that generally
144
+
145
+ 37
146
+ 00:21:11,919 --> 00:21:49,039
147
+ speaking it's a lot more useful to measure drift on the outputs of the model than the inputs the reason for that is because inputs change all the time your model tends to be robust to some degree of distribution shift of the inputs but if the outputs change then that might be more indicative that there's a problem and also outputs tend to be for most machine learning models tend to be lower dimensional so it's a little bit easier to monitor you can also rank the importance of your input features and measure drift on the most important ones you can do this just heuristically using the ones that you think are important or you can compute some notion of feature importance and use that to rank the features that you want to monitor lastly
148
+
149
+ 38
150
+ 00:21:47,440 --> 00:22:27,679
151
+ there are metrics that you can look at that natively compute or approximate the distribution distance between high dimensional distributions and the two that are most worth checking out there are the maximum mean discrepancy and the approximate earth mover's distance the caveat here is that these are pretty hard to interpret so if you have a maximum mean discrepancy alert that's triggered that doesn't really tell you much about where to look for the potential failure that caused that distribution drift a more principled way in my opinion to measure distribution drift for high dimensional inputs to the model is to use projections the idea of a projection is you take some high dimensional input to the model or output
152
+
153
+ 39
154
+ 00:22:25,440 --> 00:23:08,240
155
+ an image or text or just a really large feature vector and then you run that through a function so each data point that your model makes a prediction on gets tagged by this projection function and the goal of the projection function is to reduce the dimensionality of that input then once you've reduced the dimensionality you can do your drift detection on that lower dimensional representation of the high dimensional data and the great thing about this approach is that it works for any kind of data whether it's images or text or anything else no matter what the dimensionality is or what the data type is and it's highly flexible there's many different types of projections that can be useful you can define analytical projections that are
156
+
157
+ 40
158
+ 00:23:05,039 --> 00:23:46,880
159
+ just functions of your input data and so these are things like looking at the mean pixel value of an image or the length of a sentence that's an input to the model or any other function that you can think of analytical projections are highly customizable they're highly interpretable and can often detect problems in practice if you don't want to use your domain knowledge to craft projections by writing analytical functions then you can also just do generic projections like random projections or statistical projections like running each of your inputs through an auto encoder something like that this is my recommendation for detecting drift for high dimensional and unstructured data and it's worth also just taking note of
160
+
161
+ 41
162
+ 00:23:44,799 --> 00:24:24,720
163
+ this concept of projections because we're going to see this concept pop up in a few other places as we discuss other aspects of continual learning distribution drift is an important signal to look at when you're monitoring your models and in fact it's what a lot of people think of when they think about model monitoring so why do we rank it so low on the list let's talk about the cons of looking at distribution drift i think the big one is that models are designed to be robust to some degree of distribution drift the figure on the left shows sort of a toy example to demonstrate this point which is we have a classifier that's trained to predict two classes and we've induced a synthetic distribution shift just
164
+
165
+ 42
166
+ 00:24:22,880 --> 00:24:59,039
167
+ shifting these points from the red ones on the top left to the bottom ones on the bottom right these two distributions are extremely different the marginal distributions in the chart on the bottom and then chart on the right-hand side have very large distance between the distributions but the model performs actually equally well on the training data as it does on the production data because the shift is just shifted directly along the classifier boundary so that's kind of a toy example that demonstrates that you know distribution shift is not really the thing that we care about when we're monitoring our models because just knowing that the distribution has changed doesn't tell us how the models has reacted to that
168
+
169
+ 43
170
+ 00:24:57,440 --> 00:25:35,520
171
+ distribution change and then another example that's worth illustrating is some of my research when i was in grad school was using data that was generated from a physics simulator to solve problems on real world robots and the data that we used was highly out of distribution for the test case that we cared about the data looked like these kind of very low fidelity random images like on the left and we found that by training on a huge variety of these low fidelity random images our model was able to actually generalize to real world scenario like the one on the right so huge distribution shifts intuitively between the data the model was trained on and the data it was evaluated on but it was able to perform well on both
172
+
173
+ 44
174
+ 00:25:33,760 --> 00:26:15,039
175
+ beyond the theoretical limitations of measuring distribution drift this can also just be hard to do in practice you have to pick window sizes correctly you have to keep all this data around you need to choose metrics you need to define projections to make your data lower dimensional so it's not a super reliable signal to look at and so that's why we advocate for looking at ones that are more correlated with the thing that actually matters the last thing you should consider looking at is your standard system metrics like cpu utilization or how much gpu memory your model is taking up things like that so those don't really tell you anything about how your model is actually performing but they can tell you when something is going
176
+
177
+ 45
178
+ 00:26:12,880 --> 00:26:49,840
179
+ wrong okay so this is a ranking of all the different types of metrics or signals that you could look at if you're able to compute them but to give you a more concrete recommendation here we also have to talk about how hard it is to compute these different signals in practice we'll put the sort of value of each of these types of signals on the y-axis and on the x-axis we'll talk about the feasibility like how easy is it to actually measure these things measuring outcomes or feedback has pretty wide variability in terms of how feasible it is to do depends a lot on how your product is set up and the type of problem that you're working on measuring model performance tends to be the least feasible thing to do because
180
+
181
+ 46
182
+ 00:26:47,360 --> 00:27:26,799
183
+ it does involve collecting some labels and so things like proxy metrics are a little bit easier to compute because they don't involve labels whereas system metrics and data quality metrics are highly feasible because there's you know great off-the-shelf libraries and tools that you can use for them and they don't involve doing anything sort of special from a machine learning perspective so the practical recommendation here is getting basic data quality checks is effectively zero regret especially if you are in the phase where you're retraining your model pretty frequently because data quality issues are one of the most common causes of bad model performance in practice and they're very easy to implement the next
184
+
185
+ 47
186
+ 00:27:24,000 --> 00:28:02,000
187
+ recommendation is get some way of measuring feedback or model performance or if you really can't do either of those things than a proxy metric even if that way of measuring model performance is hacky or not scalable this is the most important signal to look at and is really the only thing that will be able to reliably tell you if your model is doing what it's supposed to be doing or not doing what it's supposed to be doing and then if your model is producing low dimensional outputs like if you're doing binary classification or something like that then monitoring the output distribution the score distribution also tends to be pretty useful and pretty easy to do and then lastly as you evolve your system like once you have these
188
+
189
+ 48
190
+ 00:28:00,000 --> 00:28:45,120
191
+ basics in place and you're iterating on your model and you're trying to get more confident about evaluation i would encourage you to adopt a mindset about metrics that you compute that's borrowed from the concept of observability so what is the observability mindset we can think about monitoring as measuring the known unknowns so if there's four or five or ten metrics that we know that we care about accuracy latency user feedback the monitoring approach would be to measure each of those signals we might set alerts on even just a few of those key metrics on the other hand observability is about measuring the unknown unknowns it's about having the power to be able to ask arbitrary questions about your system when it
192
+
193
+ 49
194
+ 00:28:42,640 --> 00:29:19,520
195
+ breaks for example how does my accuracy break out across all of the different regions that i've been considering what is my distribution drift for each of my features not signals that you would necessarily set alerts on because you don't have any reason to believe that these signals are things that are going to cause problems in the future but when you're in the mode of debugging being able to look at these things is really helpful and if you choose to adopt the observability mindset which i would highly encourage you to do especially in machine learning because it's just very very critical to be able to answer arbitrary questions to debug what's going on with your model then there's a few implications first
196
+
197
+ 50
198
+ 00:29:17,919 --> 00:29:55,760
199
+ you should really keep around the context or the raw data that makes up the metrics that you're computing because you're gonna want to be able to drill all the way down to potentially the data points themselves that make up the metric that has degraded it's also as a side note helpful to keep around the raw data to begin with for things like retraining the second implication is that you can kind of go crazy with measurement you can define lots of different metrics on anything that you can think of that might potentially go wrong in the future but you shouldn't necessarily set alerts on each of those or at least not very or at least not very precise alerts because you don't want to have the problem of getting too
200
+
201
+ 51
202
+ 00:29:54,399 --> 00:30:31,120
203
+ many alerts you want to be able to use these signals for the purpose of debugging when something is going wrong drift is a great example of this it's very useful for debugging because let's say that your accuracy was lower yesterday than it was the rest of the month well one way that you might debug that is by trying to see if there's any input fields or projections that look different that distinguish yesterday from the rest of the month those might be indicators of what is going wrong with your model and the last piece of advice i have on model monitoring and observability is it's very important to go beyond aggregate metrics let's say that your model is 99 accurate and let's say that's really good but for one
204
+
205
+ 52
206
+ 00:30:29,120 --> 00:31:06,799
207
+ particular user who happens to be your most important user it's only 50 accurate can we really still consider that mobs to be good and so the way to deal with this is by flagging important subgroups or cohorts of data and being able to slice and dice performance along those cohorts and potentially even set alerts on those cohorts some examples of this are categories of users that you don't want your model to be biased against or categories of users that are particularly important for your business or just ones where you might expect your model to perform differently on them like if you're rolled out in a bunch of different regions or a bunch of different languages it might be helpful to look at how your performance breaks
208
+
209
+ 53
210
+ 00:31:04,799 --> 00:31:44,399
211
+ out across those regions or languages all right that was a deep dive in different metrics that you can look at for the purpose of monitoring the next question that we'll talk about is how to tell if those metrics are good or bad there's a few different options for doing this that you'll see recommended one that i don't recommend and i alluded to this a little bit before is two sample statistical tests like aks test the reason why i don't recommend this is because if you think about what these two sample tests are actually doing they're trying to return a p-value for the likelihood that this data and this data are not coming from the same distribution and when you have a lot of data that just means that even really tiny shifts
212
+
213
+ 54
214
+ 00:31:42,399 --> 00:32:21,679
215
+ in the distribution will get very very small p values because even if the distributions are only a tiny bit different if you have a ton of samples you'll be able to very confidently say that those are different distributions but that's not actually what we care about since models are robust to small amounts of distribution shift better options than statistical tests include the following you can have fixed rules like there should never be any null values in this column you can have specific ranges so your accuracy should always be between 90 and 95 there can be predicted ranges so the accuracy is within what an off-the-shelf anomaly detector thinks is reasonable or there's also unsupervised detection of
216
+
217
+ 55
218
+ 00:32:19,600 --> 00:32:55,200
219
+ just new patterns in this signal and the most commonly used ones in practice are the first two fixed rules and specified ranges but predicted ranges via anomaly detection can also be really useful especially if there's some seasonality in your data the last topic i want to cover on model monitoring is the different tools that are available for monitoring your models the first category is system monitoring tools so this is a pretty mature category with a bunch of different companies in it and these are tools that help you detect problems with any software system not just machine learning models and they provide functionality for setting alarms when things go wrong and most of the cloud providers have pretty decent
220
+
221
+ 56
222
+ 00:32:53,679 --> 00:33:31,440
223
+ solutions here but if you want something better you can look at one of the observability or monitoring specific tools like honeycomb or datadog you can monitor pretty much anything in these systems and so it kind of raises the question of whether we should just use systems like this for monitoring machine learning metrics as well there's a great blog post on exactly this topic that i recommend reading if you're interested in learning about why this is feasible but pretty painful thing to do and so maybe it's better to use something that's ml specific here in terms of ml specific tools there's some open source tools the two most popular ones are evidently ai and y logs and these are both similar in that you provide them
224
+
225
+ 57
226
+ 00:33:29,120 --> 00:34:07,200
227
+ with samples of data and they produce a nice report that tells you where is their distribution shifts how have your model metrics changed etc the big limitation of these tools is that they don't solve the data infrastructure and the scale problem for you you still need to be able to get all that data into a place where you can analyze it with these tools and in practice that ends up being one of the hardest parts about this problem the main difference between these tools is that why logs is a little bit more focused on gathering data from the edge and the way they do that is by aggregating the data into statistical profiles at inference time itself so you don't need to transport all the data from your inference devices back to your
228
+
229
+ 58
230
+ 00:34:05,600 --> 00:34:44,960
231
+ cloud which in some cases can be very helpful and lastly there's a bunch of different sas vendors for ml monitoring and observability my startup gantry has some functionality around this and there's a bunch of other options as well all right so we've talked about model monitoring and observability and the goal of monitoring and observability in the context of continual learning is to give you the signals that you need to figure out what's going wrong with your continual learning system and how you can change the strategy in order to influence that outcome next we're going to talk about for each of the stages in the continual learning loop what are the different ways that you might be able to go beyond the basics and
232
+
233
+ 59
234
+ 00:34:43,520 --> 00:35:19,040
235
+ use what we learned from monitoring and observability to improve those stages the first stage of the continual learning loop is logging as a reminder the goal of logging is to get data from your model to a place where you can analyze it and the key question to answer is what data should i actually log for most of us the best answer is just to log all of your data storage is cheap and it's better to have data than not have it but there's some situations where you can't do that for example if you have just too much traffic going through your model to the point where it's too expensive to log all of it um if you have data privacy concerns if you're not actually allowed to look at your users data or if you're running
236
+
237
+ 60
238
+ 00:35:16,880 --> 00:35:56,400
239
+ your model at the edge and it's too expensive to get all that data back because you don't have enough network bandwidth if you can't log all of your data there's two things that you can do the first is profiling the idea of profiling is that rather than sending all the data back to your cloud and then using that to do monitoring or observability or retraining instead you can compute statistical profiles of your data on the edge that describe the data distribution that you're seeing so the nice thing about this is it's great from a data security perspective because it doesn't require you to send all the data back home it minimizes your storage cost and lastly you don't miss things that happen in the tails which is an issue for the
240
+
241
+ 61
242
+ 00:35:55,040 --> 00:36:30,560
243
+ next approach that we'll describe the place to use this really is primarily for security critical applications the other approach is sampling in sampling you'll just take certain data points and send those back home the advantage of sampling is that it has minimal impact on your inference resources so you don't have to actually spend the computational budget to compute profiles and you get to have access to the raw data for debugging and retraining and so this is what we recommend doing for pretty much every other application should describe in a little bit more detail how statistical profiles work because it's kind of interesting let's say that you have a stream of data that's coming in from two classes cat and dog and you
244
+
245
+ 62
246
+ 00:36:28,800 --> 00:37:13,200
247
+ want to be able to estimate what is the distribution of cat and dog over time without looking at all of the raw data so for example maybe in the past you saw three examples of a dog and two examples of a cat a statistical profile that you can store that summarizes this data is just a histogram so the histogram says we saw three examples of a dog and two a cat and over time as more and more examples stream in rather than actually storing those data we can just increment the histogram and keep track of how many total examples of each category that we've seen over time and so like a neat fact of statistics is that for a lot of the statistics that you might be interested in looking at quantiles means accuracy other statistics you can
248
+
249
+ 63
250
+ 00:37:10,720 --> 00:37:49,280
251
+ compute you can approximate those statistics pretty accurately by using statistical profiles called sketches that have minimal size so if you're interested in going on a tangent and learning more about an interesting topic in computer science that's one i'd recommend checking out next step in the continual learning loop is curation to remind you the goal of curation is to take your infinite stream of production data which is potentially unlabeled and turn this into a finite reservoir of data that has all the enrichments that it needs like labels to train your model on the key question that we need to answer here is similar to the one that we need to answer when we're sampling data at log time which is what data
252
+
253
+ 64
254
+ 00:37:47,280 --> 00:38:25,760
255
+ should we select for enrichment the most basic strategy for doing this is just sampling data randomly but especially as your model gets better most of the data that you see in production might not actually be that helpful for improving your model and if you do this you could miss rare classes or events like if you have an event that happens you know one time in every 10 000 examples in production but you are trying to improve your model on it then you might not sample any examples of that at all if you just sample randomly a way to improve on random sampling is to do what's called stratified sampling the idea here is to sample specific proportions of data points from various subpopulations so common ways that you might stratify for
256
+
257
+ 65
258
+ 00:38:23,200 --> 00:39:05,520
259
+ sampling in ml could be sampling to get a balance among classes or sampling to get a balance among categories that you don't want your model to be biased against like gender lastly the most advanced and interesting strategy for picking data to enrich is to curate data points that are somehow interesting for the purpose of improving your model and there's a few different ways of doing this that we'll cover the first is to have this notion of interesting data be driven by your users which will come from user feedback and feedback loops the second is to determine what is interesting data yourself by defining error cases or edge cases and then the third is to let an algorithm define this for you and this is a category of techniques known as
260
+
261
+ 66
262
+ 00:39:04,079 --> 00:39:40,800
263
+ active learning if you already have a feedback loop or a way of gathering feedback from your users in your machine learning system which you really should if you can then this is probably the easiest and potentially also the most effective way to pick interesting data for the purpose of curation and the way this works is you'll pick data based on signals that come from your users that they didn't like your prediction so this could be the user churned after interacting with your model it could be that they filed a support ticket about a particular prediction the model made it could be that they you know click the thumbs down button that you put in your products that they changed the label that your model produced for them or
264
+
265
+ 67
266
+ 00:39:38,880 --> 00:40:19,599
267
+ that they intervened with an automatic system like they grab the wheel of their autopilot system if you don't have user feedback or if you need even more ways of gathering interesting data from your system then probably the second most effective way of doing this is by doing manual error analysis the way this works is we will look at the errors that our model is making we will reason about the different types of failure modes that we're seeing we'll try to write functions or rules that help capture these error modes and then we'll use those functions to gather more data that might represent those error cases two sub-categories of how to do this one is what i would call similarity-based curation and the way this works is if
268
+
269
+ 68
270
+ 00:40:17,359 --> 00:40:57,119
271
+ you have some data that represents your errors or data that you think might be an error then you can pick an individual data point or a handful of data points and run a nearest neighbor similarity search algorithm to find the data points in your stream that are the closest to the one that your model is maybe making a mistake on the second way of doing this which is potentially more powerful but a little bit harder to do is called projection based curation the way this works is rather than just picking an example and grabbing the nearest neighbors of that example instead we are going to find an error case like the one on the bottom right where there's a person crossing the street with a bicycle and then we're gonna write a
272
+
273
+ 69
274
+ 00:40:54,400 --> 00:41:35,599
275
+ function that attempts to detect that error case and this could just be trading a simple neural network or it could be just writing some heuristics the advantage of doing similarity-based curation is that it's really easy and fast right like you just have to click on a few examples and you'll be able to get things that are similar to those examples this is beginning to be widely used in practice thanks to the explosion of vector search databases on the market it's relatively easy to do this and what this is particularly good for is events that are rare they don't occur very often in your data set but they're pretty easy to detect like if you had a problem with your self-driving car where you have llamas crossing the road a
276
+
277
+ 70
278
+ 00:41:33,760 --> 00:42:10,480
279
+ similarity search-based algorithm would probably do a reasonably good job of detecting other llamas in your training set on the other hand projection-based curation requires some domain knowledge because it requires you to think a little bit more about what is the particular error case that you're seeing here and write a function to detect it but it's good for more subtle error modes where a similarity search algorithm might be too coarse-screened it might find examples that look similar on the surface to the one that you are detecting but don't actually cause your model to fail the last way to curate data is to do so automatically using a class of algorithms called active learning the way active learning works
280
+
281
+ 71
282
+ 00:42:08,240 --> 00:42:42,640
283
+ is given a large amount of unlabeled data what we're going to try to do is determine which data points would improve model performance the most if you were to label those data points next and train on them and the way that these algorithms work is by defining a sampling strategy or a query strategy and then you rank all of your unlabeled examples using a scoring function that defines that strategy and take the ones with the highest scores and send them off to be labeled i'll give you a quick tour of some of the different types of scoring functions that are out there and if you want to learn more about this then i'd recommend the blog post linked on the bottom you have scoring functions that sample data points that the model
284
+
285
+ 72
286
+ 00:42:40,560 --> 00:43:16,800
287
+ is very unconfident about you have scoring functions that are defined by trying to predict what is the error that the model would make on this data point if we had a label for it you have scoring functions that are designed to detect data that doesn't look anything like the data that you've already trained on so can we distinguish these data points from our training data if so maybe those are the ones that we should sample and label we have scoring functions that are designed to take a huge data set of points and boil it down to the small number of data points that are most representative of that distribution lastly there's scoring functions that are designed to detect data points that if we train on them we
288
+
289
+ 73
290
+ 00:43:15,119 --> 00:43:49,839
291
+ think would have a big impact on training so where they would have a large expected gradient or would tend to cause the model to change its mind so that's just a quick tour of different types of scoring functions that you might implement uncertainty based scoring tends to be the one that i see the most in practice largely because it's very simple to implement and tends to produce pretty decent results but it's worth diving a little bit deeper into this if you do decide to go down this route if you're paying close attention you might have noticed that there's a lot of similarity between some of the ways that we do data curation the way that we pick interesting data points and the way that we do monitoring i
292
+
293
+ 74
294
+ 00:43:47,359 --> 00:44:25,839
295
+ think that's no coincidence monitoring and data curation are two sides of the same coin they're both interested in solving the problem of finding data points where the model may not be performing well or where we're uncertain about how the model is performing on those data points so for example user driven curation is kind of another side of the same coin of monitoring user feedback metrics both of these things look at the same metrics stratified sampling is a lot like doing subgroup or cohort analysis making sure that we're getting enough data points from subgroups that are important or making sure that our metrics are not degrading on those subgroups projections are used in both data curation and monitoring to
296
+
297
+ 75
298
+ 00:44:23,599 --> 00:45:05,200
299
+ take high dimensional data and break them down into distributions that we think are interesting for some purpose and then in active learning some of the techniques also have mirrors in monitoring like predicting the loss on an unlabeled data point or using the model's uncertainty on that data point next let's talk about some case studies of how data curation is done in practice the first one is a blog post on how openai trained dolly2 to detect malicious inputs to the model there's two techniques that they used here they used active learn learning using uncertainty sampling to reduce the false positives for the model and then they did a manual curation actually they did it kind of an automated way but they did
300
+
301
+ 76
302
+ 00:45:02,319 --> 00:45:41,599
303
+ similarity search to find similar examples to the ones that the model was not performing well on the next example from tesla this is a talk i love from andre carpathi about how they build a data flywheel of tesla and they use two techniques here one is feedback loops so gathering information about when users intervene with the autopilot and then the second is manual curation via projections for edge case detection and so this is super cool because they actually have infrastructure that allows ml engineers when they discover a new edge case to write an edge case detector function and then actually deploy that on the fleet that edge case detector not only helps them curate data but it also helps them decide which data to sample
304
+
305
+ 77
306
+ 00:45:40,000 --> 00:46:17,280
307
+ which is really powerful the last case study i want to talk about is from cruz they also have this concept of building a continual learning machine and the main way they do that is through feedback loops that's kind of a quick tour of what some people actually use to build these data curation systems in practice there's a few tools that are emerging to help with data curation scale nucleus and aquarium are relatively similar tools that are focused on computer vision and they're especially good at nearest neighbor based sampling at my startup gantry we're also working on some tools to help with this across a wide variety of different applications concrete recommendations on data curation random sampling is probably a fine starting
308
+
309
+ 78
310
+ 00:46:14,400 --> 00:46:51,359
311
+ point for most use cases but if you have a need to avoid bias or if there's rare classes in your data set you probably should start even with stratified sampling or at the very least introduce that pretty soon after you start sampling if you have a feedback loop as part of your machine learning system and i hope you're taking away from this that how helpful it is to have these feedback loops then user-driven curation is kind of a no-brainer this is definitely something that you should be doing and is probably going to be the thing that is the most effective in the early days of improving your model if you don't have a feedback loop then using confidence-based active learning is a next best bet because it's pretty easy
312
+
313
+ 79
314
+ 00:46:49,599 --> 00:47:25,680
315
+ to implement and works okay in practice and then finally as your model performance increases you're gonna have to look harder and harder for these challenging training points at the end of the day if you want to squeeze the maximum performance out of your model there's no avoiding manually looking at your your data and trying to find interesting failure modes there's no substitute for knowing your data after we've curated our infinite stream of unlabeled data down to a reservoir of labeled data that's ready to potentially train on the next thing that we'll need to decide is what trigger are we going to use to retrain and the main takeaway here is that moving to automated retraining is not always necessary in many cases just
316
+
317
+ 80
318
+ 00:47:23,839 --> 00:47:59,520
319
+ manually refraining is good enough but it can save you time and lead to better better model performance so it's worth understanding when it makes sense to actually make that move the main prerequisite for moving to automated retraining is just being able to reproduce model performance when retraining in a fairly automated fashion so if you're able to do that and you are not really working on this model very actively anymore then it's probably worth implementing some automated retraining if you just find yourself retraining this model super frequently then it'll probably save you time to implement this earlier when it's time to move to automated training the main recommendation is just keep it simple and retrain periodically like once a
320
+
321
+ 81
322
+ 00:47:57,280 --> 00:48:32,319
323
+ week rerun training on that schedule the main question though is how do you pick that training schedule so what i recommend doing here is doing a little bit of like measurement to figure out what is a reasonable retraining schedule you can plot your model performance over time and then compare to how the model would have performed if you had retrained on different frequencies you can just make basic assumptions here like if you retrain you'll be able to reach the same level of accuracy and what you're going to be doing here is looking at these different retraining schedules and looking at the area between these curves like on the chart on the top right the area between these two curves is your opportunity cost in
324
+
325
+ 82
326
+ 00:48:30,880 --> 00:49:09,920
327
+ terms of like how much model performance you're leaving on the table by not retraining more frequently and then once you have a number of these different opportunity costs for different retraining frequencies you can plot those opportunity costs and then you can sort of run the ad hoc exercise of trying to balance you know where is the rate trade-off point for us between the performance gain that we get from retraining more frequently and the cost of that retraining which is both the cost of running the retraining itself as well as the operational cost that we'll introduce by needing to evaluate that model more frequently that's what i'd recommend doing in practice a request i have for research is i think it'd be great i think it's
328
+
329
+ 83
330
+ 00:49:08,240 --> 00:49:43,839
331
+ very feasible to have a technique that would automatically determine the optimal retraining strategy based on how performance tends to decay how sensitive you are to that performance decay your operational costs and your retraining costs so i think you know eventually we won't need to do the manual data analysis every single time to determine this retraining frequency if you're more advanced then the other thing you can consider doing is retraining based on performance triggers this looks like setting triggers on metrics like accuracy and only retraining when that accuracy dips below a predefined threshold some big advantages to doing this are you can react a lot more quickly to unexpected changes that
332
+
333
+ 84
334
+ 00:49:42,160 --> 00:50:16,240
335
+ happen in between your normal training schedule it's more cost optimal because you can skip a retraining if it wouldn't actually improve your model's performance but the big cons here are that since you don't know in advance when you're gonna be retraining you need to have good instrumentation and measurement in place to make sure that when you do retrain you're doing it for the right reasons and that the new model is actually doing well these techniques i think also don't have a lot of good theoretical justification and so if you are the type of person that wants to understand you know why theoretically this should work really well i don't think you're going to find that today and probably the most important con is
336
+
337
+ 85
338
+ 00:50:14,720 --> 00:50:50,800
339
+ that this adds a lot of operational complexity because instead of just knowing like hey at 8 am i know my retraining is going live and so i can check in on that instead this retraining could happen at any time so your whole system needs to be able to handle that and that just introduces a lot of new infrastructure that you'll need to build and then lastly an idea that probably won't be relevant to most of you but is worth thinking about because i think it's it could be really powerful in the future is online learning where you train on every single data point as it comes in it's not very commonly used in practice but one sort of relaxation of this idea that is used fairly frequently in practice is online adaptation the way
340
+
341
+ 86
342
+ 00:50:48,160 --> 00:51:28,319
343
+ online adaptation works is it operates not the level of retraining the whole model itself but it operates on the level of adapting the policy that sits on top of the model what is a policy a policy is the set of rules that takes the raw prediction that the model made like the score or the raw output of the model and then turns that into the actual thing that the user sees so like a classification threshold is an example of a policy or if you have many different versions of your model that you're ensembling what are the weights of those ensembles or even which version of the model is this particular request going to be routed to in online adaptation rather than retraining the model on each new data point as it comes
344
+
345
+ 87
346
+ 00:51:26,000 --> 00:52:07,359
347
+ in instead we use an algorithm like multi-arm bandits to tune the weights of this policy online as more data comes in so if your data changes really frequently in practice or you are have a hard time training your model frequently enough to adapt to it then online adaptation is definitely worth looking into next we've fired off a trigger to start a training job and the next question we need to answer is among all of the labeled data in our reservoir of data which specific data points should we train on for this particular training job most of the time in deep learning we'll just train on all the data that we have available to us but if you have too much data to do that then depending on whether recency of data is
348
+
349
+ 88
350
+ 00:52:05,680 --> 00:52:43,599
351
+ an important signal to determine whether that data is useful you'll either slide a window to make sure that you're looking at the most recent data therefore in many cases the most useful data or we'll use techniques like sampling or online batch selection if not and a more advanced technique to be aware of that is hard to execute in practice today is continual fine-tuning we'll talk about that as well so the first option is just to train on all available data so you have a data set that you'll keep track of that your last model was trained on then over time between your last training and your next training you'll have a bunch of new data come in you'll curate some of that data then you'll just take all that data
352
+
353
+ 89
354
+ 00:52:42,160 --> 00:53:20,240
355
+ you'll add it to the data set and you'll train the new model on the combined data set so the keys here are you need to keep this data version controlled so that you know which data was added to each training iteration and it's also important if you want to be able to evaluate the model properly to keep track of the rules that you use to curate that new data so if you're sampling in a way that's not uniform from your distribution you should keep track of the rules that you use to sample so that you can determine where that data actually came from second option is to bias your sampling toward more recent data by using a sliding window the way this works is at each point when you train your model you look
356
+
357
+ 90
358
+ 00:53:17,119 --> 00:53:55,760
359
+ backward and you gather a window of data that leads up to the current moment and then at your next training you slide that window forward and so there might be a lot of overlap potentially between these two data sets but you have all the new data or like a lot of the new data and you get rid of the oldest data in order to form the new data set couple key things to do here are it's really helpful to look at the different statistics between the old and new data sets to catch bugs like if you have a large change in a particular distribution of one of the columns that might be indicative of a new bug that's been introduced and one challenge that you'll find here is just comparing the old and the new versions of the models
360
+
361
+ 91
362
+ 00:53:52,960 --> 00:54:30,960
363
+ since they are not trained on data that is related in a very straightforward way if you're working in a setting where you need to sample data you can't train on all of your data but there isn't any reason to believe that recent data is much better than older data then you can sample data from your reservoir using a variety of techniques the most promising of which is called online batch selection normally if we were doing stochastic gradient descent then what we do is we would sample mini batches on every single training iteration until we run out of data or until we run out of compute budget in online batch selection instead what we do is before each training step we sample a larger batch like much larger than the mini batch
364
+
365
+ 92
366
+ 00:54:29,359 --> 00:55:08,480
367
+ that we ultimately want to train on we rank each of the items in the mini batch according to a label aware selection function and then we take the top n items according to that function and train on those the paper on the right describes a label aware selection function called the reducible holdout loss selection function that performs pretty well on some relatively large data sets and so if you're going to look into this technique this is probably where i would start the last option that we'll discuss which is not recommended to do today is continual fine-tuning the way this works is rather than retraining from scratch every single time instead just only train your existing model on just new data the reason why you might
368
+
369
+ 93
370
+ 00:55:06,880 --> 00:55:41,839
371
+ want to do this primarily is because it's much more cost effective the paper on the right shares some findings from grubhub where they found a 45x cost improvement by doing this technique relative to sliding windows but the big challenge here is that unless you're very careful it's easy for the model to forget what it learned in the past so the upshot is that you need to have pretty mature evaluation to be able to be very careful that your model is performing well on all the types of data that it needs to perform well on before it's worth implementing something like this so now we've triggered a retraining we have selected the data points that are going to go into the training job we've trained our model you know run our
372
+
373
+ 94
374
+ 00:55:39,839 --> 00:56:13,839
375
+ hyperparameter sweeps if we want to and we have a new candidate model that we think is ready to go into production the next step is to test that model the goal of this stage is to produce a report that our team can sign off on that answers the question of whether this new model is good enough or whether it's better than the old model and the key question here is what should go into that report again this is a place where there's not a whole lot of standardization but the recommendation we have here is to compare your current model with the previous version of the model on all the following all the metrics that you care about all of the slices or subsets of data that you've flagged is important all of the edge
376
+
377
+ 95
378
+ 00:56:11,520 --> 00:56:47,760
379
+ cases that you've defined and in a way that's adjusted to account for any sample and bias that you might have introduced by your curation strategy an example of what such a report could look like is the following across the top we have all of our metrics in this case accuracy precision and recall and then all on the left are all of the data sets and slices that we're looking at so the things to notice here are we have our main validation set which is like what most people use for evaluating models but rather than just looking at that those numbers in the aggregate we also break it out across a couple of different categories in this case the age of the user and the age of the account that belongs to that user and
380
+
381
+ 96
382
+ 00:56:45,680 --> 00:57:25,200
383
+ then below the main validation set we also have more specific validation sets that correspond to particular error cases that we know have given our model trouble or a previous version of our model trouble in the past these could be like just particular edge cases that you've found in the past like maybe your model handles examples of poor grammar very poorly or it doesn't know what some gen z slang terms mean like these are examples of failure modes you've found for your model in the past that get rolled into data sets to test the next version of your model in continual learning just like how training sets are dynamic and change over time evaluation sets are dynamic as well as you curate new data you should add some of it to
384
+
385
+ 97
386
+ 00:57:23,440 --> 00:57:58,400
387
+ your training sets but also add some of it to your evaluation sets for example if you change how you do sampling you might want to add some of that newly sampled data to your eval set as well to make sure that your eval set represents that new sampling strategy or if you discover a new edge case instead of only adding that edge case to the training set it's worth holding out some examples of that edge case as a particular unit test to be part of that offline evaluation suite two corollaries to note of the fact that evaluation sets are dynamic the first is that you should also version control your evaluation sets just like you do your training sets the second is that if your data is evolving really quickly then part of the
388
+
389
+ 98
390
+ 00:57:56,559 --> 00:58:36,079
391
+ data that you hold out should always be the most recent data the data from you know the past day or the past hour or whatever it is to make sure that your model is generalizing well to new data once you have the basics in place a more advanced thing that you can look into here that i think is pretty promising is the idea of expectation tests the way that expectation tests work are you take pairs of examples where you know the relationship so let's say that you're doing sentiment analysis and you have a sentence that says my brother is good if you make the positive word in that sentence more positive and instead say my brother is great then you would expect your sentiment classifier to become even more positive about that sentence these types
392
+
393
+ 99
394
+ 00:58:33,200 --> 00:59:16,400
395
+ of tests have been explored in nlp as well as recommendation systems and they're really good for testing whether your model generalizes in predictable ways and so they give you more granular information than just aggregate performance metrics about how your model does on previously unseen data one observation to make here is that just like how data curation is highly analogous to monitoring so is offline testing just like in monitoring we want to observe our metrics not just in aggregate but also across all of our important subsets of data and across all of our edge cases one difference between these two is that you will in general have different metrics available in offline testing and online testing for
396
+
397
+ 100
398
+ 00:59:13,440 --> 00:59:51,520
399
+ example you are much more likely to have labels available offline in fact you always have labels available offline because that is uh how you're going to train your model but online you're much more likely to have feedback and so even though these two ideas are highly analogous and should share a lot of metrics and definitions of subsets and things like that one point of friction that you that will occur between online monitoring and offline testing is that the metrics are a little bit different so one direction for research that i think would be really exciting to see more of is using offline metrics like accuracy to predict online metrics like user engagement and then lastly once we've tested our candidate model offline
400
+
401
+ 101
402
+ 00:59:49,359 --> 01:00:28,319
403
+ it's time to deploy it and evaluate it online so we talked about this last time so i don't want to reiterate too much but as a reminder if you have the infrastructural capability to do so then you should do things like first running your model in shadow mode before you um actually roll it out to real users then running an a b test to make sure that users are responding to it better than they did the old model then once you have a successful av test rolling it out to all of your users but doing so gradually and then finally if you see issues during that rollout just to roll it back to the old version of the model and try to figure out what went wrong so we talked about the different stages of continual learning from
404
+
405
+ 102
406
+ 01:00:25,280 --> 01:01:07,440
407
+ logging data to curating it to triggering retraining testing the model and rolling out to production and we also talked about monitoring and observability which is about giving you a set of rules that you can use to tell whether your retraining strategy needs to change and we observed that in a bunch of different places the fundamental elements that you study in monitoring like projections and user feedback and model uncertainty are also useful for different parts of the continual learning process and that's no coincidence i see monitoring and continual learning as two sides of the same coin we should be using the signals that we monitor to very directly change our retraining strategy so the last thing i want to do is just try to
408
+
409
+ 103
410
+ 01:01:05,839 --> 01:01:42,160
411
+ make this a little bit more concrete by walking through an example of a workflow that you might have from detecting an issue in your model to altering the strategy this section describes more of a feature state until you've invested pretty heavily in infrastructure it's going to be hard to make it feel as seamless as this in practice but i wanted to mention it anyway because i think it provides like a nice end state for what we should aspire to in our continual learning workflows the thing you would need to have in place before you're able to actually execute what i'm going to describe next is a place to store and version all of the elements of your strategy which include metric definitions for both online and offline
412
+
413
+ 104
414
+ 01:01:40,319 --> 01:02:20,240
415
+ testing performance thresholds for those metrics definitions of any of the projections that you want to use for monitoring and also for data curation subgroups or cohorts that you think are particularly important to break out your metrics along the logic that defines how you do data curation whether it's sampling rules or anything else and then finally the specific data sets that you use for each different run of your training or evaluation our example continue improvement loop starts with an alert and in this case that alert might be our user feedback got worse today and so our job is now to figure out what's going on so the next thing we'll use is some of our observability tools to investigate what's going on here and we
416
+
417
+ 105
418
+ 01:02:18,720 --> 01:02:56,079
419
+ might you know run some subgroup analyses and look at some raw data and figure out that the problem is really mostly isolated to new users the next thing that we might do is do error analysis so look at those new users and the data points that they're sending us and try to reason about why those data points are performing worse and what we might discover is something like our model was trained assuming that people were going to write emails but now users are submitting a bunch of text that has things that aren't normally found in emails like emojis and that's causing our model problems so here's where we might make the first change to our retraining strategy we could define new users as a cohort of interest because we
420
+
421
+ 106
422
+ 01:02:54,400 --> 01:03:30,160
423
+ never want performance to decline on new users again without getting an alert about that then we could define a new projection that helps us detect data that has emojis and add that projection to our observability metrics so that anytime in the future if we want as part of an investigation to see how our performance differs between users that are submitting emojis and ones that are not we can always do that without needing to rewrite the projection next we might search our reservoir for historical examples that contain emojis so that we can use them to make our model better and then adjust our strategy by adding that subset of data as a new test case so now whenever we test the model going forward we'll
424
+
425
+ 107
426
+ 01:03:27,839 --> 01:04:03,920
427
+ always see how it performs on data with emojis in addition to adding emoji examples to as a test case we would also curate them and add them back into our training set and do a retraining then once we have the new model that's trained we'll get this new model comparison report which will include also the new cohort that we defined as part of this process and the new emoji edge case data set that we defined and then finally if we're doing manual deployment we can just deploy that model and that completes the continual improvement loop so to wrap up what do i want you to take away from this continual learning is a complicated rapidly evolving and poorly understood topic so this is an area to pay attention to if you're interested in
428
+
429
+ 108
430
+ 01:04:02,640 --> 01:04:39,599
431
+ seeing how the cutting edge of production machine learning is evolving and the main takeaway from this lecture is we broke down the concept of a retraining strategy which consists of a number of different pieces definitions of metrics subgroups of interest projections that help you break down and analyze high dimensional data performance thresholds for your metrics logic for curating new data sets and the specific data sets that you're going to use for retraining and evaluation at a high level the way that we can think about our role as machine learning engineers once we've deployed the first version of the model is to use rules that we define as part of our observability and monitoring suite to iterate on the strategy for many of you
432
+
433
+ 109
434
+ 01:04:37,920 --> 01:05:13,280
435
+ in the near term this won't feel that different from just using that data to retrain the model however you'd like to but i think thinking of this as a strategy that you can tune at a higher level is a productive way of understanding it as you move towards more and more automated retraining lastly just like every other aspect of the ml life cycle that we talked about in this course our main recommendation here is to start simple and add complexity later in the context of continual learning what that means is it's okay to retrain your models manually to start as you get more advanced you might want to automate retraining and you also might want to think more intelligently about how you sample data to make sure that you're
436
+
437
+ 110
438
+ 01:05:11,119 --> 01:05:19,200
439
+ getting the data that is most useful for improving your model going forward that's all for this week see you next time
440
+
documents/lecture-07.md ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Building on Transformers, GPT-3, CLIP, StableDiffusion, and other Large Models.
3
+ ---
4
+
5
+ # Lecture 7: Foundation Models
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/Rm11UeGwGgk?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Sergey Karayev](https://twitter.com/sergeykarayev).
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published September 19, 2022.
14
+ [Download slides](https://fsdl.me/2022-lecture-07-slides).
15
+
16
+ Foundation models are very large models trained on very large datasets that
17
+ can be used for multiple downstream tasks.
18
+
19
+ We’ll talk about fine-tuning, Transformers, large language models, prompt engineering, other applications of large models, and vision and text-based models like CLIP and image generation.
20
+
21
+ ![alt_text](media/image-1.png "image_tooltip")
22
+
23
+ ## 1 - Fine-Tuning
24
+
25
+ Traditional ML uses a lot of data and a large model, which takes a long time. But if you have a small amount of data, you can use **transfer learning** to benefit from the training on a lot of data. You basically use the same model that you have pre-trained, add a few layers, and unlock some weights.
26
+
27
+ We have been doing this in computer vision since 2014. Usually, you train a model on ImageNet, keep most of the layers, and replace the top three or so layers with newly learned weights. Model Zoos are full of these models like AlexNet, ResNet, etc. in both TensorFlow and PyTorch.
28
+
29
+ In NLP, pre-training was initially limited only to the first step: word embeddings. The input to a language model is words. One way you can encode them to be a vector (instead of a word) is **one-hot encoding**. Given a large matrix of words, you can make an embedding matrix and embed each word into a real-valued vector space. This new matrix is down to the dimension on the order of a thousand magnitude. Maybe those dimensions correspond to some semantic notion.
30
+
31
+ ![alt_text](media/image-2.png "image_tooltip")
32
+
33
+
34
+ [Word2Vec](https://jalammar.github.io/illustrated-word2vec/) trained a model like this in 2013. It looked at which words frequently co-occur together. The learning objective was to maximize cosine similarity between their embeddings. It could do cool demos of vector math on these embeddings. For example, when you embed the words “king,” “man,” and “woman,” you can do vector math to get a vector that is close to the word “queen” in this embedding space.
35
+
36
+ It’s useful to see more context to embed words correctly because words can play different roles in the sentence (depending on their context). If you do this, you’ll improve accuracy on all downstream tasks. In 2018, a number of models such as ELMO and ULMFit [published pre-trained LSTM-based models that set state-of-the-art results on most NLP tasks](https://ruder.io/nlp-imagenet/).
37
+
38
+ But if you look at the model zoos today, you won’t see any LSTMs. You’ll only see Transformers everywhere. What are they?
39
+
40
+
41
+ ## 2 - Transformers
42
+
43
+ Transformers come from a paper called “[Attention Is All You Need](https://arxiv.org/abs/1706.03762)” in 2017, which introduced a groundbreaking architecture that sets state-of-the-art results on translation first and a bunch of NLP tasks later.
44
+
45
+ ![alt_text](media/image-3.png "image_tooltip")
46
+
47
+
48
+ It has a decoder and an encoder. For simplicity, let’s take a look at the encoder. The interesting components here are self-attention, positional encoding, and layer normalization.
49
+
50
+
51
+ ### Self-Attention
52
+
53
+ ![alt_text](media/image-4.png "image_tooltip")
54
+
55
+
56
+ Basic self-attention follows: Given an input sequence of vectors x of size t, we will produce an output sequence of tensors of size t. Each tensor is a weighted sum of the input sequence. The weight here is just a dot product of the input vectors. All we have to do is to make that weighted vector sum to 1. We can represent it visually, as seen below. The input is a sentence in English, while the output is a translation in French.
57
+
58
+ ![alt_text](media/image-5.png "image_tooltip")
59
+
60
+
61
+ So far, there are no learned weights and no sequence order. Let’s learn some weights!* If we look at the input vectors, we use them in three ways: as **queries** to compare two other input vectors, as **keys** to compare them to input vectors and produce the corresponding output vector, and as **values **to sum up all the input vectors and produce the output vector.
62
+ * We can process each input vector with three different matrices to fulfill these roles of query, key, and value. We will have three weighted matrices, and everything else remains the same. If we learn these matrices, we learn attention.
63
+ * It’s called **multi-head attention **because we learn different sets of weighted matrices simultaneously, but we implement them as just a single matrix.
64
+
65
+ So far, we have learned the query, key, and value. Now we need to introduce some notion of order to the sequence by encoding each vector with its position. This is called **positional encoding**.
66
+
67
+
68
+ ### Positional Encoding
69
+
70
+ ![alt_text](media/image-6.png "image_tooltip")
71
+
72
+
73
+ Let’s say we have an input sequence of words
74
+
75
+ ]* The first step is to embed the words into a dense, real-valued word embedding. This part can be learned.
76
+ * However, there is no order to that embedding. Thus, we will add another embedding that only encodes the position.
77
+ * In brief, the first embedding encodes only the content, while the second embedding encodes only the position. If you add them, you now have information about both the content and the position.
78
+
79
+
80
+ ### Layer Normalization
81
+
82
+ ![alt_text](media/image-7.png "image_tooltip")
83
+
84
+
85
+ Neural network layers work best when the input vectors have uniform mean and standard deviation in each dimension. As activations flow through the network, the means and standard deviations get blown out by the weight matrices. [Layer normalization](https://arxiv.org/pdf/1803.08494.pdf) is a hack to re-normalize every activation to where we want them between each layer.
86
+
87
+ That’s it! All the amazing results you’ll see from now on are just increasingly large Transformers with dozens of layers, dozens of heads within each layer, large embedding dimensions, etc. The fundamentals are the same. It’s just the Transformer model.
88
+
89
+ [Anthropic](https://www.anthropic.com/) has been publishing great work lately to investigate why Transformers work so well. Check out these publications:
90
+
91
+ 1. [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html)
92
+ 2. [In-Context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)
93
+ 3. [Toy Models of Superposition](https://transformer-circuits.pub/2022/toy_model/index.html)
94
+
95
+
96
+ ## 3 - Large Language Models
97
+
98
+
99
+ ### Models
100
+
101
+ GPT and [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) came out in 2018 and 2019, respectively. The name means “generative pre-trained Transformers.” They are decoder-only models and use masked self-attention. This means: At a poi that at the output sequence, you can only attend to two input sequence vectors that came before that point in the sequence.
102
+
103
+ ![alt_text](media/image-8.png "image_tooltip")
104
+
105
+
106
+ These models were trained on 8 million web pages. The largest model has 1.5 billion parameters. The task that GPT-2 was trained on is predicting the next word in all of this text on the web. They found that it works increasingly well with an increasing number of parameters.
107
+
108
+ ![alt_text](media/image-9.png "image_tooltip")
109
+
110
+
111
+ [BERT](https://arxiv.org/abs/1810.04805) came out around the same time as Bidirectional Encoder Representations for Transformers. It is encoder-only and does not do attention masking. It has 110 million parameters. During training, BERT masks out random words in a sequence and has to predict whatever the masked word is.
112
+
113
+ ![alt_text](media/image-10.png "image_tooltip")
114
+
115
+
116
+ [T5 (Text-to-Text Transformer)](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) came out in 2020. The input and output are both text strings, so you can specify the task that the model supposes to be doing. T5 has an encoder-decoder architecture. It was trained on the C4 dataset (Colossal Clean Crawled Corpus), which is 100x larger than Wikipedia. It has around 10 billion parameters. You can download [the open-sourced model](https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints) and run it on your machine.
117
+
118
+ [GPT-3](https://openai.com/blog/gpt-3-apps/) was one of the state-of-the-art models in 2020. It was 100x larger than GPT/GPT-2 with 175 billion parameters. Because of its size, GPT-3 exhibits unprecedented capabilities of few-shot and zero-shot learning. As seen in the graph below, the more examples you give the model, the better its performance is. The larger the model is, the better its performance is. If a larger model was trained, it would be even better.
119
+
120
+ ![alt_text](media/image-11.png "image_tooltip")
121
+
122
+
123
+ OpenAI also released [Instruct-GPT](https://openai.com/blog/instruction-following/) earlier this year. It had humans rank different GPT-3 outputs and used reinforcement learning to fine-tune the model. Instruct-GPT was much better at following instructions. OpenAI has put this model, titled ‘text-davinci-002,’ in their API. It is unclear how big the model is. It could be ~10x smaller than GPT-3.
124
+
125
+ ![alt_text](media/image-12.png "image_tooltip")
126
+
127
+
128
+ DeepMind released [RETRO (Retrieval-Enhanced Transformers)](https://arxiv.org/pdf/2112.04426.pdf) in 2021. Instead of learning language and memorizing facts in the model’s parameters, why don’t we just learn the language in parameters and retrieve facts from a large database of internal text? To implement RETRO, they encode a bunch of sentences with BERT and store them in a huge database with more than 1 trillion tokens. At inference time, they fetch matching sentences and attend to them. This is a powerful idea because RETRO is connected to an always updated database of facts.
129
+
130
+ ![alt_text](media/image-13.png "image_tooltip")
131
+
132
+
133
+ DeepMind released another model called [Chinchilla](https://gpt3demo.com/apps/chinchilla-deepmind) in 2022, which observed the scaling laws of language models. They [trained over 400 language models](https://arxiv.org/pdf/2203.15556.pdf) from 70 million to 16 billion parameters on 5 billion to 500 billion tokens. They then derived formulas for optimal model and training set size, given a fixed compute budget. They found that most large language models are “undertrained,” meaning they haven’t seen enough data.
134
+
135
+ ![alt_text](media/image-14.png "image_tooltip")
136
+
137
+
138
+ To prove this, they trained a large model called [Gopher](https://gpt3demo.com/apps/deepmind-gopher) with 280 billion parameters and 300 billion tokens. With Chincilla, they reduced the number of parameters to 70 billion and used four times as much data (1.4 trillion tokens). Chinchilla not only matched Gopher’s performance but actually exceeded it. Check out [this LessWrong post](https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications) if you want to read about people’s opinions on it.
139
+
140
+
141
+ ### Vendors
142
+
143
+ OpenAI offers four model sizes: Ada, Babbage, Curie, and Davinci. [Each has a different price](https://openai.com/api/pricing/) and different capabilities. Most of the impressive GPT-3 results on the Internet came from Davinci. These correspond to 350M, 1.3B, 6.7B, and 175B parameters. You can also fine-tune models for an extra cost. The quota you get when you sign up is pretty small, but you can raise it over time. You have to apply for review before going into production.
144
+
145
+ There are some alternatives to OpenAI:
146
+
147
+ 1. [Cohere AI](https://cohere.ai/) has similar models for [similar prices](https://cohere.ai/pricing).
148
+ 2. [AI21](https://www.ai21.com/) also has some large models.
149
+ 3. There are also open-source large language models, such as [Eleuther GPT-NeoX](https://www.eleuther.ai/projects/gpt-neox/) (20B parameters), [Facebook OPT-175B](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) (175B parameters), and [BLOOM from BigScience](https://bigscience.huggingface.co/blog/bloom) (176B parameters). If you want to use one of these open-source models but do not have to be responsible for deploying it, you can use [HuggingFace’s inference API](https://huggingface.co/inference-api).
150
+
151
+
152
+ ## 4 - Prompt Engineering
153
+
154
+ GPT-3 and other large language models are mostly alien technologies. It’s unclear how they exactly work. People are finding out how they work by playing with them. We will cover some notable examples below. Note that if you play around with them long enough, you are likely to discover something new.
155
+
156
+ GPT-3 is surprisingly bad at reversing words due to **tokenization**: It doesn’t see letters and words as humans do. Instead, it sees “tokens,” which are chunks of characters. Furthermore, it gets confused with long-ish sequences. Finally, it has trouble merging characters. For it to work, you have to teach GPT-3 the algorithm to use to get around its limitations. Take a look at [this example from Peter Welinder](https://twitter.com/npew/status/1525900849888866307).
157
+
158
+ ![alt_text](media/image-15.jpg "image_tooltip")
159
+
160
+
161
+ Another crazy prompt engineering is “Let’s Think Step By Step.” This comes from a paper called “[Large Language Models are Zero-Shot Reasoners](https://arxiv.org/pdf/2205.11916.pdf).” Simply adding “Let’s Think Step By Step” into the prompt increases the accuracy of GPT-3 on one math problem dataset from 17% to 78% and another math problem dataset from 10% to 40%.
162
+
163
+ ![alt_text](media/image-16.png "image_tooltip")
164
+
165
+
166
+ Another unintuitive thing is that the context length of GPT is long. You can give it a **long instruction** and it can return the desired output. [This example](https://twitter.com/goodside/status/1557381916109701120) shows how GPT can output a CSV file and write the Python code as stated. You can also use **formatting tricks **to reduce the training cost, as you can do multiple tasks per call. Take a look at [this example](https://twitter.com/goodside/status/1561569870822653952) for inspiration.
167
+
168
+ We have to be careful since our models might get pwnage or possessed. User input in the prompt may instruct the model to do something naughty. This input can even reveal your prompt to [prompt injection attacks](https://simonwillison.net/2022/Sep/12/prompt-injection/) and [possess your AI](https://twitter.com/goodside/status/1564112369806151680). This actually works in GPT-3-powered production apps.
169
+
170
+ ![alt_text](media/image-17.png "image_tooltip")
171
+
172
+
173
+ Further work is needed before putting GPT-3-powered apps into production. There are some tools for prompt engineering such as [PromptSource](https://github.com/bigscience-workshop/promptsource) and [OpenPrompt](https://github.com/thunlp/OpenPrompt), but we definitely need better tools.
174
+
175
+
176
+ ## 5 - Other Applications
177
+
178
+
179
+ ### Code
180
+
181
+ ![alt_text](media/image-18.png "image_tooltip")
182
+
183
+
184
+ One notable application of large foundation models is **code generation**. With a 40- billion-parameter Transformer model pre-trained on all the Github code it could find, [DeepMind Alphacode](https://www.deepmind.com/blog/competitive-programming-with-alphacode) was able to achieve an above-average score on the Codeforce competition. To do this, they used a model to generate a large set of potential solutions and another model to winnow down the options by actually executing them.
185
+
186
+ The general idea to highlight from this is **filtering the outputs of a model**. You can have a separate model that does filtering, or you can have some kind of verification + validation process. This can really significantly boost accuracy. OpenAI demonstrates impressive results on [different math word problems](https://openai.com/blog/grade-school-math/), as seen below.
187
+
188
+ ![alt_text](media/image-19.png "image_tooltip")
189
+
190
+
191
+ Code generation has moved into products of late, like [Github Copilot](https://github.blog/2022-06-21-github-copilot-is-generally-available-to-all-developers/). We highly recommend trying it out! Another option for a similar tool is [replit’s new tool for coding](https://blog.replit.com/ai).
192
+
193
+ We’re just getting started with the applications of foundation models to the programming workflow. In fact, things are about to start getting really wild. [A recent paper](https://arxiv.org/pdf/2207.14502.pdf) showed that a large language model that generated its own synthetic puzzles to learn to code could improve significantly. **Models are teaching themselves to get better!**
194
+
195
+ ![alt_text](media/image-20.png "image_tooltip")
196
+
197
+
198
+ Playing around with systems like GPT-3 and their ability to generate code can feel quite remarkable! Check out some fun experiments Sergey ran ([here](https://twitter.com/sergeykarayev/status/1569377881440276481) and [here](https://twitter.com/sergeykarayev/status/1570848080941154304)).
199
+
200
+ ![alt_text](media/image-21.jpg "image_tooltip")
201
+
202
+ ### Semantic Search
203
+
204
+ **Semantic search** is another interesting application area. If you have texts like words, sentences, paragraphs, or even whole documents, you can embed that text with large language models to get vectors. If you have queries in sentences or paragraphs, you can also embed them in the same way. With this function, you can generate embeddings and easily find semantic overlap by examining the cosine similarity between embedding vectors.
205
+
206
+ ![alt_text](media/image-22.png "image_tooltip")
207
+
208
+
209
+ Implementing this semantic search is hard. Computations on large, dense vectors with float data types are intensive. Companies like Google and Facebook that use this approach have developed libraries like [FAISS](https://towardsdatascience.com/using-faiss-to-search-in-multidimensional-spaces-ccc80fcbf949) and [ScaNN](https://cloud.google.com/blog/topics/developers-practitioners/find-anything-blazingly-fast-googles-vector-search-technology) to solve the challenges of implementing semantic search.
210
+
211
+ Some open-source options for this include [Haystack from DeepSet](https://www.deepset.ai/haystack) and [Jina.AI](https://github.com/jina-ai/jina). Other vendor options include [Pinecone](https://www.pinecone.io/), [Weaviate](https://weaviate.io/), [Milvus](https://milvus.io/), [Qdrant](https://qdrant.tech/), [Google Vector AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview), etc.
212
+
213
+
214
+ ### Going Cross-Modal
215
+
216
+ Newer models are bridging the gap between data modalities (e.g. using both vision and text). One such model is [the Flamingo model](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf), which uses a special model component called a **perceiver resampler** (an attention module that translates images into fixed-length sequences of tokens).
217
+
218
+ ![alt_text](media/image-23.png "image_tooltip")
219
+
220
+
221
+ Another paper about [Socratic Models](https://socraticmodels.github.io/) was recently published. The author trained several large models (a vision model, a language model, and an audio model) that are able to interface with each other using language prompts to perform new tasks.
222
+
223
+ Finally, the concept of “Foundation Models” came from the paper “[On the Opportunities and Risks of Foundation Models](https://arxiv.org/abs/2108.07258)” by researchers at Stanford Institute for Human-Centered AI. We think “Large Language Models” or “Large Neural Networks” might be more useful terms.
224
+
225
+
226
+ ## 6 - CLIP and Image Generation
227
+
228
+ Now, let's talk about some of the most exciting applications of this kind of model: in vision!
229
+
230
+ In a 2021 OpenAI paper called “[Learning transferrable visual models from natural language supervision](https://arxiv.org/abs/2103.00020)”, **CLIP (Contrastive Language–Image Pre-training)** was introduced. In this paper, the authors encode text via Transforms, encode images via ResNets or Visual Transformers, and apply contrastive training to train the model. Contrastive training matches correct image and text pairs using cosine similarity. The code for this is tremendously simple!
231
+
232
+ ![alt_text](media/image-24.png "image_tooltip")
233
+
234
+
235
+ With this powerful trained model, you can map images and text using embeddings, even on unseen data. There are two ways of doing this. One is to use a “linear probe” by training a simple logistic regression model on top of the features CLIP outputs after performing inference. Otherwise, you can use a “zero-shot” technique that encodes all the text labels and compares them to the encoded image. Zero-shot tends to be better, but not always.
236
+
237
+ Since OpenAI CLIP was released in an open-source format, there have been many attempts to improve it, including [the OpenCLIP model](https://github.com/mlfoundations/open_clip), which actually outperforms CLIP.
238
+
239
+ To clarify, CLIP doesn’t go directly from image to text or vice versa. It uses embeddings. This embedding space, however, is super helpful for actually performing searches across modalities. This goes back to our section on vector search. There are so many cool projects that have come out of these efforts! (like [this](https://rom1504.github.io/clip-retrieval/) and [this](https://github.com/haltakov/natural-language-image-search))
240
+
241
+ To help develop mental models for these operations, consider how to actual perform **image captioning** (image -> text) and image generation (text -> image). There are two great examples of this written in [the ClipCap paper](https://arxiv.org/pdf/2111.09734.pdf). At a high level, image captioning is performed through training a separate model to mediate between a frozen CLIP, which generates a series of word embeddings, and a frozen GPT-2, which takes these word embeddings and generates texts.
242
+
243
+ The intermediate model is a Transformer model that gets better at modeling images and captions.
244
+
245
+ ![alt_text](media/image-25.png "image_tooltip")
246
+
247
+
248
+ In **image generation**, the most well-known approach is taken by [DALL-E 2 or unCLIP](https://cdn.openai.com/papers/dall-e-2.pdf). In this method, two additional components are introduced to a CLIP system, a prior that maps from text embedding to image embeddings and a decoder that maps from image embedding to image. The prior exists to solve the problem that many text captions can accurately work for an image.
249
+
250
+ ![alt_text](media/image-26.png "image_tooltip")
251
+
252
+
253
+ In DALL-E 2’s case, they use an approach for the prior called **a diffusion model**. [Diffusion models](https://towardsdatascience.com/diffusion-models-made-easy-8414298ce4da) are trained to denoise data effectively through training on incrementally noisy data.
254
+
255
+ ![alt_text](media/image-27.png "image_tooltip")
256
+
257
+
258
+ In DALL-E 2, the diffusion method is applied to the **prior** model, which trains its denoising approach on a sequence of encoded text, CLIP text embedding, the diffusion timestamp, and the noised CLIP embedding, all so it can predict the un-noised CLIP image embedding. In doing so, it helps us bridge the gap between the raw text caption to the model, which can be infinitely complicated and “noisy”, and the CLIP image embedding space.
259
+
260
+ ![alt_text](media/image-28.png "image_tooltip")
261
+
262
+
263
+ The **decoder** helps us go from the prior’s output of an image embedding to an image. This is a much simpler approach for us to understand. We apply a U-Net structure to a diffusion training process that is able to ultimately “de-noise” the input image embedding and output an image.
264
+
265
+ ![alt_text](media/image-29.png "image_tooltip")
266
+
267
+
268
+ The results of this model are incredible! You can even generate images and merge images using CLIP embeddings. There are all kinds of funky ways of playing with the embeddings to create various image outputs.
269
+
270
+ ![alt_text](media/image-30.png "image_tooltip")
271
+
272
+
273
+ Other models of interest are Parti and StableDiffusion.
274
+
275
+ * Google published [Parti](https://parti.research.google/) soon after DALL-E 2. Parti uses a VQGAN model instead of a diffusion model, where the image is represented as a sequence of high-dimensional tokens).
276
+ * [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release) has been released publicly, so definitely [check it out](https://github.com/CompVis/latent-diffusion)! It uses a “latent diffusion” model, which diffuses the image in a low-dimensional latent space and decodes the image back into a pixel space.
277
+
278
+ ![alt_text](media/image-31.png "image_tooltip")
279
+
280
+
281
+ There has been an absolute explosion of these applications. Check out these examples on [image-to-image](https://twitter.com/DiffusionPics/status/1568219366097039361/), [video generation](https://twitter.com/jakedowns/status/1568343105212129280), and [photoshop plugins](https://www.reddit.com/r/StableDiffusion/comments/wyduk1/). The sky is the limit.
282
+
283
+ Prompting these models is interesting and can get pretty involved. Someday this may even be tool and code-based. You can learn from other people on [Lexica](https://lexica.art/) and [promptoMANIA](https://promptomania.com/).
284
+
285
+ It’s truly a remarkable time to be involved with AI models as they scale to new heights.
documents/lecture-08.md ADDED
@@ -0,0 +1,713 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Building ML-powered products and the teams who create them
3
+ ---
4
+
5
+ # Lecture 8: ML Teams and Project Management
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/a54xH6nT4Sw?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published September 26, 2022.
14
+ [Download slides](https://fsdl.me/2022-lecture-08-slides).
15
+
16
+ ## 0 - Why is this hard?
17
+
18
+ Building any product is hard:
19
+
20
+ - You have to hire great people.
21
+
22
+ - You have to manage and develop those people.
23
+
24
+ - You have to manage your team's output and make sure your vectors are
25
+ aligned.
26
+
27
+ - You have to make good long-term technical choices and manage
28
+ technical debt.
29
+
30
+ - You have to manage expectations from leadership.
31
+
32
+ - You have to define and communicate requirements with stakeholders.
33
+
34
+ Machine Learning (ML) adds complexity to that process:
35
+
36
+ - ML talent is expensive and scarce.
37
+
38
+ - ML teams have a diverse set of roles.
39
+
40
+ - Projects have unclear timelines and high uncertainty.
41
+
42
+ - The field is moving fast, and ML is the "[high-interest credit card
43
+ of technical
44
+ debt](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)."
45
+
46
+ - Leadership often doesn't understand ML.
47
+
48
+ - ML products fail in ways that are hard for laypeople to understand.
49
+
50
+ In this lecture, we'll talk about:
51
+
52
+ 1. ML-related **roles** and their required skills.
53
+
54
+ 2. How to **hire** ML engineers (and how to get hired).
55
+
56
+ 3. How ML teams are **organized** and fit into the broader
57
+ organization.
58
+
59
+ 4. How to **manage** an ML team and ML products.
60
+
61
+ 5. **Design** considerations for ML products.
62
+
63
+ ## 1 - Roles
64
+
65
+ ### Common Roles
66
+
67
+ Let's look at the most common ML roles and the skills they require:
68
+
69
+ - The **ML Product Manager** works with the ML team, other business
70
+ functions, the end-users, and the data owners. This person designs
71
+ docs, creates wireframes, and develops a plan to prioritize and
72
+ execute ML projects.
73
+
74
+ - The **MLOps/ML Platform Engineer** builds the infrastructure to make
75
+ models easier and more scalable to deploy. This person handles the
76
+ ML infrastructure that runs the deployed ML product using
77
+ platforms like AWS, GCP, Kafka, and other ML tooling vendors.
78
+
79
+ - The **ML Engineer** trains and deploys prediction models. This
80
+ person uses tools like TensorFlow and Docker to work with
81
+ prediction systems running on real data in production.
82
+
83
+ - The **ML Researcher** trains prediction models, often those that are
84
+ forward-looking or not production-critical. This person uses
85
+ libraries like TensorFlow and PyTorch on notebook environments to
86
+ build models and reports describing their experiments.
87
+
88
+ - The **Data Scientist** is a blanket term used to describe all of the
89
+ roles above. In some organizations, this role entails answering
90
+ business questions via analytics. This person can work with
91
+ wide-ranging tools from SQL and Excel to Pandas and Scikit-Learn.
92
+
93
+ ![](./media/image9.png)
94
+
95
+ ### Skills Required
96
+
97
+ What skills are needed for these roles? The chart below displays a nice
98
+ visual - where the horizontal axis is the level of ML expertise and the
99
+ size of the bubble is the level of communication and technical writing
100
+ (the bigger, the better).
101
+
102
+ ![](./media/image4.png)
103
+
104
+ - The **MLOps** is primarily a software engineering role, which often
105
+ comes from a standard software engineering pipeline.
106
+
107
+ - The **ML Engineer** requires a rare mix of ML and Software
108
+ Engineering skills. This person is either an engineer with
109
+ significant self-teaching OR a science/engineering Ph.D. who works
110
+ as a traditional software engineer after graduate school.
111
+
112
+ - The **ML Researcher** is an ML expert who usually has an MS or Ph.D.
113
+ degree in Computer Science or Statistics or finishes an industrial
114
+ fellowship program.
115
+
116
+ - The **ML Product Manager** is just like a traditional Product
117
+ Manager but with a deep knowledge of the ML development process
118
+ and mindset.
119
+
120
+ - The **Data Scientist** role constitutes a wide range of backgrounds,
121
+ from undergraduate to Ph.D. students.
122
+
123
+ There is an important distinction between a task ML engineer and a
124
+ platform ML engineer, coined by Shreya Shankar in [this blog
125
+ post](https://www.shreya-shankar.com/phd-year-one/):
126
+
127
+ 1. **Task ML engineers** are responsible for maintaining specific ML
128
+ pipelines. They only focus on ensuring that these ML models are
129
+ healthy and updated frequently. They are often overburdened.
130
+
131
+ 2. **Platform ML engineers** help task ML engineers automate tedious
132
+ parts of their jobs. They are called MLOps/ML Platform engineers
133
+ in our parlance.
134
+
135
+ ## 2 - Hiring
136
+
137
+ ### The AI Talent Gap
138
+
139
+ In 2018 (when we started FSDL), the AI talent gap was the main story.
140
+ There were so few people who understood this technology, so the biggest
141
+ block for organizations was that they couldn't find people who were good
142
+ at ML.
143
+
144
+ In 2022, the AI talent gap persists. But it tends to be less of a
145
+ blocker than it used to be because we have had four years of folks
146
+ switching careers into ML and software engineers emerging from
147
+ undergraduate with at least a couple of ML classes under their belts.
148
+
149
+ The gap tends to be in folks that understand more than just the
150
+ underlying technology but also have experience in seeing how ML fails
151
+ and how to make ML successful when it's deployed. That's the reality of
152
+ how difficult it is to hire ML folks today, especially those with
153
+ **production experience**.
154
+
155
+ ### Sourcing
156
+
157
+ Because of this shallow talent pool and the skyrocketing demand, hiring
158
+ for ML positions is pretty hard. Typical ML roles come in the following
159
+ structure:
160
+
161
+ - ML Adjacent roles: ML product manager, DevOps, Data Engineer
162
+
163
+ - Core ML Roles: ML Engineer, ML Research/ML Scientist
164
+
165
+ - Business analytics roles: Data Scientist
166
+
167
+ For ML-adjacent roles, traditional ML knowledge is less important, as
168
+ demonstrated interest, conversational understanding, and experience can
169
+ help these professionals play an impactful role on ML teams. Let's focus
170
+ on how to hire for **the core ML roles**.
171
+
172
+ ![](./media/image6.png)
173
+
174
+
175
+ While there's no perfect way to **hire ML engineers**, there's
176
+ definitely a wrong way to hire them, with extensive job descriptions
177
+ that demand only the best qualifications (seen above). Certainly, there
178
+ are many good examples of this bad practice floating around.
179
+
180
+ - Rather than this unrealistic process, consider hiring for software
181
+ engineering skills, an interest in ML, and a desire to learn. You
182
+ can always train people in the art and science of ML, especially
183
+ when they come with strong software engineering fundamentals.
184
+
185
+ - Another option is to consider adding junior talent, as many recent
186
+ grads come out with good ML knowledge nowadays.
187
+
188
+ - Finally, and most importantly, be more specific about what you need
189
+ the position and professional to do. It's impossible to find one
190
+ person that can do everything from full-fledged DevOps to
191
+ algorithm development.
192
+
193
+ To **hire ML researchers**, here are our tips:
194
+
195
+ - Evaluate the quality of publications, over the quantity, with an eye
196
+ toward the originality of the ideas, the execution, etc.
197
+
198
+ - Prioritize researchers that focus on important problems instead of
199
+ trendy problems.
200
+
201
+ - Experience outside academia is also a positive, as these researchers
202
+ may be able to transition to industry more effectively.
203
+
204
+ - Finally, keep an open mind about research talent and consider
205
+ talented people without PhDs or from adjacent fields like physics,
206
+ statistics, etc.
207
+
208
+ To find quality candidates for these roles, here are some ideas for
209
+ sourcing:
210
+
211
+ - Use standard sources like LinkedIn, recruiters, on-campus
212
+ recruiting, etc.
213
+
214
+ - Monitor arXiv and top conferences and flag the first authors of
215
+ papers you like.
216
+
217
+ - Look for good implementations of papers you like.
218
+
219
+ - Attend ML research conferences (NeurIPS, ICML, ICLR).
220
+
221
+ ![](./media/image7.png)
222
+
223
+ As you seek to recruit, stay on top of what professionals want and make
224
+ an effort to position your company accordingly. ML practitioners want to
225
+ be empowered to do great work with interesting data. Building a culture
226
+ of learning and impact can help recruit the best talent to your team.
227
+ Additionally, sell sell sell! Talent needs to know how good your team is
228
+ and how meaningful the mission can be.
229
+
230
+ ### Interviewing
231
+
232
+ As you interview candidates for ML roles, try to **validate your
233
+ hypotheses of their strengths while testing a minimum bar on weaker
234
+ aspects**. For example, ensure ML researchers can think creatively about
235
+ new ML problems while ensuring they meet a baseline for code quality.
236
+ It's essential to test ML knowledge and software engineering skills for
237
+ all industry professionals, though the relative strengths can vary.
238
+
239
+ The actual ML interview process is much less well-defined than software
240
+ engineering interviews, though it is modeled off of it. Some helpful
241
+ inclusions are projects or exercises that test the ability to work with
242
+ ML-specific code, like take-home ML projects. Chip Huyen's
243
+ "[Introduction to ML Interviews
244
+ Book](https://huyenchip.com/ml-interviews-book/)" is a
245
+ great resource.
246
+
247
+ ### Finding A Job
248
+
249
+ To find an ML job, you can take a look at the following sources:
250
+
251
+ - Standard sources such as LinkedIn, recruiters, on-campus recruiting,
252
+ etc.
253
+
254
+ - ML research conferences (NeurIPS, ICLR, ICML).
255
+
256
+ - Apply directly (remember, there's a talent gap!).
257
+
258
+ Standing out for competitive roles can be tricky! Here are some tips (in
259
+ increasing order of impressiveness) that you can apply to differentiate
260
+ yourself:
261
+
262
+ 1. Exhibit ML interest (e.g., conference attendance, online course
263
+ certificates, etc.).
264
+
265
+ 2. Build software engineering skills (e.g., at a well-known software
266
+ company).
267
+
268
+ 3. Show you have a broad knowledge of ML (e.g., write blog posts
269
+ synthesizing a research area).
270
+
271
+ 4. Demonstrate ability to get ML projects done (e.g., create side
272
+ projects, re-implement papers).
273
+
274
+ 5. Prove you can think creatively in ML (e.g., win Kaggle competitions,
275
+ publish papers).
276
+
277
+ ## 3 - Organizations
278
+
279
+ ### Organization Archetypes
280
+
281
+ There exists not yet a consensus on the right way to structure an ML
282
+ team. Still, a few best practices are contingent upon different
283
+ organization archetypes and their ML maturity level. First, let's see
284
+ what the different ML organization archetypes are.
285
+
286
+ **Archetype 1 - Nascent and Ad-Hoc ML**
287
+
288
+ - These are organizations where no one is doing ML, or ML is done on
289
+ an ad-hoc basis. Obviously, there is little ML expertise in-house.
290
+
291
+ - They are either small-to-medium businesses or less
292
+ technology-forward large companies in industries like education or
293
+ logistics.
294
+
295
+ - There is often low-hanging fruit for ML.
296
+
297
+ - But there is little support for ML projects, and it's challenging to
298
+ hire and retain good talent.
299
+
300
+ **Archetype 2 - ML R&D**
301
+
302
+ - These are organizations in which ML efforts are centered in the R&D
303
+ arm of the organization. They often hire ML researchers and
304
+ doctorate students with experience publishing papers.
305
+
306
+ - They are larger companies in sectors such as oil and gas,
307
+ manufacturing, or telecommunications.
308
+
309
+ - They can hire experienced researchers and work on long-term business
310
+ priorities to get big wins.
311
+
312
+ - However, it is very difficult to get quality data. Most often, this
313
+ type of research work rarely translates into actual business
314
+ value, so usually, the amount of investment remains small.
315
+
316
+ **Archetype 3 - ML Embedded Into Business and Product Teams**
317
+
318
+ - These are organizations where certain product teams or business
319
+ units have ML expertise alongside their software or analytics
320
+ talent. These ML individuals report up to the team's
321
+ engineering/tech lead.
322
+
323
+ - They are either software companies or financial services companies.
324
+
325
+ - ML improvements are likely to lead to business value. Furthermore,
326
+ there is a tight feedback cycle between idea iteration and product
327
+ improvement.
328
+
329
+ - Unfortunately, it is still very hard to hire and develop top talent,
330
+ and access to data and compute resources can lag. There are also
331
+ potential conflicts between ML project cycles and engineering
332
+ management, so long-term ML projects can be hard to justify.
333
+
334
+ **Archetype 4 - Independent ML Function**
335
+
336
+ - These are organizations in which the ML division reports directly to
337
+ senior leadership. The ML Product Managers work with Researchers
338
+ and Engineers to build ML into client-facing products. They can
339
+ sometimes publish long-term research.
340
+
341
+ - They are often large financial services companies.
342
+
343
+ - Talent density allows them to hire and train top practitioners.
344
+ Senior leaders can marshal data and compute resources. This gives
345
+ the organizations to invest in tooling, practices, and culture
346
+ around ML development.
347
+
348
+ - A disadvantage is that model handoffs to different business lines
349
+ can be challenging since users need the buy-in to ML benefits and
350
+ get educated on the model use. Also, feedback cycles can be slow.
351
+
352
+ **Archetype 5 - ML-First Organizations**
353
+
354
+ - These are organizations in which the CEO invests in ML, and there
355
+ are experts across the business focusing on quick wins. The ML
356
+ division works on challenging and long-term projects.
357
+
358
+ - They are large tech companies and ML-focused startups.
359
+
360
+ - They have the best data access (data thinking permeates the
361
+ organization), the most attractive recruiting funnel (challenging
362
+ ML problems tends to attract top talent), and the easiest
363
+ deployment procedure (product teams understand ML well enough).
364
+
365
+ - This type of organization archetype is hard to implement in practice
366
+ since it is culturally difficult to embed ML thinking everywhere.
367
+
368
+ ### Team Structure Design Choices
369
+
370
+ Depending on the above archetype that your organization resembles, you
371
+ can make the appropriate design choices, which broadly speaking follow
372
+ these three categories:
373
+
374
+ 1. **Software Engineer vs. Research**: To what extent is the ML team
375
+ responsible for building or integrating with software? How
376
+ important are Software Engineering skills on the team?
377
+
378
+ 2. **Data Ownership**: How much control does the ML team have over data
379
+ collection, warehousing, labeling, and pipelining?
380
+
381
+ 3. **Model Ownership**: Is the ML team responsible for deploying models
382
+ into production? Who maintains the deployed models?
383
+
384
+ Below are our design suggestions:
385
+
386
+ If your organization focuses on **ML R&D**:
387
+
388
+ - Research is most definitely prioritized over Software Engineering
389
+ skills. Because of this, there would potentially be a lack of
390
+ collaboration between these two groups.
391
+
392
+ - ML team has no control over the data and typically will not have
393
+ data engineers to support them.
394
+
395
+ - ML models are rarely deployed into production.
396
+
397
+ If your organization has **ML embedded into the product**:
398
+
399
+ - Software Engineering skills will be prioritized over Research
400
+ skills. Often, the researchers would need strong engineering
401
+ skills since everyone would be expected to product-ionize his/her
402
+ models.
403
+
404
+ - ML teams generally do not own data production and data management.
405
+ They will need to work with data engineers to build data
406
+ pipelines.
407
+
408
+ - ML engineers totally own the models that they deploy into
409
+ production.
410
+
411
+ If your organization has **an independent ML division**:
412
+
413
+ - Each team has a potent mix of engineering and research skills;
414
+ therefore, they work closely together within teams.
415
+
416
+ - ML team has a voice in data governance discussions, as well as a
417
+ robust data engineering function.
418
+
419
+ - ML team hands-off models to users but is still responsible for
420
+ maintaining them.
421
+
422
+ If your organization is **ML-First**:
423
+
424
+ - Different teams are more or less research-oriented, but in general,
425
+ research teams collaborate closely with engineering teams.
426
+
427
+ - ML team often owns the company-wide data infrastructure.
428
+
429
+ - ML team hands the models to users, who are responsible for operating
430
+ and maintaining them.
431
+
432
+ The picture below neatly sums up these suggestions:
433
+
434
+ ![](./media/image12.png)
435
+
436
+ ## 4 - Managing
437
+
438
+ ### Managing ML Teams Is Challenging
439
+
440
+ The process of actually managing an ML team is quite challenging for
441
+ four reasons:
442
+
443
+ 1. **Engineering Estimation:** It's hard to know how easy or hard an ML
444
+ project is in advance. As you explore the data and experiment with
445
+ different models, there is enormous scope for new learnings about
446
+ the problem that materially impact the timeline. Furthermore,
447
+ knowing what methods will work is often impossible. This makes it
448
+ hard to say upfront how long or how much work may go into an ML
449
+ project.
450
+
451
+ 2. **Nonlinear Progress:** As the chart below from a [blog
452
+ post](https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641)
453
+ by Lukas Biewald (CEO of [Weights and
454
+ Biases](https://wandb.ai/site)) shows, progress on ML
455
+ projects is unpredictable over time, even when the effort expended
456
+ grows considerably. It's very common for projects to stall for
457
+ extended periods of time.
458
+
459
+ ![](./media/image1.png)
460
+
461
+ 3. **Cultural gaps:** The relative culture of engineering and research
462
+ professionals is very different. Research tends to favor novel,
463
+ creative ideas, while engineering prefers tried and true methods
464
+ that work. As a result, ML teams often experience a clash of
465
+ cultures, which can turn toxic if not appropriately managed. A
466
+ core challenge of running ML teams is addressing the cultural
467
+ barriers between ML and software engineering so that teams can
468
+ harmoniously experiment and deliver ML products.
469
+
470
+ 4. **Leadership Deficits**: It's common to see a lack of detailed
471
+ understanding of ML at senior levels of management in many
472
+ companies. As a result, expressing feasibility and setting the
473
+ right expectations for ML projects, especially high-priority ones,
474
+ can be hard.
475
+
476
+ ### How To Manage ML Teams Better
477
+
478
+ Managing ML teams is hardly a solved problem, but you can take steps to
479
+ improve the process.
480
+
481
+ **Plan probabilistically**
482
+
483
+ Many engineering projects are managed in a waterfall fashion, with the
484
+ sequential tasks defined up front clearly. Instead of forcing this
485
+ method of engineering management on difficult ML projects, try assigning
486
+ a likelihood of success to different tasks to better capture the
487
+ experimental process inherent to ML engineering. As these tasks progress
488
+ or stall, rapidly re-evaluate your task ordering to better match what is
489
+ working. Having this sense of both (1) **how likely a task is to
490
+ succeed** and (2) **how important it is** makes project planning
491
+ considerably more realistic.
492
+
493
+ ![](./media/image10.png)
494
+
495
+
496
+ **Have a portfolio of approaches**
497
+
498
+ Embrace multiple ideas and approaches to solve crucial research
499
+ challenges that gate production ML. Don't make your plan dependent on
500
+ one approach working!
501
+
502
+ **Measure inputs, not results**
503
+
504
+ As you work through several approaches in your portfolio, do not overly
505
+ emphasize whose ideas ultimately work as a reflection of contribution
506
+ quality. This can negatively impact team members' creativity, as they
507
+ focus more on trying to find only what they currently think could work,
508
+ rather than experimenting in a high-quality fashion (which is ultimately
509
+ what leads to ML success).
510
+
511
+ **Have researchers and engineers work together**
512
+
513
+ The collaboration between engineering and research is essential for
514
+ quality ML products to get into production. Emphasize collaboration
515
+ across the groups and professionals!
516
+
517
+ **Get quick wins**
518
+
519
+ Taking this approach makes it more likely that your ML project will
520
+ succeed in the long term. It allows you to demonstrate progress to your
521
+ leadership more effectively and clearly.
522
+
523
+ **Educate leadership on uncertainty**
524
+
525
+ This can be hard, as leadership is ultimately accountable for addressing
526
+ blind spots and understanding timeline risk. There are things you can
527
+ do, however, to help improve leadership's knowledge about ML timelines.
528
+
529
+ - Avoid building hype around narrow progress metrics material only to
530
+ the ML team (e.g., "*We improved F1 score by 0.2 and have achieved
531
+ awesome performance!*").
532
+
533
+ - Instead, be realistic, communicate risk, and emphasize real product
534
+ impact (e.g., "Our model improvements should increase the number
535
+ of conversions by 10%, though we must continue to validate its
536
+ performance on additional demographic factors.)
537
+
538
+ - Sharing resources like [this a16z primer](https://a16z.com/2016/06/10/ai-deep-learning-machines/),
539
+ [this class from Prof. Pieter
540
+ Abbeel](https://executive.berkeley.edu/programs/artificial-intelligence),
541
+ and [this Google's People + AI
542
+ guidebook](https://pair.withgoogle.com/guidebook) can
543
+ increase awareness of your company's leadership.
544
+
545
+ ### ML PMs are well-positioned to educate the organization
546
+
547
+ There are two types of ML product managers.
548
+
549
+ 1. **Task PMs**: These are the more common form of ML PM. They are
550
+ generally specialized into a specific product area (e.g. trust and
551
+ safety) and have a strong understanding of the particular use
552
+ case.
553
+
554
+ 2. **Platform PMs**: These are a newer form of PMs. They have a broader
555
+ mandate to ensure that the ML team (generally centralized in this
556
+ context) is highest leverage. They manage workflow and priorities
557
+ for this centralized team. To support this, they tend to have a
558
+ broad understanding of ML themselves. These PMs are critical for
559
+ educating the rest of the company about ML and ensuring that teams
560
+ trust the output of models.
561
+
562
+ Both types of PMs are crucial for ML success. Platform PMs tend to have
563
+ a particularly powerful role to play in pushing an organization's
564
+ adoption of machine learning and making it successful.
565
+
566
+ ### What is "Agile" for ML?
567
+
568
+ There are two options similar to what Agile is for software development
569
+ in the ML context. They are shown below:
570
+
571
+ ![](./media/image2.png)
572
+
573
+
574
+ They are both structured, data-science native approaches to project
575
+ management. You can use them to provide standardization for project
576
+ stages, roles, and artifacts.
577
+
578
+ **TDSP** tends to be more structured and is a strong alternative to the
579
+ Agile methodology. **CRISP-DM** is somewhat higher level and does not
580
+ provide as structured a project management workflow. If you genuinely
581
+ have a large-scale coordination problem, you can try these frameworks,
582
+ but don't otherwise. They can slow you down since they are more oriented
583
+ around "traditional" data science and not machine learning.
584
+
585
+ ## 5 - Design
586
+
587
+ Let's talk about how to actually design machine learning products now.
588
+ The biggest challenge with designing such products often isn't
589
+ implementing them; it's **bridging the gap between users' inflated
590
+ expectations and the reality**.
591
+
592
+ Users often expect extremely sophisticated systems capable of solving
593
+ many more problems than they actually can.
594
+
595
+ ![](./media/image11.png)
596
+
597
+ In reality, machine learning systems are more like dogs that are trained
598
+ to do a special task; weird little guys with a penchant for distraction
599
+ and an inability to do much more than they are explicitly told.
600
+
601
+ ![](./media/image13.png)
602
+
603
+ All this leads to a big gap between what can be done and what users
604
+ expect!
605
+
606
+ ### The Keys to Good ML Product Design
607
+
608
+ In practice, **good ML product design bridges users expectations and
609
+ reality**. If you can help users understand the benefits and limitations
610
+ of the model, they tend to be more satisfied. Furthermore, always have
611
+ backup plans for model failures! Over-automating systems tends to be a
612
+ recipe for unhappy users. Finally, building in feedback loops can really
613
+ increase satisfaction over time.
614
+
615
+ There are a couple ways to **explain the benefits and limitations** of
616
+ an ML system to users.
617
+
618
+ - Focus on the problems it solves, not the fact that the system is
619
+ "AI-powered".
620
+
621
+ - If you make the system feel "human-like" (unconstrained input,
622
+ human-like responses), expect users to treat it as human-like.
623
+
624
+ - Furthermore, seek to include guardrails or prescriptive interfaces
625
+ over open-ended, human-like experiences. A good example of the
626
+ former approach is [Amazon
627
+ Alexa](https://alexa.amazon.com/), which has specific
628
+ prompts that its ML system responds to.
629
+
630
+ ![](./media/image5.png)
631
+
632
+
633
+ **Handling failures** is a key part of keeping ML systems users happy.
634
+ There's nothing worse than a "smart" system that conks out when you do
635
+ something slightly unexpected. Having built-in solutions to solve for
636
+ automation issues is extremely important. One approach is letting users
637
+ be involved to correct improper responses. Another is to focus on the
638
+ notion of "model confidence" and only offer responses when the threshold
639
+ is met. A good example of a handling failure approach is how Facebook
640
+ recommends photo tags for users, but doesn't go so far as to autoassign.
641
+
642
+ ### Types of User Feedback
643
+
644
+ How can you collect feedback from users in a way that avoids these
645
+ issues? There are different types of user feedback and how they help
646
+ with model improvement.
647
+
648
+ ![](./media/image3.png)
649
+
650
+
651
+ Let's go across this chart.
652
+
653
+ 1. The simplest form of feedback is **indirect implicit feedback**. For
654
+ example, did the user churn from the product? That tells you
655
+ immediately how the user felt about the system without them giving
656
+ a clear signal themselves.
657
+
658
+ 2. Another form is **direct implicit feedback**, which involves the
659
+ user "taking the next step". For example, in an automated user
660
+ onboarding flow, did the user click through into ensuing steps?
661
+ This is trickier to implement, but can be useful for future
662
+ training iterations.
663
+
664
+ 3. The next type of feedback is **binary explicit feedback**, wherein
665
+ users are specifically asked (e.g. via thumbs up/down buttons) how
666
+ they feel about the model performance.
667
+
668
+ 4. You can make this more sophisticated and add **categorical explicit
669
+ feedback**, which allows users to sort their feedback into various
670
+ types.
671
+
672
+ 5. To really get a sense of how users feel, consider offering **free
673
+ text feedback**. This is tricky to use for model training and can
674
+ be involved for users, but it's very useful to highlight the
675
+ highest friction predictions.
676
+
677
+ 6. The gold standard, of course, are **model corrections**; they are
678
+ free labels!
679
+
680
+ Whenever building explicit feedback into ML systems, avoid relying on
681
+ users' altruism and be clear about why they should engage in the
682
+ feedback. Instead, build positive feedback loops by allowing users to
683
+ experience the benefits of their feedback quickly.
684
+
685
+ **Great ML product experiences are designed from scratch**. ML is a very
686
+ specific technology with clear advantages and drawbacks. Design needs to
687
+ be thoughtfully executed around these products. It's especially
688
+ important to allow users to interact safely with ML products that may
689
+ fail in unexpected ways. Always try to find ways to build in feedback
690
+ loops to make the ML product better over time.
691
+
692
+ There are tons of resources that can help you get started with this
693
+ emerging field.
694
+
695
+ - [Google's People + AI
696
+ Guidebook](https://pair.withgoogle.com/guidebook)
697
+
698
+ - [Guidelines for Human-AI
699
+ Interaction](https://dl.acm.org/doi/abs/10.1145/3290605.3300233)
700
+
701
+ - [Agency Plus Automation: Designing AI into Interactive
702
+ Systems](http://idl.cs.washington.edu/files/2019-AgencyPlusAutomation-PNAS.pdf)
703
+
704
+ - [Designing Collaborative
705
+ AI](https://medium.com/@Ben_Reinhardt/designing-collaborative-ai-5c1e8dbc8810)
706
+
707
+ In conclusion, we talked through a number of adjacent considerations to
708
+ building ML systems and products. In short, you ship the team as much
709
+ you do the code; be thoughtful about how you hire, manage, and structure
710
+ ML teams as much as ML products!
711
+
712
+ ![](./media/image8.png)
713
+
documents/lecture-08.srt ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,179 --> 00:00:40,500
3
+ hey everybody welcome back this week we're going to talk about something a little bit different than we do most weeks most weeks we talk about specific technical aspects of building machine learning powered products but this week we're going to focus on some of the organizational things that you need to do in order to work together on ml-powered products as part of an interdisciplinary team so the the reality of building ml Power Products is that building any product well is really difficult you have to figure out how to hire grade people you need to be able to manage those people and get the best out of them you need to make sure that your team is all working together towards a shared goal you need to make good
4
+
5
+ 2
6
+ 00:00:38,399 --> 00:01:20,280
7
+ long-term technical choices manage technical debt over time you need to make sure that you're managing expectations not just of your own team but also of leadership of your organization and you need to be able to make sure that you're working well within the confines of the requirements of the rest of the org that you're understanding those requirements well and communicating back to your progress to the rest of the organization against those requirements but machine learning adds even more additional complexity to this machine learning Talent tends to be very scarce and expensive to attract machine learning teams are not just a single role but today they tend to be pretty interdisciplinary which makes
8
+
9
+ 3
10
+ 00:01:18,659 --> 00:02:00,600
11
+ managing them an even bigger challenge machine learning projects often have unclear timelines and there's a high degree of uncertainty to those timelines machine learning itself is moving super fast and machine learning as we've covered before you can think of as like the high interest credit card of technical debt so keeping up with making good long-term decisions and not incurring too much technical debt is especially difficult in ml unlike traditional software ml is so new that in most organizations leadership tends not to be that well educated in it they might not understand some of the core differences between ML and other technology that you're working with machine learning products tend to fail in ways that are really hard for Lay
12
+
13
+ 4
14
+ 00:01:58,799 --> 00:02:36,360
15
+ people to understand and so that makes it very difficult to help the rest of the stakeholders in your organization understand what they could really expect from the technology that you're building and what is realistic for us to achieve so throughout the rest rest of this lecture we're going to kind of touch on some of these themes and cover different aspects of this problem of working together to build ml Power Products as an organization so here are the pieces that we're going to cover we're going to talk about different roles that are involved in building ml products we're going to talk about some of the unique aspects involved in hiring ml Talent we're going to talk about organization of teams and how the ml team tends to
16
+
17
+ 5
18
+ 00:02:34,739 --> 00:03:16,140
19
+ fit into the rest of the org and some of the pros and cons of different ways of setting that up we'll talk about managing ml teams and ml product management and then lastly we'll talk about some of the design considerations for how to design a product that is well suited to having a good ml model that backs it so let's dive in and talk about rules the most common ml rules that you might hear of are things like ml product manager ml Ops or ml platform or ml info teams machine learning Engineers machine learning researchers or ml scientists data scientists so there's a bunch of different roles here and one kind of obvious question is what's the difference between all these different things so let's break down the job
20
+
21
+ 6
22
+ 00:03:14,400 --> 00:03:57,900
23
+ function that each of these roles plays within the context of building the ml product starting with the ml product manager their goal is to work with the NL team the business team the users and any other stakeholders to prioritize projects and make sure that they're being executed well to meet the requirements of the rest of the organization so what they produce is things like design docs wireframes and work plans and they're using tools like jira like notion to help sort of organize the work of the rest of the team the ml Ops or ml platform team are focused on building the infrastructure needed in order to make models easier to deploy more scalable or generally reduce the workload of individual contributors
24
+
25
+ 7
26
+ 00:03:56,159 --> 00:04:38,160
27
+ who are working on different ml models the output of what they build is some infrastructure some shared tools that can be used across the ml teams in your company and they're working with tools like AWS like Kafka or other data infrastructure tools and potentially working with ML infrastructure vendors as well to sort of bring best in breed from traditional data and software tools and this new category of ml vendors that are providing like mlops tools together to create this sort of best solution for the specific problems that your company is trying to solve then we have the ml engineer the ml engineer is kind of a catch-all role and the way that I like to think of their responsibilities is they're the person who is responsible
28
+
29
+ 8
30
+ 00:04:36,000 --> 00:05:15,660
31
+ for training and deploying and maintaining the prediction model that powers the mlpark product they're not just the person who is you know solely training the model and then handing it off to someone else but they're also responsible for deploying it and then maintaining it once it's in production and so they need to know Technologies like tensorflow for training models but also like Docker for packaging models and making sure that they run on production infrastructure the next role is the ml researcher so this is a role that exists in some organizations that the responsibility stops after the model has been trained and so oftentimes these models are either handed off to some other team to productionize or these
32
+
33
+ 9
34
+ 00:05:13,800 --> 00:05:54,660
35
+ folks are focused on building models that are not yet production critical or forward-looking maybe they're prototyping some use cases that might be useful down the line for the organization and their work product is a trained model and oftentimes it's a report or a code repo that describes what this model does how to use it and how to reproduce their results so they're working with ML trading tools and also prototyping tools like jupyter notebooks to produce a version of a model that just needs to work once to sort of show that the thing that they're trying to do is possible and then lastly we get the data scientist data scientist is kind of a patch-all term for potentially any of the things above in some organizations data science is quite
36
+
37
+ 10
38
+ 00:05:53,220 --> 00:06:31,319
39
+ distinct from what we've been thinking of as a machine learning role in this class and these are folks in some organizations that are responsible for answering business questions using analytics so in some organizations a data scientists is you know the same as an ml researcher or an ml engineer and other organizations data science is a distinct function that is responsible for answering business questions using data the ml work is the responsibility of an ml team so the next thing we'll talk about is what are the different skills that you actually need to be successful in these roles we're going to plot this on a two by two on the x-axis is the amount of skill that you need in machine learning like how much ml do you
40
+
41
+ 11
42
+ 00:06:28,979 --> 00:07:07,500
43
+ really need to know on the y-axis is the software engineering skill needed and then the size of the bubble is a requirement on communication or technical writing how good do you have to be at communicating your ideas to other people so starting with ML Ops or ml platform teams this is really primarily a software engineering role and oftentimes where these folks will come into the organization is through their you know traditional software engineering or data engineering hiring pipeline or even moving over from a data engineering role in another part of the organization another common pattern for how organizations find ml Ops or ml platform Engineers is they Source them from mles at their organization it's
44
+
45
+ 12
46
+ 00:07:05,580 --> 00:07:43,740
47
+ oftentimes like an ml engineer who used to just work on one model and then got frustrated by the lack of tooling so decided to move into more of a platform role the ml engineer since this is someone who is required to understand the models deeply and also be able to productionize them this tends to be a rare mix of ml skills and software engineering skills so there's sort of two paths that I typically see for folks becoming ml Engineers oftentimes these are software Engineers who have a pretty significant amount of self-teaching or on the other hand maybe they are someone who's trained in machine learning traditionally like they have a science or engineering PhD but then they switch careers into software engineering after
48
+
49
+ 13
50
+ 00:07:41,460 --> 00:08:25,220
51
+ grad school or after undergrad and then later decided to fuse those two skill sets ml researchers these are your ml experts so this is kind of the only role on this list that I would say it's pretty typical still to see a graduate degree or another path to these roles are these industrial Fellowship programs like Google brain residency that are explicitly designed to train people without a PhD in in this distinct skill of research since data science is kind of like a catch-all term for a bunch of different roles in different organizations it also admits a variety of different backgrounds and oftentimes these are undergrads who went to a data science specific program or their science phds who are making the
52
+
53
+ 14
54
+ 00:08:23,039 --> 00:09:05,220
55
+ transition into industry and then lastly mlpms oftentimes these folks come from a traditional product management background but they do need to have a deep understanding of the specifics of the ml development process and that can come from having you know work closely with ML teams for a long time having just a really strong independent interest in ml or oftentimes what I see is folks who are you know former data scientists or ml Engineers who make the switch into PM it can be really effective at pming ml projects because they have a deep understand in the technology one other distinction that I think is worth covering when talking about the variety of different roles in ml organizations is the distinction between a task ml
56
+
57
+ 15
58
+ 00:09:02,880 --> 00:09:44,100
59
+ engineer and a platform ml engineer this is a distinction that was coined by Shreya Shankar in blog post that's linked below and the distinction is that some ml Engineers are really responsible for like one ml pipeline or maybe a handful of ml pipelines that they're assigned to and so they're the ones that are day in and day out responsible for making sure that this model is healthy making sure that it's being updated frequently and that any failures are sort of being accounted for these folks are often like very overburdened this can be a very sort of expansive role because they have to be training models and deploying them and understanding where they break since mlgiers are often spread so thin some ml Engineers end up
60
+
61
+ 16
62
+ 00:09:41,100 --> 00:10:22,260
63
+ taking on a role that looks more like a ml platform team or ml Ops Team where they work across teams to help ml Engineers automate tedious parts of their jobs and so we in our parlance this is called an ml platform engineer or ml Ops engineer but you'll also hear this referred to as an ml engineer or a platform ml engineer so we've talked a little bit about what are some of the different roles in the process of building ml Power Products now let's talk about hiring so how to think about hiring ml Specialists and if you are an ml specialist looking for a job how to think about making yourself more attractive as a job candidate so a few different things that we'll cover here the first is the AI Talent gap which is
64
+
65
+ 17
66
+ 00:10:20,339 --> 00:10:59,519
67
+ sort of the reality of ml hiring these days and we'll talk about how to source for ML Engineers if you're hiring folks we'll talk about interviewing and then lastly we'll talk about finding a job four years ago when we started teaching full stack deep learning the AI Talent Gap was the main story in many cases for what teams found Difficult about building with ML there was just so so few people that understood this technology that the biggest thing blocking a lot of organizations was just they couldn't find people who are good at machine learning four years later the AI Talent Gap persists and there's still you know news stories every few months that are being written about how difficult it is for companies to find ml
68
+
69
+ 18
70
+ 00:10:57,120 --> 00:11:38,279
71
+ talent but my observation day to day in the field is that it tends to be less of a blocker than it used to be because you know we've had four years of folks switching careers into ML and four years of you know software Engineers emerging from undergrad with at least a couple of ml classes in many cases under their belts so there's more and more people now that are capable of doing ml but there's still a gap and in particular that Gap tends to be in folks that understand more than just the underlying technology but also have experience in seeing how seeing how it fails and how to make it successful when it's deployed so that's the reality of how difficult it is to hire machine learning folks today especially those who have
72
+
73
+ 19
74
+ 00:11:36,360 --> 00:12:14,459
75
+ production experience so if you are hiring ml folks how should you think about finding people if you're hiring ml product managers or ml platform or ml Ops Engineers the main skill set that you need to look for is still the sort of core underlying skill set for those roles so product management or data engineering or platform Engineering in general but it is critical to find folks who have experience at least interacting with teams that are building production ml systems because I think one sort of failure mode that I've seen relatively frequently especially for ML platform teams is if you just bring in folks with pure software engineering background a lot of times it's difficult for them to understand the user requirements well
76
+
77
+ 20
78
+ 00:12:12,660 --> 00:12:52,920
79
+ enough in order to engineer things that actually solve the user's problems users here being the task mles who are you know the ones who are going to be using the infrastructure data that we'll focus for the rest of the section mostly on these two roles ml engineer and ml scientist so there's a right and a wrong way to hire ml engineers and the wrong way oftentimes looks maybe something like this so you see a job description for the Unicorn machine learning engineer the duties for this person are they need to keep up with seed of the art they need to implement new models from scratches that come out they need a deep understanding of the underlying mathematics and ability to invent new models for new tasks as it arises they
80
+
81
+ 21
82
+ 00:12:51,480 --> 00:13:30,060
83
+ need to also be able to build tooling and infrastructure for the ml team because ml teams need tooling to do their jobs they need to be able to build data pipelines as well because without data ml is nothing they need to deploy these models and monitor them in production because without deploying models you're not actually solving a problem so in order to fulfill all these duties you need these requirements as this unicorn mle role you of course need a PhD you need at least four years of tensorflow experience four years as a software engineer you need to have Publications and nurips or other top ml conferences experience building large-scale distributed systems and so when you add all this up hopefully it's becoming
84
+
85
+ 22
86
+ 00:13:28,500 --> 00:14:09,720
87
+ clear why this is the wrong way to hire ml Engineers there's just not really very many people that fit this description today if any and so the implication is the right way to hire ml Engineers is to be very very specific about what you actually need from these folks and in most cases the right answer is to primarily hire for software engineering skills not ml skills you do need folks that have at least a background in ml and a desire to learn ML and you can teach people how to do ml if they have a strong interest in it they know the basics and they're really strong in the software engineering side another approach instead of hiring for software engineering skills and training people in the on the ml side is to go
88
+
89
+ 23
90
+ 00:14:07,380 --> 00:14:46,680
91
+ more Junior most undergrads in computer science these days graduate with ML experience and so these are folks that have traditional computer science training and some theoretical ml understanding so they have sort of the seeds of being good at both ML and software engineering but maybe not a lot of experience in either one and then the third way that you can do this more effectively is to be more specific about what you really really need for not the ml engineering function in general but for this particular role right so not every ml engineer needs to be a devops expert to be successful not every ml engineer needs to be able to implement new papers from scratch to be successful either for many of the MLS years that
92
+
93
+ 24
94
+ 00:14:45,240 --> 00:15:25,440
95
+ you're hiring what they really need to do is something along the lines of taking a model that is you know pretty established as something that works while pulling it off the shelf or training it using a pretty robust library and then being able to deploy that model into production so focus on hiring people to have those skills not these aspirational skills that you don't actually really need for your company NeXT let's talk about a couple things I've found to be important for hiring ml researchers the first is a lot of folks when they're hiring ml researchers they look first at the number of Publications they have in top conferences I think it's really critical to focus entirely on the Quant the quality of public
96
+
97
+ 25
98
+ 00:15:23,160 --> 00:16:01,019
99
+ locations not the quantity and this unfortunately requires a little bit of judgment about what high quality research looks like but hopefully there's someone on your team that can provide that judgment it's more interesting to me to find machine learning researchers who have you know one or two Publications that you think are really creative or very applicable to the field that you're working in or have really really strong promising results then to find someone who's you know published 20 papers but each of them are just sort of an incremental Improvement to the state of the art if you're working in the context of a company where you're trying to build a product and you're hiring researchers then I think another really important
100
+
101
+ 26
102
+ 00:15:59,339 --> 00:16:34,079
103
+ thing to filter for is looking for researchers who have an eye for working on problems that really matter a lot of researchers maybe through no fault of their own just because of the incentives and Academia focus on problems that are trendy if everyone else is publishing about reinforcement learning then they'll publish about reinforcement learning if everyone else is publishing about generative models then they'll make an incremental improvements to generative models to get them a publication but what you really want to to look for I think is folks that have an independent sense of what problems are important to work on because in the context of your company no one's going to be telling these folks like hey this
104
+
105
+ 27
106
+ 00:16:33,300 --> 00:17:08,400
107
+ is what everyone's going to be publishing about this year oftentimes experience outside of Academia can be a good proxy for this but it's not really necessary it's just sort of one signal to look at if you already have a research team established then it's worth considering hiring talented people from adjacent Fields hiring from physics or statistics or math at open AI they did this with to really strong effects they would look for sort of folks that were really technically talented but didn't have a lot of ml expertise and they would train them in them out this works a lot better if you do have experienced researchers who can provide mentorship and guidance for folks I probably wouldn't hire like a first
108
+
109
+ 28
110
+ 00:17:06,780 --> 00:17:47,280
111
+ researcher that doesn't have ml experience and then it's also worth remembering that especially these days you really don't need a PhD to do ml research many undergrads have a lot of experience doing ml research and graduates of some of these industrial Fellowship programs like Googles or Facebooks or open AIS have learned the basics of how to do research regardless of whether they have a PhD so that's how to think about evaluating candidates for ML engineering or ml research roles the next thing I want to talk about is how to actually find those candidates so your standard sources like LinkedIn or recruiters or on campus recruiting all work but another thing that can be really effective if you want to go
112
+
113
+ 29
114
+ 00:17:44,280 --> 00:18:26,580
115
+ deeper is every time there's a new dump of papers on archive or every year at nurips and other top conferences just keep an eye on what you think are the most exciting papers and flag mostly the first authors of those papers because those are the ones that tend to be doing most of the work and are generally more recruitable because they tend to be more Junior in their careers Beyond looking at papers you can also do something similar for good re-implementations of papers that like so if you are you know looking at some hot new paper and a week later there's a re-implementation of that paper that has high quality code and hits the main results then chances are whoever wrote that implementation is probably pretty good and so they could
116
+
117
+ 30
118
+ 00:18:24,960 --> 00:19:01,919
119
+ be worth recruiting you can do a lot of this in person now that ml research conferences are back in person or you can just reach out to folks that you are interested in talking to over the Internet since there's a talent shortage in ml it's not enough just to know how to find good ml candidates and evaluate them you also need to know how to think about attracting them to your company I want to talk a little bit about from what I've seen what a lot of ml practitioners are interested in the roles they take and then talk about ways that you can make your company Stand Out along those axes so one thing a lot of ml practitioners want is to work with Cutting Edge tools and techniques to be working with latest state of the art
120
+
121
+ 31
122
+ 00:19:00,419 --> 00:19:36,780
123
+ research another thing is to build knowledge in an exciting field to like a more exciting branch of ml or application of ml working with excellent people probably pretty consistent across many technical domains but certainly true in ml working on interesting data sets this is kind of one unique thing in ml since the work that you can do is constrained in many cases the data sets that you have access to being able to offer unique data sets can be pretty powerful probably again true in general but I've noticed for a lot of ml folks in particular it's important for them to feel like they're doing work that really matters so how do you stand out on these axes you can work on Research oriented projects even if the sort of mandate of
124
+
125
+ 32
126
+ 00:19:35,580 --> 00:20:12,780
127
+ your team is primarily to help your company doing some research work that you can publicize and that you could point to as being indicative of working on The Cutting Edge open source libraries things like that can really help attract top candidates if you want to emphasize the ability of folks to sort of build skills and knowledge in an exciting field you can build a team culture around learning so you can host reading groups in your company you can organize learning days which is something that we did at open AI where we would dedicate back then a day per week just to be focused on learning new things but you can do it less frequently than that professional development budgets conference budgets things like
128
+
129
+ 33
130
+ 00:20:11,340 --> 00:20:50,820
131
+ this that you can emphasize and this is probably especially valuable if your strategy is to hire more Junior folks or more software engineering oriented folks and train them up in machine learning emphasize how much they'll be able to learn about MLA company one sort of hack to being able to hire good ml people is to have other good ml people on the team this is maybe easier said than done but one really high profile hire can help attract many many other people in the field and if you don't have the luxury of having someone high profile on your team you can help your existing team become more high profile by helping them publish blogs and papers so that other people start to know how talented your team actually is when you're attracting
132
+
133
+ 34
134
+ 00:20:48,240 --> 00:21:27,419
135
+ ml candidates you can focus on sort of emphasizing the uniqueness of your data set in recruiting materials so if you have know the best data set for a particular subset of the legal field or the medical field emphasize how interesting that is to work with how much data you have and how unique it is that you have it and then lastly you know just like any other type of recruiting selling the mission of the company and the potential for ML to have an impact on that mission can be really effective next let's talk about ml interviews what I would recommend testing for if you are on the interviewer side of an ml interview is to try to hire for strengths and meet a minimum bar for everything else and this can help you avoid falling into the Trap
136
+
137
+ 35
138
+ 00:21:24,780 --> 00:22:01,380
139
+ of looking for unicorn mles so some things that you can test are you want to validate your hypotheses of candidate strengths so if it's a researcher you want to make sure that they can think creatively about new ml problems and one way you can do this is to probe how thoughtful they were about previous projects if they're Engineers if they're mles then you want to make sure that they're great generalist software Engineers since that's sort of the core skill set in ml engineering and then you want to make sure they meet a minimum bar on weaker areas so for researchers I would advocate for only hiring researchers in Industry contexts who have at least the very basics in place about software engineering knowledge and
140
+
141
+ 36
142
+ 00:21:59,220 --> 00:22:33,299
143
+ the ability to write like decent code if not you know really high quality production ready code because in context of working with a team other people are going to need to use their code and it's not something that everyone learns how to do when they're in grad school for ML for software Engineers you want to make sure that they at least meet a minimum bar on machine learning knowledge and this is really testing for like are they passionate about this field that they have put in the requisite effort to learn the basics of ml that's a good indication that they're going to learn ml quickly on the job if you're hiring them mostly for their software engineering skills so what do ml interviews actually consist of so this
144
+
145
+ 37
146
+ 00:22:31,320 --> 00:23:10,380
147
+ is today much less well defined than your software engineering interviews some common types of Assessments that I've seen are your normal sort of background and culture fit interviews whiteboard coding interviews similar to you'd see in software engineering pair coding like in software engineering but some more ml specific ones include pair debugging where you and an interviewer will sit down and run some ml code and try to find Hey where's the bug in this code oftentimes this is ml specific code and the goal is to test for how well is this person able to find bugs in ml code since bugs tend to be where we spend most of our time in machine learning math puzzles are often common especially involving things like linear algebra
148
+
149
+ 38
150
+ 00:23:08,340 --> 00:23:46,080
151
+ take-home projects other types of Assessments include applied ml questions so typically this will have the flavor of hey here's a problem that we're trying to solve with ML let's talk through the sort of high level pieces of how we'd solve it what type of algorithm we'd use what type of system them we need to build to support it another Common Assessment is probing the past projects that you've listed on your resume or listed as part of the interview process asking you about things you tried will work what didn't work and trying to assess what role you played in that project and how thoroughly you thought through the different alternative paths that you could have considered and then lastly ml Theory questions are also pretty common
152
+
153
+ 39
154
+ 00:23:43,919 --> 00:24:20,880
155
+ in these interview type assessments that's sort of the universe of things that you might consider interviewing for if you're trying to hire ml folks or that you might expect to find on an ml interview if you are on the the other side and trying to interview for one of these jobs and the last thing I'll say on interviews is there's a great book from chipwin the introduction to machine learning interviews book which is available for free online which is especially useful I think if you're preparing to interview for machine learning roles speaking of which what else should you be doing if your goal is to find new job in machine learning the first question I typically hear is like where should I even look for ML jobs
156
+
157
+ 40
158
+ 00:24:19,080 --> 00:24:51,059
159
+ your standard sources like LinkedIn and recruiters all work ml Research Conference references can also be a fantastic place just go up and talk to the folks that are standing around the booths at those conferences they tend to be you know looking for candidates and you can also just apply directly and this is sort of something that people tell you not to do for most roles but remember there's a talent Gap in machine learning so this can actually be more effective than you might think when you're applying what's the best way to think about how to stand out for these roles so I think like sort of a baseline thing is for many companies they really want to see that you're expressing some sort of interest in ml you've been
160
+
161
+ 41
162
+ 00:24:49,260 --> 00:25:29,460
163
+ attending conferences you've been taking online courses you've been doing something to sort of put get your foot in the door for getting into the field better than that is being able to demonstrate that you have some software engineering skills again for many ml organizations hiring for software engineering is in many ways more important than hiring for ML skills if you can show that you have a broad knowledge of ml so writing blog posts that synthesize a particular research area or articulating a particular algorithm in a way that is that is new or creative or compelling can be a great way to stand out but even better than that is demonstrating an ability to you know ship ml projects and the best way to do this I think if you are not
164
+
165
+ 42
166
+ 00:25:27,600 --> 00:26:04,980
167
+ working in ml full-time right now is through side projects these can be ideas of whatever you want to work on they can be paper re-implementation so they can be your project for this course and then probably if you really want to stand out maybe the most impressive thing that you can do is to prove that you can think creatively in ml right think Beyond just reproducing things that other people have done but be able to you know win kaggle competitions or publish papers and so this is definitely not necessary to get a job in ml but this will sort of put your resume at the top of the stack so we've talked about some of the different roles that are involved in building ml products and how to think about hiring for those roles or being
168
+
169
+ 43
170
+ 00:26:03,419 --> 00:26:43,320
171
+ hired for those roles the next thing that we're going to talk about is how machine learning teams fit into the context of the rest of the organization since we're still in the relatively early days of adopting this technology there's no real consensus yet in terms of the best way to structure an ml team but what we'll cover today is taxonomy of some of the best practices for different security levels of organizations and how they think about structuring their ml teams and so we'll think about this as scaling a mountain from least mature ml team to most mature so the bottom of the mountain is the nascent or ad hoc ml archetype so what this looks like is you know your company has just started thinking about mL no
172
+
173
+ 44
174
+ 00:26:41,940 --> 00:27:20,460
175
+ one's really doing it yet or maybe there's a little of it being done on an ad hoc basis by the analytics team or one of the product teams and most smaller medium businesses are at most in this category but some of the less technology for larger organizations still fall in this category as well so the great thing about being at this stage is that there's a ton of low hanging fruit often for ML to come in and help solve but the disadvantage if you're going to go in and work in an organization at this stage is that there's often little support available for ML projects you probably won't have any infrastructure that you can rely on and it can be difficult to hire and retain good talent plus leadership in the company may not really be bought
176
+
177
+ 45
178
+ 00:27:18,419 --> 00:27:59,940
179
+ into how useful ml could be so that's some of the things to think about if you're going to go take a role role in one of these organizations once the company has decided hey this ml thing is something exciting something that we should invest in typically they'll move up to an ml r d stage so what this looks like is they'll have a specific team or specific like subset of their r d organization that's focused on machine learning they'll typically hire researchers or phds and these folks will be focused on building prototypes internally or potentially doing external facing research so some of the larger oil and gas companies manufacturing companies telecom companies were in the stage even just a few years ago although
180
+
181
+ 46
182
+ 00:27:58,260 --> 00:28:36,000
183
+ they've in many cases moved on from it now if you're going to go work in one of these organizations one of the big advantages is you can get away with being less experienced on the research side and since the ml team isn't really going to be on the hook today for any sort of meaningful business outcomes another big Advantage is that these teams can work on long-term business priorities and they can focus on trying to get to what would be really big wins for the organization but the disadvantage to be aware of if you're thinking about joining a team at this stage or building a team at this stage is that oftentimes since the ml team is sort of siled off into an R D part of the organization or a separate team from
184
+
185
+ 47
186
+ 00:28:34,080 --> 00:29:11,580
187
+ the different products initiatives it can be difficult for them to get the data that they need to solve the problems that they need to solve it's just not a priority in many cases for other parts of the business to give them the data and then probably the biggest disadvantage of this stage is that you know it doesn't usually work it doesn't usually translate to business value for the organization and so oftentimes ml teams kind of get stuck at this stage where they don't invest very much in ml and ml is kind of siled and so they don't see strong results and they can't really justify doubling down the next evolution of ml organizations oftentimes is embedding machine learning directly into business and product teams so what
188
+
189
+ 48
190
+ 00:29:09,900 --> 00:29:52,140
191
+ this looks like is you'll have some product teams within the organization that have a handful of ml people side by side with their software or analytics teams and these ml teams will report up into the sort of engineering or Tech organizations directly instead of being in their own sort of reporting arm a lot of tech companies when they start adopting ml sort of pretty quickly get to this category because they're pretty agile software organizations and pretty Tech forward organizations anyway and a lot of the financial services company is tend towards this model as well the big sort of overwhelming advantage of this organizational model is that when these ml teams ship stuff successfully it almost always is able to translate
192
+
193
+ 49
194
+ 00:29:50,159 --> 00:30:27,419
195
+ pretty directly to business value since the people that are doing ml sit side by side with the folks that are you know building the product or building the feature that the ml is going to be part of and this gives them a really tight feedback cycle between new ideas that they have for how to make the ml better how to make the product better with ml into actual results as part of the products the disadvantages of building ml this way are oftentimes it can be hard to hire and develop really really great ml people because great ml people often want to work with other great ml people it can also be difficult to get these ml folks access to the resources that they need to be really successful so that's the infrastructure they need
196
+
197
+ 50
198
+ 00:30:25,620 --> 00:31:03,360
199
+ the data they need or the compute they need because they don't have sort of a central team that reports high up in the organization to ask for help and one other disadvantage of this model is that oftentimes this is where you see conflicts between the way that ml projects are run the sort of iterative process that is high risk and the way that the software teams that these ml folks are a part of are organized sometimes you'll see conflict between folks getting frustrated with the ml folks on their team for not shipping quickly or not being able to sort of commit to a timeline that they promised the next ml organization architect will cover is independent machine learning's function what this looks like is you'll
200
+
201
+ 51
202
+ 00:31:01,799 --> 00:31:42,960
203
+ have a machine learning division of the company that reports up to senior leadership so they report to the CEO or the CTO or something along those lines this is what distinguishes it from the mlr D archetype where the ml team is often you know reporting to someone more Junior in the organization often a foreigner as sort of a smaller bet this is the organization making a big bet to investing in machine learning so oftentimes this is also the archetype where you'll start to see mlpms or platform nlpms that work with researchers and ml engineers and some of these other roles in order to deliver like a cross-functional product the big advantage of this model is access to resources so since you have a centralized ml team you can often hire
204
+
205
+ 52
206
+ 00:31:40,679 --> 00:32:18,960
207
+ really really talented people and build a talent density in the organization and you can also train people more easily since you have more ml people sitting in a room together or in a zoom room together in some cases since you report to senior leadership you can also often like Marshal more resources in terms of data from the rest of the organization or budget for compute than you can in other archetypes and it makes it a lot easier when you have a centralized organization to invest in things like tooling and infrastructure and culture and best practices around developing ml in your organization the big disadvantage of this model is that it leads to handoffs and that can add friction to the process that you as an
208
+
209
+ 53
210
+ 00:32:16,980 --> 00:32:54,720
211
+ ml team need to run in order to actually get your models into production and the last ml organization archetype the the end State the goal if you're trying to build ml the right way in your organization is to be an ml first organization so what this looks like is you have buy-in up and down the organization that ml is something that you as a company want to invest in you have an ml division that works on the most challenging long-term projects and invests in sort of centralized data and centralized infrastructure but you also have expertise in ml in every line of business that focuses on quick wins and working with the central ml division to sort of translate the ideas they have the implementations they make into
212
+
213
+ 54
214
+ 00:32:52,320 --> 00:33:30,480
215
+ actual outcomes for the products that the company is building so you'll see this in the biggest tech companies like the Googles and Facebooks of the world as well as startups that were founded with ML as a core guiding principle for how they want to build the products and these days more and more you're starting to see other tech companies who began investing in ml four or five years ago start to become closer to this archetype there's mostly advantages to this model you have great access to data It's relatively easy to recruit and most importantly it's probably easiest in this archetype out of all them to get value out of ml because the products teams that you're working with understand machine learning and really
216
+
217
+ 55
218
+ 00:33:28,799 --> 00:34:07,740
219
+ the only disadvantage of this model is that it's difficult and expensive and it takes a long time for organizations that weren't born with this mindset to adopt it because you have to recruit a lot of really good ml people and you need to culturally embed ml thinking into your organization the next thing that we'll talk about is some of the design choices you need to make if you're building an ml team we'll talk about how those depend on the archetype of the organization that you fit into the first question is software engineering versus research so to what extent is the mltm responsible for building software versus just training models the second question is data ownership so is the ml team also responsible for creating publishing data
220
+
221
+ 56
222
+ 00:34:06,240 --> 00:34:43,379
223
+ or do they just consume that from other teams and the last thing is model ownership the ml team are they the ones that are going to productionize models or is that the responsibility of some other team in the mlr D archetype typically you'll prioritize research over software engineering skills and the MLT won't really have any ownership over the data or oftentimes even the skill sets to build data pipelines themselves and similarly they won't be responsible for deploying models either and in particular models will rarely make it into production so that won't really be a huge issue embedded ml teams typically they'll prioritize software engineering skills over research skills and all researchers if they even have
224
+
225
+ 57
226
+ 00:34:42,060 --> 00:35:21,359
227
+ researchers will need to have strong software engineers skills because everyone's expected to deploy it ml teams still generally doesn't own data because they are working with data Engineers from the rest of the organizations to build data pipelines but since the expectation in these types of organizations is that everyone deploys typically ml Engineers will own maintenance of the models that they deploy in the ml function archetype typically the requirement will be that you'll need to have a team that has a strong mix of software engineering research and data skills so the team size here starts to become larger a minimum might be something like one data engineer one ml engineer potentially a platform engineer or a devops engineer
228
+
229
+ 58
230
+ 00:35:18,839 --> 00:35:57,420
231
+ and potentially a PM but these teams are often working with a bunch of other functions so they can in many cases get much larger than that and you know in many cases in these organizations you'll have both software engineers and researchers working closely together within the context of a single team usually at this stage ml teams will start to have a voice in data governance discussions and they'll probably also have some strong internal data engineering functions as well and then since the ml team is centralized at this stage they'll hand off models to a user but in many cases they'll still be responsible for maintaining them although that line is blurry in a lot of organizations that run this model finally in ml first organizations
232
+
233
+ 59
234
+ 00:35:55,320 --> 00:36:32,700
235
+ there's no real standardization around how teams are research oriented or not but research teams do tend to work pretty closely with software engineering teams to get things done in some cases the ml team is actually the one that owns company-wide data infrastructure because ml is such a central bet for the company that it makes sense for the ml team to make some of the sort of main decisions about how data will be organized then finally if the ml team is the one that actually built the model they'll typically hand it off to a user who since they have the basic ml skills and knowledge to do this they'll actually be the one to maintain the model and here's all this on one slide if you want to look at it all together
236
+
237
+ 60
238
+ 00:36:31,140 --> 00:37:06,720
239
+ all right so we've talked about machine learning teams and organizations and how these come together and the next thing that we're going to talk about is team management and product management for machine learning so the first thing to know about product management and team management for ML is that it tends to be really challenging there's a few reasons for this the first is that it's hard to tell in advance how easy or hard something is going to be so this is an example from a blog post by Lucas B Walt where they ran a kaggle competition and in the first week of that kago competition they saw a huge increase in the accuracy of the best performing model they went from 35 to 70 accuracy within one week and they were thinking
240
+
241
+ 61
242
+ 00:37:04,859 --> 00:37:42,359
243
+ this is great like we're gonna hit 95 accuracy and this contest is going to be a huge success but then if you zoom out and look at the entire course of the project over three months it turns out that most of that accuracy gain came in the first week and the improvements thereafter were just marginal and that's not because of a lack of effort the number of participating teams was still growing really rapidly over the course of that time so the upshot is it's really hard to tell in advance how easier or hard something is in ml and looking at signals like how quickly are we able to make progress on this project can be very misleading or related challenge is that progress on ML projects tends to be very non-linear so
244
+
245
+ 62
246
+ 00:37:40,260 --> 00:38:17,460
247
+ it's very common for projects to stall for weeks or longer because the ideas that you're trying just don't work or because you hit some sort of unforeseen snag with not having the right data or something like that that causes you to really get stuck and on top of that in the earliest stages of doing the project it can be very difficult to plan and to tell how long the project is going to take because it's unclear what approach will actually work for training a model that's good enough to solve the problem and the upshot of all this is that estimating the timeline for a project when you're in the project planning phase can be very difficult in other words production ml is still somewhere between research and Engineering another
248
+
249
+ 63
250
+ 00:38:14,700 --> 00:38:54,540
251
+ challenge for managing ml teams is that there's cultural gaps that exist between research and Engineering organizations these folks tend to come from different backgrounds they have different training they have different values goals and norms for example oftentimes you know stereotypically researchers care about novelty and about how exciting the approach is that they took to solve a problem whereas you know again stereotypically oftentimes software Engineers care about did we make the thing work and in more toxic cultures these two sides often can class and even if they don't Clash directly they might not really value each other as much as they should because both sides are often necessary to build the thing that you
252
+
253
+ 64
254
+ 00:38:53,099 --> 00:39:28,740
255
+ want to build to make batteries worse when you're managing a team as part of an organization you're not just responsible for making sure the team does what they're supposed to do but you'll also have to manage up to help leadership understand your progress and what the Outlook is for the thing that you're building since ml is such a new technology many leaders and organizations even in good technology organizations don't really understand it so next I want to talk about some of the ways that you can manage machine learning projects better and the first approach that I'll talk about is doing project planning probabilistically so oftentimes when we think about project planning for software projects we think
256
+
257
+ 65
258
+ 00:39:26,520 --> 00:40:07,740
259
+ about it as sort of a waterfall right where you have a set of tasks and you have a set of time estimates for those tasks and a set of dependencies for those tasks and you can plan these out one after another so if task G depends on tasks D and F then task G will happen once those are done if task D depends on C which depends on task a you'll start Task D after a and C are done Etc but in machine learning this can lead to frustration and badly estimated timelines because each of these projects has a higher chance of failure than it does in a typical software project what we ended up doing at open AI was doing project planning probabilistically so rather than assuming that like a particular task is going to take a
260
+
261
+ 66
262
+ 00:40:06,060 --> 00:40:45,540
263
+ certain amount of time instead we assign probabilities to let the likelihood of completion of each of these tasks and potentially pursue alternate tasks that allow us to unlock the same dependency in parallel so in this example you know maybe task fee and task C are both alternative approaches to unlocking task D so we might do both of them at the same time and so if we realize all of a sudden that task C is not going to work and task B is taking longer than we expected then we can adjust the timeline appropriately and then we can start planning the next wave of tasks once we know how we're going to solve the prerequisite tasks that we needed a coral area of doing machine learning project planning probabilistically is
264
+
265
+ 67
266
+ 00:40:43,560 --> 00:41:20,700
267
+ that you you shouldn't have any path critical projects that are fundamentally research research projects have a very very high rate of failure rather than just saying like this is how we're going to solve this problem instead you should be willing to try a variety of approaches to solve that problem that doesn't necessarily mean that you need to do them all in parallel but many good machine learning organizations do so one way to think about this is you know if you know that you need to build like a model that's never been built in your organization before you can have like a friendly competition of ideas if you have a culture that's built around working together as a team to get to the right answer and not just rewarding the
268
+
269
+ 68
270
+ 00:41:19,140 --> 00:41:55,079
271
+ one person who solves the problem correctly another corollary to this idea that that many machine learning ideas can and will fail is that when you're doing Performance Management it's important not to get hung up on just who is the person whose ideas worked in the long term it's important for people to do things that work like over the course of you know many many months or years if nothing that you try works then that's maybe an indication that you're not trying the right things you're not executing effectively but on at any given project object like on a timeline of weeks or a quarter then the success measure that you should be looking at is how well you executed on the project not whether the project happened to be one
272
+
273
+ 69
274
+ 00:41:53,579 --> 00:42:29,820
275
+ of the ones that worked one failure mode that I've seen in organizations that hire both researchers and Engineers is implicitly valuing one side more than the other so thinking engineering is more important than research which can lead to things getting stuck on the ml side because the ml side is not getting the resources or attention that they deserve or thinking that research is more important than engineering which can lead to creating ml innovations that are not actually useful so oftentimes the way around this is to have engineers and researchers work very closely together in fact like sometimes uncomfortably close together like working together on the same code base for the same project and understanding
276
+
277
+ 70
278
+ 00:42:28,320 --> 00:43:05,700
279
+ that these folks bring different skill sets to the table another key to success I've seen is trying to get quick wins so rather than trying to build a perfect model and then deploy it trying to ship something quickly to demonstrate that this thing can work and then iterate on it over time and then the last thing that you need to do if you're in a position of being the product manager or the engineering manager for an ml team is to put more emphasis than you might think that you need on educating the rest of your organization on how ml Works diving into that a bit more if your organization is relatively new to adopting ml I'd be willing to bet that a lot of people in the organization don't understand one or more of these things
280
+
281
+ 71
282
+ 00:43:03,359 --> 00:43:43,260
283
+ for us as like ml practitioners it can be really natural to think about where ml can and can't be used but for a lot of technologists or Business Leaders that are new to ml the uses of ml that are practical can be kind of counter-intuitive and so they might have ideas for ML projects that are feasible and they might miss ideas for ML projects that are pretty easy that don't fit their mental model of what ml can use another common point of friction in dealing with the rest of the organization is convincing the rest of the organization that the ml that you built actually works Business Leaders and folks from product teams typically the same metrics that convince us as ml practitioners that this model is useful
284
+
285
+ 72
286
+ 00:43:41,460 --> 00:44:18,780
287
+ won't convince them like just looking at an F1 score or an accuracy score doesn't really tell them what they need to know about whether this model is really solving the task that it needs to solve for the business outcome that they're aiming for and one particular way that this presents itself pretty frequently is in Business Leaders and other stakeholders not really sort of wrapping their heads around the fact that ml is inherently probabilistic and that means that it will fail in production and so a lot of times where ml efforts get hung up is in the same stakeholders potentially that champion the project to begin with not really being able to get comfortable with the fact that once the model is out in the world it's you know
288
+
289
+ 73
290
+ 00:44:17,339 --> 00:44:57,780
291
+ the users are going to start to see failures that it makes in almost all cases and the last common failure mode in working with the rest of the organization is the rest of the organization treating ml projects like other software projects and not realizing that they need to be managed differently than other software projects too and one particular way that I've seen this become a problem is when leadership gets frustrated at ml team because they're not able to really accurately convey how long projects are going to take to complete so educating leadership and other stakeholders on the probabilistic nature of ml projects is important to maintaining your sanity as an ml team if you want to share some resources with your execs that they can
292
+
293
+ 74
294
+ 00:44:55,500 --> 00:45:37,079
295
+ use to learn more about how these projects play out in the practice of real organizations I would recommend Peter beale's AI strategy class from the business school at UC Berkeley and Google's people in AI guidebook which we'll be referring to a lot more in the rest of the lecture as well the last thing I'll say on educating the rest of the organization on ml is that mlpms I think play like one of the most critical roles in doing this effectively to illustrate this I'm going to make an analogy to the two types of ml engineers and describe two prototypal types of mlpms that I see in different organizations so on one hand we have our task mlpms these are like a PM that's responsible for a specific product or
296
+
297
+ 75
298
+ 00:45:35,280 --> 00:46:16,680
299
+ specific product feature that heavily uses ml these folks will need to have a pretty specialized knowledge of ML and how it applies to the particular domain that they're working on so for example they might be the PM for the trust and safety product for your team or particular recommendation product for your team and these are probably the more common type of mlpms in Industry today but an emerging type of mlpm is the platform mlpm platform mlpms tend to start to make sense when you have a centralized ml team and that centralized ml team needs to play some role in educating the rest of the organization in terms of like what are productive uses of ml in all the products that the organization is building because these
300
+
301
+ 76
302
+ 00:46:14,640 --> 00:46:55,260
303
+ folks are responsible for managing the workflow in and out of the ml team so helping filter out projects that aren't really high priority for the business or aren't good uses of ml helping proactively find projects that might have a big impact on the the product or the company by spending a lot of time with PMS from the rest of the organization and communicating those priorities to the ml team and outward to the rest of the organization this requires a broad knowledge of ml because a lot of what this role entails is trying to really understand where ml tan and should and shouldn't be applied in the context of all the things the organization is doing and one of the other critical roles that platform MLT
304
+
305
+ 77
306
+ 00:46:52,619 --> 00:47:30,720
307
+ and PMs could play is spreading ml knowledge and culture throughout the rest of the organization not just going to PMs and business stakeholders from the other product functions and Gathering requirements from them but also helping educate them on what's possible to do with ML and helping them come up with ideas to use ml in their areas of responsibility that they find exciting so that they can over time really start to build their own intuition about what types of things they should be considering ml to be used for and then another really critical role that these platform mlpms can play is mitigating the risks of you know we've built a model but we can't convince the rest of the organization to actually use it by being really crisp
308
+
309
+ 78
310
+ 00:47:28,920 --> 00:48:07,440
311
+ about what are the requirements that we actually need this model to fulfill and then proactively communicating with the other folks that need to be bought in about the model's performance to help them understand all the things that they'll need to understand about them also really trust its performance so platform mlpms are or I think a newer Trend in ml organizations but I think one that can have a big impact on the success of ml organizations when you're in this phase starting to build a centralized ml team or trans transition from a centralized ml team to becoming an ml first organization one question I get a lot about ml product management is what's the equivalent of agile or any of these established development
312
+
313
+ 79
314
+ 00:48:05,280 --> 00:48:46,980
315
+ methodologies for software in ml is there something like that that we can just take off the shelf and apply and deliver successful ml products and the answer is there's a couple of emerging ml project management methodologies the first is Chris DM which is actually an older methodology but it was originally focused on Data Mining and has been subsequently applied to data science and ML and the second is the team data science process tdsp from Microsoft what these two things have in common is that they describe the stages of ml projects as sort of a loop where you start by trying to understand the problem that you're trying to solve acquiring data building a model evaluating it and then finally deploying it so the main reason
316
+
317
+ 80
318
+ 00:48:45,180 --> 00:49:24,960
319
+ to use one of these methodologies would be if you really want standardization for what you call the different stages of the Project Life Cycle if you're choosing between these two tdsp tends to be a little bit more structured it provides like sort of more granular list of roles responsibilities templates that you can use to actually execute on this process crisp DM is a bit higher level so if you need an actual like granular project management framework then I would start by trying tdsp but I'll see more generally it's reasonable to use these if you truly have a large scale coordination problem if you're trying to get a large ml team working together successfully for the first time but I would otherwise recommend skipping these
320
+
321
+ 81
322
+ 00:49:23,280 --> 00:50:03,660
323
+ because they're more focused on traditional data mining or data science processes and they'll probably slow you down so I would sort of exercise caution before implementing one of these methodologies in full the last thing I want to talk about is designing products that lend themselves well to being powered by Machine learning so I think the fundamental challenge in doing this is a gap between in what users expect when they're ended in AI powered products and what they actually get and so what users tend to think when they're given an AI powered product is you know their mental model is often human intelligence but better and in Silicon so they think it um has this knowledge of the world that it as achieved by
324
+
325
+ 82
326
+ 00:50:02,099 --> 00:50:41,099
327
+ reading the whole internet oftentimes they think that this product knows me better than I know myself because it has all the data about me from every interaction I've ever had with software they think that AI Power Products learn from their mistakes and that they generalize to new problems right because it's intelligence it's able to learn from new examples to solve new tasks but I think a better mental model for what you actually get with an ml powered products is a dog that you train to solve a puzzle right so it's amazing that it can solve the puzzle and it's able to solve surprisingly hard puzzles but at the end of the day it's just a dog solving a puzzle and in particular dogs are weird little guys right they
328
+
329
+ 83
330
+ 00:50:39,000 --> 00:51:18,180
331
+ tend to fail and strange and unexpected ways that you know we as people with like human intelligence might not expect they also get distracted easily right like if you take them outside they might not be able to solve the same problem that they're able to solve inside they don't generalize outside of a narrow domain The Stereotype is that you can't teach an old dog new tricks and in ml it's often hard to adapt general knowledge should new tasks or new contexts dogs are great at learning tricks but they can't do it if you don't give them treats and similarly machine Learning Systems don't tend to learn well without feedback or rewards in place to help understand where they're performing well and where they're not
332
+
333
+ 84
334
+ 00:51:16,020 --> 00:51:54,240
335
+ performing well and lastly both dogs learning tricks and machine Learning Systems might misbehave if you leave them unattended the implication is that there's a big gap between users mental model for machine learning products and what they actually get from machine learning products so the upshot is that the goal of good ml product design is to bridge the user's expectation with reality and there's a few components to that the first is helping users understand what they're actually getting from the model and also its limitations the the second is that since failures are inevitable we need to be able to handle those failures gracefully which means not over relying on Automation and being able to fall back in many cases
336
+
337
+ 85
338
+ 00:51:52,260 --> 00:52:37,140
339
+ too human in the loop and then the final goal of ml product design is to build in feedback loops that help us use data from our users to actually improve the system one of the best practices for ML product design is explaining the benefits and limitations of the system to users one way that you can do that is since users tend to have misconceptions about what AI can and can't do focus on what problem the product is actually solving for the user not on the fact that it's AI powered and similarly the more open-ended and human feeling you make the product experience like allowing users to enter any information that they want to or ask questions in whatever natural language that they want to the more they're going to treat it as
340
+
341
+ 86
342
+ 00:52:34,980 --> 00:53:15,240
343
+ human-like and expose some of the failure modes that the system still has so one example of this was when Amazon Alexa was first released one of the sort of controversial decisions that they made was they limited it to a very specific set of prompts that you could say to it rather than having it be an open-ended language or dialogue system and that allowed them to really focus on training users to interact with the system in a way that it was likely to be able to understand and then finally the reality is that your model has limitations and so you should explain those limitations to users and consider actually just baking those limitations into the model as guardrails so not letting your users provide input to your
344
+
345
+ 87
346
+ 00:53:13,680 --> 00:53:53,880
347
+ model that you know the model is not going to perform well on so that could be as simple as you know if your NLP system was designed to perform well on English text then detecting if users input text in some other language and you know either warning them or not allowing them to input text in a language where your model is not going to perform well the next best practice for ML product design is to not over rely on Automation and instead try to design where possible for a human in the loop automation is great but failed automation can be worse than automation at all so it's worth thinking about even if you know what the right answer is for your users how can you add low friction ways to let users confirm the model's
348
+
349
+ 88
350
+ 00:53:52,140 --> 00:54:28,920
351
+ predictions so that they don't have a terrible experience when the model does something wrong and they have no way to fix it one example of this was back when Facebook had an auto tagging feature of you know recognizing your face and pictures and suggesting who the person was they didn't just assign the tag to the face even though they almost always knew exactly who that person was because it'd be a really bad experience if all of a sudden you were tagged in some picture of someone else instead they just add like simple yes no that lets you confirm that they in fact got the prediction that this is your face correctly in order to mitigate the effect of when the model inevitably does make some bad predictions there's a
352
+
353
+ 89
354
+ 00:54:27,720 --> 00:55:06,300
355
+ couple of patterns that can help there the first is it's a really good idea to always bake in some way of letting users take control of the system like in a self-driving car to be able to grab the wheel and steer the car back on track if it makes a mistake and another pattern for mitigating the cost so bad predictions is looking at how confident the model is in its response and maybe being prudent about only showing responses to users that are pretty high confidence potentially falling back to a rules-based system or just telling the user that you don't have a good answer to that question the third best practice for ML product design is building in feedback loops with your users so let's talk about some of the different types
356
+
357
+ 90
358
+ 00:55:04,859 --> 00:55:42,000
359
+ of feedback that you might collect from your users on the x-axis is how easy it is to use the feedback that you get in order to actually directly make your model better on the y-axis is how much friction does it add to your users to collect this feedback so roughly speaking you could think about like above this line on the middle of the chart is implicit feedback that you collect from your users without needing to change their behavior and on the right side of the chart are signals that you can train on directly without needing to have some human intervention the type of feedback that introduces the least friction to your user is just collecting indirect implicit feedback on how well the prediction is working for
360
+
361
+ 91
362
+ 00:55:40,079 --> 00:56:17,520
363
+ them so these are signals about user behavior that tend to be a proxy for mobile performance like did the user churn or not these tend to be super easy to collect because they're often instrumented in your product already and they're really useful because they correspond to important outcomes for our products the challenge in using these is that it's often very difficult to tell whether the model is the cause because these are high level sort of business outcomes that may depend on many other things other than just your model's prediction so to get more directly useful signals from your users you can consider collecting direct implicit feedback where you collect signals from the products that measure how useful
364
+
365
+ 92
366
+ 00:56:15,240 --> 00:56:51,240
367
+ this prediction is to the user directly rather than indirectly for example if you're giving the user a recommendation you can measure whether they clicked on the recommendation or if you're suggesting an email for them to send did they send that email or did they copy the suggestion so they can use it in some other application oftentimes these take the form of did the user take the next step in whatever process that they're running that they take the prediction you gave them and use it Downstream for whatever tasks they're trying to do the great thing about this type of feedback is that you can often train on directly because it gives you a signal about you know which predictions the model made that were actually good
368
+
369
+ 93
370
+ 00:56:48,900 --> 00:57:27,119
371
+ at solving the task for the user but the challenge is that not every setup of your product lends itself to collecting this type of feedback so you may need to redesign your products in order to collect feedback like this next we'll move on to explicit types of user feedback explicit feedback is where you ask your user directly to provide feedback on the model's performance and the lowest friction way to do this for users tends to be to give them some sort of binary feedback mechanism which can be like a thumbs up or thumbs down button in your product this is pretty easy for users because it just requires them to like click one button and it can be a decent training signal there's some research and using signals like this in
372
+
373
+ 94
374
+ 00:57:24,660 --> 00:58:03,960
375
+ order to guide the learning process of models to be more aligned with users preferences if you want a little bit more signal than just was this prediction good or bad you can also ask users to help you categorize the feedback that they're giving they could for example like flag certain predictions as incorrect or offensive or irrelevant or not useful to me you can even set this up as a second step in the process after binary feedback so users will still give you binary feedback even if they don't want to spend the time to categorize that feedback and these signals can be really useful for debugging but it's difficult to set things up in such a way that you can train on them directly another way you can get more granular feedback on Mall's
376
+
377
+ 95
378
+ 00:58:02,220 --> 00:58:38,040
379
+ predictions is to have like some sort of free text input where users can tell you what they thought about in prediction this often manifests itself in support tickets or support requests for your model this requires a lot of work on the part of your users and it can be very difficult to use as a model developer because you have to parse through this like unstructured feedback about your model's predictions yet it tends to be quite useful sometimes in practice because since it's high friction to actually provide this kind of feedback the feedback that users do provide can be very high signal it can highlight in some cases like the highest friction predictions since users are willing to put in the time to complain about them
380
+
381
+ 96
382
+ 00:58:36,180 --> 00:59:19,200
383
+ and then finally the gold standard for user feedback if it's possible to do in the context of your products and your user experience is is to have users correct the predictions that your model actually makes so if you can get users to label stuff for you directly then that's great then you're in a really good spot here and so one way to think about like where this can actually be feasible is if the thing that you're making a prediction for is useful to the user Downstream within the same product experience that you're building not is this useful for them to copy and use in a different app but is it useful for them to use within my app so one example of this is in product called great scope which Sergey built there is a model that
384
+
385
+ 97
386
+ 00:59:16,020 --> 00:59:59,700
387
+ when students submit their exams it tries to match the handwritten name on the exam with the name of the student in the student registry now if the model doesn't really know who that student is if it's low confidence or if it gets the prediction wrong then the instructor can go in and re-categorize that to be the correct name that's really useful to them because they need to have the exam categorized to the correct student anyway but it's also very direct supervisory signal for the model so it's Best of Both Worlds whenever you're thinking about building explicit feedback into your products it's always worth keeping in mind that you know users are not always as altruistic as we might hope that they would be and so you
388
+
389
+ 98
390
+ 00:59:57,720 --> 01:00:35,460
391
+ should also think about like how is it going to be worthwhile for users to actually spend the time to give us feedback on this the sort of most foolproof way of doing this is as we described before to gather feedback as part of an existing user workflow but if that's not possible if the goal of users providing the feedback is to make the model better then one way you can encourage them to do that is to make it explicit how the feedback will make their user experience better and generally speaking like the more explicit you can be here and the shorter the time interval is between when they give the feedback and when they actually see the product get better the more of a sort of positive feedback loops this
392
+
393
+ 99
394
+ 01:00:33,900 --> 01:01:14,040
395
+ creates for that the more likely is that they're actually going to do it a good example here is to acknowledge user feedback and adjust automatically so so if your user provided you feedback saying hey I really like running up hills then sort of good response to that feedback might be great here's another hell that you can run up in 1.2 kilometers they see the results of that feedback immediately and it's very clear how it's being used to make the product experience better less good is the example to the right of that where the response to the feedback just says thank you for your feedback because I as a user when I give that feedback there's no way for me to know whether that feedback is actually making the product
396
+
397
+ 100
398
+ 01:01:12,000 --> 01:01:50,520
399
+ experience better so it discourages me from getting more feedback in the future the main Takeaway on product design for machine learning is that great ml powered products and product experiences are not just you know take an existing product that works well in both and then on top of it they're actually designed from scratch with machine learning and the particularities of machine learning in mind and some reasons for that include that unlike what your users might think machine learning is not superhuman intelligence encoded in Silicon and so your product experience needs to help users understand that in the context of the particular problem that you are solving for them it also needs to help them interact safely with
400
+
401
+ 101
402
+ 01:01:48,420 --> 01:02:25,079
403
+ this model that has failure modes via human in the loop and guard rails around the experience with interacting with that model and finally great ml products are powered by great feedback loops right because the perfect version of the model doesn't exist and certainly it doesn't exist in the first version of the model that you deployed and so one important thing to think about when you're designing your product is how can you help your users make the product experience better by collecting the right feedback from them this is a pretty young and underexplored topic and so here's a bunch of resources that I would recommend checking out if you want to learn more about this many of the examples that we used in the previous
404
+
405
+ 102
406
+ 01:02:23,579 --> 01:02:59,880
407
+ slides are pulled from these resources and in particular the resource from Google in the top bullet point is really good if you want to understand the basics of this field so to wrap up this lecture we talk about a bunch of different topics related to how to build machine learning products as a team and the first is machine learning roles and the sort of takeaway here is that there's many different skills involved in production machine learning machine production ml is inherently interdisciplinary so there's an opportunity for lots of different skill sets to help contribute when you're building machine learning teams since there's a scarcity of talent especially talent that is good at both software engineering and machine learning it's
408
+
409
+ 103
410
+ 01:02:58,140 --> 01:03:32,339
411
+ important to be specific about what you really need for these roles but paradoxically as an outsider it can be difficult to break into the field and the sort of main recommendation that we had for how to get around that is by using projects to build awareness of your thinking about machine learning the next thing that we talk about is how machine learning teams fit into the broader organization we covered a bunch of different archetypes for how that can work and we looked at how machine learning teams are becoming more Standalone and more interdisciplinary in how they function next we talk about managing ml teams and managing ml products managing ml teams is hard and there's no Silver Bullet here but one
412
+
413
+ 104
414
+ 01:03:30,900 --> 01:04:01,400
415
+ sort of concrete thing that we looked at is probabilistic Project planning as a way to help alleviate some of the challenges of understanding how long it's going to take to finish machine learning projects and then finally we talk about product design in the context of of machine learning and the main takeaway there is that today's machine learning systems are not AGI right they're Limited in many ways and so it's important to make sure that your users understand that and that you can use the interaction that you build with your users to help mitigate those limitations so that's all for today and we'll see you next week
416
+
documents/lecture-09.md ADDED
@@ -0,0 +1,825 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Building ML for good while building good ML
3
+ ---
4
+
5
+ # Lecture 9: Ethics
6
+
7
+ <div align="center">
8
+ <iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/7FQpbYTqjAA?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
9
+ </div>
10
+
11
+ Lecture by [Charles Frye](https://twitter.com/charles_irl).
12
+ Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
13
+ Published October 03, 2022.
14
+ [Download slides](https://fsdl.me/2022-lecture-09-slides).
15
+
16
+ In this final lecture of FSDL 2022, we'll talk about ethics. After going
17
+ through the context of what we mean by ethics, we'll go through three
18
+ different areas where ethics come up:
19
+
20
+ 1. **Tech Ethics**: ethics that anybody who works in the tech industry
21
+ broadly needs to think about.
22
+
23
+ 2. **ML Ethics**: what ethics has specifically meant for the ML
24
+ industry.
25
+
26
+ 3. **AI Ethics**: what ethics might mean in the future where true AGI
27
+ exists.
28
+
29
+ ## 1 - Overview and Context
30
+
31
+ All ethics lectures are wrong, but some are useful. They are more useful
32
+ if we admit and state what our assumptions or biases are. We'll also
33
+ talk about three general themes that come up often when ethical concerns
34
+ are raised in tech/ML: alignment, trade-offs, and humility.
35
+
36
+ ![](./media/image17.png)
37
+
38
+ In this lecture, we'll approach ethics on the basis of **concrete
39
+ cases** - specific instances where people have raised concerns. We'll
40
+ talk about **cases where people have taken actions that have led to
41
+ claims and counter-claims of ethical or unethical behavior** - such as
42
+ the use of automated weapons, the use of ML systems to make decisions
43
+ like sentencing and bail, and the use of ML algorithms to generate art.
44
+ In each case when criticism has been raised, part of that criticism has
45
+ been that the technology is unethical.
46
+
47
+ Approaching ethics in this way allows us to answer the question of "What
48
+ is ethics?" by way of Ludwig Wittgenstein's quote: "*The meaning of a
49
+ word is its use in the language*." We'll focus on times when people have
50
+ used the word "ethics" to describe what they like or dislike about a
51
+ specific technology.
52
+
53
+ If you want to try it out for yourself, you should check out the game
54
+ "[Something Something Soup
55
+ Something](https://soup.gua-le-ni.com/)." In this browser
56
+ game, you are presented with a bunch of dishes and have to decide
57
+ whether they are soup or not soup, as well as whether they can be served
58
+ to somebody who ordered soup. By playing a game like this, you'll
59
+ discover (1) how difficult it is to come up with a concrete definition
60
+ of soup and (2) how poorly your working definition of soup fits with any
61
+ given soup theory.
62
+
63
+ Because of this case-based approach, we won't be talking about ethical
64
+ schools or "trolley" problems. Rather than considering [these
65
+ hypothetical
66
+ scenarios](https://www.currentaffairs.org/2017/11/the-trolley-problem-will-tell-you-nothing-useful-about-morality),
67
+ we'll talk about concrete and specific examples from the past decade of
68
+ work in our field and adjacent fields.
69
+
70
+ ![](./media/image19.png)
71
+
72
+ If you want another point of view that emphasizes the trolley problems,
73
+ you should check out [Sergey's lecture from the last edition of the
74
+ course from
75
+ 2021](https://fullstackdeeplearning.com/spring2021/lecture-9/).
76
+ It presented similar ideas from a different perspective and came to the
77
+ same conclusion and some different conclusions.
78
+
79
+ A useful theme from that lecture that we should all have in mind when we
80
+ ponder ethical dilemmas is "What Is Water?" - which came up from [a
81
+ famous commencement speech by David Foster
82
+ Wallace](https://www.youtube.com/watch?v=PhhC_N6Bm_s). If
83
+ we aren't thoughtful and paying attention, things that are very
84
+ important can become background, assumptions, and invisible to us.
85
+
86
+ The approach of **relying on prominent cases risks replicating social
87
+ biases**. Some ethical claims are amplified and travel more because
88
+ people (who are involved) have more resources and are better connected.
89
+ Using these forms of case-based reasoning (where you explain your
90
+ beliefs in concrete detail) can **hide the principles that are actually
91
+ in operation**, making them disappear like water.
92
+
93
+ But in the end, **so much of ethics is deeply personal** that we can't
94
+ expect to have a perfect approach. We can just do the best we can and
95
+ hopefully become better every day.
96
+
97
+ ## 2 - Themes
98
+
99
+ We'll see three themes repeatedly coming up throughout this lecture:
100
+
101
+ 1. **Alignment**: a conflict between what we want and what we get.
102
+
103
+ 2. **Trade-Offs**: a conflict between what we want and what others
104
+ want.
105
+
106
+ 3. **Humility**: a response when we don't know what we want or how to
107
+ get it.
108
+
109
+ ### Alignment
110
+
111
+ The problem of **alignment** (where what we want and what we get differ)
112
+ come up over and over again. A primary driver of this is called the
113
+ **proxy problem** - in which we often optimize or maximize some proxies
114
+ for the thing that we really care about. If the alignment (or loosely
115
+ the correlation between that proxy and the thing we care about) is poor
116
+ enough, then by trying to maximize that proxy, we can end up hurting the
117
+ thing we originally cared about.
118
+
119
+ ![](./media/image16.png)
120
+
121
+ There was [a recent
122
+ paper](https://arxiv.org/abs/2102.03896) that did a
123
+ mathematical analysis of this idea. You can see these kinds of proxy
124
+ problems everywhere once you look for them.
125
+
126
+ - On the top right, we have a train and validation loss chart from one
127
+ of the training runs for the FSDL text recognizer. The thing we
128
+ can optimize is the training loss. That's what we can use to
129
+ calculate gradients and improve the parameters of our network. But
130
+ the thing we really care about is the performance of the network
131
+ on data points that it has not seen (like the validation set, the
132
+ test set, or data in production). If we optimize our training loss
133
+ too much, we can actually cause our validation loss to go up.
134
+
135
+ - Similarly, there was [an interesting
136
+ paper](https://openreview.net/forum?id=qrGKGZZvH0)
137
+ suggesting that increasing your accuracy on classification tasks
138
+ can actually result in a decrease in the utility of your
139
+ embeddings in downstream tasks.
140
+
141
+ - You can find these proxy problems outside of ML as well. [This
142
+ thread](https://skeptics.stackexchange.com/questions/22375/did-a-soviet-nail-factory-produce-useless-nails-to-improve-metrics)
143
+ reveals an example where a factory that was making chemical
144
+ machines (rather than creating a machine that was cheaper and
145
+ better) chose not to adopt producing that machine because their
146
+ output was measured in weight. So the thing that the planners
147
+ actually cared about, economic efficiency and output, was not
148
+ optimized because it was too difficult to measure.
149
+
150
+ One reason why these kinds of proxy problems arise so frequently is due
151
+ to issues of information. **The information that we are able to measure
152
+ is not the information that we want**. At a higher level, we often don't
153
+ know what it is that we truly needed. We may want the validation loss,
154
+ but what we need is the loss in production or really the value our users
155
+ will derive from this model.
156
+
157
+ ### Trade-Offs
158
+
159
+ Even when we know what we want or what we need, we are likely to run
160
+ into the second problem - **the tradeoff between stakeholders**. It is
161
+ sometimes said that the need to negotiate tradeoffs is one of the
162
+ reasons why engineers do not like thinking about some of these problems
163
+ around ethics. That's not quite right because we do accept tradeoffs as
164
+ a key component of engineering.
165
+
166
+ - In [this O'Reilly book on the fundamentals of software
167
+ architecture](https://www.oreilly.com/library/view/fundamentals-of-software/9781492043447/),
168
+ the first thing they state at the beginning is that **everything
169
+ in software architecture is a tradeoff.**
170
+
171
+ - [This satirical O'Reilly
172
+ book](https://www.reddit.com/r/orlybooks/comments/50meb5/it_depends/)
173
+ says that every programming question has the answer: "It depends."
174
+
175
+ ![](./media/image20.png)
176
+
177
+
178
+ The famous chart above compares the different convolutional networks on
179
+ the basis of their accuracy and the number of operations to run them.
180
+ Thinking about these tradeoffs between speed and correctness is exactly
181
+ the thing we have to do all the time in our job as engineers.
182
+
183
+ We can select the **Pareto Front** for the metrics we care about. A way
184
+ to remember what a Pareto front is [this definition of a data scientist
185
+ from Josh
186
+ Wills](https://twitter.com/josh_wills/status/198093512149958656?lang=en):
187
+ "Person who is better at statistics than any software engineer and
188
+ better at software engineering than any statistician." The Pareto Front
189
+ in the chart above includes the models that are more accurate than those
190
+ with fewer FLOPs and use fewer FLOPs than those that are more accurate.
191
+
192
+ A reason why engineers may dislike thinking about these problems is that
193
+ **it's hard to identify and quantify these tradeoffs**. These are indeed
194
+ proxy problems. Even further, once measured, where on that front do we
195
+ fall? As engineers, we may develop expertise in knowing whether we want
196
+ high accuracy or low latency, but we are not as comfortable deciding how
197
+ many current orphans we want to trade for what amount of future health.
198
+ This raises questions both in terms of measurement and decision-making
199
+ that are outside of our expertise.
200
+
201
+ ### Humility
202
+
203
+ The appropriate response is **humility** because most engineers do not
204
+ explicitly train in these skills. Many engineers and managers in tech,
205
+ in fact, constitutionally prefer optimizing single metrics that are not
206
+ proxies. Therefore, when encountering a different kind of problem, it's
207
+ important to bring a humble mindset, ask for help from experts, and
208
+ recognize that the help you get might not be immediately obvious to what
209
+ you are used to.
210
+
211
+ Additionally, when intervening due to an ethical concern, it's important
212
+ to remember this humility. It's easy to think that when you are on the
213
+ good side, this humility is not necessary. But even trying to be helpful
214
+ is a delicate and dangerous undertaking. We want to make sure that as we
215
+ resolve ethical concerns, we come up with solutions that are not just
216
+ parts of the problem.
217
+
218
+ ### User Orientation Undergirds Each Theme
219
+
220
+ We can resolve all of these via **user orientation**.
221
+
222
+ 1. By getting feedback from users, we maintain **alignment** between
223
+ our system and our users.
224
+
225
+ 2. When making **tradeoffs**, we should resolve them in consultation
226
+ with users.
227
+
228
+ 3. **Humility** means we actually listen to our users because we
229
+ recognize we don't have the answers to all the questions.
230
+
231
+ ## 3 - Tech Ethics
232
+
233
+ The tech industry can't afford to ignore ethics as public trust in tech
234
+ declines. We need to learn from other nearby industries that have done a
235
+ better job on professional ethics. We'll also touch on some contemporary
236
+ topics.
237
+
238
+ ### Tech Industry's Ethical Crisis
239
+
240
+ Throughout the past decade, the tech industry has been plagued by
241
+ scandal - whether that's how tech companies interface with national
242
+ governments at the largest scale or how tech systems are being used or
243
+ manipulated by people who create disinformation or fake social media
244
+ accounts that hack the YouTube recommendation system.
245
+
246
+ As a result, distrust in tech companies has risen markedly in the last
247
+ ten years. [This Public Affairs Pulse
248
+ survey](https://pac.org/public-affairs-pulse-survey-2021)
249
+ shows that in 2013, the tech industry was one of the industries with
250
+ less trustworthiness on average. In 2021, it has rubbed elbows with
251
+ famously more distrusted industries such as energy and pharmaceuticals.
252
+
253
+ ![](./media/image10.png)
254
+
255
+ Politicians care quite a bit about public opinion polls. In the last few
256
+ years, the fraction of people who believe that large tech companies
257
+ should be more regulated has gone up a substantial amount. [Comparing
258
+ it to 10 years ago, it's astronomically
259
+ higher](https://news.gallup.com/poll/329666/views-big-tech-worsen-public-wants-regulation.aspx).
260
+ So there will be a substantial impact on the tech industry due to this
261
+ loss of public trust.
262
+
263
+ We can learn from nearby fields: from the culture of professional ethics
264
+ in engineering in Canada (by wearing [the Iron
265
+ Ring](https://en.wikipedia.org/wiki/Iron_Ring)) to ethical
266
+ standards for human subjects research ([Nuremberg
267
+ Code](https://en.wikipedia.org/wiki/Nuremberg_Code), [1973
268
+ National Research
269
+ Act](https://en.wikipedia.org/wiki/National_Research_Act)).
270
+ We are at the point where we need a professional code of ethics for
271
+ software. Hopefully, many codes of ethics developed in different
272
+ communities can compete with each other and merge into something that
273
+ most of us can agree on. That can be incorporated into our education for
274
+ new members of our field.
275
+
276
+ Let's talk about two particular ethical concerns that arise in tech in
277
+ general: carbon emissions and dark/user-hostile design patterns.
278
+
279
+ ### Tracking Carbon Emissions
280
+
281
+ Because carbon emissions scale with cost, you only need to worry about
282
+ them when the costs of what you are working on are very large. Then you
283
+ won't be alone in making these decisions and can move a bit more
284
+ deliberately to make these choices more thoughtfully.
285
+
286
+ Anthropogenic climate change from carbon emissions raises ethical
287
+ concerns - tradeoffs between the present and future generations. The
288
+ other view is that this is an issue that arises from a classic alignment
289
+ problem: many organizations are trying to maximize their profit, which
290
+ is based on prices for goods that don't include externalities (such as
291
+ environmental damage caused by carbon emissions, leading to increased
292
+ temperatures and lactic change).
293
+
294
+ ![](./media/image8.png)
295
+
296
+ The primary dimension along which we have to worry about carbon
297
+ emissions is in **compute jobs that require power**. That power can
298
+ result in carbon emissions. [This
299
+ paper](https://aclanthology.org/P19-1355/) walks through
300
+ how much carbon dioxide was emitted using typical US-based cloud
301
+ infrastructure.
302
+
303
+ - The top headline shows that training a large Transformer model with
304
+ neural architecture search produces as much carbon dioxide as five
305
+ cars create during their lifetimes.
306
+
307
+ - It's important to remember that power is not free. On US-based cloud
308
+ infrastructure, \$10 of cloud spent is roughly equal to \$1 of air
309
+ travel costs. That's on the basis of something like the numbers
310
+ and the chart indicating air travel across the US from New York to
311
+ San Francisco.
312
+
313
+ - Just changing cloud regions can actually reduce your emissions quite
314
+ a bit. There's [a factor of
315
+ 50x](https://www.youtube.com/watch?v=ftWlj4FBHTg)
316
+ from regions with the most to least carbon-intensive power
317
+ generation.
318
+
319
+ The interest in this problem has led to new tools.
320
+ [Codecarbon.io](https://codecarbon.io/) allows you to
321
+ track power consumption and reduce carbon emissions from your computing.
322
+ [ML CO2 Impact](https://mlco2.github.io/impact/) is
323
+ oriented directly towards machine learning.
324
+
325
+ ### Deceptive Design and Dark Patterns
326
+
327
+ The other ethical concern in tech is **deceptive design**. An
328
+ unfortunate amount of deception is tolerated in some areas of software.
329
+ As seen below, on the left is a nearly complete history of the way
330
+ Google displays ads in its search engine results. It started off very
331
+ clearly colored and separated out with bright colors from the rest of
332
+ the results. Then about ten years ago, that colored background was
333
+ removed and replaced with a tiny little colored snippet that said "Ad."
334
+ Now, as of 2020, that small bit is no longer even colored. It is just
335
+ bolded. This makes it difficult for users to know which content is being
336
+ served to them because somebody paid for it (versus content served up
337
+ organically).
338
+
339
+ ![](./media/image15.png)
340
+
341
+ A number of **dark patterns** of deceptive design have emerged over the
342
+ last ten years. You can read about them on the website called
343
+ [deceptive.design](https://www.deceptive.design/). There's
344
+ also a Twitter account called
345
+ [\@darkpatterns](https://twitter.com/darkpatterns) that
346
+ shares examples found in the wild.
347
+
348
+ A practice in the tech industry that's on a very shaky ethical /legal
349
+ ground is **growth hacking**. This entails a set of techniques for
350
+ achieving rapid growth in user base or revenue for a product and has all
351
+ the connotations you might expect from the name - with examples
352
+ including LinkedIn and Hotmail.
353
+
354
+ ![](./media/image14.png)
355
+
356
+ **ML can actually make this problem worse if we optimize short-term
357
+ metrics**. These growth hacks and deceptive designs can often drive user
358
+ and revenue growth in the short term but worsen user experience and draw
359
+ down on goodwill towards the brand in a way that can erode the long-term
360
+ value of customers. When we incorporate ML into the design of our
361
+ products with A/B testing, we have to watch out to make sure that the
362
+ metrics that we are optimizing do not encourage this kind of deception.
363
+
364
+ These arise inside another alignment problem. One broadly-accepted
365
+ justification for the private ownership of the means of production is
366
+ that private enterprise delivers broad social value aligned by price
367
+ signals and market focus. But these private enterprises optimize metrics
368
+ that are, at best, a proxy for social value. There's the possibility of
369
+ an alignment problem where **companies pursuing and maximizing their
370
+ market capitalization can lead to net negative production of value**. If
371
+ you spend time at the intersection of funding, leadership, and
372
+ technology, you will encounter it.
373
+
374
+ ![](./media/image12.png)
375
+
376
+
377
+ In the short term, you can **push for longer-term thinking within your
378
+ organization** to allow for better alignment between metrics and goals
379
+ and between goals and utility. You can also learn to recognize
380
+ user-hostile designs and **advocate for user-centered design instead**.
381
+
382
+ To wrap up this section on tech ethics:
383
+
384
+ 1. The tech industry should learn from other disciplines if it wants to
385
+ avoid a trust crisis.
386
+
387
+ 2. We can start by educating ourselves about common deceptive or
388
+ user-hostile practices in our industry.
389
+
390
+ ## 4 - ML Ethics
391
+
392
+ The ethical concerns raised about ML have gone beyond just the ethical
393
+ questions about other kinds of technology. We'll talk about common
394
+ ethical questions in ML and lessons learned from Medical ML.
395
+
396
+ ### Why Not Just Tech Ethics?
397
+
398
+ ML touches human lives more intimately than other technologies. Many ML
399
+ methods, especially deep neural networks, make human-legible data into
400
+ computer-legible data. Humans are more sensitive to errors and have more
401
+ opinions about visual and text data than they do about the type of data
402
+ manipulated by computers. As a result, there are more stakeholders with
403
+ more concerns that need to be traded off in ML applications.
404
+
405
+ Broadly speaking, ML involves being wrong pretty much all the time. Our
406
+ models are statistical and include "randomness." Randomness is almost
407
+ always an admission of ignorance. As we admit a certain degree of
408
+ ignorance in our models, our models will be wrong and misunderstand
409
+ situations that they are put into. It can be upsetting and even harmful
410
+ to be misunderstood by our models.
411
+
412
+ Against this backlash of greater interest or higher stakes, a number of
413
+ common types of ethical concerns have coalesced in the last couple of
414
+ years. There are somewhat established camps of answers to these
415
+ questions, so you should at least know where you stand on the four core
416
+ questions:
417
+
418
+ 1. Is the model "fair"?
419
+
420
+ 2. Is the system accountable?
421
+
422
+ 3. Who owns the data?
423
+
424
+ 4. Should the system be built at all?
425
+
426
+ ### Common Ethical Questions in ML
427
+
428
+ #### Is The Model "Fair"?
429
+
430
+ The classic case on this comes from criminal justice with [the COMPAS
431
+ system](https://en.wikipedia.org/wiki/COMPAS_(software))
432
+ for predicting whether a defendant will be arrested again before trial.
433
+ If they are arrested again, that suggests they committed a crime during
434
+ that time. This assesses a certain degree of risk for additional harm
435
+ while the justice system decides what to do about a previous arrest and
436
+ potential crime.
437
+
438
+ The operationalization here was a 10-point re-arrest probability based
439
+ on past data about this person, and they set a goal from the very
440
+ beginning to be less biased than human judges. They operationalize that
441
+ by calibrating these arrest probabilities across subgroups. Racial bias
442
+ is a primary concern in the US criminal justice system, so they took
443
+ care to make sure that these probabilities of re-arrest were calibrated
444
+ for all racial groups.
445
+
446
+ ![](./media/image2.png)
447
+
448
+
449
+ The system was deployed and used all around the US. It's proprietary and
450
+ difficult to analyze. But using the Freedom of Information Act and
451
+ coalescing together a bunch of records, [people at ProPublica were able
452
+ to run their own analysis of this
453
+ algorithm](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing).
454
+ They determined that the model was not more or less wrong for one racial
455
+ group or another. It tended to have more false positives for Black
456
+ defendants and more false negatives for White defendants. So despite the
457
+ creators of COMPAS taking into account bias from the beginning, they
458
+ still ended up with an algorithm with this undesirable property of being
459
+ more likely to falsely accuse Black defendants than White defendants.
460
+
461
+ It turned out that some quick algebra revealed that some form of
462
+ race-based bias is inevitable in this setting, as indicated [in this
463
+ paper](https://arxiv.org/abs/1610.07524). There are a large
464
+ number of fairness definitions that are mutually incompatible. [This
465
+ tutorial by Arvind
466
+ Narayanan](https://www.youtube.com/watch?v=jIXIuYdnyyk&ab_channel=ArvindNarayanan)
467
+ is an excellent one to display them.
468
+
469
+ It is noteworthy that **the impact of "unfairness" is not fixed**. The
470
+ story is often presented as "no matter what, the journalists would have
471
+ found something to complain about." But note that equalizing false
472
+ positive rates and positive predictive value across groups would lead to
473
+ a higher false negative rate for Black defendants relative to White
474
+ defendants. In the context of American politics, that's not going to
475
+ lead to complaints from the same people.
476
+
477
+ ![](./media/image6.png)
478
+
479
+
480
+ This is the story about the necessity of confronting the tradeoffs that
481
+ will inevitably come up. Researchers at Google made [a nice little
482
+ tool](https://research.google.com/bigpicture/attacking-discrimination-in-ml/)
483
+ where you can think through and make these tradeoffs for yourself. It's
484
+ helpful for building intuition on these fairness metrics and what it
485
+ means to pick one over the other.
486
+
487
+ Events in this controversy kicked off a flurry of research on fairness.
488
+ [The Fairness, Accountability, and Transparency
489
+ conference](https://facctconference.org/) has been held for
490
+ several years. There has been a ton of work on both **algorithmic-level
491
+ approaches** on measuring and incorporating fairness metrics into
492
+ training and **qualitative work** on designing systems that are more
493
+ transparent and accountable.
494
+
495
+ In the case of COMPAS, **re-arrest is not the same as recidivism**.
496
+ Being rearrested requires that a police officer believes you committed a
497
+ crime. Police officers are subject to their own biases and patterns of
498
+ policing, which result in a far higher fraction of crimes being caught
499
+ for some groups than for others. Our real goal, in terms of fairness and
500
+ criminal justice, might be around reducing those kinds of unfair impacts
501
+ and using past rearrest data that have these issues.
502
+
503
+ #### Representation Matters for Model Fairness
504
+
505
+ ![](./media/image18.png)
506
+
507
+ Unfortunately, it is easy to make ML-powered tech that fails for
508
+ minoritized groups. For example, off-the-shelf computer vision tools
509
+ often fail on darker sins (as illustrated in [this talk by Joy
510
+ Buolamwini](https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms)).
511
+ This is not a new issue in technology, just a more salient one with ML.
512
+
513
+ There has been a good amount of progress on this in the last five years.
514
+ An example is [Google's Model
515
+ Cards](https://modelcards.withgoogle.com/about) which show
516
+ how well a model will perform on human subgroups of interest.
517
+ HuggingFace has good integrations for creating these kinds of model
518
+ cards.
519
+
520
+ When you invite people for talks or hire people to join your
521
+ organizations, you should work to reduce the bias of that discovery
522
+ process by diversifying your network. Some good resources include
523
+ [Black in AI](https://blackinai.github.io/#/), [Diversify
524
+ Tech Job Board](https://www.diversifytech.co/job-board/),
525
+ [Women in Data Science](https://www.widsconference.org/),
526
+ and the [You Belong in AI
527
+ podcast](https://anchor.fm/ucla-acm-ai). You can make
528
+ professional connections via them to improve the representation of
529
+ minoritized groups in the engineering, design, and product management
530
+ process.
531
+
532
+ #### Is The System Accountable?
533
+
534
+ At a broader level than fairness, we should expect "accountability" from
535
+ ML systems. Some societies and states, including the EU, consider "[the
536
+ right to an explanation](https://arxiv.org/abs/1606.08813)"
537
+ in the face of important judgments to be a part of human rights.
538
+
539
+ In the GDPR act, there is [a section that enshrines
540
+ accountability](https://www.consumerfinance.gov/rules-policy/regulations/1002/interp-9/#9-b-1-Interp-1).
541
+ This isn't quite a totally new requirement; credit denials in the US
542
+ have been required to be explained since 1974. People have a right to
543
+ know what and why into making decisions for them!
544
+
545
+ If you want to impose this "accountability" on a deep neural network and
546
+ understand its selections, there are a number of methods that use the
547
+ input-output gradient to explain the model. You can see a list of
548
+ several methods in order of increasing performance below (from [this
549
+ paper](https://arxiv.org/abs/1810.03292)). These approaches
550
+ don't quite have strong theoretical underpinnings or a holistic
551
+ explanation, and are not that robust as a result. A lot of these methods
552
+ act primarily as edge detectors. The paper shows how even randomizing
553
+ layers in a model does not materially change the interpretability output
554
+ of GradCAM methods.
555
+
556
+ ![](./media/image11.png)
557
+
558
+
559
+ As a result, introspecting DNNs effectively requires reverse engineering
560
+ the system to really understand what is going on, largely thanks to
561
+ efforts like [Distil](https://distil.pub/) and
562
+ [Transfomer Circuits](https://transformer-circuits.pub/).
563
+
564
+ Due to these technical challenges, machine learning systems are prone to
565
+ unaccountability that impacts most those least able to understand and
566
+ influence their outputs. Books such as [Automating
567
+ Inequality](https://www.amazon.com/Automating-Inequality-High-Tech-Profile-Police/dp/1250074312)
568
+ describe the impacts of these systems. In such a context, you should
569
+ seek to question the purpose of model, involve those impacted by the
570
+ decisions (either through direct human inputs or through other means),
571
+ and ensure that equal attention is paid to benefits and harms of
572
+ automation.
573
+
574
+ #### Who Owns The Data?
575
+
576
+ **Humans justifiably feel ownership of the data they creat**e, which is
577
+ subsequently used to train machine learning models. Large datasets used
578
+ to train models like GPT-3 are created by mining this data without the
579
+ explicit involvement of those who create the data. Many people are not
580
+ aware that this is both possible and legal. As technology has changed,
581
+ what can be done with data has changed.
582
+
583
+ [You can even verify if your data has been used to train models
584
+ on](https://haveibeentrained.com/). Some of these images
585
+ are potentially [obtained
586
+ illegally](https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/),
587
+ as a result of sensitive data being posted openly without the recorded
588
+ consent of the originator.
589
+
590
+ ![](./media/image5.png)
591
+
592
+
593
+ Each of these controversies around image generation and illegal data has
594
+ opened up a new frontier in **data governance**. Focus will be placed on
595
+ ensuring new ML systems are sensitive to personal and professional
596
+ concerns of those who generate the data ML systems are trained on.
597
+ [Emad Mostaque](https://uk.linkedin.com/in/emostaque), CEO
598
+ of [Stability AI](https://stability.ai/), has gone so far
599
+ as to offer future opt out systems from systems similar to Stable
600
+ Diffusion.
601
+
602
+ Here are some practical tips: [Dataset
603
+ cards](https://huggingface.co/docs/datasets/dataset_card)
604
+ can be helpful in providing documentation in a similar fashion to model
605
+ cards. There are also ethics lists, like [the deon ethic
606
+ checklist](https://deon.drivendata.org/examples/) that
607
+ helps design proper systems. Deon also has a helpful list of failure
608
+ cases.
609
+
610
+ #### Should This Be Built At All?
611
+
612
+ The undercurrent behind this conversation is the justifiable question of
613
+ whether some of these systems should be built at all, let alone in an
614
+ ethical way.
615
+
616
+ **ML-powered weaponry** is the canonical example here, which is already
617
+ in use. The definition of these systems are blurry, as both systems old
618
+ and new have had various autonomous capacities. This is difficult to get
619
+ a sense of due to the secrecy associated with weapon systems.
620
+
621
+ Some have argued that "autonomous weapons" have existed for hundreds of
622
+ years, but even this does not mean that they are ethical. Mines are good
623
+ examples of these systems. Movements like t[he Campaign Against Killer
624
+ Robots](https://www.stopkillerrobots.org/about-us/) are
625
+ trying to prevent the cycle we entered with mines - where we invented
626
+ them, when we realized the incredible harm, and why we are trying to ban
627
+ them. Why invent these at all?
628
+
629
+ Let's wrap up this entire section with some closing questions that you
630
+ should always have a thoughtful answer to as you build a machine
631
+ learning system.
632
+
633
+ 1. **Is the model "fair"?** Fairness is possible, but requires
634
+ trade-offs.
635
+
636
+ 2. **Is the system accountable?** Accountability is easier than
637
+ interpretability.
638
+
639
+ 3. **Who owns the data?** Answer this upfront. Changes are on the way.
640
+
641
+ 4. **Should the system be built at all?** Repeatedly ask this and use
642
+ it to narrow scope.
643
+
644
+ ### What Can We Learn from Medical ML
645
+
646
+ *Note: The FSDL team would like to thank [Dr. Amir Ashraf
647
+ Ganjouei](https://scholar.google.com/citations?user=pwLadpcAAAAJ)
648
+ for his feedback on this section.*
649
+
650
+ Interestingly, medicine can teach us a lot about how to apply machine
651
+ learning in a responsible way. Fundamentally, this has led to a mismatch
652
+ between how medicine works and how machine learning systems are built
653
+ today.
654
+
655
+ Let's start with a startling fact: **the machine learning response to
656
+ COVID-19 was an abject failure**. In contrast, the biomedical response
657
+ was a major triumph. For example, the vaccines were developed with
658
+ tremendous speed and precision.
659
+
660
+ ![](./media/image9.png)
661
+
662
+ Machine learning did not acquit itself well with the COVID-19 problem.
663
+ Two reviews ([Roberts et al.,
664
+ 2021](https://www.nature.com/articles/s42256-021-00307-0)
665
+ and [Wynants et al.,
666
+ 2020-2022](https://www.bmj.com/content/369/bmj.m1328))
667
+ found that nearly all machine learning models were insufficiently
668
+ documented, had little to no external validation, and did not follow
669
+ model development best practices. A full 25% of the papers used a
670
+ dataset incorrect for the task, which simply highlighted the difference
671
+ between children and adults, not pneumonia and COVID.
672
+
673
+ Medicine has a strong culture of ethics that professionals are
674
+ integrated into from the point they start training. Medical
675
+ professionals take the Hippocratic oath of practicing two things: either
676
+ help or do not harm the patient. In contrast, the foremost belief
677
+ associated with software development tends to be the infamous "Move fast
678
+ and break things." While this approach works for harmless software like
679
+ web apps, **it has serious implications for medicine and other more
680
+ critical sectors**. Consider the example of a retinal implant that was
681
+ simply deprecated by developers and left hundreds without sight [in
682
+ this Statnews
683
+ article](https://www.statnews.com/2022/08/10/implant-recipients-shouldnt-be-left-in-the-dark-when-device-company-moves-on/).
684
+
685
+ ![](./media/image4.png)
686
+
687
+ **Researchers are drawing inspiration from medicine to develop similar
688
+ standards for ML**.
689
+
690
+ - For example, clinical trial standards have been extended to ML.
691
+ These standards were developed through extensive surveys,
692
+ conferences, and consensus building (detailed in
693
+ [these](https://www.nature.com/articles/s41591-020-1037-7)
694
+ [papers](https://www.nature.com/articles/s41591-020-1034-x)).
695
+
696
+ - Progress is being made in understanding how this problem presents.
697
+ [A recent
698
+ study](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2796833)
699
+ found that while clinical activities are generally performed at a
700
+ high compliance level, statistical and data issues tend to suffer
701
+ low compliance.
702
+
703
+ - New approaches are developing [entire "auditing"
704
+ procedures](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00003-6/fulltext)
705
+ that exquisitely identify the activities required to effectively
706
+ develop models.
707
+
708
+ Like medicine, machine learning is intimately intertwined with people's
709
+ lives. The most important question to ask is "Should this system be
710
+ built at all?". Always ask yourselves this and understand the
711
+ implications!
712
+
713
+ ## 5 - AI Ethics
714
+
715
+ AI ethics are a frontier in both the technology and the ethics worlds.
716
+ False claims and hype are the most pressing concerns, but other risks
717
+ could present themselves soon.
718
+
719
+ ### AI Snake Oils
720
+
721
+ **False claims outpace the performance of AI**. This poses a serious
722
+ threat to adoption and satisfaction with AI systems long term.
723
+
724
+ - For example, if you call something "AutoPilot", people might truly
725
+ assume it is fully autonomous, as happened in the below case of a
726
+ Tesla user. This goes back to our discussion about how AI systems
727
+ are more like funky dogs than truly human intelligent systems.
728
+
729
+ - Another example of this is [IBM's Watson
730
+ system](https://www.ibm.com/ibm/history/ibm100/us/en/icons/watson/),
731
+ which went from tackling the future of healthcare to being sold
732
+ off for parts.
733
+
734
+ ![](./media/image13.png)
735
+
736
+ These false claims tend to be amplified in the media. But this isn't
737
+ confined to traditional media. Even Geoff Hinton, a godfather of modern
738
+ machine learning, has been [a little too aggressive in his forecasts
739
+ for AI
740
+ performance](https://www.youtube.com/watch?v=2HMPRXstSvQ)!
741
+
742
+ You can call this **"AI Snake Oil"** as Arvind Narayanan does in [his
743
+ Substack](https://aisnakeoil.substack.com/) and
744
+ [talk](https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf).
745
+
746
+ Let's separate out where true progress has been made versus where
747
+ progress is likely to be overstated. On some level, AI perception has
748
+ seen tremendous progress, AI judgment has seen moderate progress, and AI
749
+ prediction of social outcomes has seen not nearly as much progress.
750
+
751
+ ![](./media/image3.png)
752
+
753
+ ### Frontiers: AI Rights and X-Risk
754
+
755
+ There's obvious rationale that should artificial sentient beings exist,
756
+ tremendous ethical implications would be raised. Few people believe that
757
+ we are truly on the precipice of sentient beings, but there is
758
+ disagreement on how close we are.
759
+
760
+ ![](./media/image1.png)
761
+
762
+ There's a different set of concerns around how to regard self-improving
763
+ intelligent beings, for which there is already evidence. Large Language
764
+ Models have been show to be able to improve themselves in a range of
765
+ studies
766
+ ([here](https://openreview.net/forum?id=92gvk82DE-) and
767
+ [here](https://arxiv.org/abs/2207.14502v1)).
768
+
769
+ Failing to pursue this technology would lead to [a huge opportunity
770
+ cost](https://nickbostrom.com/astronomical/waste) (as
771
+ argued by Nick Bostrom)! There truly is a great opportunity in having
772
+ such systems help us sold major problems and lead better lives. The key
773
+ though, is that such technology should be developed in the **safest way
774
+ possible,** not the fastest way.
775
+
776
+ [The paperclip
777
+ problem](https://www.lesswrong.com/tag/paperclip-maximizer)
778
+ shows how the potential for misalignment between AI systems and humans
779
+ could dramatically reduce human utility and even compromise our
780
+ interests. Imagine a system designed to manufacture paperclips... could
781
+ actually develop the intelligence to alter elements of society to favor
782
+ paper clips?! This thought experiments illustrates how self-learning
783
+ systems could truly change our world for the worse in a misaligned way.
784
+
785
+ These ideas around existential risk are most associated with [the
786
+ Effective Altruism community](https://www.eaglobal.org/).
787
+ Check out resources like [Giving What We
788
+ Can](https://www.givingwhatwecan.org/donate/organizations)
789
+ and [80,000 Hours](https://80000hours.org/) if you're
790
+ interested!
791
+
792
+ ## 6 - What Is To Be Done?
793
+
794
+ This course can't end on a dour a note as existential risk. What can be
795
+ done to mitigate these consequences and participate in developing truly
796
+ ethical AI?
797
+
798
+ 1. The first step is **to educate yourself on the topic**. There are
799
+ many great books that give lengthy, useful treatment to this
800
+ topic. We recommend [Automating
801
+ Inequality](https://www.amazon.com/Automating-Inequality-High-Tech-Profile-Police/dp/1250074312),
802
+ [Weapons of Math
803
+ Destruction](https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815),
804
+ and [The Alignment
805
+ Problem](https://www.amazon.com/Alignment-Problem-Machine-Learning-Values/dp/0393635821).
806
+
807
+ 2. After reading this, **consider how to prioritize your actions**.
808
+ What do you want to impact? When do you want to do that? Place
809
+ them in this two-by-two to get a sense of where their importance
810
+ is.
811
+
812
+ ![](./media/image7.png)
813
+
814
+ **Ethics cannot be purely negative**. We do good, and we want to
815
+ *prevent* bad! Focus on the good you can do and be mindful of the harm
816
+ you can prevent.
817
+
818
+ Leading organizations like
819
+ [DeepMind](https://www.deepmind.com/about/operating-principles)
820
+ and [OpenAI](https://openai.com/charter/) are leading from
821
+ the front. Fundamentally, building ML well aligns with building ML for
822
+ good. All the leading organizations emphasize effective *and*
823
+ responsible best practices for building ML powered practices. Keep all
824
+ this in mind as you make the world a better place with your AI-powered
825
+ products!
documents/lecture-09.srt ADDED
@@ -0,0 +1,488 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,539 --> 00:00:45,660
3
+ hey everyone welcome to the ninth and final lecture of full stack deep learning 2022. today we'll be talking about ethics after going through a little bit of context of what it is that we mean by ethics what I mean by ethics when I talk about it we'll go through three different areas where ethics comes up both Broad tech ethics ethics that anybody who works in the tech industry broadly needs to think about and care about what ethics has meant specifically for the machine learning industry what's happened in the last couple of years as ethical concerns have come to the Forefront and then finally what ethics might mean in a future where true artificial general intelligence exists so first let's do a little bit of
4
+
5
+ 2
6
+ 00:00:42,899 --> 00:01:33,000
7
+ context setting even more so than other topics all lectures on ethics are wrong but some of them are useful and they're more useful if we admit and state what our assumptions or biases or approaches are before we dive into the material and then I'll also talk about three kind of General themes that I see coming up again and again when ethical concerns are raised in Tech and in machine learning themes of alignment themes of trade-off and the critical theme of humility so in this lecture I'm going to approach ethics on the basis of concrete cases specific instances where people have raised concerns so we'll talk about cases where people have taken actions that have led to claims and counter claims of ethical or unethical Behavior
8
+
9
+ 3
10
+ 00:01:29,040 --> 00:02:13,680
11
+ the use of automated weapons the use of machine learning systems for making decisions like sentencing and bail and the use of machine learning algorithms to generate art in each case one criticism has been raised part of the criticism has been that the technology Awards impact is unethical so approaching ethics in this way allows me to give my favorite answer to the question of what is ethics which is to quote one of my favorite philosophers Ludwig wickenstein and say that the meaning of a word is its use in the language so we'll be focusing on times when people have used the word ethics to describe what they like or dislike about some piece of technology and this approach to definition is an interesting
12
+
13
+ 4
14
+ 00:02:11,940 --> 00:02:51,959
15
+ one if you want to try it out for yourself you should check out the game something something soup something which is a browser game at the link in the bottom left of this slide in which you presented with a bunch of dishes and you have to decide whether they are soup or not soup whether they can be served to somebody who ordered soup and by playing a game like this you can discover both how difficult it is to really put your finger on a concrete definition of soup and how poorly maybe your working definition of soup fits with any given soup theory because of this sort of case-based approach we won't be talking about ethical schools and we won't be doing any trolley problems so this article here from current affairs asks
16
+
17
+ 5
18
+ 00:02:50,400 --> 00:03:34,800
19
+ you to consider this particular example of a of an ethical dilemma where an asteroid containing all of the universe's top doctors who are working on a cure for all possible illnesses is hurtling towards the planet of Orphans and you can destroy the asteroid and save the orphans but if you do so the hope for a cure for all diseases will be lost forever and the question posed by the authors of this article is is this hypothetical useful at all for Illuminating any moral truths so rather than considering these hypothetical scenarios about trolley cars going down rails and fat men standing on Bridges we'll talk about concrete specific examples from the last 10 years of work in our field and adjacent Fields but
20
+
21
+ 6
22
+ 00:03:32,580 --> 00:04:11,040
23
+ this isn't the only way of talking about or thinking about ethics it's the way that I think about it is the way that I prefer to talk about it is not the only one and it might not be the one that works for you so if you want another point of view and one that really emphasizes and loves trolley problems then you should check out sergey's lecture from the last edition of the course from 2021 it's a really delightful talk and presents some similar ideas from a very different perspective coming to some of the same conclusions and some different conclusions a useful theme team from that lecture that I think we should all have in mind when we're pondering ethical dilemmas and the related questions that they bring up is the
24
+
25
+ 7
26
+ 00:04:09,060 --> 00:04:52,080
27
+ theme of what is water from last year's lecture so this is a famous little story from a commencement speech by David Foster Wallace where an older fish swing by two younger fish says morning boys how's the water and after he swims away one of the younger fish turns the other and says wait what the hell is water the idea is that if we aren't thoughtful if we aren't paying attention some things that are very important can become background can become assumption and can become invisible and so when I share these slides with Sergey he challenged me to answer this question for myself about how we were approaching ethics this time around and I'll say that this approach of relying on prominent cases risks replicating a lot of social biases
28
+
29
+ 8
30
+ 00:04:50,520 --> 00:05:33,060
31
+ some people's ethical claims are Amplified and some fall on unhearing ears some stories travel more because the people involved have more resources and are better connected and using these forms of case-based reasoning where you explain your response or your beliefs in terms of these concrete specifics can end up hiding the principles that are actually in operation maybe you don't even realize that that's how you're making the decision maybe some of the true ethical principles that you're operating under can disappear like water to these fish so don't claim that the approach I'm taking here is perfect but in the end so much of Ethics is deeply personal that we can't expect to have a perfect approach we can just do the best
32
+
33
+ 9
34
+ 00:05:30,479 --> 00:06:18,120
35
+ we can and hopefully better every day so we're gonna see three themes repeatedly come up throughout this talk two different forms of conflict that give rise to ethical disputes one when there is conflict between what we want and what we get and another when there is conflict between what we want and what others want and then finally a theme of maybe an appropriate response a response of humility when we don't know what we want or how to get it the problem of alignment where what we want and what we get differ we'll come up over and over again and one of the primary drivers of this is what you might call the proxy problem which is in the end we are often optimizing or maximizing some proxy of the thing that we really care about and
36
+
37
+ 10
38
+ 00:06:15,720 --> 00:06:52,080
39
+ if the alignment or Loosely the correlation between that proxy and the thing that we actually care about is poor enough then by trying to maximize that proxy we can end up hurting the thing that we originally cared about there is a nice paper that came out just very recently doing a mathematical analysis of this idea that's actually been around for quite some time excuse you can see these kinds of proxy problems everywhere once you're looking for them on the top right I have a train and validation loss chart from one of the training runs for the full stack deep learning text recognizer the thing that we can actually optimize is the training loss that's what we can use to calculate gradients and improve the
40
+
41
+ 11
42
+ 00:06:50,639 --> 00:07:35,400
43
+ parameters of our network but the thing that we really care about is the performance of the network on data points it hasn't seen like the validation set or the test set or data in production if we optimize our training lost too much then we can actually cause our validation loss to go up similarly there was an interesting paper that suggested that increasing your accuracy on classification tasks can actually result in a decrease in the utility of your embeddings in Downstream tasks and you can find these proxy problems outside of machine learning as well there's a famous story involving a Soviet Factory and nails that turned out to be false but in looking up a reference for it I was able to find an actual example where a factory that was
44
+
45
+ 12
46
+ 00:07:33,300 --> 00:08:18,120
47
+ making chemical machines rather than creating a machine that was cheaper and better chose not to adopt producing that machine because their output was measured in weight so the thing that that the planners actually cared about economic efficiency and output was not what was being optimized for because it was too difficult to measure and one reason why these kinds of proxy problems arise so frequently is due to issues of information the information that we're able to measure is not the information that we want so the training loss is the information that we have but the information that we want is the validation loss but then at a higher level we often don't even know what it is that we truly need so we may want the
48
+
49
+ 13
50
+ 00:08:15,360 --> 00:09:03,660
51
+ validation loss but what we need is the loss in production or really the value our users will derive from this model but even when we do know what it is that we want or what it is that we need we're likely to run into the second kind of problem the problem of a trade-off between stakeholders going back to our hypothetical example with the asteroid of doctors hurtling towards the planet of Orphans what makes this challenging is the need to determine a trade-off between the wants and needs of the people on the asteroid the wants and needs of the orphans on the planet and the wants and needs of future people who cannot be reached for comment and to weigh in on this concern is some sometimes said that this need to
52
+
53
+ 14
54
+ 00:09:02,040 --> 00:09:40,920
55
+ negotiate trade offices one of the reasons why Engineers don't like thinking about some of these problems around ethics I don't think that's quite right because we do accept trade-offs as a key component of engineering there's this nice O'Reilly book on the fundamentals of software architecture the first thing that they State at the very beginning is that everything in software architecture is a trade-off and even this satirical oh really book says that every programming question has the answer it depends so we're comfortable negotiating trade-offs take for example this famous chart comparing the different convolutional networks on the basis of their accuracy and the number of operations that it takes to run them
56
+
57
+ 15
58
+ 00:09:38,580 --> 00:10:22,080
59
+ thinking about these kinds of trade-offs between speed and correctness is exactly the sort of thing that we have to do all the time in our job as engineers and one part of it that is maybe easier is at least selecting What's called the Pareto front for the metrics that we care about my favorite way of remembering what a Pareto front is is this definition of a data scientist from Josh Wills which is a data scientist who's better at Stats than any software engineer and better at software engineering than any statistician so this Pareto front that I've drawn here is the models that have are more accurate than anybody who takes fewer flops and use fewer flops than anybody who is more accurate so I think rather than fundamentally being about
60
+
61
+ 16
62
+ 00:10:20,100 --> 00:11:02,640
63
+ trade-offs one of the reasons why Engineers maybe dislike thinking about these problems is that it's really hard to identify the axes for a chart like the one that I just showed it's very hard to quantify these things and if we do quantify things like the utility or the rights of people involved in a problem we know that those quantifications are far away from what what they truly want to measure there's a proxy problem in fact but even further ones measured where on that front do we fall as Engineers we maybe develop an expertise in knowing whether we want high accuracy or low latency or computational load but we are not as comfortable deciding how many current orphans we want to trade for what amount
64
+
65
+ 17
66
+ 00:11:01,260 --> 00:11:45,120
67
+ of future health so this raises questions both in terms of measurement and in terms of decision making that are outside of our expertise so the appropriate response here is humility because we don't explicitly train these skills the way that we do many of the other skills that are critical for our job and many folks engineers and managers in technology seem to kind of deepen their bones prefer optimizing single metrics making a number go up so there's no trade-offs to think about and those metrics are they're not proxies they're the exact same thing that you care about my goal within this company my objective for this quarter my North Star is user growth or lines of code and by God I'll make that go up so when we
68
+
69
+ 18
70
+ 00:11:43,380 --> 00:12:28,800
71
+ encounter a different kind of problem it's important to bring a humble mindset a student mindset to the problems to ask for help to look for experts and to recognize that the help that you get and the experts that you find might not be immediately obviously which you want or what you're used to additionally one form of this that we'll see repeatedly is that when attempting to intervene because of an ethical concern it's important to remember this same humility it's easy to think when you are on the good side that this humility is not necessary but even trying to be helpful is a delicate and dangerous undertaking one of my favorite quotes from the systems Bible so we want to make sure as we resolve the ethical concerns that
72
+
73
+ 19
74
+ 00:12:26,220 --> 00:13:06,000
75
+ people raise about our technology that we come up with solutions that are not just part of the problem so the way that I resolve all of these is through user orientation by getting feedback from users we maintain alignment between ourselves and the system that we're creating and the users that it's meant to serve and then when it's time to make trade-offs we should resolve them in consultation with users and in my opinion we should tilt the scales in their favor and away from the favor of other stakeholders including within our own organization and then humility is one of the reasons why we actually listen to users at all all because we are humble enough to recognize that we don't have the answers to all of these
76
+
77
+ 20
78
+ 00:13:03,720 --> 00:13:46,260
79
+ questions all right with our context and our themes under our belt let's dive into some concrete cases and responses we'll start by considering ethics in the broader world of technology that machine learning fights itself in so the key thing that I want folks to take away from this section is that the tech industry cannot afford to ignore ethics as public trust in Tech declines we need to learn from other nearby industries that have done a better job on professional ethics and then we'll talk about some contemporary topics some that I find particularly interesting and important throughout the past decade the technology industry has been plagued by Scandal whether that's how technology companies interface with national
80
+
81
+ 21
82
+ 00:13:43,680 --> 00:14:30,839
83
+ governments at the largest scale over to how technological systems are being used or manipulated by people creating disinformation or fake social media accounts or targeting children with automatically generated content that hacks the YouTube recommendation system and the impact effect of this has been that distrust in tech companies has risen markedly in the last 10 years so this is from the public affairs pulse survey just last year the tech industry went from being in 2013 one of the industries that the fewest people felt was less trustworthy than average to rubbing elbows with famously much distrusted Industries like energy and pharmaceuticals and the tech industry doesn't have to win elections so we
84
+
85
+ 22
86
+ 00:14:29,220 --> 00:15:16,079
87
+ don't have to care about public polling as much as politicians but politicians care quite a bit about those public opinion polls and just in the last few years the fraction of people who believe that the large tech companies should be more regulated has gone up a substantial amount and comparing it to 10 years ago it's astronomically higher so there will be substantial impacts on our industry due to this loss of public trust so as machine learning engineers and researchers we can learn from nearby Fields so I'll talk about two of them one a nice little bit about the culture of professional ethics in Engineering in Canada and then a little bit about ethical standards for human subjects research so one of the worst
88
+
89
+ 23
90
+ 00:15:12,779 --> 00:15:57,899
91
+ construction disasters in modern history was the collapse of the Quebec bridge in 1907. 75 people who were working on the bridge at the time were killed and a parliamentary inquiry placed the blame pretty much entirely on two engineers in response there was the development of some additional rituals that many Canadian Engineers take part in when they finish their education that are meant to impress upon them the weight of their responsibility so one component of this is a large iron ring which literally impresses that weight upon people and then another is an oath that people take a non-legally binding oath that includes saying that I will not hence forward suffer or pass or be privy to the passing of bad workmanship or
92
+
93
+ 24
94
+ 00:15:56,100 --> 00:16:37,800
95
+ faulty material I think the software would look quite a bit different if software Engineers took an oath like this and took it seriously one other piece I wanted to point out is that it includes within it some built-in humility asking pardon ahead of time for the assured failures lots of machine learning is still in the research stage and so some people may say that oh well that's important for the people who are building stuff but I'm working on R D for fundamental technology so I don't have to worry about that but research is also subject to regulation so this is something I was required to learn because I did my PhD in a neuroscience Department that was funded by the National Institutes of Health which
96
+
97
+ 25
98
+ 00:16:34,500 --> 00:17:18,360
99
+ mandates training in ethics and in the ethical conduct of research so these regulations for human subjects research date back to the 1940s when there were medical experiments on unwilling human subjects by totalitarian regimes this is still pretty much the Cornerstone for laws on human subjects research around the world through the Helsinki declaration which gets regularly updated in the US the Touchstone bit of regulation on this the 9 1973 research act requires among other things informed consent from people who are participating in research and there were two major revelations in the late 60s and early 70s that led to this legislation not dissimilar to the scandals that have plagued the technology industry recently one was the
100
+
101
+ 26
102
+ 00:17:16,740 --> 00:18:04,860
103
+ infliction of hepatitis on mentally disabled children in New York in order to test hepatitis treatments and the other was the non-treatment of syphilis in black men at Tuskegee in order to study the progression of the disease in both cases these subjects did not provide informed consent and seemed to be selected for being unable to advocate for themselves or to get legal redress for the harms they were suffering and so if we are running experiments and those experiments involve humans evolve our users we are expected to adhere to the same principles and one of the famous instances of mismatch between the culture in our industry and the culture of human subjects research was was when some researchers at Facebook studied
104
+
105
+ 27
106
+ 00:18:02,760 --> 00:18:49,440
107
+ emotional contagion by altering people's news feeds either adding more negative content or adding more positive content and they found a modest but robust effect that introducing more positive content caused people to post more positively when people found out about this they were very upset the authors noted that Facebook's data use policy includes that the user's data and interactions can be used for this but most people who were Facebook users and the editorial board of pnas where this was published did not see it that way so put together I think we are at the point where we need a professional code of ethics for software hopefully many codes of Ethics developed in different communities that can Bubble Up compete
108
+
109
+ 28
110
+ 00:18:48,000 --> 00:19:35,700
111
+ with each other and merge to finally something that most of us or all of us can agree on and that is incorporated into our education and acculturation of new members into our field and into more aspects of how we build to close out this section I wanted to talk about some particular ethical concerns that arise in Tech in general first around carbon emissions and then second around dark patterns and user hostile designs the good news with carbon emissions is that because they scale with cost it's only something that you need to worry about when the costs of what you're building what you're working on are very large at which time you both won't be alone in making these decisions and you can move a bit more deliberately and make these
112
+
113
+ 29
114
+ 00:19:32,760 --> 00:20:19,440
115
+ choices more thoughtfully so first what are the ethical concerns with carbon emissions anthropogenic climate change driven by CO2 emissions raises a classic trade-off which was dramatized in this episode of Harvey Birdman Attorney at Law in which George Jetson travels back from the future to sue the present for melting the ice caps and destroying his civilization so unfortunately we don't have future Generations present now to advocate for themselves the other view is that this is an issue that arises from a classic alignment problem which is many organizations are trying to maximize their profit that raw profit is based off of prices for goods that don't include externalities like the environmental damage caused by carbon
116
+
117
+ 30
118
+ 00:20:16,740 --> 00:21:06,960
119
+ dioxide emissions leading to increased temperatures and climactic change so the primary Dimension along which we have to worry about carbon emissions is in compute jobs that require power that power has to be generated somehow and that can result in the emission of carbon and so there was a nice paper Linked In This slide that walked through how much carbon dioxide was emitted using typical us-based Cloud infrastructure and the top headline from this paper was that training a large Transformer model with neural architecture search produces as much carbon dioxide as five cars create during their lifetime so that sounds like quite a bit of carbon dioxide and it is in fact but it's important to remember that power is not free and so
120
+
121
+ 31
122
+ 00:21:05,160 --> 00:21:49,200
123
+ there is a metric that we're quite used to tracking that is at least correlated with our carbon emissions our compute spend and if you look for the cost runs between one and three million dollars to run the neural architecture search that emitted five cars worth of CO2 and one to three million dollars is actually a bit more than it would cost to buy five cars and provide their fuel so the number that I like to use is that four us-based Cloud infrastructure like the US West one that many of us find ourselves in ten dollars of cloud spend is roughly equal to one dollar worth of air travel costs so that's on the basis of something like the numbers in the chart indicating air travel across the United States from New York to San
124
+
125
+ 32
126
+ 00:21:47,220 --> 00:22:28,799
127
+ Francisco I've been taking care to always say us-based cloud infrastructure because just changing Cloud regions can actually reduce your emissions quite a bit there's actually a factor of nearly 50x from some of the some of the cloud regions that have have the most carbon intensive power generation like AP Southeast 2 and the regions that have the the least carbon intensive power like ca Central one that chart comes from a nice talk from hugging face that you can find on YouTube part of their course that talks a little bit more about that paper and about managing carbon emissions interest in this problem has led to some nice new tooling one code carbon dot IO allows you to track power consumption and therefore
128
+
129
+ 33
130
+ 00:22:26,820 --> 00:23:12,419
131
+ CO2 emissions just like you would any of your other metrics and then there's also this mlco2 impact tool that's oriented a little bit more directly towards machine learning the other ethical concern in Tech that I wanted to bring up is deceptive design and how to recognize it an unfortunate amount of deception is tolerated in some areas of software the example on the left comes from an article by Narayanan at all that shows a fake countdown timer that claims that an offer will only be available for an hour but when it hits zero nothing the offer is still there there's also a possibly apocryphal example on the right here you may have seen these numbers next to products when online shopping saying that some number of people are currently
132
+
133
+ 34
134
+ 00:23:10,799 --> 00:23:57,059
135
+ looking at this product this little snippet of JavaScript here produces a random number to put in that spot so that example on the right may not be real but because of real examples like the one on the left it strikes a chord with a lot of developers and Engineers there's a kind of slippery slope here that goes from being unclear or maybe not maximally upfront about something that is a source of friction or a negative user experience in your product and then in trying to remove that friction or sand that edge down you slowly find yourself being effectively deceptive to your users on the left is a nearly complete history of the way Google displays ads in its search engine results it started off very clearly
136
+
137
+ 35
138
+ 00:23:54,960 --> 00:24:38,039
139
+ colored and separated out with a bright color from the rest of the results and then a about 10 years ago that colored background was removed and replaced with just a tiny little colored snippet that said add and now as of 2020 that small bit there is no longer even colored it's just bolded and so this makes it difficult for users to know which content is being served to them because somebody paid for them to see it versus being served up organically so a number of patterns of deceptive design also known as dark patterns have emerged over the last 10 years you can read about them on this website deceptive.design there's also a Twitter account at dark patterns where you can share examples that you find in the wild so some
140
+
141
+ 36
142
+ 00:24:36,360 --> 00:25:19,980
143
+ examples that you might be familiar with are the roach motel named after a kind of insect trap where you can get into a situation very easily but then it's very hard to get out of it if you've ever attempted to cancel a gym membership or delete your Amazon account then you may have found yourself a roach in a motel another example is trick questions where forms intentionally make it difficult to choose the option that most use users want for example using negation in a non-standard way like check this box to not receive emails from our service one practice in our industry that's on very shaky ethical and legal ground is growth hacking which is a set of techniques for achieving really rapid growth in user
144
+
145
+ 37
146
+ 00:25:17,820 --> 00:26:00,659
147
+ base or revenue for a product and has all the connotations you might expect from the name hack LinkedIn was famously very spammy when it first got started I'd like to add you to my Professional Network on LinkedIn became something of a meme and this was in part because LinkedIn made it very easy to unintentionally send LinkedIn invitations to every person you'd ever emailed they ended up actually having to pay out in a class action lawsuit because they were sending multiple follow-up emails when user only clicked to send an invitation once and the structure of their emails made it seem like they were being sent by the user rather than LinkedIn and the use of these growth hacks goes back to the very Inception of email Hotmail Market itself
148
+
149
+ 38
150
+ 00:25:58,200 --> 00:26:43,860
151
+ in part by attacking on a signature to the bottom of every email that said PS I love you get your free email at Hotmail so this seemed like it was being sent by the actual user I grabbed a snippet from a top 10 growth hacks article that said that the personal sounding nature of the message and the fact that it came from a friend made this a very effective growth hack but it's fundamentally deceptive to add this to messages in such a way that it seems personal and to not tell users that this change is being made to the emails that they're sending so machine learning can actually make this problem worse if we are optimizing short-term metrics these growth acts and deceptive designs can often Drive user and revenue
152
+
153
+ 39
154
+ 00:26:41,460 --> 00:27:25,679
155
+ growth in the short term but they do that by worsening user experience and drawing down on Goodwill towards the brand in a way that can erode the long-term value of customers when we incorporate machine learning into the design of our products with a B testing we have to watch out to make sure that the the metrics that we're optimizing don't encourage this kind of deception so consider these two examples on the right the top example is a very straightforwardly implemented and direct and easy to understand form for users to indicate whether they want to receive emails from the company and from its Affiliates in example B the wording of the first message has been changed so that it indicates that the first hitbox
156
+
157
+ 40
158
+ 00:27:23,520 --> 00:28:05,760
159
+ should be checked to not receive emails while the second one should not be ticked in order to not receive emails and if you're a b testing these two designs against each other and your metric is the number of people who sign up to receive emails then it's highly likely that the system is going to select example B so taking care and setting up a b tests such that either they're tracking longer term metrics or things that correlate with them and that the variant generation system that generates all the different possible designs can't generate any designs that we would be unhappy with as we would hopefully be unhappy with the deceptive design in example B and I think it's also important to call out that this
160
+
161
+ 41
162
+ 00:28:03,840 --> 00:28:46,679
163
+ problem arises inside of another alignment problem we were considering the case where the long-term value of customers and the company's interests were being harmed by these deceptive designs but unfortunately that's not always going to be the case the private Enterprises that build most technology these days are able to deliver Broad Social value to make the world a better place as they say but the way that they do that is generally by optimizing metrics that are at best a very weak proxy for that value that they're delivering like their market capitalization and so there's the possibility of an alignment problem where companies pursuing and maximizing their own profit and success can lead to net negative production of value and
164
+
165
+ 42
166
+ 00:28:44,880 --> 00:29:23,700
167
+ this misalignment is something that if you spend time at the intersection of capital and funding leadership and Technology development you will encounter it so it's important to consider these questions ahead of time and come to your own position whether that's trade reading this as the price of doing business or the way the world Works seeking ways to improve this alignment or considering different ways to build technology but on the shorter term you can push for longer term thinking within your organization to allow for better alignment between the metrics that you're measuring and the goals that you're setting and between the goals that you're setting and what is overall good for our industry and for
168
+
169
+ 43
170
+ 00:29:21,960 --> 00:30:06,480
171
+ the broader world and you can also learn to recognize these user hostile design patterns call them out when you see them and you can advocate for a More user-centered Design instead so to wrap up our section on ethics for Building Technology broadly we as an industry should learn from other disciplines if we want to avoid a trust crisis or if we want to avoid the crisis getting any worse and we can start by educating ourselves about the common user hostile practices in our industry and how to avoid them now that we've covered the kinds of ethical concerns and conflicts that come up when Building Technology in general let's talk about concerns that are specific to machine learning just in the past couple of years there have been
172
+
173
+ 44
174
+ 00:30:04,200 --> 00:30:42,480
175
+ more and more ethical concerns raised about the uses of machine learning and this has gone beyond just the ethical questions that can get raised about other kinds of technology so we'll talk about some of the common ethical questions that have been raised repeatedly over the last couple of years and then we'll close out by talking about what we can learn from a particular sub-discipline of machine learning medical machine learning so the fundamental reason I think that ethics is different for machine learning and maybe more Salient is that machine learning touches human lives more intimately than a lot of other kinds of technology so many machine learning methods especially deep learning methods make human legible data into computer
176
+
177
+ 45
178
+ 00:30:40,320 --> 00:31:22,860
179
+ legible data so we're working on things like computer vision on processing natural language and humans are more sensitive to errors in and have more opinions about this kind of data about images like this puppy than they do about the other kinds of data manipulated by computers like abstract syntax trees so because of of this there are more stakeholders with more concerns that need to be traded off in machine learning applications and then more broadly machine learning involves being wrong pretty much all the time there's the famous statement that all models are wrong though some are useful and I think the first part applies at least particularly strongly to machine learning our models are statistical and
180
+
181
+ 46
182
+ 00:31:20,760 --> 00:32:01,320
183
+ include in them Randomness the way that we frame our problems the way that we frame our optimization in terms of cross entropies or divergences and Randomness is almost always an admission of ignorance even the quintessential examples of Randomness like random number generation in computers and the flipping of a coin are things that we know in fact are not random truly they are in fact predictable and if we knew the right things and had the right laws of physics and the right computational power then we could predict how a coin would land we could control it we could predict what the next number to come out of a random number generator would be whether it's pseudorandom or based on some kind of Hardware Randomness and so
184
+
185
+ 47
186
+ 00:31:59,760 --> 00:32:41,520
187
+ we're admitting a certain degree of ignorance in our models and that means our models are going to be wrong and they're going to misunderstand situations that they are put into and it can be very upsetting and even harmful to be misunderstood by a machine learning model so against this backdrop of Greater interest or higher Stakes a number of common types of ethical concern have coalesced in the last couple of years and there are somewhat established camps of answers to these questions and you should at least know where it is you stand on these core questions so for four really important questions that you should be able to answer about about anything that you build with machine learning are is the model fair and what does that mean in
188
+
189
+ 48
190
+ 00:32:39,840 --> 00:33:20,039
191
+ this situation is the system that you're building accountable who owns the data involved in this system and finally and perhaps most importantly an undergirding all of these questions is should this system be built at all so first is the model we're building Fair the classic case on this comes from Criminal Justice from the compass system for predicting before trial whether a defendant will be arrested again so if they're arrested again that's just they committed a crime during that time and so this is assessing a certain degree of risk for additional harm while the justice system is deciding what to do about a previous arrest and potential crime so the operationalization here was a 10-point rearrest probability based off of past
192
+
193
+ 49
194
+ 00:33:17,640 --> 00:34:07,799
195
+ data about this person and they set a goal from the very beginning to be less biased than human judges so they operationalize that by calibrating these arrest probabilities and making sure that if say a person received a 2 2 on this scale they had a 20 chance of being arrested again and then critically that those probabilities were calibrated across subgroups so racial bias is one of the primary concerns around bias in criminal justice in the United States and so they took care to make sure that these probabilities of rearrest were calibrated for all racial groups the system was deployed in it is actually used all around the United States it's proprietary so it's difficult to analyze but using the Freedom of Information Act
196
+
197
+ 50
198
+ 00:34:05,519 --> 00:34:53,820
199
+ and by colliding together a bunch of Records some people at propublica were able to run their own analysis of this algorithm and they determined that though this calibration that Compass claimed for arrest probabilities was there so the model was not more or less wrong for one racial group or another the way that the model tended to fail was different across racial groups so the model had more false positives for black defendants so saying that somebody was higher risk but then them not going on to reoffend and had more false negatives for white defendants so labeling them as low risk and then them going on to reoffend so despite North Point the creators of compass taking into account bias from the beginning
200
+
201
+ 51
202
+ 00:34:51,599 --> 00:35:31,260
203
+ they ended up with an algorithm with this undesirable property of being more likely to effectively falsely accuse defendants who were black than defendants who were white this report touched off a ton of controversy and back and forth between propublica the creator of the article and North Point Craters of compass and also a bunch of research and it turned out that some quick algebra revealed that some form of race-based bias is inevitable in this setting so the things that we care about when we're building a binary classifier are relatively simple you can write down all of these metrics directly so we care about things like the false positive rate which means we've imprisoned somebody with no need the false negative
204
+
205
+ 52
206
+ 00:35:29,760 --> 00:36:16,380
207
+ rate which means we missed an opportunity to event a situation that led to an arrest and then we also care about the positive predictive value which is this rearrest probability that Compass was calibrated on so because all of these metrics are related to each other and related to The Joint probability distribution of our model's labels and the actual ground truth if the probability of rearrest differs across groups then we have to have that some of these numbers are different across groups and that is a form of racial bias so the basic way that this argument works just involves rearranging these numbers and saying that if the numbers on the left side of this equation are different for Group 1 and group two then it can't possibly be the
208
+
209
+ 53
210
+ 00:36:14,280 --> 00:36:54,900
211
+ case that all three of the numbers on the right hand side are the same for Group 1 and group two and I'm presenting this here as though it only impacts these specific binary classification metrics but there are are in fact a very large number of definitions of fairness which are mutually incompatible so there's a nice a really incredible tutorial by Arvin Narayanan who was also the first author on the dark patterns work on a bunch of these fairness definitions what they mean and why they're in commensurate so I can highly recommend that lecture so returning to our concrete case if the prevalence is differ across groups then one of our things that we're concerned with the false positive rate the false negative
212
+
213
+ 54
214
+ 00:36:53,160 --> 00:37:35,339
215
+ rate or the positive predictive value will not be equal and that's something that people can point to and say that's unfair in the middle that positive predictive value was equalized across groups in compass that was what they really wanted to make sure was equal cross groups and because the probability of rearrest was larger for black defendants then either the false positive rate had to be bigger or the false negative rate had to be bigger for that group and there's an analysis in this cholachova 2017 paper that suggests that the usual way that this will work is that there will be a higher false positive rate for the group with a larger prevalence so the fact that there will be some form of unfairness that we
216
+
217
+ 55
218
+ 00:37:33,420 --> 00:38:14,280
219
+ can't just say oh well all these metrics are the same across all groups and so everything has to be fair that fact is fixed but the impact of the unfairness of models is not fixed the story is often presented as oh well no matter what the journalists would have found something to complain about there's always critics and so you know you don't need to worry about fairness that much but I think it's important to note that the particular kind of unfairness that came about from this model from focusing on this positive predictive value led to a higher false positive rate more unnecessary imprisonment for black defendants the false positive rate and the positive predictive value were equalized across groups that would have
220
+
221
+ 56
222
+ 00:38:12,420 --> 00:38:54,119
223
+ led to a higher false negative rate for black defendants relative to White defendants and in the context of American politics and concerns about racial inequity in the criminal justice system bias against white defendants is not going to lead to complaints from the same people and has a different relationship to the historical operation of the American justice system and so far from this being a story about the hopelessness of thinking about or caring about fairness this is a story about the necessity of confronting the trade-offs that are inevitably going to come up so some researchers that Google made a nice little tool where you can try thinking through and making these trade-offs for yourself it's a loan decision rather
224
+
225
+ 57
226
+ 00:38:51,900 --> 00:39:37,740
227
+ than a criminal justice decision but it has a lot of the same properties you have a binary classifier you have different possible goals that you might set either maximizing the profit of the loaning entity or providing equal opportunity to the two groups and it's very helpful for building intuition on these fairness metrics and what it means to pay pick one over the other and these events in this controversy kicked off a real flurry of research on fairness and there's now been several years of this fairness accountability and transparity Conference fact there's tons of work on both algorithmic level approaches to try and measure these fairness metrics incorporate them into training and also more qualitative work on designing
228
+
229
+ 58
230
+ 00:39:36,180 --> 00:40:20,940
231
+ systems that are more transparent and accountable so the compass example is really important for dramatizing these issues of fairness but I think it's very critical for this case and for many others to step back and ask whether this model should be built at all so this algorithm for scoring risk is proprietary and uninterpretable it doesn't give answers for why a person is higher risk or not and because it is closed Source there's no way to examine it it achieves an accuracy of about 65 which is quite High given that the marginal probability of reoffence is much lower than 50 but it's important to compare the baselines here pulling together a bunch of non-experts like you would on a jury has an accuracy of about
232
+
233
+ 59
234
+ 00:40:17,640 --> 00:41:01,140
235
+ 65 percent and creating a simple scoring system on the basis of how old the person is and how many prior arrests they have also has an accuracy of around 65 and it's much easier to feel comfortable with the system that says if you've been arrested twice then you have a higher risk of being arrested again and so you'll be imprisoned before trial then a system that just says oh well we ran the numbers and it looks like you have a high chance of committing a crime but even framing this problem in terms of who is likely to be rearrested is already potentially a mistake so a slightly different example of predicting failure to appear in court was tweeted out by Moritz heart who's one of the main researchers in this area choosing
236
+
237
+ 60
238
+ 00:40:59,520 --> 00:41:37,320
239
+ to try to predict who will fail to appear in court treating this as something that is then a fact of the universe that this person is likely to fail to appear in court and then intervening on this and punishing them for that for that fact it's important to recognize why people fail to appear in court in general often it's because they don't have child care to cover for the care of their dependence while they're in court they don't have transportation their work schedule is inflexible or the core deployment schedule is inflexible or unreasonable it'd be better to implement steps to mitigate these issues and reduce the number of people who are likely to fail to appear in court for example by making it possible to join
240
+
241
+ 61
242
+ 00:41:35,579 --> 00:42:20,940
243
+ Court remotely that's a far better approach for all involved than simply getting really really good at predicting Who currently fails to appear in court so it's important to remember that the things that we're measuring the things that we're predicting are not the be-all end-all in themselves the things that we care about are things like an effective and fair justice system and this comes up perhaps most acutely in the case of compass when we recognize that rearrest is not the same as recidivism it's not the same thing as committing more crimes being rearrested requires that a police officer believes that you committed a crime police officers are subject effect to their own biases and patterns of policing result in a far higher fraction
244
+
245
+ 62
246
+ 00:42:18,480 --> 00:43:04,440
247
+ of crimes being caught for some groups than for others and so our real goal in terms of fairness and criminal justice might be around reducing those kinds of unfair impacts and using past rearrest data that we know has these issues to determine who is treated more harshly by the criminal justice system is likely to exacerbate these issues there's also a notion of model fairness that is broader than just models that make decisions about human beings so even if you're deciding a model that works on text or works on images you should consider which kinds of people your model works well for and in general representation both on engineering and management teams and in data sets really matters for this kind of model fairness so it's
248
+
249
+ 63
250
+ 00:43:02,640 --> 00:43:46,680
251
+ unfortunately still very easy to make machine learning powered technology that fails for minoritized groups so for example off-the-shelf computer vision tools will often fail on darker skin so this is an example by Joy bull and weenie from MIT on how a computer vision based project that she was working on ran into difficulties because the face detection algorithm could not detect her face even though it could detect the faces of some of her friends with lighter skin and in fact she found that just putting on a white mask was enough to get the computer vision model to detect her face so this is unfortunately not a new issue in technology it's just a more Salient one with machine learning so one example is that hand soap
252
+
253
+ 64
254
+ 00:43:43,920 --> 00:44:32,640
255
+ dispensers that use infrared to determine when to dispense soap will often work better for lighter skin than darker skin and issues around lighting and vision and skin tone go back to the foundation of Photography let alone computer vision the design of film of cameras and printing processes was oriented around primarily making lighter skin photograph well as in these so-called Shirley cards that were used by code DAC for calibration these resulted in much worse experiences for people with darker skin using these cameras there has been a good amount of work on this and progress since four or five years ago one example of the kind of tool that can help with this are these model cards this particular format
256
+
257
+ 65
258
+ 00:44:30,660 --> 00:45:14,760
259
+ for talking about what a model can and cannot do that was published by a number of researchers including Margaret Mitchell and Timmy Gabriel it includes explicitly considering things like on which human subgroups of Interest many of them minoritized identities how well does the model perform hugging face has good Integrations for creating these kinds of model cards I think it's important to note that just solving these things by changing the data around or by calculating demographic information is not really an adequate response if the CEO of Kodak or their partner had been photographed poorly by those cameras then there's no chance that that issue would have been allowed to stay for decades so when you're
260
+
261
+ 66
262
+ 00:45:12,359 --> 00:45:57,839
263
+ looking at inviting people for talks hiring people or joining organizations you should try to make sure that you have worked to reduce the bias of that Discovery process by diversifying your network and your input sources the diversify Tech job board is a really wonderful source for candidates and then there are also professional organizations inside of the ml World black and Ai and women in data science being two of the larger and more successful ones these are great places to get started to make the kinds of professional connections that can improve the representations of these minoritized groups in the engineering and design and product management process where these kinds of issues should be solved a lot of progress has
264
+
265
+ 67
266
+ 00:45:55,680 --> 00:46:44,040
267
+ been made but these problems are still pretty difficult to solve an unbiased face detector might not be so challenging but unbiased image generation is still really difficult for example if you make an image generation model from internet scraped data without any safeguards in place then if you ask it to generate a picture of a CEO it will generate the stereotypical CEO a six foot or taller white man and this applies across a wide set of jobs and situations people can find themselves in and this led to a lot of criticism of early text damage generation models like Dolly and the solution that openai opted to this was to edit prompts that people put in if you did not fully specify what kind of person should be generated then
268
+
269
+ 68
270
+ 00:46:41,579 --> 00:47:21,300
271
+ race and gender words would be added to the prompt with weights based on the world's population so people discovered this somewhat embarrassingly by writing prompts like a person holding a sign that says or pixel art of a person holding a text sign that says and then seeing that the appended words were then printed out by the model suffice it to say that this change did not make very many people very happy and indicates that more work needs to be done to de-bias image generation models at a broader level than just fairness we can also ask whether the system we're building is accountable to the people it's serving or acting upon and this is important because some people can consider explanation and accountability
272
+
273
+ 69
274
+ 00:47:19,560 --> 00:48:00,060
275
+ in the face of important judgments to be human rights this is the right to an explanation in the European Union's general data protection regulation gdpr there is a subsection that mentions the right to obtain an explanation of a decision reached after automated assessment and the right to challenge that decision the legal status here is a little bit unclear there's a nice archive paper that talks about this a bit about what the right to an explanation might mean but what's more important for our purposes is just to know that there is an increasing chorus of people claiming that this is indeed a human right and it's not an entirely New Concept and it's not even really technology or automation specific as far
276
+
277
+ 70
278
+ 00:47:57,420 --> 00:48:40,800
279
+ back as 1974 has been the law in the United States that If you deny credit to a person you must disclose the principal reasons for denying that credit application and in fact I found this interesting it's expected that you provide no more than four reasons why you denied them credit but the general idea that somebody as a right to know why something happened to them in certain cases is enshrined in some laws so what are we supposed to do if we use a deep neural network to decide whether somebody should be Advanced Credit or not so there are some off-the-shelf methods for introspecting deep neural networks that are all based off of input output gradients how would changing the pixels of this input image change the
280
+
281
+ 71
282
+ 00:48:38,579 --> 00:49:18,359
283
+ class probabilities and the output so this captures a kind of local contribution but as you can see from the small image there it doesn't produce a very compelling map and there's no reason to think that just changing one pixel a tiny bit should really change the model's output that much one Improvement to that called Smooth grad is to add noise to the input and then average results kind of getting a sense for what the gradients look like in a general area around the input there isn't great theory on why that should give better explanations but people tend to find these explanations better and you can see in the smooth grad image on the left there that you can pick out the picture of a bird it seems like that is
284
+
285
+ 72
286
+ 00:49:16,440 --> 00:50:04,560
287
+ giving a better explanation or an explanation that we like better for why this network is identifying that as a picture of a bird there's a bunch of kind of hacking methods like specific tricks you need when you're when you're using the relu activation there's some methods that are better for classification like grad cam one that is more popular integrated gradients takes the integral of the gradient along a path from some baseline to the final image and this method has a nice interpretation in terms of Cooperative Game Theory something called a shapley value that quantifies how much a particular collection of players in a game contributed to the final reward and adding noise to integrated gradients tends to produce really clean
288
+
289
+ 73
290
+ 00:50:02,460 --> 00:50:45,480
291
+ explanations that people like but unfortunately these methods are generally not very robust their outputs tend to correlate pretty strongly in the case of images with just an edge detector there's built-in biases to convolutional networks and the architectures that we use that 10 and to emphasize certain features of images what this particular chart shows from this archive paper by Julius adebayo Moritz heart and others is that even as we randomize layers in the network going from left to right we are randomizing starting at the top of the network and then randomizing more layers going down even for popular methods like integrated gradients with smoothing or guided back propagation we can effectively randomize
292
+
293
+ 74
294
+ 00:50:43,800 --> 00:51:27,140
295
+ a really large fraction of the network without changing the gross features of the explanation and resulting in an explanation that people would still accept and believe even though this network is now producing random output so in general introspecting deep neural networks and figuring out what's going inside them requires something that looks a lot more like a reverse engineering process that's still very much a research problem there's some great work on distill on reverse engineering primarily Vision networks and then some great work from anthropic AI recently on Transformer circuits that's reverse engineering large Lang language models and Chris Ola is the researcher who's done the most work here but it still is the sort of thing that
296
+
297
+ 75
298
+ 00:51:24,300 --> 00:52:06,839
299
+ even getting a loose qualitative sense for how neural networks work and what they are doing in response to inputs is still the type of thing that takes a research team several years so Building A system that can explain why it took a particular decision is maybe not currently possible with deep neural networks but that doesn't mean that the systems that we build with them have to be unaccountable if somebody dislikes the decision that they get and the explanation that we give is well the neural network said you shouldn't get a loan and they challenge that it might be time to bring in a human in the loop to make that decision and building that in to the system so that it's an expected mode of operation and is considered an
300
+
301
+ 76
302
+ 00:52:03,900 --> 00:52:50,099
303
+ important part of the feedback and the operation of the system is key to building an accountable system so this book automating inequality by Virginia Eubanks talks a little bit about the ways in which Technical Systems as their build today are very prone to this unaccountability where the people who are Indian most impacted by these systems some of the most critical stakeholders for these systems for example recipients of government assistance are unable to have their voices and their needs heard and taken into account in the operation of a system so this is perhaps the point at which you should ask when building a system with machine learning whether this should be built at all and particular to ask who benefits and who
304
+
305
+ 77
306
+ 00:52:47,579 --> 00:53:28,500
307
+ is harmed by automating this task in addition to concerns around the behavior of models increasing concern has been pointed towards data and in particular who owns and who has rights to the data involved in the creation of machine Learning Systems it's important to remember that the training data that we use for our machine learning algorithms is almost always generated by humans and they generally feel some ownership over that data and we end up behaving a little bit like this comic on the right where they hand us some data that they made and then we say oh this is ours now I made this and in particular the large data sets you train the really large models that are pushing the frontiers of what is possible with machine learning
308
+
309
+ 78
310
+ 00:53:26,220 --> 00:54:09,359
311
+ are produced by crawling the Internet by searching over all the images all the text posted on the internet and pulling large fractions of it down and many people are not aware that this is possible let alone legal and so to some extent any consent that they gave to their data being used was not informed and then additionally as technology has changed in the last decade and machine learning has gotten better what can be done with data has changed somebody uploading their art a decade ago certainly did not have on their radar the idea that they were giving consent to that art being used to create an algorithm that can mimic its style and you can in fact check whether an image of interest to you has been used to
312
+
313
+ 79
314
+ 00:54:06,240 --> 00:54:53,520
315
+ train one of the large text image models specifically this have I been trained.com website will search through the Leon data set that is used to train the stable diffusion model for images that you upload so you can look to see if any pictures of you were incorporated into the data set and this goes further than just pictures that people might rather not have used in this way to actual data that has somehow been obtained illegally there's an Arts technical article a particular artist who was interested in this found that some of their medical photos which they did not consent to have uploaded to the internet somehow found their way into the lay on data set and so cleaning large web scraped data sets from this
316
+
317
+ 80
318
+ 00:54:51,540 --> 00:55:37,800
319
+ kind of illegally obtained data is definitely going to be important as more attention is paid to these models as they are product eyes and monetized and more on people's radar even for data that is obtained legally saying well technically you did agree to this does not generally satisfy people remember the Facebook emotion research study technically some reading of the Facebook user data policy did support the way that they were running their experiment but many users disagreed many artists feel that creating an art generation tool that threatens their livelihoods and copies art down to the point of even faking watermarks and logos on images when told to recreate the style of an artist is an ethical use of that data
320
+
321
+ 81
322
+ 00:55:35,700 --> 00:56:23,400
323
+ and it certainly is the case that creating a sort of parrot that can mimic somebody is something that a lot of people find concerning dealing with these issues around data governance is likely to be a new frontier imagine of stable diffusion has said that he's partnering with people to create mechanisms for artists to opt in or opt out of being included in training data sets for future versions of stable diffusion I found that noteworthy because mostacc has been very vocal in his defense of image generation technology and of what it can be used for but even he is interested in adjusting the way data is used there's also been work from Tech forward artists like Holly Hunter who was involved in the creation of have I been trained
324
+
325
+ 82
326
+ 00:56:20,520 --> 00:57:05,460
327
+ around trying to incorporate AI systems into art in a way that empowers artists and compensates them rather than immiserating them just as we can create cards for models we can also create cards for data sets that describe how they were curated what the sources were and any other potential issues with the data and perhaps in the future even how to opt out of or be removed from a data set so this is an example from a hugging face as with model cards there's lots of good examples of data set cards on hugging face there's also a nice checklist the Dion ethics checklist that is mostly focused around data ethics but covers a lot of other ground they also have this nice list of examples for each question in their checklist of cases
328
+
329
+ 83
330
+ 00:57:02,760 --> 00:57:47,339
331
+ where people have run into ethical or legal trouble by building an ml project that didn't satisfy a particular checklist item running underneath all of this has been this final most important question of whether this system should be built at all one particular use case that very frequently elicits this question is building ml-powered Weaponry ml powered Weaponry is already here it's already starting to be deployed in the world there are some remote controlled weapons that use computer vision for targeting deployed by the Israeli military in the West Bank using this smart shooter technology that's designed to in principle take normal weapons and add computer vision based targeting to them to make them into smart weapons
332
+
333
+ 84
334
+ 00:57:44,819 --> 00:58:29,220
335
+ right now this deployed system shown on the left uses only sponge tipped bullets which are designed to be less lethal but they can still cause serious injury and according to the deployers in the pilot stage so it's a little unclear to what extent autonomous Weaponry is already here and being used because the definition is a little bit blurry so for example the hayrop Drone shown in the top left is a loitering munition a type of drone that can fly around hold its position for a while and then automatically destroy any radar system that locks onto it this type of drone was used in the nagorno-karabakh war between Armenia and Azerbaijan in 2021 but there's also older autonomous weapon systems the Phalanx c-whiz is designed
336
+
337
+ 85
338
+ 00:58:27,660 --> 00:59:10,740
339
+ to automatically fire at Targets moving towards Naval vessels at very very high velocities so these are velocities they're usually only achieved by rocket Munitions not by manned craft and that system's been used since at least the first Gulf War in 1991. there was an analysis in 2017 by The Economist to try and look for how many systems with automated targeting there were and in particular how many of them could engage with targets without involving humans at all so that would be the last section of human out of the loop systems but given the general level of secrecy in some cases and hype and others around military technology it can be difficult to get a very clear sense and the blurriness of this definition has led
340
+
341
+ 86
342
+ 00:59:08,940 --> 00:59:50,339
343
+ some to say that autonomous weapons are actually at least 100 years old for example anti-personnel mines that were used starting in the 30s and in World War II attempts to you detect whether a person has come close to them and then explode and in some sense that is an autonomous weapon and if we broaden our definition that far then maybe lots of different kinds of traps are some form of autonomous weapon but just because these weapons already exist and maybe even have been around for a century does not mean that designing ml-powered weapons is ethical anti-personnel mines in fact are the subject of a mind Ban Treaty that a very large number of countries have signed unfortunately not some of the countries with the largest
344
+
345
+ 87
346
+ 00:59:47,400 --> 01:00:26,880
347
+ militaries in the world but that at least suggests that for one type of autonomous weapon that has caused a tremendous amount of collateral damage there's interest in Banning them and so perhaps rather than building these autonomous weapons so we can then ban them it would be better if we just didn't build them at all so the campaign to stop Killer Robots is a group to look into if this is something that's interesting to you it brings us to the end of our tour of the four common questions that people raise around the ethics of building an ml system I've provided some of my answers to these questions and some of the common answers to these questions but you should have thoughtful answers to these for the
348
+
349
+ 88
350
+ 01:00:25,619 --> 01:01:07,619
351
+ individual projects that you work on first is the model fair I think it's generally possible but it requires trade-offs is the system accountable I think it's pretty challenging to make interpretable deep Learning Systems where interpretability allows an explanation for why a decision was made but making a system that's accountable where answers can be changed in response to user feedback or perhaps user lawsuit is possible you'll definitely want to answer the question of who owns the data up front and be on the lookout for changes especially to these large-scale internet scraped data sets and then lastly should this be built at all you'll want to ask this repeatedly throughout the life cycle of the technology I wanted to close this
352
+
353
+ 89
354
+ 01:01:04,920 --> 01:01:53,460
355
+ section by talking about just how much the machine learning world can learn from medicine and from applications of machine learning to medical problems this is a field I've had a chance to work in and I've seen some of the best work on building with ML responsibly come from this field and fundamentally it's because of a mismatch between machine learning and medicine that impedance mismatch has led to a ton of learning so first we'll talk about the Fiasco that was machine learning and the covid-19 pandemic then briefly consider why medicine would have this big of a mismatch with machine learning and what the benefits of examining it closer might be and then lastly we'll talk about some concrete research on auditing
356
+
357
+ 90
358
+ 01:01:50,819 --> 01:02:34,680
359
+ and Frameworks for building with ML that have come out of medicine first something that should be scary and embarrassing for people in machine learning medical researchers found that almost all machine learning research on covid-19 was effectively useless this is in the context of a biomedical response to covid-19 that was an absolute Triumph in the first year vaccinations prevented some tens of millions of deaths these vaccines were designed based on novel Technologies like lipid nanoparticles for delivering mRNA and even more traditional techniques like small molecule Therapeutics for example paxilavid the quality of research that was done was extremely high so on the right we have an inferred 3D structure
360
+
361
+ 91
362
+ 01:02:32,099 --> 01:03:24,900
363
+ for a coronavirus protein in complex with the primary effective molecule in paxilavid allowing for a mechanistic understanding of how this drug was working at the atomic level and at this crucial time machine learning did not really acquit itself well so there were two reviews one in bmj and one in nature that reviewed a large set of prediction models for covid-19 either prognosis or diagnosis primarily prognosis in the case of the Winans at all paper in bmj or diagnosis on the basis of chest x-rays and CT scans and both of these reviews found that almost all of the papers were insufficiently documented did not follow best practices for developing models and did not have sufficient external validation testing
364
+
365
+ 92
366
+ 01:03:21,059 --> 01:04:06,780
367
+ on external data to justify any wider use of these models even though many of them were provided as software or apis ready to be used in a clinical setting so the depth of the errors here is really very sobering a full quarter of the papers analyzed in the Roberts at all review used a pneumonia data set as a control group so the idea was we don't want our model just to detect whether people are sick or not just having having coveted patients and healthy patients might cause models that detect all pneumonias as covid so let's incorporate this pneumonia data set but they failed to mention and perhaps failed to notice that the pneumonia data set was all children all pediatric patients so the models that they were
368
+
369
+ 93
370
+ 01:04:04,799 --> 01:04:52,079
371
+ training were very likely just detecting children versus adults because that would give them perfect performance on Pneumonia versus covid on that data set so it's a pretty egregious error of modeling and data set construction alongside bunch of other more subtle errors around proper validation and reporting of models and methods so I think one reason for the substantial difference in responses here is that medicine both in practice and in research has a very strong professional culture of Ethics that equips it to handle very very serious and difficult problems at least in the United States medical doctors still take the Hippocratic Oath parts of which date back all the way to Hippocrates one of the founding fathers of Greek medicine
372
+
373
+ 94
374
+ 01:04:49,559 --> 01:05:42,839
375
+ and one of the core precepts of that oath is to do no harm meanwhile one of the core precepts of the Contemporary tech industry represented here by this ml generated Greek bust of Mark Zuckerberg is to move fast and break things with the implication that breaking things is not so bad and well that's probably the right approach for building lots of kinds of web applications and other software when this culture gets applied to things like medicine the results can be really ugly one particularly striking example of this was when a retinal implant that was used to restore sight to some blind people was deprecated by the vendor and so stopped working and there was no recourse for these patients because there is no other organization capable
376
+
377
+ 95
378
+ 01:05:40,020 --> 01:06:23,460
379
+ of maintaining these devices the news here is not all bad for machine learning there are researchers who are working at the intersection of medicine and machine learning and developing and proposing solutions to some of these issues that I think might have broad applicability on building responsibly with machine learning first the clinical trial standards that are used for other medical devices and for pharmaceuticals have been extended to machine learning the spirit standard for Designing clinical trials and the consort standard for reporting results of clinical trials these have both been extended to include ml with Spirit Ai and consort AI two Links at the bottom of this slide for the details on the contents of both of
380
+
381
+ 96
382
+ 01:06:21,900 --> 01:07:07,380
383
+ those standards one thing I wanted to highlight here was the process by which these standards were created and which is reported in those research articles which included an international survey with over a hundred participants and then a conference with 30 participants to come up with a final checklist and then a pilot use of it to determine how well it worked so the standard for producing standards in medicine is also quite high and something we could very much learn from in machine learning so because of that work and because people have pointed out these concerns progress is being made on doing better work in machine learning for medicine this recent paper in the Journal of the American Medical Association does a
384
+
385
+ 97
386
+ 01:07:05,400 --> 01:07:52,680
387
+ review of clinical trials involving machine learning and finds that for many of the components of these clinical trial standards compliance and quality is very high incorporating clinical context state very clearly how the method will contribute to clinical care but there are definitely some places with poor compliance for example interestingly enough very few trials reported how low quality data was handled how data was assessed for quality and and how cases of poor quality data should be handled I think that's also something that the broader machine learning world could do a better job on and then also analysis of errors that models made which also shows up in medical research and clinical trials as analysis of Adverse Events this kind of
388
+
389
+ 98
390
+ 01:07:50,460 --> 01:08:34,199
391
+ error analysis was not commonly done and this is something that in talking about testing and troubleshooting and in talking about model monitoring and continual learning we've tried to emphasize the importance of this kind of error analysis for building with ML there's also this really gorgeous pair of papers by Lauren Oakton Raynor and others in the Lancet that both developed and applied this algorithmic auditing framework for medical ml so this is something that is probably easier to incorporate into other ml workflows than is a full-on clinical trial approach but still has some of the same rigor incorporates checklists and tasks and defined artifacts that highlight what the problems are and what needs to be
392
+
393
+ 99
394
+ 01:08:31,980 --> 01:09:11,940
395
+ tracked and shared while building a machine Learning System one particular component that I wanted to highlight and is here indicated in blue is that there's a big emphasis on failure modes and error analysis and what they call adversarial testing which is coming up with different kinds of inputs to put into the model to see how it performs so sort of like a behavioral check on the model these are all things that we've emphasized as part of how to build a model well there's lots of other components of this audit that the broader ml Community would do well to incorporate into their work there's a ton of really great work being done a lot of these papers are just within the last three or six months so I think it's
396
+
397
+ 100
398
+ 01:09:10,620 --> 01:09:51,779
399
+ a pretty good idea to keep your finger on the pulse here so to speak in medical ml the Stanford in Institute for AI and medicine has a regular panel that gets posted on YouTube they also share a lot of great other kinds of content via Twitter and then a lot of the researchers who did some of the work that I shared Lauren Oakton Raynor Benjamin Khan are also active on Twitter along with other folks who've done great work that I didn't get time to talk about like Judy chichoya and Matt Lundgren closing out this section like medicine machine learning can be very intimately intertwined with people's lives and so ethics is really really Salient perhaps the most important ethical question to ask ourselves over
400
+
401
+ 101
402
+ 01:09:48,540 --> 01:10:33,360
403
+ and over again is should this system be built at all what are the implications of building the system of automating this task or this work and it seems clear that if we don't regulate ourselves we will end up being regulated and so we should learn from older Industries like medicine rather than just assuming we can disrupt our way through so as our final section I want to talk about the ethics of artificial intelligence this is clearly a frontier both for the field of Ethics trying to think through these problems and for the technology communities that are building this I think that right now false claims and hype around artificial intelligence are the most pressing concern but we shouldn't sleep on some of the major
404
+
405
+ 102
406
+ 01:10:31,560 --> 01:11:18,540
407
+ ethical issues that are potentially oncoming with AI so right now claims and Hyperbole and hype around artificial intelligence are outpacing capabilities even though those capabilities are also growing fast and this risks a kind of blowback so one way to summarize this is say that if you call something autopilot people are going to treat it like autopilot and then be upset or worse when that's not the case so famously there is an incident where somebody who believed that Tesla is lean and braking assistant system autopilot was really full self-driving was killed in a car crash in this gap between what people expect out of ml systems and what they actually get is something that Josh talked about in the project management
408
+
409
+ 103
410
+ 01:11:16,800 --> 01:12:02,219
411
+ lecture so this is something that we're already having to incorporate into our engineering and our product design that people are overselling the capacities of ml systems in a way that gives users a bad idea of what is possible and this problem is very widespread even large and mature organizations like IBM can create products like Watson which was the capable question and answering system and then sell it as artificial intelligence and try to revolutionize or disrupt Fields like medicine and then end up falling far short of these extremely lofty goals they've set themselves and along the way they get at least the beginning journalistic coverage with pictures of robot hands reaching out to grab balls of light or
412
+
413
+ 104
414
+ 01:12:00,300 --> 01:12:44,400
415
+ brains inside computers or computers inside brains so not only do companies oversell what their technology can do but these overstatements are repeated or Amplified by traditional and social media and this problem even extends to Academia there is a Infamous now case where Japan in 2017 said that Radiologists at that point were like Wiley Coyote already over the edge of the cliff and haven't realized that there's no ground underneath them and that people should stop training Radiologists now because within five years AKA now deep learning is going to be better than Radiologists some of the work in the intersection of medicine in ml that I presented was done by people who were in their Radiology training at
416
+
417
+ 105
418
+ 01:12:42,659 --> 01:13:27,840
419
+ the time around the time this statement was made and were lucky that they continued training as Radiologists while also gaining ml expertise so that they could do the slow hard work of bringing deep learning and machine learning into Radiology this overall problem of overselling artificial intelligence you could call AI snake oil so that's the name of an upcoming book and a new sub stack by Arvin Narayanan are now very good friend and so this refers not just to people overselling the capabilities of large language models or predicting that we'll have artificial intelligence by Christmas but people who use this General Aura of hypanic segment around artificial intelligence to sell shoddy technology an example from this really
420
+
421
+ 106
422
+ 01:13:25,440 --> 01:14:12,960
423
+ great set of slides linked here the tool Elevate that claims to be able to assess personality and job suitability from a 30 second video including identifying whether the person in the video is a change agent or not so the call here is to separate out the actual places where there's been rapid Improvement in what's possible with machine learning for example computer perception identifying the contents of images face recognition Orion in here even includes medical diagnosis from scans from places where there's not been as much progress and so the split that he proposes that I think is helpful is that most things that involve some form of human judgment like determining whether something is hate speech or what grade an essay should
424
+
425
+ 107
426
+ 01:14:10,080 --> 01:14:51,179
427
+ receive these are on the borderline most forms of prediction especially around what he calls social outcomes so things like policing jobs Child Development these are places where there has not been substantial progress and where the risk of somebody essentially riding the coattails of gpt3 with some technique that doesn't perform any better than linear regression is at its highest so we don't have artificial intelligence yet but if we do synthesize intelligent agents a lot of thorny ethical questions are going to immediately arise so it's probably a good idea as a field and as individuals for us to think a little bit about these ahead of time so there's broad agreement that creating sentient intelligent beings would have ethical
428
+
429
+ 108
430
+ 01:14:49,020 --> 01:15:38,580
431
+ implications just this past summer Google engineer Blake Lemoine became convinced that a large language model built by Google Lambda was in fact conscious and almost everyone agrees that that's not the case for these large language models but there's pretty big disagreement on how far away we are and perhaps most importantly this concern did cause a pretty big reaction both inside the field and in the popular press in my view it's a bit unfortunate that this conversation was started so early because it's so easy to dismiss this claim if it happens too many more times we might end up inured to these kinds of conversations in a boy who cried AI type situation there's also a different set of concerns around what
432
+
433
+ 109
434
+ 01:15:36,719 --> 01:16:19,679
435
+ might happen with the creation of a self-improving artificial intelligence so there's already some hints in this direction for one the latest Nvidia GPU architecture Hopper incorporates a very large number of AI design circuits pictured here on the left the quality of the AI design circuits are superior this is also something that's been reported by the folks working on tpus at Google there's also cases in which large language models can be used to build better models for example large language models can teach themselves to program better and large language models can also use large language models at least as well as humans this suggests the possibility of virtuous Cycles in machine learning capabilities and
436
+
437
+ 110
438
+ 01:16:17,820 --> 01:17:03,300
439
+ machine intelligence and failing to pursue this kind of very powerful technology comes with a very substantial opportunity cost this is something that's argued by the philosopher Nick Bostrom in a famous paper called astronomical waste that points out just given the size of the universe the amount of resources and the amount of time it will be around there's a huge cost in terms of potential good potential lives worth living that we leave on the table if we do not develop the Necessary Technology quickly but the primary lesson that's drawn in this paper is actually not that technology should be developed as quickly as possible but rather that it should be developed as safely as possible which is to say that the probability that this
440
+
441
+ 111
442
+ 01:17:00,960 --> 01:17:48,540
443
+ imagined Galaxy or Universe spanning Utopia comes into being that probability should be maximized and so this concern around safety originating the work of Bostrom and others has become a central concern for people thinking about the ethical implications of artificial intelligence and so the concerns around self-improving intelligent systems that could end up being more intelligent than humans are nicely summarized in the parable of the paperclip maximizer also from Bostrom at least popularized in the book super intelligence so the idea here is a classic example of this proxy problem in alignment so we design an artificial intelligence system for building paper clips so it's designed to make sure that the paper clip producing
444
+
445
+ 112
446
+ 01:17:46,260 --> 01:18:27,900
447
+ component of our economy runs as effectively as possible produces as many paper clips as it can and we incorporate self-improvement into it so that it becomes smarter and more capable over time at first it improves human utility as it introduces better industrial processes for paper clips but as it becomes more intelligent perhaps it finds a way to manipulate the legal system and manipulate politics to introduce a more favorable tax code for pay-per-clip related Industries and that starts to hurt overall human utility uh even as the number of paper clips created and the capacity of the paperclip maximizer increases and of course at the point when we have mandatory national service in the paperclip mines or that all matter in
448
+
449
+ 113
450
+ 01:18:26,400 --> 01:19:14,280
451
+ the universe is converted to paper clips we've pretty clearly decreased human utility as this paperclip maximizer has maximized its objective and increased its own capacity so this still feels fairly far away and a lot of the speculations feel a lot more like science fiction than science fact but the stakes here are high enough that it is certainly worth having some people thinking about and working on it and many of the techniques can be applied to controlled and responsible deployment of less capable ml systems as a small aside these ideas around existential risk and super intelligences are often associated with the effective altruism Community which is concerned with the best ways to do the most good both with what you do
452
+
453
+ 114
454
+ 01:19:11,880 --> 01:19:54,480
455
+ with your career one of the focuses is the 80 000 hours organization and also through charitable donations as a way to by donating to the highest impact Charities and non-profits have the largest positive impact on the world so there's a lot of very interesting ideas coming out of this community and it's particularly appealing to a lot of folks who work in technology and especially in machine learning so it's worth checking out so that brings us to the end of our planned agenda here after giving some context around what our approach to Ethics in this lecture would look like we talked about ethical concerns in three different fields first past and immediate concerns around the ethical development of Technology then up and
456
+
457
+ 115
458
+ 01:19:52,260 --> 01:20:42,540
459
+ coming and near future concerns around building ethically with machine learning and then finally a taste of the ethical concerns we might face in a future where machine learning gives way to artificial intelligence with a reminder that we should make sure not to oversell our progress on that front so I got to the end of these slides and realized that this was the end of the course and felt that I couldn't leave it on uh dower and sad note of unusable medical algorithms and existential risk from Super intelligences so I wanted to close out with a bit of a more positive note on the things that we can do so I think the first and most obvious step is education a lot of these ideas around ethics are unfamiliar to people with a technical
460
+
461
+ 116
462
+ 01:20:40,560 --> 01:21:27,179
463
+ background there's a lot of great longer form content that captures a lot of these ideas and can help you build your own knowledge of the history and context and eventually your own opinions on these topics I can highly recommend each of these books the alignment problem is a great place to get started it focuses pretty tightly on ML ethics and AI ethics it covers a lot of recent research and is very easily digestible for an ml audience you might also want to consider some of these books around more Tech ethics like weapons of math destruction by Kathy O'Neill and automating inequality by Virginia Eubanks from there you can prioritize things that you want to act on make your own two by two around things that have
464
+
465
+ 117
466
+ 01:21:24,420 --> 01:22:10,980
467
+ impact now and can have very high impact for me I think that's things around deceptive design and dark patterns and around AI snake oil then there's also places where acting in the future might be very important and high impact for me I think that's things around ml Weaponry behind my head is existential risk from Super intelligences on super high impact but something that we can't act on right now and then all the things in between you can create your own two by two on these and then search around for organizations communities and people working on these problems to align yourself with and by way of a final goodbye as we're ending this class I want to call out that a lot of the discussion of Ethics in this lecture was
468
+
469
+ 118
470
+ 01:22:09,420 --> 01:22:53,040
471
+ very negative because of the framing around cases where people raised ethical concerns but ethics is not and cannot be purely negative about avoiding doing bad things the work that we do in Building Technology with machine learning can do good in the world not just avoid doing harm we can reduce suffering so this diagram here from a neuroscience from a brain machine interface paper from 2012 is what got me into the field of machine learning in the first place it shows a tetraplegic woman who has learned to control a robot arm using only other thoughts by means of an electrode attached to her head and while the technical achievements in this paper were certainly very impressive the thing that made the strongest impression on me
472
+
473
+ 119
474
+ 01:22:50,460 --> 01:23:33,000
475
+ reading this paper in college was the smile on the woman's face in the final panel if you've experienced this kind of limit Mobility either yourself or in someone close to you then you know that the joy even from something as simple as being able to feed yourself is very real we can also do good by increasing Joy not just reducing suffering despite the concerns that we talked about with text to image models there they're clearly being used to create Beauty and Joy or as Ted Underwood a digital Humanity scholar put it to explore a dimension of human culture that was accidentally created across the last five thousand years of captioning that's beautiful and it's something we should hold on to that's not to say that this happens
476
+
477
+ 120
478
+ 01:23:30,000 --> 01:24:17,040
479
+ automatically by Building Technology the world automatically becomes better but leading organizations in our field are making proactive statements on this openai around long term safely around long-term safety and Broad distribution of the benefits of machine learning and artificial intelligence research Deep Mind stating which Technologies they won't pursue and making a clear statement of a gold a broadly benefit Humanity the final bit of really great news that I have is that the tools for building ml well that you've learned throughout this class align very well with building ml for good so we saw it with the medical machine learning around failure analysis and we can also see it in the principles for for responsible
480
+
481
+ 121
482
+ 01:24:15,060 --> 01:25:03,199
483
+ development from these leading organizations Deep Mind mentioning accountability to people and Gathering feedback Google AI mentioning it as well and if you look closely at Google ai's list of recommended practices for responsible AI use multiple metrics to assess training and monitoring understand limitations use tests directly examine raw data Monitor and update your system after deployment these are exactly the same principles that we've been emphasizing in this course around building ml powered products the right way these techniques will also help you build machine learning that does what's right and so on that note I want to thank you for your time and your interest in this course and I wish you the best of luck as you
484
+
485
+ 122
486
+ 01:24:59,940 --> 01:25:03,199
487
+ go out to build with ML
488
+