Spaces:
Runtime error
Runtime error
charlesfrye
commited on
Commit
·
a08f3cd
1
Parent(s):
1b5101a
adds documents
Browse files- documents/lecture-01.md +563 -0
- documents/lecture-01.srt +352 -0
- documents/lecture-02.md +563 -0
- documents/lecture-02.srt +256 -0
- documents/lecture-03.md +597 -0
- documents/lecture-03.srt +244 -0
- documents/lecture-04.md +421 -0
- documents/lecture-04.srt +160 -0
- documents/lecture-05.md +788 -0
- documents/lecture-05.srt +396 -0
- documents/lecture-06.md +809 -0
- documents/lecture-06.srt +440 -0
- documents/lecture-07.md +285 -0
- documents/lecture-08.md +713 -0
- documents/lecture-08.srt +416 -0
- documents/lecture-09.md +825 -0
- documents/lecture-09.srt +488 -0
documents/lecture-01.md
ADDED
@@ -0,0 +1,563 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: Introduction to planning, developing, and shipping ML-powered products.
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 1: Course Vision and When to Use ML
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/-Iob-FW5jVM?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published August 8, 2022.
|
14 |
+
[Download slides](https://drive.google.com/file/d/18EVuJpnJ9z5Pz7oRYcgax_IzRVhbuAMC/view?usp=sharing).
|
15 |
+
|
16 |
+
## 1 - Course Vision
|
17 |
+
|
18 |
+
### History of FSDL
|
19 |
+
|
20 |
+
**Full Stack Deep Learning (FSDL) is the course and community for people
|
21 |
+
who are building products that are powered by machine learning (ML).**
|
22 |
+
It's an exciting time to talk about ML-powered products because ML is
|
23 |
+
rapidly becoming a mainstream technology - as you can see in startup
|
24 |
+
funding, job postings, and continued investments of large companies.
|
25 |
+
|
26 |
+
FSDL was originally started in 2018 when the most exciting ML-powered
|
27 |
+
products were built by the biggest companies. However, the broader
|
28 |
+
narrative in the field was that very few companies could get value out
|
29 |
+
of this technology.
|
30 |
+
|
31 |
+
Now in 2022, there's a proliferation of powerful products that are
|
32 |
+
powered by ML. The narrative has shifted as well: There's
|
33 |
+
standardization that has emerged around the tech stack - with
|
34 |
+
transformers and NLP starting to seep their way into more use cases, as
|
35 |
+
well as practices around how to apply ML technologies in the world. One
|
36 |
+
of the biggest changes in the field in the past four years has been the
|
37 |
+
emergence of the term **MLOps**.
|
38 |
+
|
39 |
+
In addition to the
|
40 |
+
field being more mature and research continuing to progress, a big
|
41 |
+
reason for this rapid change is that **the training of models is starting to become
|
42 |
+
commoditized**.
|
43 |
+
|
44 |
+
- With tools like [HuggingFace](https://huggingface.co), you can deploy a state-of-the-art NLP
|
45 |
+
or CV model in one or two lines of code.
|
46 |
+
|
47 |
+
- AutoML is starting to work for a lot of applications.
|
48 |
+
|
49 |
+
- Companies like [OpenAI](https://openai.com/api/) are starting to provide models as a service where you
|
50 |
+
don't even have to download open-source packages to use them. You
|
51 |
+
can make a network call to get predictions from a state-of-the-art
|
52 |
+
model.
|
53 |
+
|
54 |
+
- Many libraries are starting to standardize around frameworks like [Keras](https://keras.io/) and [PyTorch
|
55 |
+
Lightning](https://www.pytorchlightning.ai/).
|
56 |
+
|
57 |
+
### AI Progress
|
58 |
+
|
59 |
+
The history of ML is characterized by stratospheric rises and meteoric falls of the public
|
60 |
+
perception of the technology. These were driven by a few different AI
|
61 |
+
winters that happened over the history of the field - where the
|
62 |
+
technology didn't live up to its hype. If you project forward a few
|
63 |
+
years, what will happen to ML?
|
64 |
+
|
65 |
+
![](./media/image6.png)
|
66 |
+
|
67 |
+
|
68 |
+
*Source: [5 Things You Should Know About
|
69 |
+
AI](https://www.cambridgewireless.co.uk/media/uploads/resources/AI%20Group/AIMobility-11.05.17-Cambridge_Consultants-Monty_Barlow.pdf)
|
70 |
+
(Cambridge Consultants, May 2017)*
|
71 |
+
|
72 |
+
Here are the major categories of possible outcomes and our guess about their likelihoods:
|
73 |
+
|
74 |
+
1. A true AI winter, where people
|
75 |
+
become skeptical about AI as a technology.
|
76 |
+
We think this is less likely.
|
77 |
+
|
78 |
+
2. A slightly more likely outcome is that the overall luster of the
|
79 |
+
technology starts to wear off, but specific applications are
|
80 |
+
getting a ton of value out of it.
|
81 |
+
|
82 |
+
3. The upside outcome for the field is that AI continues to accelerate
|
83 |
+
rapidly and becomes pervasive and incredibly effective.
|
84 |
+
|
85 |
+
Our conjecture is that: **The way we, as a field, avoid an AI winter is
|
86 |
+
by translating research progress into real-world products.** That's how
|
87 |
+
we avoid repeating what has happened in the past.
|
88 |
+
|
89 |
+
### ML-Powered Products Require a Different Process
|
90 |
+
|
91 |
+
Building ML-powered products requires a fundamentally different process
|
92 |
+
in many ways than developing ML models in an academic setting.
|
93 |
+
|
94 |
+
![](./media/image7.png)
|
95 |
+
|
96 |
+
|
97 |
+
In academia, you build **"flat-earth" ML** - selecting a problem,
|
98 |
+
collecting data, cleaning and labeling the data, iterating on model
|
99 |
+
development until you have a model that performs well on the dataset
|
100 |
+
collected, evaluating that model, and writing a report at the end.
|
101 |
+
|
102 |
+
![](./media/image5.png)
|
103 |
+
|
104 |
+
|
105 |
+
But ML-powered products require **an outer loop** where after you deploy
|
106 |
+
the model into production, you measure how that model performs when it
|
107 |
+
interacts with real users. Then, you use real-world data to
|
108 |
+
improve your model, setting up a data flywheel that enables
|
109 |
+
continual improvement.
|
110 |
+
|
111 |
+
### This Course
|
112 |
+
|
113 |
+
![](./media/image2.png)
|
114 |
+
|
115 |
+
|
116 |
+
This class is about the unique aspects you need to know beyond training
|
117 |
+
models to build great ML-powered products. Here are some concrete goals
|
118 |
+
for us:
|
119 |
+
|
120 |
+
1. Teaching you **generalist skills** and an understanding of the
|
121 |
+
**components of ML-powered products** (and ML projects more
|
122 |
+
generally).
|
123 |
+
|
124 |
+
2. Teaching you **enough MLOps to get things done**.
|
125 |
+
|
126 |
+
3. Sharing **best practices** and **explaining the motivation** behind them.
|
127 |
+
|
128 |
+
4. Learning things that might **help you with job interviews** for ML engineering roles.
|
129 |
+
|
130 |
+
5. **Forming a community** to learn together and from each other.
|
131 |
+
|
132 |
+
We do NOT try to:
|
133 |
+
|
134 |
+
1. Teach you ML or software engineering from scratch.
|
135 |
+
|
136 |
+
2. Cover the whole breadth of deep learning techniques.
|
137 |
+
|
138 |
+
3. Make you an expert in any single aspect of ML.
|
139 |
+
|
140 |
+
4. Do research in deep learning.
|
141 |
+
|
142 |
+
5. Cover the full spectrum of MLOps.
|
143 |
+
|
144 |
+
If you feel rusty on your pre-requisites but want to get started with
|
145 |
+
FSDL, here are our recommendations to get up to speed with the
|
146 |
+
fundamentals:
|
147 |
+
|
148 |
+
- Andrew Ng's [Machine Learning Coursera
|
149 |
+
course](https://www.coursera.org/collections/machine-learning)
|
150 |
+
|
151 |
+
- Google's [crash course on Machine
|
152 |
+
Learning](https://developers.google.com/machine-learning/crash-course)
|
153 |
+
|
154 |
+
- MIT's [The Missing
|
155 |
+
Semester](https://missing.csail.mit.edu/) on software
|
156 |
+
engineering
|
157 |
+
|
158 |
+
### ML-Powered Products vs MLOps
|
159 |
+
|
160 |
+
MLOps, as a discipline, has emerged in just the last few years. It is
|
161 |
+
about practices for deploying, maintaining, and operating ML systems
|
162 |
+
that generate ML models in production. A lot of MLOps is about:
|
163 |
+
|
164 |
+
- How do we put together an infrastructure that allows us to build
|
165 |
+
models in a repeatable and governable way?
|
166 |
+
|
167 |
+
- How can we run ML systems in a potentially high-scale production
|
168 |
+
setting?
|
169 |
+
|
170 |
+
- How can we collaborate on these systems as a team?
|
171 |
+
|
172 |
+
![](./media/image1.png)
|
173 |
+
|
174 |
+
|
175 |
+
ML-powered product building is a distinct but overlapping discipline. A lot of
|
176 |
+
what it takes to build a great ML-powered product goes beyond the
|
177 |
+
infrastructure side of ML systems. It focuses on how to fit ML into the
|
178 |
+
context of the product or the application that you're building.
|
179 |
+
|
180 |
+
Other topics in the scope of the ML product discipline include:
|
181 |
+
|
182 |
+
- How do you understand how your users are interacting with your
|
183 |
+
model?
|
184 |
+
|
185 |
+
- How do you build a team or an organization that can work together
|
186 |
+
effectively on ML systems?
|
187 |
+
|
188 |
+
- How do you do product management in the context of ML?
|
189 |
+
|
190 |
+
- What are the best practices for designing products that use ML as
|
191 |
+
part of them?
|
192 |
+
|
193 |
+
This class focuses on teaching you end-to-end what it takes to get a
|
194 |
+
product out in the world that uses ML and will cover aspects of MLOps
|
195 |
+
that are most critical in order to do that.
|
196 |
+
|
197 |
+
### Chapter Summary
|
198 |
+
|
199 |
+
1. **ML-powered products are going mainstream** thanks to the
|
200 |
+
democratization of modeling.
|
201 |
+
|
202 |
+
2. However, building **great ML-powered products requires a different
|
203 |
+
process** from building models.
|
204 |
+
|
205 |
+
3. Full-Stack Deep Learning is **here to help**!
|
206 |
+
|
207 |
+
## 2 - When To Use Machine Learning
|
208 |
+
|
209 |
+
### When to Use ML At All
|
210 |
+
|
211 |
+
**ML projects have a higher failure rate than software projects in
|
212 |
+
general**. One reason that's worth acknowledging is that for many
|
213 |
+
applications, ML is fundamentally still research. Therefore, we
|
214 |
+
shouldn't aim for 100% success.
|
215 |
+
|
216 |
+
Additionally, many ML projects are
|
217 |
+
doomed to fail even before they are undertaken due to a variety of
|
218 |
+
reasons:
|
219 |
+
|
220 |
+
1. They are technically infeasible or poorly scoped.
|
221 |
+
|
222 |
+
2. They never make the leap to a production environment.
|
223 |
+
|
224 |
+
3. The broader organization is not all on the same page about what
|
225 |
+
would be considered success criteria for them.
|
226 |
+
|
227 |
+
4. They solve the problem that you set out to solve but do not solve a
|
228 |
+
big enough problem to be worth their complexity.
|
229 |
+
|
230 |
+
The bar for your ML projects should be that **their value must outweigh
|
231 |
+
not just the cost of developing them but also the additional complexity
|
232 |
+
that these ML systems introduce to your software** (as introduced in the
|
233 |
+
classic paper "[The High-Interest Credit Card of Technical
|
234 |
+
Debt](https://research.google/pubs/pub43146/)").
|
235 |
+
|
236 |
+
In brief,
|
237 |
+
ML systems erode the boundaries between other systems, rely on expensive
|
238 |
+
data dependencies, are commonly plagued by system design anti-patterns,
|
239 |
+
and are subject to the instability of the external world.
|
240 |
+
|
241 |
+
Before starting an ML project, ask yourself:
|
242 |
+
|
243 |
+
1. **Are you ready to use ML?** More specifically, do you have a
|
244 |
+
product? Are you collecting data and storing it in a sane way? Do
|
245 |
+
you have the right people?
|
246 |
+
|
247 |
+
2. **Do you really need ML to solve this problem?** More specifically,
|
248 |
+
do you need to solve the problem at all? Have you tried using
|
249 |
+
rules or simple statistics to solve the problem?
|
250 |
+
|
251 |
+
3. **Is it ethical to use ML to solve this problem?** We have a
|
252 |
+
[whole lecture about ethics](../lecture-9-ethics/)!
|
253 |
+
|
254 |
+
### How to Pick Problems to Solve with ML
|
255 |
+
|
256 |
+
Just like any other project prioritization, you want to look for use
|
257 |
+
cases that have **high impact** and **low cost**:
|
258 |
+
|
259 |
+
1. **High-impact problems** are likely to be those that address friction in
|
260 |
+
your product, complex parts of your pipeline, places where cheap
|
261 |
+
prediction is valuable, and generally what other people in your
|
262 |
+
industry are doing.
|
263 |
+
|
264 |
+
2. **Low-cost projects** are those with available data, where bad
|
265 |
+
predictions are not too harmful.
|
266 |
+
|
267 |
+
![](./media/image11.png)
|
268 |
+
|
269 |
+
|
270 |
+
#### High-Impact Projects
|
271 |
+
|
272 |
+
Here are some heuristics that you can use to find high-impact ML
|
273 |
+
projects:
|
274 |
+
|
275 |
+
1. **Find problems that ML takes from economically infeasible to feasible**.
|
276 |
+
A good resource here is the book "[Prediction Machines:
|
277 |
+
The Simple Economics of
|
278 |
+
AI](https://www.amazon.com/Prediction-Machines-Economics-Artificial-Intelligence/dp/1633695670)."
|
279 |
+
The book's central thesis is that AI reduces the cost of
|
280 |
+
prediction, which is central to decision-making. Therefore, look
|
281 |
+
for projects where making prediction cheaper will have a huge impact.
|
282 |
+
|
283 |
+
2. **Think about what your product needs**.
|
284 |
+
[This article from the ML team at Spotify](https://spotify.design/article/three-principles-for-designing-ml-powered-products)
|
285 |
+
talks about the three principles for designing Discover Weekly,
|
286 |
+
one of Spotify's most powerful and popular ML-powered features.
|
287 |
+
|
288 |
+
3. **Think about the types of problems that ML is particularly good at**.
|
289 |
+
One common class of problem that is overlooked is
|
290 |
+
["Software 2.0"](https://karpathy.medium.com/software-2-0-a64152b37c35),
|
291 |
+
as coined by Andrej Kaparthy. Essentially, if you have a part of your
|
292 |
+
system that is complex and manually defined, then that's
|
293 |
+
potentially a good candidate to be automated with ML.
|
294 |
+
|
295 |
+
4. **Look at what other people in the industry are doing**.
|
296 |
+
Generally, you can read papers and blog posts from both Big Tech and top
|
297 |
+
earlier-stage companies.
|
298 |
+
|
299 |
+
#### Low-Cost Projects
|
300 |
+
|
301 |
+
![](./media/image12.png)
|
302 |
+
|
303 |
+
|
304 |
+
There are three main drivers for how much a project will cost:
|
305 |
+
|
306 |
+
1. **Data availability**: How hard is it to acquire data? How expensive
|
307 |
+
is data labeling? How much data will be needed? How stable is the
|
308 |
+
data? What data security requirements do you have?
|
309 |
+
|
310 |
+
2. **Accuracy requirement**: How costly are wrong predictions? How
|
311 |
+
frequently does the system need to be right to be useful? What are
|
312 |
+
the ethical implications of your model making wrong predictions?
|
313 |
+
It is noteworthy that **ML project costs tend to scale
|
314 |
+
super-linearly in the accuracy requirement**.
|
315 |
+
|
316 |
+
3. **Problem difficulty**: Is the problem well-defined enough to be
|
317 |
+
solved with ML? Is there good published work on similar problems?
|
318 |
+
How much compute does it take to solve the problem? **Generally,
|
319 |
+
it's hard to reason about what's feasible in ML**.
|
320 |
+
|
321 |
+
#### What's Hard in ML?
|
322 |
+
|
323 |
+
![](./media/image8.png)
|
324 |
+
|
325 |
+
|
326 |
+
Here are the three types of hard problems:
|
327 |
+
|
328 |
+
1. **Output is complex**: The model predictions are ambiguous or in a
|
329 |
+
high-dimensional structure.
|
330 |
+
|
331 |
+
2. **Reliability is required**: ML systems tend to fail in unexpected
|
332 |
+
ways, so anywhere you need high precision or high robustness is
|
333 |
+
going to be more difficult to solve with ML.
|
334 |
+
|
335 |
+
3. **Generalization is required**: These problems tend to be more in
|
336 |
+
the research domain. They can deal with out-of-distribution data
|
337 |
+
or do tasks such as reasoning, planning, or understanding
|
338 |
+
causality.
|
339 |
+
|
340 |
+
#### ML Feasibility Assessment
|
341 |
+
|
342 |
+
This is a quick checklist you can use to assess the feasibility of your
|
343 |
+
ML projects:
|
344 |
+
|
345 |
+
1. Make sure that you actually need ML.
|
346 |
+
|
347 |
+
2. Put in the work upfront to define success criteria with all of the
|
348 |
+
stakeholders.
|
349 |
+
|
350 |
+
3. Consider the ethics of using ML.
|
351 |
+
|
352 |
+
4. Do a literature review.
|
353 |
+
|
354 |
+
5. Try to rapidly build a labeled benchmark dataset.
|
355 |
+
|
356 |
+
6. Build a "minimum" viable model using manual rules or simple
|
357 |
+
heuristics.
|
358 |
+
|
359 |
+
7. Answer this question again: "Are you sure that you need ML at all?"
|
360 |
+
|
361 |
+
### Not All ML Projects Should Be Planned The Same Way
|
362 |
+
|
363 |
+
Not all ML projects have the same characteristics; therefore, they
|
364 |
+
shouldn't be planned the same way. Understanding different archetypes of
|
365 |
+
ML projects can help select the right approach.
|
366 |
+
|
367 |
+
#### ML Product Archetypes
|
368 |
+
|
369 |
+
The three archetypes offered here are defined by how they interact with
|
370 |
+
real-world use cases:
|
371 |
+
|
372 |
+
1. **Software 2.0 use cases**: Broadly speaking, this means taking
|
373 |
+
something that software or a product does in an automated fashion
|
374 |
+
today and augmenting its automation with machine learning. An
|
375 |
+
example of this would be improving code completion in the IDE
|
376 |
+
(like [Github
|
377 |
+
Copilot](https://github.com/features/copilot)).
|
378 |
+
|
379 |
+
2. **Human-in-the-loop systems:** Machine learning can be applied for
|
380 |
+
tasks where automation is not currently deployed - but where
|
381 |
+
humans could have their judgment or efficiency augmented. Simply
|
382 |
+
put, helping humans do their jobs better by complementing them
|
383 |
+
with ML-based tools. An example of this would be turning sketches
|
384 |
+
into slides, a process will usually involve humans approving the
|
385 |
+
output of a machine learning model that made the slides.
|
386 |
+
|
387 |
+
3. **Autonomous systems:** Systems that apply machine learning to
|
388 |
+
augment existing or implement new processes without human input.
|
389 |
+
An example of this would be full self-driving, where there is no
|
390 |
+
opportunity for a driver to intervene in the functioning of the
|
391 |
+
car.
|
392 |
+
|
393 |
+
For each archetype, some key considerations inform how you should go
|
394 |
+
about planning projects.
|
395 |
+
|
396 |
+
![](./media/image10.png)
|
397 |
+
|
398 |
+
|
399 |
+
1. In the case of Software 2.0 projects, you should focus more on
|
400 |
+
understanding **how impactful the performance of the new model
|
401 |
+
is**. Is the model truly much better? How can the performance
|
402 |
+
continue to increase across iterations?
|
403 |
+
|
404 |
+
2. In the case of human-in-the-loop systems, consider more **the
|
405 |
+
context of the human user and what their needs might be**. How
|
406 |
+
good does the system actually have to be to improve the life of a
|
407 |
+
human reviewing its output? In some cases, a model that does even
|
408 |
+
10% better with accuracy (nominally a small increase) might have
|
409 |
+
outsize impacts on human users in the loop.
|
410 |
+
|
411 |
+
3. For autonomous systems, focus heavily on t**he failure rate and its
|
412 |
+
consequences**. When there is no opportunity for human
|
413 |
+
intervention, as is the case with autonomous systems, failures
|
414 |
+
need to be carefully monitored to ensure outsize harm doesn't
|
415 |
+
occur. Self-driving cars are an excellent example of an autonomous
|
416 |
+
system where failure rates are carefully monitored.
|
417 |
+
|
418 |
+
#### Data Flywheels
|
419 |
+
|
420 |
+
As you build a software 2.0 project, strongly consider the concept of
|
421 |
+
the **data flywheel**. For certain ML projects, as you improve your
|
422 |
+
model, your product will get better and more users will engage with the
|
423 |
+
product, thereby generating more data for the model to get even better.
|
424 |
+
It's a classic virtuous cycle and truly the gold standard for ML
|
425 |
+
projects.
|
426 |
+
|
427 |
+
![](./media/image4.png)
|
428 |
+
|
429 |
+
|
430 |
+
As you consider implementing data flywheels, remember to know the answer
|
431 |
+
to these three questions:
|
432 |
+
|
433 |
+
1. **Do you have a data loop?** To build a data flywheel, you crucially
|
434 |
+
need to be able to get labeled data from users in a scalable
|
435 |
+
fashion. This helps increase access to high-quality data and
|
436 |
+
define a data loop.
|
437 |
+
|
438 |
+
2. **Can you turn more data into a better model?** This somewhat falls
|
439 |
+
onto you as the modeling expert, but it may also not be the case
|
440 |
+
that more data leads to significantly better performance. Make
|
441 |
+
sure you can actually translate data scale into better model
|
442 |
+
performance.
|
443 |
+
|
444 |
+
3. **Does better model performance lead to better product use?** You
|
445 |
+
need to verify that improvements with models are actually tied to
|
446 |
+
users enjoying the product more and benefiting from it!
|
447 |
+
|
448 |
+
#### Impact and Feasibility of ML Product Archetypes
|
449 |
+
|
450 |
+
Let's visit our impact vs. feasibility matrix. Our three product
|
451 |
+
archetypes differ across the spectrum.
|
452 |
+
|
453 |
+
![](./media/image9.png)
|
454 |
+
|
455 |
+
|
456 |
+
This is a pretty intuitive evaluation you can apply to all your ML
|
457 |
+
projects: **If it's harder to build (like autonomous systems), it's
|
458 |
+
likely to have a greater impact**! There are ways, however, to change
|
459 |
+
this matrix in the context of specific projects.
|
460 |
+
|
461 |
+
1. For **Software 2.0**, data flywheels can magnify impact by allowing
|
462 |
+
models to get much better and increase customer delight over time.
|
463 |
+
|
464 |
+
2. For **human-in-the-loop systems**, you can increase feasibility by
|
465 |
+
leveraging good product design. Thoughtful design can help reduce
|
466 |
+
expectations and accuracy requirements. Alternatively, a "good
|
467 |
+
enough" mindset that prioritizes incremental delivery over time
|
468 |
+
can make such systems more feasible.
|
469 |
+
|
470 |
+
3. For **autonomous systems**, leveraging humans in the loop can make
|
471 |
+
development more feasible by adding guardrails and reducing the
|
472 |
+
potential impact of failures.
|
473 |
+
|
474 |
+
### Just Get Started!
|
475 |
+
|
476 |
+
With all this discussion about archetypes and impact matrices, don't
|
477 |
+
forget the most important component of engineering: **actually
|
478 |
+
building**! Dive in and get started. Start solving problems and iterate
|
479 |
+
on solutions.
|
480 |
+
|
481 |
+
One common area practitioners trip up in is **tool fetishization.** As
|
482 |
+
MLOps and production ML have flourished, so too has the number of tools
|
483 |
+
and platforms that address various aspects of the ML process. You don't
|
484 |
+
need to be perfect with your tooling before driving value from machine
|
485 |
+
learning. Just because Google and Uber are doing things in a very
|
486 |
+
structured, at-scale way doesn't mean you need to as well!
|
487 |
+
|
488 |
+
In this course, we will primarily focus on how to set things up the
|
489 |
+
right way to do machine learning in production without overcomplicating
|
490 |
+
it. This is an ML products-focused class, not an MLOps class! Check out
|
491 |
+
this talk by Jacopo Tagliabue describing [MLOps at Reasonable
|
492 |
+
Scale](https://www.youtube.com/watch?v=Ndxpo4PeEms) for a
|
493 |
+
great exposition of this mindset.
|
494 |
+
|
495 |
+
### Chapter Summary
|
496 |
+
|
497 |
+
1. ML adds complexity. Consider whether you really need it.
|
498 |
+
|
499 |
+
2. Make sure what you're working on is high impact, or else it might
|
500 |
+
get killed.
|
501 |
+
|
502 |
+
## 3 - Lifecycle
|
503 |
+
|
504 |
+
ML adds complexity to projects and isn't always a value driver. Once you
|
505 |
+
know, however, that it's the right addition to your project, what does
|
506 |
+
the actual lifecycle look like? What steps do we embark upon as we
|
507 |
+
execute?
|
508 |
+
|
509 |
+
In this course, the common running example we use is of **a pose
|
510 |
+
estimation problem**. We'll use this as a case study to demonstrate the
|
511 |
+
lifecycle and illustrate various points about ML-powered products.
|
512 |
+
|
513 |
+
![](./media/image13.png)
|
514 |
+
|
515 |
+
|
516 |
+
Here's a graphic that visualizes the lifecycle of ML projects:
|
517 |
+
|
518 |
+
![](./media/image3.png)
|
519 |
+
|
520 |
+
|
521 |
+
It provides a very helpful structure. Watch from 48:00 to 54:00 to dive
|
522 |
+
deeper into how this lifecycle occurs in the context of a real machine
|
523 |
+
learning problem around pose estimation that Josh worked on at OpenAI.
|
524 |
+
|
525 |
+
Let's comment on some specific nuances:
|
526 |
+
|
527 |
+
- **Machine learning projects tend to be very iterative**. Each of
|
528 |
+
these phases can feed back into any of the phases that go before
|
529 |
+
it, as you learn more about the problem that you're working on.
|
530 |
+
|
531 |
+
- For example, you might realize that "Actually, it's way too
|
532 |
+
hard for us to get data in order to solve this problem!" or
|
533 |
+
"It's really difficult for us to label the pose of these
|
534 |
+
objects in 3D space".
|
535 |
+
|
536 |
+
- A solution might actually be to go back a step in the lifecycle
|
537 |
+
and set up the problem differently. For example, what if it
|
538 |
+
were cheaper to annotate per pixel?
|
539 |
+
|
540 |
+
- This could repeat itself multiple times as you progress through
|
541 |
+
a project. It's a normal and expected part of the machine
|
542 |
+
learning product development process.
|
543 |
+
|
544 |
+
- In addition to iteration during execution, there's also
|
545 |
+
cross-project "platform" work that matters! **Hiring and
|
546 |
+
infrastructure development are crucial to the long-term health of
|
547 |
+
your project**.
|
548 |
+
|
549 |
+
- Going through this lifecycle and winning each step is what we'll
|
550 |
+
cover in this class!
|
551 |
+
|
552 |
+
## Lecture Summary
|
553 |
+
|
554 |
+
In summary, here's what we covered in this lecture:
|
555 |
+
|
556 |
+
1. ML is NOT a cure-all. It's a complex technology that needs to be
|
557 |
+
used thoughtfully.
|
558 |
+
|
559 |
+
2. You DON'T need a perfect setup to get going. Start building and
|
560 |
+
iterate!
|
561 |
+
|
562 |
+
3. The lifecycle of machine learning is purposefully iterative and
|
563 |
+
circuitous. We'll learn how to master this process together!
|
documents/lecture-01.srt
ADDED
@@ -0,0 +1,352 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,320 --> 00:00:33,120
|
3 |
+
hey everyone welcome to the 2022 edition of full stack deep learning i'm josh tobin one of the instructors and i'm really excited about this version of the class because we've made a bunch of improvements and i think it comes at a really interesting time to be talking about some of these topics let's dive in today we're going to cover a few things first we'll talk about why this course exists and what you might hope to take away from it then we'll talk about the first question you should ask when you're starting a new ml project which is should we be using ml for this at all and then we'll talk through the high level overview of what the life cycle of a typical ml project might look like which will also give you a conceptual
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:31,840 --> 00:01:08,880
|
7 |
+
outline for some of the things we'll talk about in this class really what is full stack deep learning about we aim to be the course and community for people who are building products that are powered by machine learning and i think it's a really exciting time to be talking about mlpowered products because machine learning is rapidly becoming a mainstream technology and you can see this in startup funding in job postings as well as in the continued investment of large companies in this technology i think it's particularly interesting to think about how this has changed since 2018 when we started teaching the class in 2018 a lot of the most exciting ml powered products were built by the biggest companies you had self-driving
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:06,880 --> 00:01:44,079
|
11 |
+
cars that were starting to show promise you had systems like translation from big companies like google that were really starting to hit the market in a way that was actually effective but the broader narrative in the field was that very few companies were able to get value out of this technology and even on the research side right now gpt3 is becoming a mainstream technology but in 2018 gpt-1 was one of the state-of-the-art examples of language models and you know if you look at what it actually took to build a system like this it was the code and the standardization around it was still not there like these technologies were still hard to apply now on the other hand there's a much wider range of really
|
12 |
+
|
13 |
+
4
|
14 |
+
00:01:42,240 --> 00:02:20,000
|
15 |
+
powerful products that are powered by machine learning dolly 2 is i think a great example image generation technology more on the consumer side tick tock is a really powerful example but it's not just massive companies now that are able to build machine learning powered products dscript is an application that we at full stack deep learning use all the time in fact we'll probably use it to edit this video that i'm recording right now and startups are also building things like email generation so there's a proliferation of machine learning powered products and the narrative has shifted i think a little bit as well which is that before these technologies were really hard to apply but now there's standardization
|
16 |
+
|
17 |
+
5
|
18 |
+
00:02:17,840 --> 00:02:55,519
|
19 |
+
that's emerging both around the technology stack transformers and nlp starting to seep their way into more and more use cases as well as the practices around how to actually apply these technologies in the world one of the biggest changes in the field in the past four years has been the emergence of this term called ml ops which we'll talk a lot about in this class and so if you ask yourself why like why is this changed so rapidly i think in addition to the field just maturing and research continuing to progress i think one of the biggest reasons is that the training of models is starting to become commoditized we showed a couple of slides ago how complicated code for gpt-1 was now using something like
|
20 |
+
|
21 |
+
6
|
22 |
+
00:02:53,840 --> 00:03:28,480
|
23 |
+
hugging face you can deploy a state-of-the-art nlp model or computer vision model in one or two lines of code on top of that automl is starting to actually work for a lot of applications i think four years ago we were pretty skeptical about it now i think it's a really good starting point for a lot of problems that you might want to solve and companies are starting to provide models really as a service where you don't even have to download open source package to use it you can just make a network call and you can have predictions from a state-of-the-art model and on the software side a lot of frameworks are starting to standardize around things like keras and pytorch lightning so a lot of the like spaghetti
|
24 |
+
|
25 |
+
7
|
26 |
+
00:03:27,200 --> 00:04:07,680
|
27 |
+
code that you had to write to build these systems just isn't necessary anymore and so i think if you project forward a few years what's going to happen in ml i think the history of the ml is characterized by rise and fall of the public perception of the technology these were driven by a few different ai winters that happened over the history of the field where the technology didn't live up to its height live up to its promise and people became skeptical about it what's going to happen in the future of the field i think what a lot of people think is that this time is different we have real applications of machine learning that are generating a lot of value in the world and so the prospect of a true ai winter where
|
28 |
+
|
29 |
+
8
|
30 |
+
00:04:05,680 --> 00:04:46,560
|
31 |
+
people become skeptical about ai as a technology maybe feels less likely but it's still possible a slightly more likely outcome is that the overall luster of the technology starts to wear off but certain applications are getting a ton of value out of this technology and then i think you know the upside outcome for the field is that ai continues to accelerate really rapidly and it becomes pervasive and incredibly effective and i think that's also what a lot of people believe to happen and so what i would conjecture is that the way that we as a field avoid an ai winter is by not just making progress in research but also making sure that that progress is translated to actual real world products that's how we avoid repeating
|
32 |
+
|
33 |
+
9
|
34 |
+
00:04:45,199 --> 00:05:20,639
|
35 |
+
what's happened in the past that's caused the field to lose some of its luster but the challenge that presents is that building ml powered products requires a fundamentally different process in many ways than building the types of ml systems you create in academia the sort of process that you might use to develop a model in an academic setting i would call flat earth machine learning flat earth ml this is a process that will probably be familiar to many people you start by selecting a problem you collect some data to use to solve the problem you clean and label that data you iterate on developing a model until you have a model that performs well on the data set that you collected and then you evaluate that
|
36 |
+
|
37 |
+
10
|
38 |
+
00:05:19,120 --> 00:05:56,639
|
39 |
+
model and if the model performs well according to your metrics then you write a report produce a jupiter notebook a paper or some slides and then you're done but in the real world the challenge is that if you deploy that model in production it's not going to perform well in the real world for long necessarily right and so ml powered products require this outer loop where you deploy the model into production you measure how that model is performing when it interacts with real users you use the real world data to build a data flywheel and then you continue this as part of an outer loop some people believe that the earth isn't round just because you can't see the outer loop in the ml system doesn't mean it's not
|
40 |
+
|
41 |
+
11
|
42 |
+
00:05:54,160 --> 00:06:32,080
|
43 |
+
there and so this course is really about how to do this process of building ml-powered products and so what we won't cover as much is the theory and the math and the sort of computer science behind deep learning or and machine learning more broadly there are many great courses that you can check out to learn those materials we also will talk a little bit about training models and some of the practical aspects of that but this isn't meant to be your first course in training machine learning models again there's many great classes for that as well but what this class is about is the unique aspects that you need to know beyond just training models to build great ml powered products so our goals in the class are to teach you
|
44 |
+
|
45 |
+
12
|
46 |
+
00:06:30,080 --> 00:07:09,120
|
47 |
+
a generalist skill set that you can use to build an ammo powered product and an understanding of how the different pieces of ml power products fit together we will also teach you a little bit about this concept of ml ops but this is not an ml ops class our goal is to teach you enough ml ops to get things done but not to cover the full depth of ml ops as a topic we'll also share certain best practices from what we've seen to work in the real world and try to explain some of the motivations behind them and if you're on the job market or if you're thinking about transitioning into a role in machine learning we also aim to teach you some things that might help you with ml engineering job interviews and then lastly in practice i think what we've
|
48 |
+
|
49 |
+
13
|
50 |
+
00:07:07,199 --> 00:07:42,560
|
51 |
+
found to be maybe the most powerful part of this is forming a community that you can use to learn from your peers about what works in the real world and what doesn't we as instructors have solved many problems with ml but there's a very good chance that we haven't solved one that's like the one that you're working on but in the broader full stack deep learning community i would bet that there probably is someone who's worked on something similar and so we hope that this can be a place where folks come together to learn from each other as well as just learning from us now there are some things that we are explicitly not trying to do with this class we're not trying to teach you machine learning or software engineering from scratch if
|
52 |
+
|
53 |
+
14
|
54 |
+
00:07:40,800 --> 00:08:15,919
|
55 |
+
you are coming at this class and you know you have an academic background in ml but you've never written production code before or you're a software engineer but i've never taken an ml class before you can follow along with this class but i would highly recommend taking these prerequisites before you dive into the material here because you'll i think get a lot more out of the class once you've learned the fundamentals of each of these fields we're also not aiming to cover the full breadth of deep learning techniques or machine learning techniques more broadly we'll talk about a lot of the techniques that are used in practice but the chances are that we won't talk about the specific model that you use for your use
|
56 |
+
|
57 |
+
15
|
58 |
+
00:08:14,240 --> 00:08:52,800
|
59 |
+
case it's not the goal here we're also not trying to make you an expert in any single aspect of machine learning we have a project and a set of labs that are associated with this course that will allow you to spend some time working on a particular application of machine learning but there there isn't a focus on becoming an expert in computer vision or nlp or any other single branch of machine learning and we're also not aiming to help you do research in deep learning or any other ml field and similarly ml ops is this broad topic that involves everything from infrastructure and tooling to organizational practices and we're not aiming to be comprehensive here the goal of this class is to show you end to end
|
60 |
+
|
61 |
+
16
|
62 |
+
00:08:51,279 --> 00:09:28,640
|
63 |
+
what it takes to build an ml-powered product and give you pointers to the different pieces of the field that you'll potentially need to go deeper on to solve the particular problem that you're working on so if you are feeling rusty on your prerequisites but want to get started with the class anyway here are some recommendations for classes on ml and software engineering that i'd recommend checking out if you want to remind yourself of some of the fundamentals i mentioned this distinction between ml power products and ml ops a little bit and so i wanted to dive into that a little bit more ml ops is this discipline that's emerged in the last couple of years really that is about practices for deploying and
|
64 |
+
|
65 |
+
17
|
66 |
+
00:09:26,880 --> 00:10:06,240
|
67 |
+
maintaining and operating machine learning models and the systems that generate these machine learning models in production and so a lot of ml ops is about how do we put together the infrastructure that will allow us to build models in a repeatable and governable way how we're able to do this at scale how we're able to collaborate on these systems as a team and how we're able to really run these machine learning systems in a potentially high scale production setting super important topic if your goal is to make ml work in the real world and there's a lot of overlap with what we're covering in this class but we see mlpowered products as kind of a distinct but overlapping discipline because a lot of what it
|
68 |
+
|
69 |
+
18
|
70 |
+
00:10:04,800 --> 00:10:45,120
|
71 |
+
takes to build a great ml powered product goes beyond the infrastructure side and the sort of repeatability and automation side of machine learning systems and it also focuses on how to fit machine learning into the context of product or the application that you're building so other topics that are in scope of this mlpowered product discipline are things like how do you understand how your users are interacting with your model and what type of model they need how do you build a team or an organization that can work together effectively on machine learning systems how do you do product management in the context of ml what are some of the best practices for designing products that use ml as part of them things like data labeling capturing
|
72 |
+
|
73 |
+
19
|
74 |
+
00:10:42,640 --> 00:11:24,240
|
75 |
+
feedback from users etc and so this class is really focused on teaching you end to end what it takes to get a product out in the world that uses ml and we'll cover the aspects of mlaps that are most critical to understand in order to do that a little bit about us as instructors i'm josh tobin i'm co-founder and ceo of machine learning infrastructure startup called gantry previously i was a research scientist at openai and did my machine learning phd at berkeley and charles and sergey are my wonderful co-instructors who you'll be hearing from in the coming weeks on the history of full stack deep learning so we started out as a boot camp in 2018 sergey and i as well as my grad school advisor and our close collaborator peter
|
76 |
+
|
77 |
+
20
|
78 |
+
00:11:22,240 --> 00:12:00,720
|
79 |
+
abeel had this collective realization that a lot of what we had been discovering about making ml work in the real world wasn't really well covered in other courses and we didn't really know if other people would be interested in this topic so we put it together as a one-time weekend long boot camp we got started to get good feedback on that and so it grew from there and we put the class online for the first time in 2021 and here we are so the way that this class was developed was a lot of this is from our personal experience our study and reading of materials in the field we also did a bunch of interviews with practitioners from this list of companies and at this point like a much longer list as well so we're constantly
|
80 |
+
|
81 |
+
21
|
82 |
+
00:11:58,880 --> 00:12:32,880
|
83 |
+
out there talking to folks who are doing this who are building ml powered products and trying to fold their perspectives into what we teach in this class some logistics before we dive into the rest of the material for today first is if you're part of the synchronous cohort all of the communication for that cohort is going to happen on discord so if you're not on discord already then please reach out to us instructors and we'll make sure to get you on that if you're not on discord or if you're not checking it regularly there's a high likelihood that you're going to miss some of the value of the synchronous course we will have a course project again for folks who are participating in the synchronous option which we'll share
|
84 |
+
|
85 |
+
22
|
86 |
+
00:12:31,040 --> 00:13:11,839
|
87 |
+
more details about on discord in the coming weeks and there's also i think one of the most valuable parts of this class is the labs which have undergone like a big revamp this time around i want to talk a little bit more about what we're covering there so the problem that we're going to be working on the labs is creating an application that allows you to take a picture of a handwritten page of text and then transcribe that into some actual text and so imagine that you have this web application where you can take a picture of your handwriting and then at the end you get the text that comes out of it and so the way this is going to work is we're going to build a web backend that allows you to send web requests decodes those images and sends
|
88 |
+
|
89 |
+
23
|
90 |
+
00:13:09,279 --> 00:13:47,920
|
91 |
+
them to a prediction model an ocr model that will develop that will transcribe those into the text itself and those models are going to be generated by a model training system that will also show you how to build in the class and the architecture that we'll use will look something like this we'll use state-of-the-art tools that we think balance being able to really build a system like this in a principled way without adding too much complexity to what you're doing all right so just to summarize this section machine learning powered products are going mainstream and in large part this is due to the fact that it's just much much easier to build machine learning models today than it was even four or five years ago and
|
92 |
+
|
93 |
+
24
|
94 |
+
00:13:46,079 --> 00:14:21,839
|
95 |
+
so i think the challenge ahead is given that we're able to create these models pretty easily how do we actually use the models to build great products and that's a lot of what we'll talk about in this class and i think the sort of fundamental challenge is that there's not only different tools that you need in order to build great products but also different processes and mindsets as well and that's what we're really aiming to do here in fsdl so looking forward to covering some of this material and hopefully helping create the next generation of ml powered products the next topic i want to dive into is when to use machine learning at all like what problems is this technology useful for solving and so the key points that we're
|
96 |
+
|
97 |
+
25
|
98 |
+
00:14:20,320 --> 00:14:55,920
|
99 |
+
going to cover here are the first is that machine learning introduces a lot of complexity and so you really shouldn't do it before you're ready to do it and you should think about exhausting your other options before you introduce this to your stack on the flip side that doesn't mean that you need to a perfect infrastructure to get started and then we'll talk a little bit about what types of projects tend to lend themselves to being good applications of machine learning and we'll talk about how to know whether projects are feasible and whether they'll have an impact on your organization but to start out with when should you use machine learning at all so i think the first thing that's really critical to know
|
100 |
+
|
101 |
+
26
|
102 |
+
00:14:54,079 --> 00:15:32,000
|
103 |
+
here is that machine learning projects have a higher failure rate than software products in general the statistic that you'll see most often floated around in blog posts or vendor pitches is that 87 percent this very precise number of machine learning projects fail i think it's also worth noting that 73 of all statistics are made up on the spot so and this one in particular i think is a little bit questionable whether this is actually a valid statistic or not anecdotally i would say that from what i've seen it's probably more like 25 it's still a very high number still a very high failure rate but maybe not the 90-ish percent that people are quoting the question you might ask is why is that the case right why is there such a
|
104 |
+
|
105 |
+
27
|
106 |
+
00:15:30,639 --> 00:16:12,639
|
107 |
+
high failure rate for machine learning projects you know one reason that's worth acknowledging is that for a lot of applications machine learning is fundamentally still research so 100 success rate probably shouldn't be the target that we're aiming for but i do think that many machine learning projects are doomed to fail maybe even before they are undertaken and i think there's a few reasons that this can happen so oftentimes machine learning projects are technically infeasible or they're just scoped poorly and there's just too much of a lift to even get the first version of the model developed and that leads to projects failing because they just take too long to see any value another common failure mode that's becoming less and less
|
108 |
+
|
109 |
+
28
|
110 |
+
00:16:10,079 --> 00:16:50,720
|
111 |
+
common is that a team that's really effective at developing a model may not be the right team to deploy that model into production and so there's this friction after the model is developed where you know the model maybe looks promising in a jupiter notebook but it never makes the leap to prod so hopefully you'll take things away from this class that will help you avoid being in this category another really common issue that i've seen is when you as a broader organization are not all on the same page about what we would consider to be successful here and so i've seen a lot of machine learning projects fail because you have a model that you think works pretty well and you actually know how to deploy into
|
112 |
+
|
113 |
+
29
|
114 |
+
00:16:49,120 --> 00:17:27,120
|
115 |
+
production but the rest of the organization can't get comfortable with the fact that this is actually going to be running and serving predictions to users so how do we know when we're ready to deploy and then maybe the most frustrating of all these failure modes is when you actually have your model work well and it solves the problem that you set out to solve but it doesn't solve a big enough problem and so the organization decides hey this isn't worth the additional complexity that it's going to take to make this part of our stack you know i think this is a point i want to double click on which is that really i think the bar for your machine learning project should be that the value of the project must outweigh not just the cost of
|
116 |
+
|
117 |
+
30
|
118 |
+
00:17:25,679 --> 00:18:07,440
|
119 |
+
developing it but the additional complexity that machine learning systems introduce into your software and machine learning introduces a lot of complexity to your software so this is kind of a quick summary of a classic paper that i would recommend reading which is the high interest credit card of technical debt paper the thesis of this paper is that machine learning as a technology tends to introduce technical debt at a much higher rate than most other software and the reasons that the authors point to are one an erosion of boundary between systems so machine learning systems often have the property for example that the predictions that they make influence other systems that they interact with if you recommend a user a particular type
|
120 |
+
|
121 |
+
31
|
122 |
+
00:18:05,120 --> 00:18:44,799
|
123 |
+
of content that changes their behavior and so that makes it hard to isolate machine learning as a component in your system it also relies on expensive data dependencies so if your machine learning system relies on a feature that's generated by another part of your system then those types of dependencies the authors found can be very expensive to maintain it's also very common for machine learning systems to be developed with design anti-patterns somewhat avoidable but in practice very common and the systems are subject to the instability of the external world if your user's behavior changes that can dramatically affect the performance of your machine learning models in a way that doesn't typically happen with
|
124 |
+
|
125 |
+
32
|
126 |
+
00:18:42,240 --> 00:19:21,200
|
127 |
+
traditional software so the upshot is before you start a new ml project you should ask yourself are we ready to use ml at all do we really need this technology to solve this problem and is it actually ethical to use ml to solve this problem to know if you're ready to use ml some of the questions you might ask are do we have a product at all do we have something that we can use to collect the data to know whether this is actually working are we already collecting that data and storing it in the same way if you're not currently doing data collection then it's going to be difficult to build your first ml system and do we have the team that will allow us to do this knowing whether you need ml to solve a problem i think the first question that
|
128 |
+
|
129 |
+
33
|
130 |
+
00:19:19,760 --> 00:19:58,880
|
131 |
+
you should ask yourself is do we need to solve this problem at all or are we just inventing a reason to use ml because we're excited about the technology have we tried using rules or simple statistics to solve the problem with some exceptions i think usually the first version of a system that you deploy that will eventually use ml should be a simple rule based or statistics-based system because a lot of times you can get 80 of the benefit of your complex ml system with some simple rules now there's some exceptions to this if the system is an nlp system or a computer vision system where rules just typically don't perform very well but as a general rule i think if you haven't at least thought about whether you can use
|
132 |
+
|
133 |
+
34
|
134 |
+
00:19:57,360 --> 00:20:33,760
|
135 |
+
a rule-based system to achieve the same outcome then maybe you're not ready to use ml yet and lastly is it ethical i won't dive into the details here because we'll have a whole lecture on this later in the course next thing i want to talk about is if we feel like we're ready to use ml in our organization how do we know if the problem that we're working on is a good fit to solving it with machine learning the sort of tl dr here is you want to look for like any other project prioritization you want to look for use cases that have high impact and low cost and so we'll talk about different heuristics that you can use to determine whether this application of machine learning is likely to be high impact and low cost and so we'll talk about
|
136 |
+
|
137 |
+
35
|
138 |
+
00:20:32,559 --> 00:21:10,240
|
139 |
+
heuristics like friction in your products complex parts of your pipeline places where it's valuable to reduce the cost of prediction and looking at what other people in your industry are doing which is a very underrated technique for picking problems to work on and then we'll also talk about some heuristics for for assessing whether a machine learning project is going to be feasible from a cost perspective overall prioritization framework that we're going to look at here is projects that you want to select are ones that are feasible so they're low cost and they're high impact let's start with the high impact side of things so what are some mental models you can use to find high impact ml projects and these are some of the ones that we'll
|
140 |
+
|
141 |
+
36
|
142 |
+
00:21:08,000 --> 00:21:49,520
|
143 |
+
cover so starting with a book called the economics of ai and so the question this book asks is what problems does machine learning make economically feasible to solve that were maybe not feasible to solve in the past and so the sort of core observation in this book is that really at a fundamental level what ai does is it reduces the cost of prediction before maybe you needed a person and that person would take five minutes to create a prediction it's very expensive it's very operationally complex ai can do that in a fraction of a second for the cost of essentially running your machine or running your gpu cheap prediction means that there's going to be predictions that are happening in more places even in problems whereas too
|
144 |
+
|
145 |
+
37
|
146 |
+
00:21:47,200 --> 00:22:27,360
|
147 |
+
expensive to do before and so the upshot of this mental model for project selection is think about projects where cheap prediction will have a huge business impact like where would you hire a bunch of people to make predictions that it isn't feasible to do now um the next mental model i want to talk about for selecting high impact projects is just thinking about what is your product need and so i really like this article called three principles for designing ml-powered products from spotify and in this article they talked about the principles that they used to develop the discover weekly feature which i think is like one of the most powerful features of spotify and you know really the way they thought about it is this reduces
|
148 |
+
|
149 |
+
38
|
150 |
+
00:22:25,840 --> 00:23:06,080
|
151 |
+
friction for our users reduces the friction of chasing everything down yourself and just brings you everything in a neat little package and so this is something that really makes their product a lot better and so that's another kind of easy way to come up with ideas for machine learning projects another angle to think about is what are types of problems that machine learning is particularly good at and one exploration of this mental model is an article called software 2.0 from andre carpathi which is also definitely worth a read and the kind of main thesis of this article is that machine learning is really useful when you can take a really complex part of your existing software system so a really messy stack of
|
152 |
+
|
153 |
+
39
|
154 |
+
00:23:04,400 --> 00:23:46,799
|
155 |
+
handwritten rules and replace that with machine learning replace that with gradient descent and so if you have a part of your system that is complex manually defined rules then that's potentially a really good candidate for automating with ml and then lastly i think it's worth just looking at what other people in your industry are doing with ml and there's a bunch of different resources that you can look at to try to figure out what other success stories with ml are i really like this article covering the spectrum of use cases of ml at netflix there are various industry reports this is a summary of one from algorithmia which kind of covers the spectrum of what people are using ml to do and more generally i think looking at
|
156 |
+
|
157 |
+
40
|
158 |
+
00:23:44,960 --> 00:24:21,919
|
159 |
+
papers from the biggest technology companies tends to be a good source of what those companies are trying to build with ml and how they're doing it as well as earlier stage tech companies that are still pretty ml forward and those companies i think are more likely to write these insights in blog posts than they are in papers and so here's a list that i didn't compile but i think is really valuable of case studies of using machine learning in the real world that are worth going through if you're looking for inspiration of what are types of problems you can solve and how might you solve them okay so coming back to our prioritization framework we talked about some mental models for what ml projects might be high impact
|
160 |
+
|
161 |
+
41
|
162 |
+
00:24:20,960 --> 00:24:57,760
|
163 |
+
and the next thing that we're going to talk about is how to assess the cost of a machine learning project that you're considering so the way i like to think about the cost of machine learning power projects is there's three main drivers for how much a project is gonna cost the first and most important is data availability so how easy is it for you to get the data that you're gonna need to solve this problem the second most important is the accuracy requirement that you have for the problem that you're solving and then also important is the sort of intrinsic difficulty of the problem that you're trying to solve so let's start by talking about data availability the kind of key questions that you might ask here to assess
|
164 |
+
|
165 |
+
42
|
166 |
+
00:24:56,080 --> 00:25:36,000
|
167 |
+
whether data availability is going to be a bottleneck for your project is do we have this data already and if not how hard is it and how expensive is it going to be to acquire how expensive is it not just to acquire but also to label if your labelers are really expensive then getting enough data to solve the problem really well might be difficult how much data will we need in total this can be difficult to assess a priori but if you have some way of guessing whether it's 5 000 or 10 000 or 100 000 data points this is an important input and then how stable is the data so if you're working on a problem where you don't really expect the underlying data to change that much over time then the project is going to be a lot more feasible than if
|
168 |
+
|
169 |
+
43
|
170 |
+
00:25:33,919 --> 00:26:15,200
|
171 |
+
the data that you need changes on a day-to-day basis so data availability is probably the most important cost driver for a lot of machine learning powered projects because data just tends to be expensive and this is slightly less true outside of the deep learning realm it's particularly true in deep learning where you often require manual labeling but it also is true in a lot of other ml applications where data collection is expensive and lastly on data bill availability is what data security requirements do you have if you're able to collect data from your users and use that to retrain your model then that bodes well for the overall cost of the project if on the other hand you're not even able to look at the data that your
|
172 |
+
|
173 |
+
44
|
174 |
+
00:26:12,960 --> 00:26:53,760
|
175 |
+
users are generating then that's just going to make the project more expensive because it's going to be harder to debug and harder to build a data flywheel moving on to the accuracy requirement the kinds of questions you might ask here are how expensive is it when you make a wrong prediction on one extreme you might have something like a self-driving car where a wrong prediction is extremely expensive because the prospect of that is really terrible on the other extreme is something like let's say potentially a recommender system where if a user sees a bad recommendation once it's probably not really going to be that bad maybe it affects their user experience over time and maybe and causes them to churn but certainly not
|
176 |
+
|
177 |
+
45
|
178 |
+
00:26:52,080 --> 00:27:30,320
|
179 |
+
as bad as a wrong prediction in a self-driving car you also need to ask yourself how frequently does the system actually need to be right to be useful i like to think of systems like dolly 2 which is an image generation system as like a positive example of this where you can if you're just using dolly 2 as a creative supplement you can generate thousands and thousands of images and select the one that you like best for your use case so the system doesn't need to be right more than like once every n times in order to actually get value from it as a user on the other hand if the system needs to be 100 reliable like never ever make a wrong prediction in order for it to be useful then it's just going to be more expensive to build
|
180 |
+
|
181 |
+
46
|
182 |
+
00:27:28,640 --> 00:28:06,480
|
183 |
+
these systems and then what are the ethical implications of your model making wrong predictions is like an important question to consider as well and then lastly on the problem difficulty questions to ask yourself are is this problem well defined enough to solve with ml are other people working on similar things doesn't necessarily need to be the exact same problem but if it's a sort of a brand new problem that no one's ever solved with mlv4 that's going to introduce a lot of technical risk another thing that's worth looking at if you're looking at other work on similar problems is how much compute did it actually take them to solve this problem and it's worth looking at that both on the training side as well as on
|
184 |
+
|
185 |
+
47
|
186 |
+
00:28:04,559 --> 00:28:42,240
|
187 |
+
the inference side because if it's feasible to train your model but it takes five seconds to make a prediction then for some applications that will be good enough and some for some it won't and then i think like maybe the weakest heuristic here but still potentially a useful one is can a human do this problem at all if a human can solve the problem then that's a decent indication that a machine learning system might be able to solve it as well but not a perfect indication as we'll come back to so i want to double click on this accuracy requirement why is this such an important driver of the cost of machine learning projects the fundamental reason is that in my observation the project cost tends to scale like super linearly
|
188 |
+
|
189 |
+
48
|
190 |
+
00:28:40,000 --> 00:29:25,120
|
191 |
+
in your accuracy requirement so as a very rough rule of thumb every time that you add an additional nine to your required accuracy so moving from 99.9 to 99.99 accuracy might lead to something like a 10x increase in your project costs because you might expect to need at least 10 times as much data if not more in order to actually solve the problem to that degree of accuracy required but also you might need a bunch of additional infrastructure monitoring support in order to ensure that the model is actually performing that accurately next thing i'm going to double click on is the problem difficulty so how do we know which problems are difficult for machine learning systems to solve the first point i want to make here is this is
|
192 |
+
|
193 |
+
49
|
194 |
+
00:29:23,039 --> 00:30:08,960
|
195 |
+
like i think like a classically hard problem to really answer confidently and so i really like this comic for two reasons the first is because it gets at this core property of machine learning systems which is that it's not always intuitive which problems will be easy for a computer to solve and which ones will be hard for a computer to solve in 2010 doing gis lookup was super easy and detecting whether a photo was a bird was like a research team in five years level of difficulty so not super intuitive as someone maybe outside of the field the second reason i like this comic is because it also points to the sort of second challenge in assessing feasibility in ml which is that this field just moves so fast that if you're not keeping up with
|
196 |
+
|
197 |
+
50
|
198 |
+
00:30:07,279 --> 00:30:50,320
|
199 |
+
what's going on in the state of the art then your understanding of what's feasible will be stale very quickly building an application to detect whether a photo is of a bird is no longer a research team in five years problem it's like a api call and 15 minutes type problem so take everything i say here with the grain of salt because the feasibility of ml projects is notoriously difficult to predict another example here is in the late 90s the new york times when they were talking about sort of ai systems beating humans at chess predicted that it might be a hundred years before a computer beats human echo or even longer and you know less than 20 years later machine learning systems from deep mind beat the best humans in
|
200 |
+
|
201 |
+
51
|
202 |
+
00:30:48,399 --> 00:31:22,000
|
203 |
+
the world that go these predictions are notoriously difficult to make but that being said i think it's still worth talking about and so one heuristic that you'll hear for what's feasible to do with machine learning is this heuristic from andrew ing which is that anything that a normal person can do in less than one second we can automate with ai i think this is actually not a great heuristic for what's feasible to do with ai but you'll hear it a lot so i wanted to talk about it anyway there's some examples of where this is true right so recognizing the content of images understanding speech potentially translating speech maybe grasping objects with a robot and things like that are things that you could point to
|
204 |
+
|
205 |
+
52
|
206 |
+
00:31:20,320 --> 00:31:57,600
|
207 |
+
as evidence for andrew's statement being correct but i think there's some really obvious counter examples as well machine learning systems are still no good at things that a lot of people are really good at like understanding human humor or sarcasm complex in-hand manipulation of objects generalizing to brand new scenarios that they've never seen before this is a heuristic that you'll see it's not one that i would recommend using seriously to assess whether your project is feasible or not there's a few things that we can say are definitely still hard in machine learning i kept a couple of things in these slides that we talked about being really difficult in machine learning when we started teaching the class in
|
208 |
+
|
209 |
+
53
|
210 |
+
00:31:55,120 --> 00:32:32,640
|
211 |
+
2018 that i think i would no longer consider to be super difficult anymore unsupervised learning being one of them but reinforcement learning problems still tend to be not very feasible to solve for real world use cases although there are some use cases where with tons of data and compute reinforcement learning can be used to solve real world problems within the context of supervised learning there are also still problems that are hard so things like question answering a lot of progress over the last few years still these systems aren't perfect text summarization video prediction building 3d models another example of one that i think i would use to say is really difficult but with nerf and all the sort
|
212 |
+
|
213 |
+
54
|
214 |
+
00:32:30,640 --> 00:33:07,679
|
215 |
+
of derivatives of that i think is more feasible than ever real world speech recognition so outside of the context of a clean data set in a noisy room can we recognize what people are saying resisting adversarial examples doing math although there's been a lot of progress on this problem as well over the last few months solving world war problems or bond guard problems this is an example by the way of a bomb card problem it's a visual analogy type problem so this is kind of a laundry list of some things that are still difficult even in supervised learning and so can we reason about this what types of problems are still difficult to do so i think one type is where not the input to the model itself but the prediction that the model is
|
216 |
+
|
217 |
+
55
|
218 |
+
00:33:06,000 --> 00:33:47,760
|
219 |
+
making the output of the model where that is like a complex or high dimensional structure or where it's ambiguous right so for example 3d reconstruction the 3d model that you're outputting is very high dimensional and so that makes it difficult to do for ml video prediction not only high dimensional but also ambiguous just because you know what happened in the video for the last five seconds there's still maybe infinite possibilities for what the video might look like going forward so it's ambiguous and it's high dimensional which makes it very difficult to do with ml dialog systems again very ambiguous very open-ended very difficult to do with ml and uh open-ended recommender systems so a second category of problems that are
|
220 |
+
|
221 |
+
56
|
222 |
+
00:33:46,080 --> 00:34:26,720
|
223 |
+
still difficult to do with ml are problems where you really need the system to be reliable machine learning systems tend to fail in all kinds of unexpected and hard to reason about ways so anywhere where you need really high precision or robustness is gonna be more difficult to solve using machine learning so failing safely out of distribution for example is still a difficult problem in ml robustness to adversarial attacks is still a difficult problem in ml and even things that are easier to do with low precision like estimating the position and rotation of an object in 3d space can be very difficult to do if you have a high precision requirement the last category of problems i'll point to here is problems where you need the
|
224 |
+
|
225 |
+
57
|
226 |
+
00:34:24,560 --> 00:35:04,720
|
227 |
+
system to be able to generalize well to data that it's never seen before this can be data that's out of distribution it can be where your system needs to do something that looks like reasoning or planning or understanding of causality these problems tend to be more in the research domain today i would say one example is in the self-driving car world dealing with edge cases very difficult challenge in that field but also control problems in self-driving cars you know those stacks are incorporating more and more ml into them whereas the computer vision and perception part of self-driving cars adopted machine learning pretty early the control piece was using more traditional methods for much longer and then places where you have a small
|
228 |
+
|
229 |
+
58
|
230 |
+
00:35:02,800 --> 00:35:40,400
|
231 |
+
amount of data again like if you're considering machine learning broadly small data is often possible but especially in the context of deep learning small data still presents a lot of challenges summing this up like how should you try to assess whether your machine learning project is feasible or not first question you should ask is do we really need to solve this problem with ml at all i would recommend putting in the work up front to define what is the success criteria that we need and doing this with everyone that needs to sign up on the project in the end not just the ml team let's avoid being an ml team that works on problems in isolation and then has those projects killed because no one actually really needed to solve
|
232 |
+
|
233 |
+
59
|
234 |
+
00:35:38,000 --> 00:36:13,440
|
235 |
+
this problem or because the value of the solution is not worth the complexity that it adds to your product then you should consider the ethics of using ml to solve this problem and we'll talk more about this towards the end of the course in the ethics lecture then it's worth doing a literature review to make sure that there are examples of people working on similar problems trying to rapidly build a benchmark data set that's labeled so you can start to get some sense of whether your model's performing well or not then and only then building a minimum viable model so this is potentially even just manual rules or simple linear regression deploying this into production if it's feasible to do so or at least running
|
236 |
+
|
237 |
+
60
|
238 |
+
00:36:11,520 --> 00:36:50,640
|
239 |
+
this on your existing problem so you have a baseline and then lastly it's worth just restating making sure that you once you've built this minimum viable model that may not even use ml just really asking yourself the question of whether this is good enough for now or whether it's worth putting in the additional effort to turn this into a complex ml system the next point i want to make here is that not all ml projects really have the same characteristics and so should be and so you shouldn't think about planning all ml projects in the same way i want to talk about some archetypes of different types of ml projects and the implications that they have for the feasibility of the projects and how you might run the projects
|
240 |
+
|
241 |
+
61
|
242 |
+
00:36:49,040 --> 00:37:28,560
|
243 |
+
effectively and so the three archetypes i want to talk to are defined by how they interact with real world users and so the first archetype is software 2.0 use cases and so i would define this as taking something that software does today so an existing part of your product that you have let's say and doing it better more accurately or more efficiently with ml it's taking a part of your product that's already automated or already partially automated and adding more automation or more efficient automation using machine learning then the next archetype is human in the loop systems and so this is where you take something that is not currently automated in your system but it's something that humans are doing or
|
244 |
+
|
245 |
+
62
|
246 |
+
00:37:26,720 --> 00:38:07,839
|
247 |
+
humans could be doing and helping them do that job better more efficiently or more accurate accurately by supplementing their judgment with ml based tools preventing them from needing to do the job on every single data point by giving them suggestions of what they can do so they can shortcut their process in a lot of places human loop systems are about making the humans that are ultimately making the decisions more efficient or more effective and then lastly autonomous systems and so these are systems where you take something that humans do today or maybe is just not being done at all today and fully automated with ml to the point where you actually don't need humans to do the judgment piece of it at all and so some
|
248 |
+
|
249 |
+
63
|
250 |
+
00:38:05,440 --> 00:38:47,200
|
251 |
+
examples of software 2.0 are if you have an ide that has code completion can we do better code completion by using ml can we take a recommendation system that is initially using some simple rules and making it more customized can we take our video game ai that's using this rule-based system and make it much better by using machine learning some examples of human and loop systems would be building a product to turn hand-drawn sketches into slides you still have a human on the other end that's evaluating the quality of those sketches before they go in front of a customer or stakeholder so it's a human in the loop system but it's potentially saving a lot of time for that human email auto completion so if you use
|
252 |
+
|
253 |
+
64
|
254 |
+
00:38:45,359 --> 00:39:23,119
|
255 |
+
gmail you've seen these email suggestions where it'll suggest sort of short responses to the email that you got i get to decide whether that email actually goes out to the world so it's not an automation system it's a human in the loop system or helping a radiologist do their job faster and then examples of autonomous systems are things like full self-driving right maybe there's not even a steering wheel in the car i can't interrupt the autonomous system and take over control of the car even if i wanted to or maybe it's not designed for me to do that very often fully automated customer support so if i go on a company's website and i interact with their customer support without even having the option of talking to an agent
|
256 |
+
|
257 |
+
65
|
258 |
+
00:39:21,280 --> 00:39:56,720
|
259 |
+
or with them making it very difficult to talk to an agent that's an autonomous system or for example like fully automating website design so that to the point where people who are not design experts can just click a button and get a website designed for them and so i think some of the key questions that you need to ask before embarking on these projects are a little bit different depending on which archetype your project falls into so if you're working on a software 2.0 project then i think some of the questions you should be concerned about are how do you know that your models are actually performing improving performance over the baseline that you already have how confident are you that the type of performance improvement that
|
260 |
+
|
261 |
+
66
|
262 |
+
00:39:54,960 --> 00:40:32,560
|
263 |
+
you might be able to get from ml is actually going to generate value for your business if it's just one percent better is that really worth the cost then do these performance improvements lead to what's called a data flywheel which i'll talk a little bit more about with human in the loop systems you might ask a different set of questions before you embark on the project like how good does the system actually need to be useful if the system you know is able to automate 10 of the work of the human that is ultimately making the decisions or producing the end product is that useful to them or does that just slow it slow them down how can you collect enough data to make it that good is it possible to actually build a data set
|
264 |
+
|
265 |
+
67
|
266 |
+
00:40:30,720 --> 00:41:08,400
|
267 |
+
that is able to get you to that useful threshold for your system and for autonomous systems the types of questions you might ask are what is an acceptable failure rate for this system how many nines in your performance threshold do you need in order for this sort of not to cause harm in the world and how can you guarantee like how can you be really confident that one it won't exceed that failure rate and so this is something that in autonomous vehicles for example teams put a ton of effort into building the simulation and testing systems that they need to be confident that they won't exceed the failure rate that's except the very very low failure rate that's acceptable for those systems i want to double click on this data
|
268 |
+
|
269 |
+
68
|
270 |
+
00:41:06,160 --> 00:41:49,040
|
271 |
+
flywheel concept for software 2.0 we talked about can we build a data flywheel that lead to better and better performance of the system and the way to think about a data flywheel is it's this virtuous cycle where as your model gets better you are able to use a better that better model to make a better product which allows you to acquire more users and as you have more users those users generate more data which you can use to build a better model and this creates this virtuous cycle and so the connections between each of these steps are also important in order for more users to allow you to collect more data you need to have a data loop where you need to have a way of automatically collecting data and deciding what data points to
|
272 |
+
|
273 |
+
69
|
274 |
+
00:41:46,960 --> 00:42:23,839
|
275 |
+
label from your users or at least processes for doing these in order for more data to lead to a better model that's that's kind of on you as an ml practitioner right like you need to be able to translate more data more granular data more labels into a model that performs better for your users and then in order for the better model to lead to better users you need to be sure that better predictions are actually making your product better another point that i want to make on these project archetypes is i would sort of characterize them as having different trade-offs on this feasibility versus impact two by two that we talked about earlier software 2.0 projects since they're just taking something that you
|
276 |
+
|
277 |
+
70
|
278 |
+
00:42:22,480 --> 00:43:00,000
|
279 |
+
already know you can automate and automating it better tend to be more feasible but since you already have an answer to the question that they're also answering they also tend to be lower impact on the other extreme autonomous systems tend to be very difficult to build because the accuracy requirements in general are quite high but the impact can be quite high as well because you're replacing something that literally doesn't exist and human in the loop systems tend to be somewhere in between where you can really like you can use this paradigm of machine learning products to build things that couldn't exist before but the impact is not quite as high because you still need people in the loop that are helping use their judgment
|
280 |
+
|
281 |
+
71
|
282 |
+
00:42:57,599 --> 00:43:40,400
|
283 |
+
to complement the machine learning model there's ways that you can move these types of projects on the feasibility impact matrix to make them more likely to succeed so if you're working on a software 2.0 project you can make these projects have potentially higher impacts by implementing a data loop that allows you to build continual improvement data flywheel that we talked about before and potentially allows you to use the data that you're collecting from users interacting with this system to automate more tasks in the future so for example in the code completion ide example that we gave before you can you know if you're building something like github copilot then think about all the things that the data that you're collecting
|
284 |
+
|
285 |
+
72
|
286 |
+
00:43:38,560 --> 00:44:20,240
|
287 |
+
from that could be useful for building in the future you can make human in the loop systems more feasible through good product design and we'll talk a little bit more about this in a future lecture but there's design paradigms in the product itself that can reduce the accuracy requirement for these types of systems and another way to make these projects more feasible is by adopting sort of a different mindset which is let's just make the system good enough and ship it into the real world so we can start the process of you know seeing how how real users interact with it and using the feedback that we get from our humans in the loop to make the model better and then lastly autonomous systems can be made more feasible by adding guard rails
|
288 |
+
|
289 |
+
73
|
290 |
+
00:44:18,240 --> 00:44:59,119
|
291 |
+
or in some cases adding humans in the loop and so this is you can think of this as the approach to autonomous vehicles where you have safety drivers in the loop early on in the project or where you introduce tele operations so that a human can take control of the system if it looks like something is going wrong i think another point that is really important here is despite all this talk about what's feasible to do with ml the complexity that ml introduce is in your system i don't mean by any of this to say that you should do necessarily a huge amount of planning before you dive into using ml at all just make sure that the project that you're working on is the right project and then just dive in and get started and in particular i think a
|
292 |
+
|
293 |
+
74
|
294 |
+
00:44:57,359 --> 00:45:34,960
|
295 |
+
failure mode that i'm seeing crop up more and more over the past couple of years that you should avoid is falling into the trap of tool fetishization so one of the great things that's happened in ml over the past couple of years is the rise of this ml ops discipline and alongside of that has been proliferation of different tools that are available on the market to help with different parts of the ml process and one thing that i've noticed that this has caused for a lot of folks is this sort of general feeling that you really need to have perfect tools before you get started you don't need perfect tools to get started and you also don't need a perfect model and in particular just because google or uber is doing
|
296 |
+
|
297 |
+
75
|
298 |
+
00:45:33,599 --> 00:46:12,400
|
299 |
+
something like just because they have you know a feature store as part of their stack or they serve models in a particular way doesn't mean that you need to have that as well and so a lot of what we'll try to do in this class is talk about what's the middle ground be between doing things in the right way from a production perspective but not introducing too much complexity early on into your project so that's one of the reasons why fsdl is a class about building ml powered products in a practical way and not in mlaps class that's focused on what is the state of the art in the best possible infrastructure that you can use and um a talk and blog posts and associated set of things on this concept that i really
|
300 |
+
|
301 |
+
76
|
302 |
+
00:46:09,520 --> 00:46:54,960
|
303 |
+
like is this ml offset reasonable scale push by some of the folks from kovio and the sort of central thesis of ml offs at reasonable scale is you're not google you probably have a finite compute budget not entire cloud you probably have a limited number of folks on your team you probably have not an infinite budget to spend on this and you probably have a limited amount of data as well and so those differences between what you have and what uber has or what google has have implications for what the right stack is for the problems that you're solving and so it's worth thinking about these cases separately and so if you're interested in what one company did and recommends for an ml stack that isn't designed to
|
304 |
+
|
305 |
+
77
|
306 |
+
00:46:52,000 --> 00:47:31,200
|
307 |
+
scale to becoming uber scale then i recommend checking out this talk to summarize what we've covered so far machine learning is an incredibly powerful technology but it does add a lot of complexity and so before you embark on a machine learning project you should make sure that you're thinking carefully about whether you really need ml to solve the problem that you're solving and whether the problem is actually worth solving at all given the complexity that this adds and so let's avoid being ml teams that have their projects get killed because we're working on things that don't really matter to the business that we're a part of all right and the last topic i want to cover today is once you've sort of made this decision to embark on an ml
|
308 |
+
|
309 |
+
78
|
310 |
+
00:47:29,520 --> 00:48:07,599
|
311 |
+
project what are the different steps that you're going to go through in order to actually execute on that project and this will also give you an outline for some of the other things you can expect from the class so the running case study that we'll use here is a modified version of a problem that i worked on when i was at open ai which is pose estimation our goal is to build a system that runs on a robot that takes the camera feed from that robot and uses it to estimate the position in 3d space and the orientation the rotation of each of the objects in the scene so that we can use those for downstream tasks and in particular so we can use them to feed into a separate model which will be used to tell the robot how it
|
312 |
+
|
313 |
+
79
|
314 |
+
00:48:06,000 --> 00:48:40,000
|
315 |
+
actually can grasp the different objects in the scene machine learning projects start like any other project in a planning and project setup phase and so what the types of activities we'd be doing in this phase when we're working on this pose estimation project are things like deciding to work on post-estimation at all determining whether how much this is going to cost what resources we need to allocate to it considering the ethical implications and things like this right a lot of what we've been talking about so far in this lecture once we plan the project then we'll move into a data collection and labeling phase and so for pose estimation what this might look like is collecting the corpus of objects that
|
316 |
+
|
317 |
+
80
|
318 |
+
00:48:38,640 --> 00:49:18,559
|
319 |
+
we're going to train our model on setting up our sensors like our cameras to capture our information about those objects actually capturing those objects and somehow figuring out how to annotate these images that we're capturing with ground truth like the pose of the of the objects in those images one point i want to make about the life cycle of mbl projects is that this is not like a straightforward path machine learning projects tend to be very iterative and each of these phases can feed back into any of the phases before as you learn more about the problem that you're working on so for example you might realize that actually it's way too hard for us to get data in order to solve this problem or it's really difficult for us to label
|
320 |
+
|
321 |
+
81
|
322 |
+
00:49:16,079 --> 00:49:54,559
|
323 |
+
the pose of these objects in 3d space but what we can do is it's actually much cheaper for us to annotate like per pixel segmentation so can we reformulate the problem in a way that allows us to to use what we've learned about data collection and labeling to plan a better project once you have some data to work on then you enter the sort of training and debugging phase and so what we might do here is we might implement a baseline for our model not using like a complex neural network but just using some opencv functions and then once we have that working we might find a state-of-the-art model and reproduce it debug our implementation and iterate on our model run some hyper parameter sweeps until it performs well
|
324 |
+
|
325 |
+
82
|
326 |
+
00:49:52,720 --> 00:50:29,599
|
327 |
+
on our task this can feed back into the data collection and labeling phase because we might realize that you know we actually need more data in order to solve this problem or we might also realize that there's something flawed in the process that we've been using to label the data that we're using data labeling process might need to be revisited but we can also loop all the way back to the project planning phase because we might realize that actually this task is a lot harder than we thought or the requirements that we specified at the planning phase trade off with each other so we need to revisit which are most important so for example like maybe we thought that we had an accuracy requirement of estimating the pose of these objects to
|
328 |
+
|
329 |
+
83
|
330 |
+
00:50:26,960 --> 00:51:06,720
|
331 |
+
one tenth of one centimeter and we also had an a latency requirement for inference in our models of 1 100th of a second to run on robotic hardware and we might realize that hey you know we can get this really really tight accuracy requirement or we can have really fast inference but it's very difficult to do both so is it possible to relax one of those assumptions once you've trained a model that works pretty well offline for your task then your goal is going to be to deploy that model test it in the real world and then use that information to figure out where to go next for the purpose of this project that might look like piloting the grasping system in the lab so before we roll it out to actual users can we
|
332 |
+
|
333 |
+
84
|
334 |
+
00:51:04,880 --> 00:51:42,319
|
335 |
+
test it in a realistic scenario and we might also do things like writing tests to prevent regressions and evaluate for bias in the model and then eventually rolling this out into production and monitoring it and continually improving it from there and so we can feed back here into the training and debugging stage because oftentimes what we'll find is that the model that worked really well for our offline data set once it gets into the real world it doesn't actually work as well as we thought whether that's because the accuracy requirement that we had for the model was wrong like we actually needed it to be more accurate than we thought or maybe the metric that we're looking at the accuracy is not actually the metric
|
336 |
+
|
337 |
+
85
|
338 |
+
00:51:39,760 --> 00:52:17,920
|
339 |
+
that really matters for success at the downstream task that we're trying to solve because that could cause us to revisit the training phase we also could loop back to the data collection and labeling phase because common problem that we might find in the real world is that there's some mismatch between the training data that we collected and the data that we actually saw when we went out and tested this we could use what we learned from that to go collect more data or mine for hard cases like mine for the failure cases that we found in production and then finally as i alluded to before we could loop all the way back to the project planning phase because we realized that the metric that we picked doesn't really drive the downstream
|
340 |
+
|
341 |
+
86
|
342 |
+
00:52:15,920 --> 00:52:51,599
|
343 |
+
behavior that we desired just because the grasp model is accurate doesn't mean that the robot will actually be able to successfully grasp the object so we might need to use a different metric to really solve this task or we might realize that the performance in the real world isn't that great and so we maybe need to add additional requirements to our model as well maybe it just needs to be faster to in order to run on a real robot so these are kind of like what i think of as the activities that you do in any particular machine learning project that you undertake but there's also some sort of cross project things that you need in order to be successful which we'll talk about in the class as well you need to be able to work on
|
344 |
+
|
345 |
+
87
|
346 |
+
00:52:49,920 --> 00:53:25,040
|
347 |
+
these problems together as a team and you need to have the right infrastructure and tooling to make these processes more repeatable and these are topics that we'll cover as well so this is like a broad conceptual outline of the different topics that we'll talk about in this class and so to wrap up for today what we covered is machine learning is a complex technology and so you should use it because you need it or because you think it'll generate a lot of value but it's not a cure-all it doesn't solve every problem it won't automate every single thing that you wanted to automate so let's pick projects that are going to be valuable but in spite of this you don't need a perfect setup to get started and let's
|
348 |
+
|
349 |
+
88
|
350 |
+
00:53:23,440 --> 00:53:34,760
|
351 |
+
spend the rest of this course walking through the project lifecycle and learning about each of these stages and how we can how we can use them to build great ml powered products
|
352 |
+
|
documents/lecture-02.md
ADDED
@@ -0,0 +1,563 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: Software engineering, Deep learning frameworks, Distributed training, GPUs, and Experiment Management.
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 2: Development Infrastructure & Tooling
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/BPYOsDCZbno?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Sergey Karayev](https://twitter.com/sergeykarayev).
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published August 15, 2022.
|
14 |
+
[Download slides](https://drive.google.com/open?id=16pEG5GesO4_UAWiD5jrIReMGzoyn165M).
|
15 |
+
|
16 |
+
## 1 - Introduction
|
17 |
+
|
18 |
+
The **dream** of ML development is that given a project spec and some
|
19 |
+
sample data, you get a continually improving prediction system deployed
|
20 |
+
at scale.
|
21 |
+
|
22 |
+
The **reality** is starkly different:
|
23 |
+
|
24 |
+
- You have to collect, aggregate, process, clean, label, and version
|
25 |
+
the data.
|
26 |
+
|
27 |
+
- You have to find the model architecture and their pre-trained
|
28 |
+
weights and then write and debug the model code.
|
29 |
+
|
30 |
+
- You run training experiments and review the results, which will be
|
31 |
+
fed back into the process of trying out new architectures and
|
32 |
+
debugging more code.
|
33 |
+
|
34 |
+
- You can now deploy the model.
|
35 |
+
|
36 |
+
- After model deployment, you have to monitor model predictions and
|
37 |
+
close the data flywheel loop. Basically, your users generate fresh
|
38 |
+
data for you, which needs to be added to the training set.
|
39 |
+
|
40 |
+
![](./media/image3.png)
|
41 |
+
|
42 |
+
|
43 |
+
This reality has roughly three components: data, development, and
|
44 |
+
deployment. The tooling infrastructure landscape for them is large, so
|
45 |
+
we'll have three lectures to cover it all. **This lecture focuses on the
|
46 |
+
development component**.
|
47 |
+
|
48 |
+
## 2 - Software Engineering
|
49 |
+
|
50 |
+
![](./media/image7.png)
|
51 |
+
|
52 |
+
|
53 |
+
### Language
|
54 |
+
|
55 |
+
For your choice of **programming language**, Python is the clear winner
|
56 |
+
in scientific and data computing because of all the libraries that have
|
57 |
+
been developed. There have been some contenders like Julia and C/C++,
|
58 |
+
but Python has really won out.
|
59 |
+
|
60 |
+
### Editors
|
61 |
+
|
62 |
+
To write Python code, you need an **editor**. You have many options,
|
63 |
+
such as Vim, Emacs, Jupyter Notebook/Lab, VS Code, PyCharm, etc.
|
64 |
+
|
65 |
+
- We recommend [VS Code](https://code.visualstudio.com/)
|
66 |
+
because of its nice features such as built-in git version control,
|
67 |
+
documentation peeking, remote projects opening, linters and type
|
68 |
+
hints to catch bugs, etc.
|
69 |
+
|
70 |
+
- Many practitioners develop in [Jupyter
|
71 |
+
Notebooks](https://jupyter.org/), which is great as
|
72 |
+
the "first draft" of a data science project. You have to put in
|
73 |
+
little thought before you start coding and seeing the immediate
|
74 |
+
output. However, notebooks have a variety of problems: primitive
|
75 |
+
editor, out-of-order execution artifacts, and challenges to
|
76 |
+
version and test them. A counterpoint to these problems is the
|
77 |
+
[nbdev package](https://nbdev.fast.ai/) that lets
|
78 |
+
you write and test code all in one notebook environment.
|
79 |
+
|
80 |
+
- We recommend you use **VS Code with built-in support for
|
81 |
+
notebooks** - where you can write code in modules imported into
|
82 |
+
notebooks. It also enables awesome debugging.
|
83 |
+
|
84 |
+
If you want to build something more interactive,
|
85 |
+
[Streamlit](https://streamlit.io/) is an excellent choice.
|
86 |
+
It lets you decorate Python code, get interactive applets, and publish
|
87 |
+
them on the web to share with the world.
|
88 |
+
|
89 |
+
![](./media/image10.png)
|
90 |
+
|
91 |
+
|
92 |
+
For setting up the Python environment, we recommend you see [how we did
|
93 |
+
it in the
|
94 |
+
lab.](https://github.com/full-stack-deep-learning/conda-piptools)
|
95 |
+
|
96 |
+
## 3 - Deep Learning Frameworks
|
97 |
+
|
98 |
+
![](./media/image15.png)
|
99 |
+
|
100 |
+
|
101 |
+
Deep learning is not a lot of code with a matrix math library like
|
102 |
+
Numpy. But when you have to deploy your code onto CUDA for GPU-powered
|
103 |
+
deep learning, you want to consider deep learning frameworks as you
|
104 |
+
might be writing weird layer types, optimizers, data interfaces, etc.
|
105 |
+
|
106 |
+
### Frameworks
|
107 |
+
|
108 |
+
There are various frameworks, such as PyTorch, TensorFlow, and Jax. They
|
109 |
+
are all similar in that you first define your model by running Python
|
110 |
+
code and then collect an optimized execution graph for different
|
111 |
+
deployment patterns (CPU, GPU, TPU, mobile).
|
112 |
+
|
113 |
+
1. We prefer PyTorch because [it is absolutely
|
114 |
+
dominant](https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2022/)
|
115 |
+
by measures such as the number of models, the number of papers,
|
116 |
+
and the number of competition winners. For instance, about [77%
|
117 |
+
of 2021 ML competition winners used
|
118 |
+
PyTorch](https://blog.mlcontests.com/p/winning-at-competitive-ml-in-2022?s=w).
|
119 |
+
|
120 |
+
2. With TensorFlow, you have TensorFlow.js (that lets you run deep
|
121 |
+
learning models in your browser) and Keras (an unmatched developer
|
122 |
+
experience for easy model development).
|
123 |
+
|
124 |
+
3. Jax is a meta-framework for deep learning.
|
125 |
+
|
126 |
+
![](./media/image12.png)
|
127 |
+
|
128 |
+
|
129 |
+
[PyTorch](https://pytorch.org/) has excellent developer
|
130 |
+
experience and is production-ready and even faster with TorchScript.
|
131 |
+
There is a great distributed training ecosystem. There are libraries for
|
132 |
+
vision, audio, etc. There are also mobile deployment targets.
|
133 |
+
|
134 |
+
[PyTorch Lightning](https://www.pytorchlightning.ai/)
|
135 |
+
provides a nice structure for organizing your training code, optimizer
|
136 |
+
code, evaluation code, data loaders, etc. With that structure, you can
|
137 |
+
run your code on any hardware. There are nice features such as
|
138 |
+
performance and bottleneck profiler, model checkpointing, 16-bit
|
139 |
+
precision, and distributed training libraries.
|
140 |
+
|
141 |
+
Another possibility is [FastAI
|
142 |
+
software](https://www.fast.ai/), which is developed
|
143 |
+
alongside the fast.ai course. It provides many advanced tricks such as
|
144 |
+
data augmentations, better initializations, learning rate schedulers,
|
145 |
+
etc. It has a modular structure with low-level API, mid-level API,
|
146 |
+
high-level API, and specific applications. The main problem with FastAI
|
147 |
+
is that its code style is quite different from mainstream Python.
|
148 |
+
|
149 |
+
At FSDL, we prefer PyTorch because of its strong ecosystem, but
|
150 |
+
[TensorFlow](https://www.tensorflow.org/) is still
|
151 |
+
perfectly good. If you have a specific reason to prefer it, you are
|
152 |
+
still going to have a good time.
|
153 |
+
|
154 |
+
[Jax](https://github.com/google/jax) is a more recent
|
155 |
+
project from Google that is not specific to deep learning. It provides
|
156 |
+
general vectorization, auto-differentiation, and compilation to GPU/TPU
|
157 |
+
code. For deep learning, there are separate frameworks like
|
158 |
+
[Flax](https://github.com/google/flax) and
|
159 |
+
[Haiku](https://github.com/deepmind/dm-haiku). You should
|
160 |
+
only use Jax for a specific need.
|
161 |
+
|
162 |
+
### Meta-Frameworks and Model Zoos
|
163 |
+
|
164 |
+
Most of the time, you will start with at least a model architecture that
|
165 |
+
someone has developed or published. You will use a specific architecture
|
166 |
+
(trained on specific data with pre-trained weights) on a model hub.
|
167 |
+
|
168 |
+
- [ONNX](https://onnx.ai/) is an open standard for
|
169 |
+
saving deep learning models and lets you convert from one type of
|
170 |
+
format to another. It can work well but can also run into some
|
171 |
+
edge cases.
|
172 |
+
|
173 |
+
- [HuggingFace](https://huggingface.co/) has become an
|
174 |
+
absolutely stellar repository of models. It started with NLP tasks
|
175 |
+
but has then expanded into all kinds of tasks (audio
|
176 |
+
classification, image classification, object detection, etc.).
|
177 |
+
There are 60,000 pre-trained models for all these tasks. There is
|
178 |
+
a Transformers library that works with PyTorch, TensorFlow, and
|
179 |
+
Jax. There are 7,500 datasets uploaded by people. There's also a
|
180 |
+
community aspect to it with a Q&A forum.
|
181 |
+
|
182 |
+
- [TIMM](https://github.com/rwightman/pytorch-image-models)
|
183 |
+
is a collection of state-of-the-art computer vision models and
|
184 |
+
related code that looks cool.
|
185 |
+
|
186 |
+
## 4 - Distributed Training
|
187 |
+
|
188 |
+
![](./media/image9.png)
|
189 |
+
|
190 |
+
|
191 |
+
Let's say we have multiple machines represented by little squares above
|
192 |
+
(with multiple GPUs in each machine). You are sending batches of data to
|
193 |
+
be processed by a model with parameters. The data batch can fit on a
|
194 |
+
single GPU or not. The model parameters can fit on a single GPU or not.
|
195 |
+
|
196 |
+
The best case is that both your data batch and model parameters fit on a
|
197 |
+
single GPU. That's called **trivial parallelism**. You can either launch
|
198 |
+
more independent experiments on other GPUs/machines or increase the
|
199 |
+
batch size until it no longer fits on one GPU.
|
200 |
+
|
201 |
+
### Data Parallelism
|
202 |
+
|
203 |
+
If your model still fits on a single GPU, but your data no longer does,
|
204 |
+
you have to try out **data parallelism** - which lets you distribute a
|
205 |
+
single batch of data across GPUs and average gradients that are computed
|
206 |
+
by the model across GPUs. A lot of model development work is cross-GPU,
|
207 |
+
so you want to ensure that GPUs have fast interconnects.
|
208 |
+
|
209 |
+
If you are using a server card, expect [a linear
|
210 |
+
speedup](https://lambdalabs.com/blog/best-gpu-2022-sofar/)
|
211 |
+
in training time. If you are using a consumer card, expect [a sublinear
|
212 |
+
speedup](https://lambdalabs.com/blog/titan-v-deep-learning-benchmarks/)
|
213 |
+
instead.
|
214 |
+
|
215 |
+
Data parallelism is implemented in PyTorch with the robust
|
216 |
+
[DistributedDataParallel
|
217 |
+
library](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
|
218 |
+
[Horovod](https://github.com/horovod/horovod) is another
|
219 |
+
3rd-party library option. PyTorch Lightning makes it dead simple to use
|
220 |
+
either of these two libraries - where [speedup seems to be the
|
221 |
+
same](https://www.reddit.com/r/MachineLearning/comments/hmgr9g/d_pytorch_distributeddataparallel_and_horovod/).
|
222 |
+
|
223 |
+
A more advanced scenario is that you can't even fit your model on a
|
224 |
+
single GPU. You have to spread the model over multiple GPUs. There are
|
225 |
+
three solutions to this.
|
226 |
+
|
227 |
+
### Sharded Data-Parallelism
|
228 |
+
|
229 |
+
Sharded data parallelism starts with the question: What exactly takes up
|
230 |
+
GPU memory?
|
231 |
+
|
232 |
+
- The **model parameters** include the floats that make up our model
|
233 |
+
layers.
|
234 |
+
|
235 |
+
- The **gradients** are needed to do back-propagation.
|
236 |
+
|
237 |
+
- The **optimizer states** include statistics about the gradients
|
238 |
+
|
239 |
+
- Finally, you have to send a **batch of data** for model development.
|
240 |
+
|
241 |
+
![](./media/image5.png)
|
242 |
+
|
243 |
+
Sharding is a concept from databases where if you have one source of
|
244 |
+
data, you actually break it into shards of data that live across your
|
245 |
+
distributed system. Microsoft implemented an approach called
|
246 |
+
[ZeRO](https://arxiv.org/pdf/1910.02054.pdf) that shards
|
247 |
+
the optimizer states, the gradients, and the model parameters. **This
|
248 |
+
results in an insane order of magnitude reduction in memory use, which
|
249 |
+
means your batch size can be 10x bigger.** You should [watch the video
|
250 |
+
in this
|
251 |
+
article](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
|
252 |
+
to see how model parameters are passed around GPUs as computation
|
253 |
+
proceeds.
|
254 |
+
|
255 |
+
Sharded data-parallelism is implemented by Microsoft's
|
256 |
+
[DeepSpeed](https://github.com/microsoft/DeepSpeed)
|
257 |
+
library and Facebook's
|
258 |
+
[FairScale](https://github.com/facebookresearch/fairscale)
|
259 |
+
library, as well as natively by PyTorch. In PyTorch, it's called
|
260 |
+
[Fully-Sharded
|
261 |
+
DataParallel](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/).
|
262 |
+
With PyTorch Lightning, you can try it for a massive memory reduction
|
263 |
+
without changing the model code.
|
264 |
+
|
265 |
+
This same ZeRO principle can also be applied to a single GPU. You can
|
266 |
+
train a 13B-parameter model on a single V100 (32GB) GPU. Fairscale
|
267 |
+
implements this (called
|
268 |
+
[CPU-offloading](https://fairscale.readthedocs.io/en/stable/deep_dive/offload.html)).
|
269 |
+
|
270 |
+
### Pipelined Model-Parallelism
|
271 |
+
|
272 |
+
**Model parallelism means that you can put each layer of your model on
|
273 |
+
each GPU**. It is trivial to implement natively but results in only one
|
274 |
+
GPU being active at a time. Libraries like DeepSpeed and FairScale make
|
275 |
+
it better by pipelining computation so that the GPUs are fully utilized.
|
276 |
+
You need to tune the amount of pipelining on the batch size to the exact
|
277 |
+
degree of how you will split up the model on the GPU.
|
278 |
+
|
279 |
+
### Tensor-Parallelism
|
280 |
+
|
281 |
+
Tensor parallelism is another approach, which observes that there is
|
282 |
+
nothing special about matrix multiplication that requires the whole
|
283 |
+
matrix to be on one GPU. **You can distribute the matrix over multiple
|
284 |
+
GPUs**. NVIDIA published [the Megatron-LM
|
285 |
+
repo](https://github.com/NVIDIA/Megatron-LM), which does
|
286 |
+
this for the Transformer model.
|
287 |
+
|
288 |
+
You can actually use all of the three techniques mentioned above if you
|
289 |
+
really want to scale a huge model (like a GPT-3 sized language model).
|
290 |
+
Read [this article on the technology behind BLOOM
|
291 |
+
training](https://huggingface.co/blog/bloom-megatron-deepspeed)
|
292 |
+
for a taste.
|
293 |
+
|
294 |
+
![](./media/image6.png)
|
295 |
+
|
296 |
+
|
297 |
+
In conclusion:
|
298 |
+
|
299 |
+
- If your model and data fit on one GPU, that's awesome.
|
300 |
+
|
301 |
+
- If they do not, and you want to speed up training, try
|
302 |
+
DistributedDataParallel.
|
303 |
+
|
304 |
+
- If the model still doesn't fit, try ZeRO-3 or Full-Sharded Data
|
305 |
+
Parallel.
|
306 |
+
|
307 |
+
For more resources to speed up model training, look at [this list
|
308 |
+
compiled by DeepSpeed](https://www.deepspeed.ai/training/),
|
309 |
+
[MosaicML](https://www.mosaicml.com), and
|
310 |
+
[FFCV](https://ffcv.io).
|
311 |
+
|
312 |
+
## 5 - Compute
|
313 |
+
|
314 |
+
![](./media/image14.png)
|
315 |
+
|
316 |
+
|
317 |
+
**Compute** is the next essential ingredient to developing machine
|
318 |
+
learning models and products.
|
319 |
+
|
320 |
+
The compute-intensiveness of models has grown tremendously over the last
|
321 |
+
ten years, as the below charts from
|
322 |
+
[OpenAI](https://openai.com/blog/ai-and-compute/) and
|
323 |
+
[HuggingFace](https://huggingface.co/blog/large-language-models)
|
324 |
+
show.
|
325 |
+
|
326 |
+
![](./media/image1.png)
|
327 |
+
|
328 |
+
|
329 |
+
Recent developments, including models like
|
330 |
+
[GPT-3](https://openai.com/blog/gpt-3-apps/), have
|
331 |
+
accelerated this trend. These models are extremely large and require a
|
332 |
+
large number of petaflops to train.
|
333 |
+
|
334 |
+
### GPUs
|
335 |
+
|
336 |
+
**To effectively train deep learning models**, **GPUs are required.**
|
337 |
+
NVIDIA has been the superior choice for GPU vendors, though Google has
|
338 |
+
introduced TPUs (Tensor Processing Units) that are effective but are
|
339 |
+
only available via Google Cloud. There are three primary considerations
|
340 |
+
when choosing GPUs:
|
341 |
+
|
342 |
+
1. How much data fits on the GPU?
|
343 |
+
|
344 |
+
2. How fast can the GPU crunch through data? To evaluate this, is your
|
345 |
+
data 16-bit or 32-bit? The latter is more resource intensive.
|
346 |
+
|
347 |
+
3. How fast can you communicate between the CPU and the GPU and between
|
348 |
+
GPUs?
|
349 |
+
|
350 |
+
Looking at recent NVIDIA GPUs, it becomes clear that a new
|
351 |
+
high-performing architecture is introduced every few years. There's a
|
352 |
+
difference between these chips, which are licensed for personal use as
|
353 |
+
opposed to corporate use; businesses should only use **server**
|
354 |
+
**cards**.
|
355 |
+
|
356 |
+
![](./media/image8.png)
|
357 |
+
|
358 |
+
|
359 |
+
Two key factors in evaluating GPUs are **RAM** and **Tensor TFlops**.
|
360 |
+
The more RAM, the better the GPU contains large models and datasets.
|
361 |
+
Tensor TFlops are special tensor cores that NVIDIA includes specifically
|
362 |
+
for deep learning operations and can handle more intensive
|
363 |
+
mixed-precision operations. **A tip**: leveraging 16-bit training can
|
364 |
+
effectively double your RAM capacity!
|
365 |
+
|
366 |
+
While these theoretical benchmarks are useful, how do GPUs perform
|
367 |
+
practically? Lambda Labs offers [the best benchmarks
|
368 |
+
here](https://lambdalabs.com/gpu-benchmarks). Their results
|
369 |
+
show that the most recent server-grade NVIDIA GPU (A100) is more than
|
370 |
+
2.5 times faster than the classic V100 GPU. RTX chips also outperform
|
371 |
+
the V100. [AIME is also another source of GPU
|
372 |
+
benchmarks](https://www.aime.info/en/blog/deep-learning-gpu-benchmarks-2021/).
|
373 |
+
|
374 |
+
Cloud services such as Microsoft Azure, Google Cloud Platform, and
|
375 |
+
Amazon Web Services are the default place to buy access to GPUs. Startup
|
376 |
+
cloud providers like
|
377 |
+
[Paperspace](https://www.paperspace.com/),
|
378 |
+
[CoreWeave](https://www.coreweave.com/), and [Lambda
|
379 |
+
Labs](https://lambdalabs.com/) also offer such services.
|
380 |
+
|
381 |
+
### TPUs
|
382 |
+
|
383 |
+
Let's briefly discuss TPUs. There are four generations of TPUs, and the
|
384 |
+
most recent v4 is the fastest possible accelerator for deep learning. V4
|
385 |
+
TPUs are not generally available yet, but **TPUs generally excel at
|
386 |
+
scaling to larger and model sizes**. The below charts compare TPUs to
|
387 |
+
the fastest A100 NVIDIA chip.
|
388 |
+
|
389 |
+
![](./media/image11.png)
|
390 |
+
|
391 |
+
|
392 |
+
It can be overwhelming to compare the cost of cloud access to GPUs, so
|
393 |
+
[we made a tool that solves this
|
394 |
+
problem](https://fullstackdeeplearning.com/cloud-gpus/)!
|
395 |
+
Feel free to contribute to [our repository of Cloud GPU cost
|
396 |
+
metrics](https://github.com/full-stack-deep-learning/website/).
|
397 |
+
The tool has all kinds of nifty features like enabling filters for only
|
398 |
+
the most recent chip models, etc.
|
399 |
+
|
400 |
+
If we [combine the cost metrics with performance
|
401 |
+
metrics](https://github.com/full-stack-deep-learning/website/blob/main/docs/cloud-gpus/benchmark-analysis.ipynb),
|
402 |
+
we find that **the most expensive per hour chips are not the most
|
403 |
+
expensive per experiment!** Case in point: running the same Transformers
|
404 |
+
experiment on 4 V100s costs \$1750 over 72 hours, whereas the same
|
405 |
+
experiment on 4 A100s costs \$250 over only 8 hours. Think carefully
|
406 |
+
about cost and performance based on the model you're trying to train.
|
407 |
+
|
408 |
+
Some helpful heuristics here are:
|
409 |
+
|
410 |
+
1. Use the most expensive per-hour GPU in the least expensive cloud.
|
411 |
+
|
412 |
+
2. Startups (e.g., Paperspace) tend to be cheaper than major cloud
|
413 |
+
providers.
|
414 |
+
|
415 |
+
### On-Prem vs. Cloud
|
416 |
+
|
417 |
+
For **on-prem** use cases, you can build your own pretty easily or opt
|
418 |
+
for a pre-built computer from a company like NVIDIA. You can build a
|
419 |
+
good, quiet PC with 128 GB RAM and 2 RTX 3909s for about \$7000 and set
|
420 |
+
it up in a day. Going beyond this can start to get far more expensive
|
421 |
+
and complicated. Lambda Labs offers a \$60,000 machine with 8 A100s
|
422 |
+
(super fast!). Tim Dettmers offers a great (slightly outdated)
|
423 |
+
perspective on building a machine
|
424 |
+
[here](https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/).
|
425 |
+
|
426 |
+
Some tips on on-prem vs. cloud use:
|
427 |
+
|
428 |
+
- It can be useful to have your own GPU machine to shift your mindset
|
429 |
+
from minimizing cost to maximizing utility.
|
430 |
+
|
431 |
+
- To truly scale-out experiments, you should probably just use the
|
432 |
+
most expensive machines in the least expensive cloud.
|
433 |
+
|
434 |
+
- TPUs are worth experimenting with for large-scale training, given
|
435 |
+
their performance.
|
436 |
+
|
437 |
+
- Lambda Labs is a sponsor, and we highly encourage looking at them
|
438 |
+
for on-prem and cloud GPU use!
|
439 |
+
|
440 |
+
## 6 - Resource Management
|
441 |
+
|
442 |
+
![](./media/image2.png)
|
443 |
+
|
444 |
+
|
445 |
+
Now that we've talked about raw compute, let's talk about options for
|
446 |
+
**how to manage our compute resources**. Let's say we want to manage a
|
447 |
+
set of experiments. Broadly speaking, we'll need hardware in the form of
|
448 |
+
GPUs, software requirements (e.g., PyTorch version), and data to train
|
449 |
+
on.
|
450 |
+
|
451 |
+
### Solutions
|
452 |
+
|
453 |
+
Leveraging best practices for specifying dependencies (e.g., Poetry,
|
454 |
+
conda, pip-tools) makes the process of spinning up such experiments
|
455 |
+
quick and easy on a single machine.
|
456 |
+
|
457 |
+
If, however, you have a cluster of machines to run experiments on,
|
458 |
+
[SLURM](https://slurm.schedmd.com/documentation.html) is
|
459 |
+
the tried and true solution for workload management that is still widely
|
460 |
+
used.
|
461 |
+
|
462 |
+
For more portability, [Docker](https://www.docker.com/) is
|
463 |
+
a way to package up an entire dependency stack into a lighter-than-a-VM
|
464 |
+
package. [Kubernetes](https://kubernetes.io/) is the most
|
465 |
+
popular way to run many Docker containers on top of a cluster. The OSS
|
466 |
+
[Kubeflow](https://www.kubeflow.org/) project helps manage
|
467 |
+
ML projects that rely on Kubernetes.
|
468 |
+
|
469 |
+
These projects are useful, but they may not be the easiest or best
|
470 |
+
choice. They're great if you already have a cluster up and running, but
|
471 |
+
**how do you actually set up a cluster or compute platform?**
|
472 |
+
|
473 |
+
*Before proceeding, FSDL prefers open source and/or transparently priced
|
474 |
+
products. We discuss tools that fall into these categories, not SaaS
|
475 |
+
with opaque pricing.*
|
476 |
+
|
477 |
+
### Tools
|
478 |
+
|
479 |
+
For practitioners all in on AWS, [AWS
|
480 |
+
Sagemaker](https://aws.amazon.com/sagemaker/) offers a
|
481 |
+
convenient end-to-end solution for building machine learning models,
|
482 |
+
from labeling data to deploying models. Sagemaker has a ton of
|
483 |
+
AWS-specific configuration, which can be a turnoff, but it brings a lot
|
484 |
+
of easy-to-use old school algorithms for training and allows you to BYO
|
485 |
+
algorithms as well. They're also increasing support for PyTorch, though
|
486 |
+
the markup for PyTorch is about 15-20% more expensive.
|
487 |
+
|
488 |
+
[Anyscale](https://www.anyscale.com/) is a company created
|
489 |
+
by the makers of the Berkeley OSS project
|
490 |
+
[Ray](https://github.com/ray-project/ray). Anyscale
|
491 |
+
recently launched [Ray
|
492 |
+
Train](https://docs.ray.io/en/latest/train/train.html),
|
493 |
+
which they claim is faster than Sagemaker with a similar value
|
494 |
+
proposition. Anyscale makes it really easy to provision a compute
|
495 |
+
cluster, but it's considerably more expensive than alternatives.
|
496 |
+
|
497 |
+
[Grid.ai](https://www.grid.ai/) is created by the PyTorch
|
498 |
+
Lightning creators. Grid allows you to specify what compute parameters
|
499 |
+
to use easily with "grid run" followed by the types of compute and
|
500 |
+
options you want. You can use their instances or AWS under the hood.
|
501 |
+
Grid has an uncertain future, as its future compatibility with Lightning
|
502 |
+
(given their rebrand) has not been clarified.
|
503 |
+
|
504 |
+
There are several non-ML options for spinning up compute too! Writing
|
505 |
+
your own scripts, using various libraries, or even Kubernetes are all
|
506 |
+
options. This route is harder.
|
507 |
+
|
508 |
+
[Determined.AI](https://determined.ai/) is an OSS solution
|
509 |
+
for managing on-prem and cloud clusters. They offer cluster management,
|
510 |
+
distributed training, and more. It's pretty easy to use and is in active
|
511 |
+
development.
|
512 |
+
|
513 |
+
With all this said, **there is still room to improve the ease of
|
514 |
+
experience for launching training on many cloud providers**.
|
515 |
+
|
516 |
+
## 7 - Experiment and Model Management
|
517 |
+
|
518 |
+
![](./media/image4.png)
|
519 |
+
|
520 |
+
|
521 |
+
In contrast to compute, **experiment management is quite close to being
|
522 |
+
solved**. Experiment management refers to tools and processes that help
|
523 |
+
us keep track of code, model parameters, and data sets that are iterated
|
524 |
+
on during the model development lifecycle. Such tools are essential to
|
525 |
+
effective model development. There are several solutions here:
|
526 |
+
|
527 |
+
- [TensorBoard](https://www.tensorflow.org/tensorboard):
|
528 |
+
A non-exclusive Google solution effective at one-off experiment
|
529 |
+
tracking. It is difficult to manage many experiments.
|
530 |
+
|
531 |
+
- [MLflow](https://mlflow.org/): A non-exclusive
|
532 |
+
Databricks project that includes model packaging and more, in
|
533 |
+
addition to experiment management. It must be self-hosted.
|
534 |
+
|
535 |
+
- [Weights and Biases](https://wandb.ai/site): An
|
536 |
+
easy-to-use solution that is free for personal and academic projects! Logging
|
537 |
+
starts simply with an "experiment config" command.
|
538 |
+
|
539 |
+
- Other options include [Neptune
|
540 |
+
AI](https://neptune.ai/), [Comet
|
541 |
+
ML](https://www.comet.ml/), and [Determined
|
542 |
+
AI](https://determined.ai/), all of which have solid
|
543 |
+
experiment tracking options.
|
544 |
+
|
545 |
+
Many of these platforms also offer **intelligent hyperparameter
|
546 |
+
optimization**, which allows us to control the cost of searching for the
|
547 |
+
right parameters for a model. For example, Weights and Biases has a
|
548 |
+
product called [Sweeps](https://wandb.ai/site/sweeps) that
|
549 |
+
helps with hyperparameter optimization. It's best to have it as part of
|
550 |
+
your regular ML training tool; there's no need for a dedicated tool.
|
551 |
+
|
552 |
+
## 8 - "All-In-One"
|
553 |
+
|
554 |
+
![](./media/image13.png)
|
555 |
+
|
556 |
+
|
557 |
+
There are machine learning infrastructure solutions that offer
|
558 |
+
everything\--training, experiment tracking, scaling out, deployment,
|
559 |
+
etc. These "all-in-one" platforms simplify things but don't come cheap!
|
560 |
+
Examples include [Gradient by
|
561 |
+
Paperspace](https://www.paperspace.com/gradient), [Domino
|
562 |
+
Data Lab](https://www.dominodatalab.com/), [AWS
|
563 |
+
Sagemaker](https://aws.amazon.com/sagemaker/), etc.
|
documents/lecture-02.srt
ADDED
@@ -0,0 +1,256 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,399 --> 00:00:49,360
|
3 |
+
hi everyone welcome to week two of full stack deep learning 2022. today we have a lecture on development infrastructure and tooling my name is sergey and i have my assistant mishka right here so just diving right in the dream of machine learning development is that you provide a project spec identify birds maybe some sample data here's what the birds look like here's what i want to see and then you get a continually improving prediction system and it's deployed at scale but the reality is that it's not just some sample data you really have to find the data aggregated process it clean it label it then you have to find the model architecture potentially the pre-trained weights then you still have to look at the model code
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:46,480 --> 00:01:32,640
|
7 |
+
probably edit it debug it run training experiments review the results that's going to feed back into maybe trying a new architecture debugging some more code and then when that's done you can actually deploy the model and then after you deploy it you have to monitor the predictions and then you close the data flywheel loop basically your user is generating fresh data for you that that you then have to add to your data set so this reality has roughly kind of three components and we divided into data and read this development in yellow and deployment in green and there are a lot of tools like the infrastructure landscape is pretty large so we have three lectures to cover all of it and today we're going to concentrate on
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:30,240 --> 00:02:16,239
|
11 |
+
the development part the middle part which is probably what you're familiar with from previous courses most of what you do is model development we actually want to start even a little bit before that and talk about software engineering you know it starts with maybe the programming language and for machine learning it's pretty clear it has to be python and the reason is because of all the libraries that have been developed for it it's just the winner in scientific and data computing there have been some contenders so julia is actually the the ju in jupiter jupiter notebooks to write python code you need an editor you can be old school and use vim or emacs a lot of people just write in jupyter notebooks or jupyter lab which
|
12 |
+
|
13 |
+
4
|
14 |
+
00:02:14,000 --> 00:03:06,400
|
15 |
+
also gives you a code editor window vs code is a very popular text editor python specific code editor pycharm is is really good as well at fsdl we recommend vs code it has a lot of nice stuff it hasn't built you know in addition to the nice editing features it has built-in git version control so you can see your commit you can actually stage line by line you can look at documentation as you write your code you can open projects remotely so like the window i'm showing here is actually on a remote machine that i've sshed into you can lend code as you write and if you haven't seen linters before it's basically this idea that if there are code style rules that you want to follow like a certain number of spaces
|
16 |
+
|
17 |
+
5
|
18 |
+
00:03:04,959 --> 00:03:49,280
|
19 |
+
for indentation whatever you decide you want to do gotta you should just codify it so that you don't ever have to think about it or manually put that in your tools just do it for you and you've run something that just looks at your code all the time you can do a little bit of static analysis so for example there's two commas in a row it's not going to run in this file or potentially you're using a variable that never got defined and in addition python now has type hints so you can actually say you know this variable is supposed to be an integer and then if you use it as an argument to a function that expects expect to float a static type checker can catch and tell you about it before you actually run it so we set
|
20 |
+
|
21 |
+
6
|
22 |
+
00:03:46,799 --> 00:04:35,040
|
23 |
+
that all up in the lab by the way and you will see how that works it's a very nice part of the lab a lot of people develop in jupiter notebooks and they're really fundamental to data science and i think for good reason i think it's a great kind of first draft of a project you just open up this notebook and you start coding there's very little thought that you have to put in before you start coding and start seeing immediate output so that kind of like fast feedback cycle that's really great and jeremy howard is a great practitioner so if you watch the fast ai course videos you'll see him use them to their full extent they do have problems though for example the editor that you use in the notebook is pretty primitive right
|
24 |
+
|
25 |
+
7
|
26 |
+
00:04:32,960 --> 00:05:23,039
|
27 |
+
there's no refactoring support there's no maybe peaking of the documentation there's no copilot which i have now got used to in vs code there's out of order execution artifact so if you've run the cells in a different order you might not get the same result as if you ran them all in line it's hard to version them you either strip out the output of each cell in which case you lose some of the benefit because sometimes you want to save the artifact that you produced in the notebook or the file is pretty large and keeps changing and it's hard to test because it's just not very amenable to like the unit testing frameworks and and and best practices that people have built up counterpoint to everything i just said
|
28 |
+
|
29 |
+
8
|
30 |
+
00:05:20,400 --> 00:06:10,560
|
31 |
+
is that you can kind of fix all of that and that's what jeremy howard is trying to do with nbdev which is this package that lets you write documentation your code and test for the code all in a notebook the full site deep learning recommendation is go ahead and use notebooks actually use the vs code built-in notebook support so i actually don't i'm not in the browser ever i'm just in in my vs code but i'm coding in a notebook style but also i usually write code in a module that then gets imported into a notebook and with this live reload extension it's quite nice because when you change code in the module and rerun the notebook that it gets the updated code and also you have nice things like you
|
32 |
+
|
33 |
+
9
|
34 |
+
00:06:08,960 --> 00:06:55,520
|
35 |
+
have a terminal you can look at files and so on and by the way it enables really awesome debugging so if you want to debug some code you can put a breakpoint here on the right you see the little red dot and then i'm about to launch the cell with the debug cell command and it'll drop me in into the debugger at that break point and so this is just really nice without leaving the editor i'm able to to do a lot notebooks are great sometimes you want something a little more interactive maybe something you can share with the world and streamlit has come along and let you just decorate python codes you write a python script you decorate it with widgets and data loaders and stuff and you can get interactive applets
|
36 |
+
|
37 |
+
10
|
38 |
+
00:06:53,599 --> 00:07:45,120
|
39 |
+
where people can let's say a variable can be controlled by a slider and everything just gets rerun very efficiently and then when you're happy with your applet you can publish it to the web and just share that streamlet address with your audience it's really quite great for setting up the python environment it can actually be pretty tricky so for deep learning usually you have a gpu and the gpu needs cuda libraries and python has a version and then each of the requirements that you use like pytorch or numpy have their own specific version also some requirements are for production like torch but some are only for development for example black is a code styling tool where my pi is a static analysis tool and it'd be nice to
|
40 |
+
|
41 |
+
11
|
42 |
+
00:07:41,440 --> 00:08:34,080
|
43 |
+
just separate the two so we can achieve all these desired things by specifying python and cuda versions in environment.yaml file and use conda to install the python and the cuda version that we specified but then all the other requirements we specify in with basically just very minimal constraints so we say like torch version greater than 1.7 or maybe no constraints like numpy any version and then we use this tool called pip tools that will analyze the constraints we gave and the constraints they might have for each other and find a mutually compatible version of all the requirements and then locks it so that when you come back to the project you have exactly the versions of everything you used
|
44 |
+
|
45 |
+
12
|
46 |
+
00:08:32,640 --> 00:09:25,040
|
47 |
+
and we can also just use a make file to simplify this now we do this in lab so you'll see this in lab and on that note please go through labs one through three they're already out and starts with an overview of what the labs are going to be about then pi torch lightning and pytorch and then we go through cnns transformers and we see a lot of the structure that i've been talking about so that is it for software engineering and the next thing i want to talk about are specifically deep learning frameworks and distributed training so why do we need frameworks well deep learning is actually not a lot of code if you have a matrix math library like numpy now fast.ai course does this pretty brilliantly they they basically have you
|
48 |
+
|
49 |
+
13
|
50 |
+
00:09:23,120 --> 00:10:12,560
|
51 |
+
build your own deep learning library and and you see how very little code it is but when you have to deploy stuff onto cuda for gpu power deep learning and when you have to consider that you might be writing weird layers that have to you have to figure out the differentiation of the layers that you write that can get to be just a lot to maintain and so and then also there's all the layer types that have been published in the literature like the convolutional layers there's all the different optimizers so there's just a lot of code and for that you really need a framework so which framework should you use right well i think josh answered this you know pretty concisely about a year ago and you said jax is for researchers pi
|
52 |
+
|
53 |
+
14
|
54 |
+
00:10:10,480 --> 00:11:00,880
|
55 |
+
torches for engineers and tensorflows for boomers so pytorch is the full stack deep learning choice but seriously though you know both pytorch and tensorflow and jaxx they all are similar you define a deep learning model by running python code writing and running python code and then what you get is an optimized execution graph that can target cpus gpus tpus mobile deployments now the reason you might prefer pytorch is because it just basically is absolutely dominant right so if you look at the number of models trained models that are shared on hugging face which is like the largest model zoo we'll talk about it in a few minutes you know there's models that are both pi torch and tensorflow there's some models
|
56 |
+
|
57 |
+
15
|
58 |
+
00:10:59,600 --> 00:11:50,240
|
59 |
+
on jacks there's some models for tensorflow only there's a lot of models that are just for pi torch if you track paper submissions to academic conferences it's about 75 plus percent pi torch implementations of these research papers and my face is blocking the stat but it's something like 75 percent of machine learning competition winners used pytorch in 2022 now tensorflow is kind of cool tensorflow.js in particular lets you run deep learning models in your browser and pytorch doesn't have that and then keras as a development experience is i think pretty unmatched for just stacking together layers easily training the model and then there's jax which you might have heard about so jack's you know the main thing is you
|
60 |
+
|
61 |
+
16
|
62 |
+
00:11:48,800 --> 00:12:37,200
|
63 |
+
need a meta framework for deep learning we'll talk about in a second but pytorch that's the pick excellent dev experience it's people used to say well maybe it's a little slow but it really is production ready even as is but you can make it even faster by compiling your model with a torch script there's a great distributed training ecosystem there's libraries for vision audio 3d data you know etc there's mobile deployment targets and with pytorch lightning which is what we use in labs have a nice structure for how to kind of where do you put your actual model code where you put your optimizer code where do you put your training code your evaluation code how should the data loaders look like and and then what you get is if you just
|
64 |
+
|
65 |
+
17
|
66 |
+
00:12:34,959 --> 00:13:26,800
|
67 |
+
kind of structure your code as pytorch lightning expects it you can run your code on cpu or gpu or any number of gpus or tpus with just you know a few characters change in your code there's a performance profiler there's model checkpointing there's 16-bit precision there's distributed training libraries it's just all very nice to use now another possibility is fast ai software which is developed alongside the fastai cores and it provides a lot of advanced tricks like data augmentations better weight initializations learning grade schedulers it has this kind of modular structure where there's data blocks and learners and then even vision text tabular applications the main problem with it that i see is
|
68 |
+
|
69 |
+
18
|
70 |
+
00:13:24,399 --> 00:14:20,560
|
71 |
+
the code style is quite different and in general it's it's a bit different than than mainstream pie torch it can be very powerful if you go in on it at fsdl we recommend pytorch lightning tensorflow is not just for boomers right fsdl prefers pi torch because we think it's a stronger ecosystem but tensorflow is still perfectly good and if you have a specific reason to prefer it such as that's what your employer uses you're gonna have a good time it still makes sense it's not bad jax is a recent a more recent project from google which is really not specific to deep learning it's about just general vectorization of all kinds of code and also auto differentiation of all kinds of code including your physics simulations
|
72 |
+
|
73 |
+
19
|
74 |
+
00:14:19,040 --> 00:15:03,440
|
75 |
+
stuff like that and then whatever you can express in jax gets compiled to gpu or tpu code and super fast for deep learning there are separate frameworks like flax or haiku and you know here at fsdl we say use it if you have a specific need maybe you're doing research on something kind of weird that's fine or you know potentially you're working at google you're not allowed to use pytorch that could make it a pretty good reason to use jacks there's also this notion of meta frameworks and model zoos that i want to cover so model zooz is the idea that sure you can just start with blank pi torch but most of the time you're going to start with at least a model architecture that someone's developed and published
|
76 |
+
|
77 |
+
20
|
78 |
+
00:15:02,320 --> 00:15:49,519
|
79 |
+
and a lot of the time you're going to start with actually a pre-trained model meaning someone trained the architecture on specific data they got weights that they then saved and uploaded to a hub and you can download and actually start not from scratch but from a pre-trained model onyx is this idea that deep learning models are all about the same right like we know what an mlp type of layer is we know what a cnn type of layer is and it doesn't matter if it's written in pytorch or tensorflow or cafe whatever it's written in we should be able to actually port it between the different code bases because the real thing that we're that we care about are the weights and the weights are just numbers right so onyx is this format that lets you
|
80 |
+
|
81 |
+
21
|
82 |
+
00:15:47,920 --> 00:16:39,279
|
83 |
+
convert from pytorch to tensorflow and vice versa and it can work super well it can also not work super well you can run into some edge cases so if it's something that you need to do then definitely worth a try but it's not necessarily going to work for all types of models hugging face has become an absolutely stellar repository of models starting with nlp but have since expanded to all kinds of tasks audio classification image classification object detection there's sixty thousand pre-trained models for all these tasks there is a specific library of transformers that works with pytorch tensorflow jacks also 7.5 000 data sets that people have uploaded there's also a lot more to it it's worth checking out you can host your model for
|
84 |
+
|
85 |
+
22
|
86 |
+
00:16:36,720 --> 00:17:31,679
|
87 |
+
inference and there's there's community aspects to it so it's a great resource another great resource specifically for vision is called tim state of the art computer vision models can be found on tim just search tim github next up let's talk about distributed training so the scenarios are we have multiple machines represented by little squares here with multiple gpus on each machine and you are sending batches of data to be processed by a model that has parameters right and the data batch can fit on a single gpu or potentially not fit on a single gpu and the model parameters can fit in a single gpu or potentially not fit in a single gpu so let's say the best case the easiest case is your batch of data fits on a single gpu
|
88 |
+
|
89 |
+
23
|
90 |
+
00:17:30,320 --> 00:18:21,280
|
91 |
+
your model parameters fit on a single gpu and that's really called trivial parallelism you can launch independent experiments on other gpus so maybe do a hyper parameter search or potentially you increase your batch size until it can no longer fit on one gpu and then you have to figure something else out and but then yeah what you have to then figure out is okay well my model still fits on a single gpu but my data no longer fits on a single gpu so now i have to go and do something different and what that different thing is usually is data parallelism it lets you distribute a single batch of data across gpus and then average gradients that are computed by the model across all the gpus so it's the same model on each gpu but
|
92 |
+
|
93 |
+
24
|
94 |
+
00:18:18,880 --> 00:19:10,960
|
95 |
+
different batches of data because a lot of this work is cross gpu we have to make sure that the gpus have fast interconnect right so gpu is connected usually through a pci interface to the computer but it and so if there's no other connection then all the data has to flow through the pci bus all the time it's possible that there is a faster interconnect like nv link between the gpus and then the data can leave the pci bus alone and just go straight across the the fast interconnect and the speed up you can expect is if you are using server cards like a100s a6000s you know v100s it's basically a linear speed up for data parallelism which is really cool if you're using consumer cards like 2080s or 3080s we'll talk about it a
|
96 |
+
|
97 |
+
25
|
98 |
+
00:19:08,720 --> 00:19:59,919
|
99 |
+
little further down then unfortunately it's going to be a sublinear speed up so maybe if you have four gpus it'll be like a 3x speed up if you have a gpus maybe a 5x speed up and that's due to the the fact that the consumer cards don't have as fast of an interconnect so data parallelism is implemented in pi torch in the distributed data parallel library there's also a third-party library called horovod and you can use either one super simply using pytorch lightning you basically say what's your strategy if you don't say anything then it's single gpu but if your strategy is ddp then it uses the python distributed data parallel if you use strategy horovod then it uses horivon it seems like the speedup's basically
|
100 |
+
|
101 |
+
26
|
102 |
+
00:19:58,160 --> 00:20:48,640
|
103 |
+
about the same there's no real reason to use horowat over distributed data parallel but it might make it easier for a specific case that you might have so it's good to know about but the first thing to try is just distributed data parallel now we come to a more advanced scenario which is now we can't even fit our model our model is so large it has billions of parameters it doesn't actually fit on a single gpu so we have to spread the model not just the data over multiple gpus and there's three solutions to this so sharded data parallelism starts with the question what exactly is in the gpu memory what is taking up the gpu memory so okay we have the model parameters the floats that make up our actual
|
104 |
+
|
105 |
+
27
|
106 |
+
00:20:47,360 --> 00:21:40,400
|
107 |
+
layers we have the gradients we need to know about the gradients because that's what we average to do our backdrop but we also have optimizer states and that's actually a lot of data for the atom optimizer that's probably the most often used optimizer today it has to be statistics about the gradients basically and in addition if you're doing kind of float 16 training then your model parameters gradients might be float 16 but the optimizer will keep a copy of them as float32 as well so it can be a lot more data and then plus of course you send your batch of data so all of this has to fit on a gpu but does it actually have to fit on every gpu is the question so the baseline that we have is yeah let's send all of this stuff to each gpu
|
108 |
+
|
109 |
+
28
|
110 |
+
00:21:37,840 --> 00:22:33,440
|
111 |
+
and that might take up like 129 gigabytes of data in this in this example this is from the paper called zero optimization storage training trillion parameter models okay so what if we shard the optimizer states sharding is a concept from databases where if you have one source of data you actually break it up into shards of data such that across your distributed system each part of your each node only sees a shard a single shard of the data so here the first thing we can try is we can shard the optimizer states each gpu doesn't have to have all the optimizer state it just has to have its little shard of it we can do the same for gradients and that's called zero two and then pretty crazily we can also do it for the
|
112 |
+
|
113 |
+
29
|
114 |
+
00:22:31,520 --> 00:23:19,840
|
115 |
+
model parameters themselves and that's called zero three and that can result in a pretty insane order of magnitude reduction in memory use which means that your batch size can be 10 times bigger i recommend watching this helpful video that i have linked but you literally pass around the model params between the gpus as computation is proceeding so here we see four gpus four chunks of data entering the gpus and what happened is gpu zero had the model parameters for that first part of the model and it communicated these parameters to the other three gpus and then they did their computation and once they were complete with that computation the other gpus can actually delete the parameters for those first
|
116 |
+
|
117 |
+
30
|
118 |
+
00:23:18,559 --> 00:24:05,440
|
119 |
+
four layers and then gpu one has the parameters for the next four layers and it broadcasts them to the other three gpus who are now able to do the next four layers of computation and that's just in the forward pass then you do the same with gradients and optimizer states in the backward pass this is a lot to implement thankfully we don't have to do it it's implemented by the deep speed library from microsoft and the fair scale library from facebook and recently actually also implemented natively by pytorch so in pytorch it's called fully sharded data parallel instead of zero three and with pytorch lightning you can actually try sharded ddp with just a tiny bit of a change try it see if you see a massive memory
|
120 |
+
|
121 |
+
31
|
122 |
+
00:24:04,400 --> 00:24:54,880
|
123 |
+
reduction that can correspond to a speed up in your training now the same idea the zero three principle right is that the gpu only needs the model frames it needs in the moment for the computation it's doing at this moment the same principle can be applied to just a single gpu you can get a 13 billion parameters onto the gpu and you can train a 13 billion parameter model on a single v100 which doesn't even fit it natively and fair scale also implements this and calls it cpu offloading there's a couple more solutions model parallelism take your model your model let's say has three layers and you have three gpus you can put each layer on a gpu right and in pytorch you can just implement it very trivially but the
|
124 |
+
|
125 |
+
32
|
126 |
+
00:24:52,960 --> 00:25:41,840
|
127 |
+
problem is that only one gpu will be active at a given time so the trick here is that and once again implemented by libraries like deep speed and fair scale they make it better so they pipeline this kind of computation so that gpus are mostly fully utilized although you need to tune the amount of pipelining on the batch size and exactly how you're going to split up the model into the gpus so this isn't as much of fire and forget solution like like sharded data parallel and another solution is tensor parallelism which basically is observing that there's nothing special about a matrix multiplication that requires the whole matrix to be on one gpu you can distribute the matrix over gpus so megatron lm is a repository from
|
128 |
+
|
129 |
+
33
|
130 |
+
00:25:39,279 --> 00:26:34,960
|
131 |
+
nvidia which did this for the transformer model and is widely used so you can actually use all of these if you really need to scale and the model that really needs to scale is a gpt3 three-sized language model such as bloom which recently finished training so they used zero data parallelism tensor parallelism pipeline parallelism in addition to some other stuff and they called it 3d parallelism but they also write that since they started their endeavor the the zero stage three performance has dramatically improved and if they were to start over again today maybe they would just do sharded data parallel and that would just be enough so in conclusion you know if your model and data fits on one gpu that's awesome
|
132 |
+
|
133 |
+
34
|
134 |
+
00:26:32,799 --> 00:27:21,919
|
135 |
+
if it doesn't or you want to speed up training then you can distribute over gpus with distributed data parallel if the model still doesn't fit you should try zero three or fully shared data parallel there's other ways to speed up there's 16 bit training there's maybe some special you know fast kernels for different types of layers like transformers you can maybe try sparse attention instead of normal dense attention so there's other things that these libraries like deep speed and fair skill implement that you can try and there's even more tricks that you could try for example for nlp there's this position encoding step you can use something called alibi which scales to basically all length of sequences
|
136 |
+
|
137 |
+
35
|
138 |
+
00:27:20,480 --> 00:28:09,600
|
139 |
+
so you can actually train on shorter sequences and use this trick called sequence length warm up where you train on shorter sequences and then you increase the size and because you're using alibi it should not mess up your position and then for vision you can also use a size warm up by progressively increasing the size of the image you can use special optimizers and these tricks are implemented by a library called mosaic ml composer and they report some pretty cool speed ups and it's pretty easy to implement and they also have a cool web tool i'm a fan of these things that basically lets you see the efficient frontier for training models time versus cost kind of fun to play around with this mosaic ml explorer
|
140 |
+
|
141 |
+
36
|
142 |
+
00:28:08,000 --> 00:29:01,039
|
143 |
+
there's also some research libraries like ffcv which actually try to optimize the data flow there are some simple tricks you can maybe do that speed it up a lot these things will probably find their way into mainstream pie torch eventually but it's worth giving this a try especially if again you're training on vision models the next thing we're going to talk about is compute that we need for deep learning i'm sure you've seen plots like this from open ai this is up through 2019 showing on a log scale just how many times the compute needs for the top performing models have grown and this goes even further into 2022 with the large language models like gpt3 they're just incredibly large and required an incredible amount of
|
144 |
+
|
145 |
+
37
|
146 |
+
00:28:58,720 --> 00:29:53,279
|
147 |
+
pedoflops to train so basically nvidia is the only choice for deep learning gpus and recently google tpus have been made available in the gcp cloud and they're also very nice and the three main factors that we need to think about when it comes to gpus are how much data can you transfer to the gpu then how fast can you crunch through that data and that actually depends on whether the data is 32-bit or 16-bit and then how fast can you communicate between the cpu and the gpu and between gpus we can look at some landmark nvidia gpus so the first thing we might notice is that there's a basically a new architecture every year every couple of years it went from kepler with the k80 and k40 cards in 2014
|
148 |
+
|
149 |
+
38
|
150 |
+
00:29:51,520 --> 00:30:44,480
|
151 |
+
up through ampere from 2020 on some cards are for server use some cards are for consumer use if you're doing stuff for business you're only supposed to use the server cards the ram that the gpu has allows you to fit a large model and a meaningful batch of data on the gpu so the more ram the better these are this is like kind of how much data can you crunch through in a unit time and there's also i have a column for tensor t flops are special tensor cores that nvidia specifically intends for deep learning operations which are mixed precision float32 and float16 these are much higher than just straight 32-bit teraflops if you use 16-bit training you effectively double or so your rain capacity we looked at the teraflops these are
|
152 |
+
|
153 |
+
39
|
154 |
+
00:30:42,720 --> 00:31:46,960
|
155 |
+
theoretical numbers but how do they actually benchmark lame the labs is probably the best source of benchmark data and here they show relative to the v100 single gpu how do the different gpus compare so one thing we might notice is the a100 which is the most recent gpu that's the server grade is over 2.5 faster than v100 you'll notice there's a couple of different a100s the pcie versus sxm4 refers to how fast you can get the data onto the gpu and the 40 gig versus 80 gig refers to how much data can fit on the gpu also recently there's rtx a 4000 5000 6000 and so on cards and the a40 and these are all better than the v100 another source of benchmarks is aime they show you time for resnet50 model to go through
|
156 |
+
|
157 |
+
40
|
158 |
+
00:31:44,240 --> 00:32:42,720
|
159 |
+
1.4 images in imagenet the configuration of four a100s versus four v100s is three times faster in in flow 32 and only one and a half times faster in float 16. there's a lot more stuff you can notice but that's what i wanted to highlight and we could buy some of these gpus we could also use them in the cloud so amazon web services google cloud platform microsoft azure are all the heavyweight cloud providers google cloud platform out of the three is special because it also has tpus and the startup cloud providers are lame the labs paper space core weave data crunch jarvis and others so briefly about tpus so there's four versions of them four generations the tpu v4 are the most recent ones and they're
|
160 |
+
|
161 |
+
41
|
162 |
+
00:32:40,480 --> 00:33:36,960
|
163 |
+
just the fastest possible accelerator for deep learning this graphic shows speed ups over a100 which is the fastest nvidia accelerator but the v4s are not quite in general availability yet the v3s are still super fast and they excel at scaling so if you use if you have to train such a large model that you use multiple nodes multiple and all the cores in the tpu then this can be quite fast each tpu has 128 gigs of ram so there's a lot of different clouds and it's a little bit overwhelming to actually compare prices so we built a tool for cloud comparison cloud gpu comparison so we have aws gcp azure lambda labs paper space jarvis labs data crunch and we solicit pull requests so if you know another one like core weave
|
164 |
+
|
165 |
+
42
|
166 |
+
00:33:35,519 --> 00:34:34,639
|
167 |
+
make a pull request to this csv file and then what you can do is you can filter so for example i want to see only the latest generation gpus i want to see only four or eight gpu machines and then maybe particularly i actually don't even want to see the i want to see only the a100s so let's only select the a100s so that narrows it down right so if we want to use that that narrows it down and furthermore maybe i only want to use the 80 gig versions so that narrows it down further and then we can sort by per gpu price or the total price and we can see the properties of the machines right so we know the gpu ram but how many virtual cpus and how much machine ram do these different providers supply to us now let's combine this cost
|
168 |
+
|
169 |
+
43
|
170 |
+
00:34:33,679 --> 00:35:33,119
|
171 |
+
data with benchmark data and what we find is that something that's expensive per hour is not necessarily expensive per experiment using lambda labs benchmarking data if you use the forex v100 machine which is the cheapest per hour and you run an experiment using a transformers model that takes 72 hours it'll cost 1750 to run but if you use the 8x a100 machine it will only take eight hours to run and it'll actually only cost 250 and there's a similar story if you use confnet instead of transformer models less dramatic but still we find that the 8 by a100 machine is both the fastest and the cheapest so that's a little counter-intuitive so i was looking for more benchmarks so here is mosaic ml which i mentioned
|
172 |
+
|
173 |
+
44
|
174 |
+
00:35:30,960 --> 00:36:30,000
|
175 |
+
earlier they're benchmarking the resnet 50 and this is on aws what they find is the 8x a100 machine is one and a half times faster and 15 cheaper than 8x v100 so this is a confident experiment and here's a transformer experiment ept2 model so the 8x a100 machine is twice as fast and 25 cheaper than the adax v100 machine and it's actually three times faster and 30 cheaper than the 8x t4 machine which is a touring generation gpu a good heuristic is use the most expensive per hour gpu which is probably going to be a 4x or 8x a100 in the least expensive cloud and from playing with that cloud gpu table you can convince yourself that the startups are much cheaper than the big boys so here i'm filtering by a100
|
176 |
+
|
177 |
+
45
|
178 |
+
00:36:26,960 --> 00:37:22,560
|
179 |
+
and the per gpu cost on lambda labs is only one dollar and 10 cents per hour and on gcp azure and and aws it's at least you know 3.67 cents but what if you don't want to use the cloud there's two options you could build your own which is i would say easy or you can buy pre-built which is definitely even easier lambda labs builds them and nvidia builds them and then just pc builders like super micro and stuff like that build them you can build a pretty quiet pc with with a lot of ram and let's say you know two 390s or 2080 ti's or something that would maybe be five to eight thousand dollars it take you a day to build it and set it up maybe it's a rite of passage for deep learning practitioners
|
180 |
+
|
181 |
+
46
|
182 |
+
00:37:20,480 --> 00:38:15,680
|
183 |
+
now if you want to go beyond four or 2000 series like 20 80s or two 3000 series like 30 90s that can be painful just because there's a lot of power that they consume and they get hot so pre-built can be better here's a 12 000 machine with two a5000s which each have 24 gigs ram it's going to be incredibly fast or maybe you want 8 gpus now this one is going to be loud you're going to have to put it in some kind of special facility like a colo and actually lame the labs can can stored in their colo for you it'd be maybe sixty thousand dollars for eight a six thousands which is a really really fast server lame the labs also provides actionable advice for selecting specific gpus there is a well known article from tim detmers
|
184 |
+
|
185 |
+
47
|
186 |
+
00:38:13,119 --> 00:38:58,320
|
187 |
+
that is now slightly out of date because there's no ampere cards but it's still good he talks about more than just gpus but also about what cpu to get the ram the recommendations that that i want to give is i think it's it's useful to have your own gpu machine just to shift your mindset from minimizing cost of running in the cloud to maximizing utility of having something that you already paid for and just maximizing how much use you get out of it but to scale out experiments you probably need to enter the cloud and you should use the most expensive machines in the least expensive cloud tpus are worth experimenting with if you're doing large scale training lameda labs is a sponsor of the full-stack deep
|
188 |
+
|
189 |
+
48
|
190 |
+
00:38:56,800 --> 00:39:45,520
|
191 |
+
learning projects that our students are doing this year it's actually an excellent choice for both buying a machine for yourself and it's the least expensive cloud for a100s now that we've talked about compute we can talk about how to manage it so we want to do is we want to launch an experiment or a set of experiments each experiment is going to need a machinery machines with gpu or gpus in the machine it's going to need some kind of setup like a python version cuda version nvidia drivers python requirements like a specific version of pytorch and then it needs a source of data so we could do this manually we could use a workload manager like slurm we could use docker and kubernetes or we could use some software specialized
|
192 |
+
|
193 |
+
49
|
194 |
+
00:39:43,920 --> 00:40:35,520
|
195 |
+
for machine learning if you follow best practices for specifying dependencies like content pip tools that we covered earlier then all you have to do is log into the machine launch an experiment right activate your environment launch the experiment say how many gpus it needs if you however have a cluster of machines then you need to do some more advanced which is probably going to be slurm which is an old-school solution to workload management that's still that's still widely used this is actually a job from the big science effort to train the gpt3 size language model so they have 24 nodes with 64 cpus and 8 gpus on each node slurm is the way that they launched it on their cluster docker is a way to package up an entire
|
196 |
+
|
197 |
+
50
|
198 |
+
00:40:34,079 --> 00:41:22,560
|
199 |
+
dependency stack in in something that's lighter than a full-on virtual machine nvidia docker is also something you'll have to install which let's use gpus and we'll actually use this in lab so we'll talk more about it later kubernetes has kind of emerged as as the best way the most popular way to run many docker containers on top of a cluster cube flow specifically is a project for machine learning both of these are google originated open source projects but they're not controlled by google anymore so with kubeflow you can spawn and manage jupiter notebooks you can manage multi-step workflows it interfaces with pytorch and tensorflow and you can run it on top of google cloud platform or aws or azure or on your own
|
200 |
+
|
201 |
+
51
|
202 |
+
00:41:20,800 --> 00:42:10,400
|
203 |
+
cluster and it can be useful but it's a lot so it could be the right choice for you but we think it probably won't be slarm and kubeflow they make sense if you already have a cluster up and running but how do you even get a cluster up and running in the first place and before we proceed i try not to mention software as a service that doesn't show pricing i find that you know when you go to the website and it says call us or whatever contact us for a demo that's not the right fit for the fsdl community we like to use open source ideally but if it's not open source then at least something that's transparently priced aws sagemaker is a solution you've probably heard about if you've used amazon web services
|
204 |
+
|
205 |
+
52
|
206 |
+
00:42:08,160 --> 00:42:57,040
|
207 |
+
and it's really a set of solutions it's everything from labeling data to launching notebooks to training to deploying your models and even to monitoring them and notebooks are a central paradigm they call it sagemaker studio and sagemaker could totally make sense to adopt if you're already using aws for everything if you're not already using aws for everything it's not such a silver bullet that it's worth adopting necessarily but if you are it's definitely worth a look so for training specifically they have some basically pre-built algorithms and they're quite they're quite old-school but you can also connect any other algorithm yourself it's a little more it's a little more complicated and right away you have to configure a
|
208 |
+
|
209 |
+
53
|
210 |
+
00:42:55,119 --> 00:43:48,880
|
211 |
+
lot of i am you know roles and and security groups and stuff like that it might be overwhelming if all you're trying to do is train a machine learning model that said they do have increasing support for pytorch now notice if you're using sagemaker to launch your python training you're going to be paying about a 15 to 20 markup so there's special sagemaker instances that correspond to normal aws gpu instances but it's more expensive they do have support for using spot instances and so that could make it worth it any scale is a company from the makers of ray which is an open source project from berkeley and recently they released ray train which they claim is faster than sagemaker so the same idea basically lets you
|
212 |
+
|
213 |
+
54
|
214 |
+
00:43:46,800 --> 00:44:40,560
|
215 |
+
scale out your training to many nodes with many gpus but does it faster and it has better spot instance support where if a spot instance gets killed during training it recovers from it intelligently and any scale any scale a software is a service that makes it you know really simple to provision compute with one line of code you can launch a cluster of any size that ease of use comes at a significant markup to amazon web services grid ai is makers of py torch lightning and the the tagline is seamlessly trained hundreds of machine learning models on the cloud with zero code changes if you have some kind of main dot pi method that's going to run your training and that can run on your laptop or on on
|
216 |
+
|
217 |
+
55
|
218 |
+
00:44:38,400 --> 00:45:27,760
|
219 |
+
some local machine you can just scale it out to a grid of instances by prefacing it with grid run and then just saying what kind of instance type how many gpus should i use spot instances and so on and you can also you can use their instances or you can use aws under the hood and then it shows you all the experiments you're running and so on now i'm not totally sure about the long term plans for grid.ai because the makers of python's lightning are also rebranding as lightning.i which has its own pricing so i'm i'm just not totally sure but it's if it sticks around it looks like a really cool solution there's also non-machine learning specific solutions like you don't need sagemaker to provision compute on aws
|
220 |
+
|
221 |
+
56
|
222 |
+
00:45:25,440 --> 00:46:11,200
|
223 |
+
you could just do it in a number of ways that people have been doing you know provisioning aws instances and then uniting them into a cluster you can write your own scripts you can use kubernetes you can use some libraries for spot instances but there's nothing you know we can really recommend that's super easy to use determined ai is a machine learning specific open source solution that lets you manage a cluster either on prem or in the cloud it's cluster management distributed training experiment tracking hyper parameter search a lot of extra stuff it was a startup also from berkeley it got acquired by hp but it's still an active development it's really easy to use you just install determined get a
|
224 |
+
|
225 |
+
57
|
226 |
+
00:46:09,839 --> 00:46:59,920
|
227 |
+
cluster up and running you can also launch it on aws or gcp that said i feel like a truly simple solution to launching training on many cloud instances still doesn't exist so this is an area where i think there's room for a better solution and that cannot be said about experiment management and model management because i think there's great solutions there so what experiment management refers to is you know as we run machine learning experiments we we can lose track of which code parameters data set generated which model when we run multiple experiments that's even more difficult we need to like start making a spreadsheet of all the experiments we ran and the results and so on tensorboard is a solution from google
|
228 |
+
|
229 |
+
58
|
230 |
+
00:46:58,079 --> 00:47:48,640
|
231 |
+
that's not exclusive to tensorflow it gives you this nice set of pages that lets you track your loss and see where your model saved and it's a great solution for single experiments it does get unwieldy to manage many experiments as you get into dozens of experiments ml flow tracking is a solution that is open source it's from data bricks but it's not exclusive to data breaks it's not only for experiment management it's also for model packaging and stuff like that but they do have a robust solution for experiment management you do have to host it yourself weights and biases is a really popular super easy to use solution that is free for public projects and paid for private projects they show you
|
232 |
+
|
233 |
+
59
|
234 |
+
00:47:46,240 --> 00:48:33,119
|
235 |
+
all the experiments you've ever run slice them dice however you want for each experiment they record would you log like your loss but also stuff about your system like how utilized your gpu is which is pretty important to track and you basically just initialize it with your experiment config and then you log anything you want including images and we're actually going to see this in lab 4 which is this week they also have some other stuff like you can host reports and tables is a recent product that lets you slice and dice your data and predictions in really cool ways determine.ai also has an experiment tracking solution which is also perfectly good and there's other solutions too like neptune and comet
|
236 |
+
|
237 |
+
60
|
238 |
+
00:48:30,160 --> 00:49:22,240
|
239 |
+
and a number of others really often we actually want to programmatically launch experiments by doing something that's called hyper parameter optimization so maybe we want to search over learning rates so as we launch our training we don't want to commit to a specific learning rate we basically want to search over learning rates from you know point zero zero zero one to point one it'd be even more awesome if like this was done intelligently where if multiple runs are proceeding in in parallel the ones that aren't going as well as others get stopped early and we get to search over more of the potential hyperparameter space weights and biases has a solution to this that's very pragmatic and easy to
|
240 |
+
|
241 |
+
61
|
242 |
+
00:49:19,599 --> 00:50:10,720
|
243 |
+
use it's called sweeps the way this works is you basically add a yaml file to your project that specifies the parameters you want to search over and how you want to do the search so here on the right you'll see we're using this hyperband algorithm which is a state-of-the-art hyper-parameter optimization algorithm and then you launch agent on whatever machines you control the agent will pull the sweep server for a set of parameters run an experiment report results poll the server for more parameters and keep doing that and there's other solutions this is pretty table stakes kind of thing so sagemaker has hyperparameter search determined ai has hyperparameter search i think of it as just it's a part of your training harness so
|
244 |
+
|
245 |
+
62
|
246 |
+
00:50:09,680 --> 00:51:02,480
|
247 |
+
if you're already using weights and biases just use sweeps from weights and biases if you're already using determine just use hyperparameter search from determined it's not worth using some specialized software for this and lastly there are all-in-one solutions that cover everything from data to development to deployment a single system for everything for development usually a notebook interface scaling a training experiment to many machines provisioning the compute for you tracking experiments versioning models but also deploying models and monitoring performance managing data of really all-in-one each maker is the you know the prototypical solution here but there's some other ones like gradients from paper space so look at
|
248 |
+
|
249 |
+
63
|
250 |
+
00:51:00,960 --> 00:51:48,240
|
251 |
+
look at these features notebooks experiments data sets models and inference or domino data labs you can provision compute you can track the experiments you can deploy a model via a rest api you can monitor the predictions that the the api makes and you can publish little data applets kind of like streamlit you can also monitor spend and you see all the projects in one place domino's meant more for kind of non-deep learning machine learning but i just wanted to show it because it's a nice set of the all-in-one functionality so these all-in-one solutions could be good but before deciding we want to go in on one of them let's wait to learn more about data management and deployment in the weeks ahead
|
252 |
+
|
253 |
+
64
|
254 |
+
00:51:46,480 --> 00:51:52,960
|
255 |
+
and that is it for development infrastructure and tooling thank you
|
256 |
+
|
documents/lecture-03.md
ADDED
@@ -0,0 +1,597 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: Principles for testing software, tools for testing Python code, practices for debugging models and testing ML
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 3: Troubleshooting & Testing
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/RLemHNAO5Lw?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Charles Frye](https://twitter.com/charles_irl).<br />
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published August 22, 2022.
|
14 |
+
[Download slides](https://fsdl.me/2022-lecture-03-slides).
|
15 |
+
|
16 |
+
## 1 - Testing Software
|
17 |
+
|
18 |
+
1. The general approach is that tests will help us ship faster with
|
19 |
+
fewer bugs, but they won't catch all of our bugs.
|
20 |
+
|
21 |
+
2. That means we will use testing tools but won't try to achieve 100%
|
22 |
+
coverage.
|
23 |
+
|
24 |
+
3. Similarly, we will use linting tools to improve the development
|
25 |
+
experience but leave escape valves rather than pedantically
|
26 |
+
following our style guides.
|
27 |
+
|
28 |
+
4. Finally, we'll discuss tools for automating these workflows.
|
29 |
+
|
30 |
+
### 1.1 - Tests Help Us Ship Faster. They Don't Catch All Bugs
|
31 |
+
|
32 |
+
![](./media/image1.png)
|
33 |
+
|
34 |
+
**Tests are code we write that are designed to fail intelligibly when
|
35 |
+
our other code has bugs**. These tests can help catch some bugs before
|
36 |
+
they are merged into the main product, but they can't catch all bugs.
|
37 |
+
The main reason is that test suites are not certificates of correctness.
|
38 |
+
In some formal systems, tests can be proof of code correctness. But we
|
39 |
+
are writing in Python (a loosely goosey language), so all bets are off
|
40 |
+
in terms of code correctness.
|
41 |
+
|
42 |
+
[Nelson Elhage](https://twitter.com/nelhage?lang=en)
|
43 |
+
framed test suites more like classifiers. The classification problem is:
|
44 |
+
does this commit have a bug, or is it okay? The classifier output is
|
45 |
+
whether the tests pass or fail. We can then **treat test suites as a
|
46 |
+
"prediction" of whether there is a bug**, which suggests a different way
|
47 |
+
of designing our test suites.
|
48 |
+
|
49 |
+
When designing classifiers, we need to trade off detection and false
|
50 |
+
alarms. **If we try to catch all possible bugs, we can inadvertently
|
51 |
+
introduce false alarms**. The classic signature of a false alarm is a
|
52 |
+
failed test - followed by a commit that fixes the test rather than the
|
53 |
+
code.
|
54 |
+
|
55 |
+
To avoid introducing too many false alarms, it's useful to ask yourself
|
56 |
+
two questions before adding a test:
|
57 |
+
|
58 |
+
1. Which real bugs will this test catch?
|
59 |
+
|
60 |
+
2. Which false alarms will this test raise?
|
61 |
+
|
62 |
+
If you can think of more examples for the second question than the first
|
63 |
+
one, maybe you should reconsider whether you need this test.
|
64 |
+
|
65 |
+
One caveat is that: **in some settings, correctness is important**.
|
66 |
+
Examples include medical diagnostics/intervention, self-driving
|
67 |
+
vehicles, and banking/finance. A pattern immediately arises here: If you
|
68 |
+
are operating in a high-stakes situation where errors have consequences
|
69 |
+
for people's lives and livelihoods, even if it's not regulated yet, it
|
70 |
+
might be regulated soon. These are examples of **low-feasibility,
|
71 |
+
high-impact ML projects** discussed in the first lecture.
|
72 |
+
|
73 |
+
![](./media/image19.png)
|
74 |
+
|
75 |
+
|
76 |
+
### 1.2 - Use Testing Tools, But Don't Chase Coverage
|
77 |
+
|
78 |
+
- *[Pytest](https://docs.pytest.org/)* is the standard
|
79 |
+
tool for testing Python code. It has a Pythonic implementation and
|
80 |
+
powerful features such as creating separate suites, sharing
|
81 |
+
resources across tests, and running parametrized variations of
|
82 |
+
tests.
|
83 |
+
|
84 |
+
- Pure text docs can't be checked for correctness automatically, so
|
85 |
+
they are hard to maintain or trust. Python has a nice module,
|
86 |
+
[*[doctests]*](https://docs.python.org/3/library/doctest.html),
|
87 |
+
for checking code in the documentation and preventing rot.
|
88 |
+
|
89 |
+
- Notebooks help connect rich media (charts, images, and web pages)
|
90 |
+
with code execution. A cheap and dirty solution to test notebooks
|
91 |
+
is adding some *asserts* and using *nbformat* to run the
|
92 |
+
notebooks.
|
93 |
+
|
94 |
+
![](./media/image17.png)
|
95 |
+
|
96 |
+
|
97 |
+
Once you start adding different types of tests and your codebase grows,
|
98 |
+
you will want coverage tools for recording which code is checked or
|
99 |
+
"covered" by tests. Typically, this is done in lines of code, but some
|
100 |
+
tools can be more fine-grained. We recommend
|
101 |
+
[Codecov](https://about.codecov.io/), which generates nice
|
102 |
+
visualizations you can use to drill down and get a high-level overview
|
103 |
+
of the current state of your testing. Codecov helps you understand your
|
104 |
+
tests and can be incorporated into your testing. You can say you want to
|
105 |
+
reject commits not only where tests fail, but also where test coverage
|
106 |
+
goes down below a certain threshold.
|
107 |
+
|
108 |
+
However, we recommend against that. Personal experience, interviews, and
|
109 |
+
published research suggest that only a small fraction of the tests you
|
110 |
+
write will generate most of your value. **The right tactic,
|
111 |
+
engineering-wise, is to expand the limited engineering effort we have on
|
112 |
+
the highest-impact tests and ensure that those are super high quality**.
|
113 |
+
If you set a coverage target, you will instead write tests in order to
|
114 |
+
meet that coverage target (regardless of their quality). You end up
|
115 |
+
spending more effort to write tests and deal with their low quality.
|
116 |
+
|
117 |
+
![](./media/image16.png)
|
118 |
+
|
119 |
+
|
120 |
+
### 1.3 - Use Linting Tools, But Leave Escape Valves
|
121 |
+
|
122 |
+
**Clean code is of uniform and standard style**.
|
123 |
+
|
124 |
+
1. Uniform style helps avoid spending engineering time on arguments
|
125 |
+
over style in pull requests and code review. It also helps improve
|
126 |
+
the utility of our version control by cutting down on noisy
|
127 |
+
components of diffs and reducing their size. Both benefits make it
|
128 |
+
easier for humans to visually parse the diffs in our version
|
129 |
+
control system and make it easier to build automation around them.
|
130 |
+
|
131 |
+
2. Standard style makes it easier to accept contributions for an
|
132 |
+
open-source repository and onboard new team members for a
|
133 |
+
closed-source system.
|
134 |
+
|
135 |
+
![](./media/image18.png)
|
136 |
+
|
137 |
+
|
138 |
+
One aspect of consistent style is consistent code formatting (with
|
139 |
+
things like whitespace). The standard tool for that in Python is
|
140 |
+
[the] *[black]* [Python
|
141 |
+
formatter](https://github.com/psf/black). It's a very
|
142 |
+
opinionated tool with a fairly narrow scope in terms of style. It
|
143 |
+
focuses on things that can be fully automated and can be nicely
|
144 |
+
integrated into your editor and automated workflows.
|
145 |
+
|
146 |
+
For non-automatable aspects of style (like missing docstrings), we
|
147 |
+
recommend [*[flake8]*](https://flake8.pycqa.org/). It comes
|
148 |
+
with many extensions and plugins such as docstring completeness, type
|
149 |
+
hinting, security, and common bugs.
|
150 |
+
|
151 |
+
ML codebases often have both Python code and shell scripts in them.
|
152 |
+
Shell scripts are powerful, but they also have a lot of sharp edges.
|
153 |
+
*[shellcheck](https://www.shellcheck.net/)* knows all the
|
154 |
+
weird behaviors of bash that often cause errors and issues that aren't
|
155 |
+
immediately obvious. It also provides explanations for why it's raising
|
156 |
+
a warning or an error. It's very fast to run and can be easily
|
157 |
+
incorporated into your editor.
|
158 |
+
|
159 |
+
![](./media/image6.png)
|
160 |
+
|
161 |
+
|
162 |
+
One caveat to this is: **pedantic enforcement of style is obnoxious.**
|
163 |
+
To avoid frustration with code style and linting, we recommend:
|
164 |
+
|
165 |
+
1. Filtering rules down to the minimal style that achieves the goals we
|
166 |
+
set out (sticking with standards, avoiding arguments, keeping
|
167 |
+
version control history clean, etc.)
|
168 |
+
|
169 |
+
2. Having an "opt-in" application of rules and gradually growing
|
170 |
+
coverage over time - which is especially important for existing
|
171 |
+
codebases (which may have thousands of lines of code that we need
|
172 |
+
to be fixed).
|
173 |
+
|
174 |
+
### 1.4 - Always Be Automating
|
175 |
+
|
176 |
+
**To make the best use of testing and linting practices, you want to
|
177 |
+
automate these tasks and connect to your cloud version control system
|
178 |
+
(VCS)**. Connecting to the VCS state reduces friction when trying to
|
179 |
+
reproduce or understand errors. Furthermore, running things outside of
|
180 |
+
developer environments means that you can run tests automatically in
|
181 |
+
parallel to other development work.
|
182 |
+
|
183 |
+
Popular, open-source repositories are the best place to learn about
|
184 |
+
automation best practices. For instance, the PyTorch Github library has
|
185 |
+
tons of automated workflows built into the repo - such as workflows that
|
186 |
+
automatically run on every push and pull.
|
187 |
+
|
188 |
+
![](./media/image15.png)
|
189 |
+
|
190 |
+
|
191 |
+
The tool that PyTorch uses (and that we recommend) is [GitHub
|
192 |
+
Actions](https://docs.github.com/en/actions), which ties
|
193 |
+
automation directly to VCS. It is powerful, flexible, performant, and
|
194 |
+
easy to use. It gets great documentation, can be used with a YAML file,
|
195 |
+
and is embraced by the open-source community. There are other options
|
196 |
+
such as [pre-commit.ci](https://pre-commit.ci/),
|
197 |
+
[CircleCI](https://circleci.com/), and
|
198 |
+
[Jenkins](https://www.jenkins.io/); but GitHub Actions
|
199 |
+
seems to have won the hearts and minds in the open-source community in
|
200 |
+
the last few years.
|
201 |
+
|
202 |
+
To keep your version control history as clean as possible, you want to
|
203 |
+
be able to run tests and linters locally before committing. We recommend
|
204 |
+
*[pre-commit](https://github.com/pre-commit/pre-commit)*
|
205 |
+
to enforce hygiene checks. You can use it to run formatting, linting,
|
206 |
+
etc. on every commit and keep the total runtime to a few seconds.
|
207 |
+
*pre-commit* is easy to run locally and easy to automate with GitHub
|
208 |
+
Actions.
|
209 |
+
|
210 |
+
**Automation to ensure the quality and integrity of our software is a
|
211 |
+
productivity enhancer.** That's broader than just CI/CD. Automation
|
212 |
+
helps you avoid context switching, surfaces issues early, is a force
|
213 |
+
multiplier for small teams, and is better documented by default.
|
214 |
+
|
215 |
+
One caveat is that: **automation requires really knowing your tools.**
|
216 |
+
Knowing Docker well enough to use it is not the same as knowing Docker
|
217 |
+
well enough to automate it. Bad automation, like bad tests, takes more
|
218 |
+
time than it saves. Organizationally, that makes automation a good task
|
219 |
+
for senior engineers who have knowledge of these tools, have ownership
|
220 |
+
over code, and can make these decisions around automation.
|
221 |
+
|
222 |
+
### Summary
|
223 |
+
|
224 |
+
1. Automate tasks with GitHub Actions to reduce friction.
|
225 |
+
|
226 |
+
2. Use the standard Python toolkit for testing and cleaning your
|
227 |
+
projects.
|
228 |
+
|
229 |
+
3. Choose testing and linting practices with the 80/20 principle,
|
230 |
+
shipping velocity, and usability/developer experience in mind.
|
231 |
+
|
232 |
+
## 2 - Testing ML Systems
|
233 |
+
|
234 |
+
1. Testing ML is hard, but not impossible.
|
235 |
+
|
236 |
+
2. We should stick with the low-hanging fruit to start.
|
237 |
+
|
238 |
+
3. Test your code in production, but don't release bad code.
|
239 |
+
|
240 |
+
### 2.1 - Testing ML Is Hard, But Not Impossible
|
241 |
+
|
242 |
+
Software engineering is where many testing practices have been
|
243 |
+
developed. In software engineering, we compile source code into
|
244 |
+
programs. In machine learning, training compiles data into a model.
|
245 |
+
These components are harder to test:
|
246 |
+
|
247 |
+
1. Data is heavier and more inscrutable than source code.
|
248 |
+
|
249 |
+
2. Training is more complex and less well-defined.
|
250 |
+
|
251 |
+
3. Models have worse tools for debugging and inspection than compiled
|
252 |
+
programs.
|
253 |
+
|
254 |
+
In this section, we will focus primarily on "smoke" tests. These tests
|
255 |
+
are easy to implement and still effective. They are among the 20% of
|
256 |
+
tests that get us 80% of the value.
|
257 |
+
|
258 |
+
### 2.2 - Use Expectation Testing on Data
|
259 |
+
|
260 |
+
**We test our data by checking basic properties**. We express our
|
261 |
+
expectations about the data, which might be things like there are no
|
262 |
+
nulls in this column or the completion date is after the start date.
|
263 |
+
With expectation testing, you will start small with only a few
|
264 |
+
properties and grow them slowly. You only want to test things that are
|
265 |
+
worth raising alarms and sending notifications to others.
|
266 |
+
|
267 |
+
![](./media/image14.png)
|
268 |
+
|
269 |
+
|
270 |
+
We recommend
|
271 |
+
[*[great_expectations]*](https://greatexpectations.io/) for
|
272 |
+
data testing. It automatically generates documentation and quality
|
273 |
+
reports for your data, in addition to built-in logging and alerting
|
274 |
+
designed for expectation testing. To get started, check out [this
|
275 |
+
MadeWithML tutorial on
|
276 |
+
great_expectations](https://github.com/GokuMohandas/testing-ml).
|
277 |
+
|
278 |
+
![](./media/image13.png)
|
279 |
+
|
280 |
+
To move forward, you want to stay as close to the data as possible:
|
281 |
+
|
282 |
+
1. A common pattern is that there's a benchmark dataset with
|
283 |
+
annotations (in academia) or an external annotation team (in the
|
284 |
+
industry). A lot of the detailed information about that data can
|
285 |
+
be extracted by simply looking at it.
|
286 |
+
|
287 |
+
2. One way for data to get internalized into the organization is that
|
288 |
+
at the start of the project, model developers annotate data ad-hoc
|
289 |
+
(especially if you don't have the budget for an external
|
290 |
+
annotation team).
|
291 |
+
|
292 |
+
3. However, if the model developers at the start of the project move on
|
293 |
+
and more developers get onboarded, that knowledge is diluted. A
|
294 |
+
better solution is an internal annotation team that has a regular
|
295 |
+
information flow with the model developers is a better solution.
|
296 |
+
|
297 |
+
4. The best practice ([recommended by Shreya
|
298 |
+
Shankar](https://twitter.com/sh_reya/status/1521903046392877056))
|
299 |
+
is t**o have a regular on-call rotation where model developers
|
300 |
+
annotate data themselves**. Ideally, these are fresh data so that
|
301 |
+
all members of the team who are developing models know about the
|
302 |
+
data and build intuition/expertise in the data.
|
303 |
+
|
304 |
+
### 2.3 - Use Memorization Testing on Training
|
305 |
+
|
306 |
+
**Memorization is the simplest form of learning**. Deep neural networks
|
307 |
+
are very good at memorizing data, so checking whether your model can
|
308 |
+
memorize a very small fraction of the full data set is a great smoke
|
309 |
+
test for training. If a model can\'t memorize, then something is clearly
|
310 |
+
very wrong!
|
311 |
+
|
312 |
+
Only really gross issues with training will show up with this test. For
|
313 |
+
example, your gradients may not be calculated correctly, you have a
|
314 |
+
numerical issue, or your labels have been shuffled; serious issues like
|
315 |
+
these. Subtle bugs in your model or your data are not going to show up.
|
316 |
+
A way to catch smaller bugs is to include the length of run time in your
|
317 |
+
test coverage. It's a good way to detect if smaller issues are making it
|
318 |
+
harder for your model to learn. If the number of epochs it takes to
|
319 |
+
reach an expected performance suddenly goes up, it may be due to a
|
320 |
+
training bug. PyTorch Lightning has an "*overfit_batches*" feature that
|
321 |
+
can help with this.
|
322 |
+
|
323 |
+
**Make sure to tune memorization tests to run quickly, so you can
|
324 |
+
regularly run them**. If they are under 10 minutes or some short
|
325 |
+
threshold, they can be run every PR or code change to better catch
|
326 |
+
breaking changes. A couple of ideas for speeding up these tests are
|
327 |
+
below:
|
328 |
+
|
329 |
+
![](./media/image3.png)
|
330 |
+
|
331 |
+
Overall, these ideas lead to memorization tests that implement model
|
332 |
+
training on different time scale and allow you to mock out scenarios.
|
333 |
+
|
334 |
+
A solid, if expensive idea for testing training is to **rerun old
|
335 |
+
training jobs with new code**. It's not something that can be run
|
336 |
+
frequently, but doing so can yield lessons about what unexpected changes
|
337 |
+
might have happened in your training pipeline. The main drawback is the
|
338 |
+
potential expense of running these tests. CI platforms like
|
339 |
+
[CircleCI](https://circleci.com/) charge a great deal for
|
340 |
+
GPUs, while others like Github Actions don't offer access to the
|
341 |
+
relevant machines easily.
|
342 |
+
|
343 |
+
The best option for testing training is to **regularly run training with
|
344 |
+
new data that's coming in from production**. This is still expensive,
|
345 |
+
but it is directly related to improvements in model development, not
|
346 |
+
just testing for breakages. Setting this up requires **a data flywheel**
|
347 |
+
similar to what we talked about in Lecture 1. Further tooling needed to
|
348 |
+
achieve will be discussed down the line.
|
349 |
+
|
350 |
+
### 2.4 - Adapt Regression Testing for Models
|
351 |
+
|
352 |
+
**Models are effectively functions**. They have inputs and produce
|
353 |
+
outputs like any other function in code. So, why not test them like
|
354 |
+
functions with regression testing? For specific inputs, we can check to
|
355 |
+
see whether the model consistently returns the same outputs. This is
|
356 |
+
best done with simpler models like classification models. It's harder to
|
357 |
+
maintain such tests with more complex models. However, even in a more
|
358 |
+
complex model scenario, regression testing can be useful for comparing
|
359 |
+
changes from training to production.
|
360 |
+
|
361 |
+
![](./media/image11.png)
|
362 |
+
|
363 |
+
|
364 |
+
A more sophisticated approach to testing for ML models is to **use loss
|
365 |
+
values and model metrics to build documented test suites out of your
|
366 |
+
data**. Consider this similar to [the test-driven
|
367 |
+
development](https://en.wikipedia.org/wiki/Test-driven_development)
|
368 |
+
(TDD) code writing paradigm. The test that is written before your code
|
369 |
+
in TDD is akin to your model's loss performance; both represent the gap
|
370 |
+
between where your code needs to be and where it is. Over time, as we
|
371 |
+
improve the loss metric, our model is getting closer to passing "the
|
372 |
+
test" we've imposed on it. The gradient descent we use to improve the
|
373 |
+
model can be considered a TDD approach to machine learning models!
|
374 |
+
|
375 |
+
![](./media/image9.png)
|
376 |
+
|
377 |
+
|
378 |
+
While gradient descent is somewhat like TDD, it's not *exactly* the same
|
379 |
+
because simply reviewing metrics doesn't tell us how to resolve model
|
380 |
+
failures (the way traditional software tests do).
|
381 |
+
|
382 |
+
To fill in this gap, **start by [looking at the data points that have
|
383 |
+
the highest loss](https://arxiv.org/abs/1912.05283)**. Flag
|
384 |
+
them for a test suite composed of "hard" examples. Doing this provides
|
385 |
+
two advantages: it helps find where the model can be improved, and it
|
386 |
+
can also help find errors in the data itself (i.e. poor labels).
|
387 |
+
|
388 |
+
As you examine these failures, you can aggregate types of failures into
|
389 |
+
named suites. For example in a self-driving car use case, you could have
|
390 |
+
a "night time" suite and a "reflection" suite. **Building these test
|
391 |
+
suites can be considered the machine learning version of regression
|
392 |
+
testing**, where you take bugs that you\'ve observed in production and
|
393 |
+
add them to your test suite to make sure that they don\'t come up again.
|
394 |
+
|
395 |
+
![](./media/image8.png)
|
396 |
+
|
397 |
+
The method can be quite manual, but there are some options for speeding
|
398 |
+
it up. Partnering with the annotation team at your company can help make
|
399 |
+
developing these tests a lot faster. Another approach is to use a method
|
400 |
+
called [Domino](https://arxiv.org/abs/2203.14960) that
|
401 |
+
uses foundation models to find errors. Additionally, for testing NLP
|
402 |
+
models, use the
|
403 |
+
[CheckList](https://arxiv.org/abs/2005.04118) approach.
|
404 |
+
|
405 |
+
### 2.5 - Test in Production, But Don't YOLO
|
406 |
+
|
407 |
+
It's crucial to test in true production settings. This is especially
|
408 |
+
true for machine learning models, because data is an important component
|
409 |
+
of both the production and the development environments. It's difficult
|
410 |
+
to ensure that both are very close to one another.
|
411 |
+
|
412 |
+
**The best way to solve the training and production difference is to
|
413 |
+
test in production**.
|
414 |
+
|
415 |
+
Testing in production isn't sufficient on its own. Rather, testing in
|
416 |
+
production allows us to develop tooling and infrastructure that allows
|
417 |
+
us to resolve production errors quickly (which are often quite
|
418 |
+
expensive). It reduces pressure on other kinds of testing, but does not
|
419 |
+
replace them.
|
420 |
+
|
421 |
+
![](./media/image7.png)
|
422 |
+
|
423 |
+
|
424 |
+
We will cover in detail the tooling needed for production monitoring and
|
425 |
+
continual learning of ML systems in a future lecture.
|
426 |
+
|
427 |
+
### 2.6 - ML Test Score
|
428 |
+
|
429 |
+
So far, we have discussed writing "smoke" tests for ML: expectation
|
430 |
+
tests for data, memorization tests for training, and regression tests
|
431 |
+
for models.
|
432 |
+
|
433 |
+
**As your code base and team mature, adopt a more full-fledged approach
|
434 |
+
to testing ML systems like the approach identified in the [ML Test
|
435 |
+
Score](https://research.google/pubs/pub46555/) paper**. The
|
436 |
+
ML Test Score is a rubric that evolved out of machine learning efforts
|
437 |
+
at Google. It's a strict rubric for ML test quality that covers data,
|
438 |
+
models, training, infrastructure, and production monitoring. It overlaps
|
439 |
+
with, but goes beyond some of the recommendations we've offered.
|
440 |
+
|
441 |
+
![](./media/image2.png)
|
442 |
+
|
443 |
+
It's rather expensive, but worth it for high stakes use cases that need
|
444 |
+
to be really well-engineered! To be really clear, this rubric is
|
445 |
+
*really* strict. Even our Text Recognizer system we've designed so far
|
446 |
+
misses a few categories. Use the ML Test Score as inspiration to develop
|
447 |
+
the right testing approach that works for your team's resources and
|
448 |
+
needs.
|
449 |
+
|
450 |
+
![](./media/image5.png)
|
451 |
+
|
452 |
+
## 3 - Troubleshooting Models
|
453 |
+
|
454 |
+
**Tests help us figure out something is wrong, but troubleshooting is
|
455 |
+
required to actually fix broken ML systems**. Models often require the
|
456 |
+
most troubleshooting, and in this section we'll cover a three step
|
457 |
+
approach to troubleshooting them.
|
458 |
+
|
459 |
+
1. "Make it run" by avoiding common errors.
|
460 |
+
|
461 |
+
2. "Make it fast" by profiling and removing bottlenecks.
|
462 |
+
|
463 |
+
3. "Make it right" by scaling model/data and sticking with proven
|
464 |
+
architectures.
|
465 |
+
|
466 |
+
### 3.1 - Make It Run
|
467 |
+
|
468 |
+
This is the easiest step for models; only a small portion of bugs cause
|
469 |
+
the kind of loud failures that prevent a model from running at all.
|
470 |
+
Watch out for these bugs in advance and save yourself the trouble of
|
471 |
+
models that don't run.
|
472 |
+
|
473 |
+
The first type of bugs that prevent models from running at all are
|
474 |
+
**shape errors.** When the shape of the tensors don't match for the
|
475 |
+
operations run on them, models can't be trained or run. Prevent these
|
476 |
+
errors by keeping notes on the expected size of tensors, annotate the
|
477 |
+
sizes in the code, and even step through your model code with a debugger
|
478 |
+
to check tensor size as you go.
|
479 |
+
|
480 |
+
![](./media/image10.png)
|
481 |
+
|
482 |
+
|
483 |
+
The second type of bugs is out of **memory errors**. This occurs when
|
484 |
+
you try to push a tensor to a GPU that is too large to fit. PyTorch
|
485 |
+
Lightning has good tools to prevent this. Make sure you're using the
|
486 |
+
lowest precision your training can tolerate; a good default is 16 bit
|
487 |
+
precision. Another common reason for this is trying to run a model on
|
488 |
+
too much data or too large a batch size. Use the autoscale batch size
|
489 |
+
feature in PyTorch Lightning to pick the right size batch. You can use
|
490 |
+
gradient accumulation if these batch sizes get too small. If neither of
|
491 |
+
these options work, you can look into manual techniques like tensor
|
492 |
+
parallelism and gradient checkpoints.
|
493 |
+
|
494 |
+
**Numerical errors** also cause machine learning failures. This is when
|
495 |
+
NaNs or infinite values show up in tensors. These issues most commonly
|
496 |
+
appear first in the gradient and then cascade through the model. PyTorch
|
497 |
+
Lightning has a good tool for tracking and logging gradient norms. A
|
498 |
+
good tip to check whether these issues are caused by precision issues is
|
499 |
+
to switch to Python 64 bit floats and see if that causes these issues to
|
500 |
+
go away. Normalization layers tend to cause these issues, generally
|
501 |
+
speaking. So watch out for how you do normalization!
|
502 |
+
|
503 |
+
### 3.2 - Make It Fast
|
504 |
+
|
505 |
+
![](./media/image4.png)
|
506 |
+
|
507 |
+
Once you can run a model, you'll want it to run fast. This can be tricky
|
508 |
+
because the performance of DNN training code is very counterintuitive.
|
509 |
+
For example, transformers can actually spend more time in the MLP layer
|
510 |
+
than the attention layer. Similarly, trivial components like loading
|
511 |
+
data can soak up performance.
|
512 |
+
|
513 |
+
To solve these issues, the primary solution is to **roll up your sleeves
|
514 |
+
and profile your code**. You can often find pretty easy Python changes
|
515 |
+
that yield big results. Read these two tutorials by
|
516 |
+
[Charles](https://wandb.ai/wandb/trace/reports/A-Public-Dissection-of-a-PyTorch-Training-Step--Vmlldzo5MDE3NjU?galleryTag=&utm_source=fully_connected&utm_medium=blog&utm_campaign=using+the+pytorch+profiler+with+w%26b)
|
517 |
+
and [Horace](https://horace.io/brrr_intro.html) for more
|
518 |
+
details.
|
519 |
+
|
520 |
+
### 3.3 - Make It Right
|
521 |
+
|
522 |
+
After you make it run fast, make the model right. Unlike traditional
|
523 |
+
software, machine learning models never are truly perfect. Production
|
524 |
+
performance is never perfect. As such, it might be more appropriate to
|
525 |
+
say "make it as right as needed".
|
526 |
+
|
527 |
+
Knowing this, making the model run and run fast allows us to make the
|
528 |
+
model right through applying **scale.** To achieve performance benefits,
|
529 |
+
scaling a model or its data are generally fruitful and achievable
|
530 |
+
routes. It's a lot easier to scale a fast model. [Research from OpenAI
|
531 |
+
and other institutions](https://arxiv.org/abs/2001.08361)
|
532 |
+
is showing that benefits from scale can be rigorously measured and
|
533 |
+
predicted across compute budget, dataset size, and parameter count.
|
534 |
+
|
535 |
+
![](./media/image12.png)
|
536 |
+
|
537 |
+
If you can't afford to scale yourself, consider finetuning a model
|
538 |
+
trained at scale for your task.
|
539 |
+
|
540 |
+
So far, all of the advice given has been model and task-agnostic.
|
541 |
+
Anything more detailed has to be specific to the model and the relevant
|
542 |
+
task. Stick close to working architectures and hyperparameters from
|
543 |
+
places like HuggingFace, and try not to reinvent the wheel!
|
544 |
+
|
545 |
+
## 4 - Resources
|
546 |
+
|
547 |
+
Here are some helpful resources that discuss this topic.
|
548 |
+
|
549 |
+
### Tweeters
|
550 |
+
|
551 |
+
1. [Julia Evans](https://twitter.com/b0rk)
|
552 |
+
|
553 |
+
2. [Charity Majors](https://twitter.com/mipsytipsy)
|
554 |
+
|
555 |
+
3. [Nelson Elhage](https://twitter.com/nelhage)
|
556 |
+
|
557 |
+
4. [kipply](https://twitter.com/kipperrii)
|
558 |
+
|
559 |
+
5. [Horace He](https://twitter.com/cHHillee)
|
560 |
+
|
561 |
+
6. [Andrej Karpathy](https://twitter.com/karpathy)
|
562 |
+
|
563 |
+
7. [Chip Huyen](https://twitter.com/chipro)
|
564 |
+
|
565 |
+
8. [Jeremy Howard](https://twitter.com/jeremyphoward)
|
566 |
+
|
567 |
+
9. [Ross Wightman](https://twitter.com/wightmanr)
|
568 |
+
|
569 |
+
### Templates
|
570 |
+
|
571 |
+
1. [Lightning Hydra
|
572 |
+
Template](https://github.com/ashleve/lightning-hydra-template)
|
573 |
+
|
574 |
+
2. [NN Template](https://github.com/grok-ai/nn-template)
|
575 |
+
|
576 |
+
3. [Generic Deep Learning Project
|
577 |
+
Template](https://github.com/sudomaze/deep-learning-project-template)
|
578 |
+
|
579 |
+
### Texts
|
580 |
+
|
581 |
+
1. [Reliable ML Systems
|
582 |
+
talk](https://www.usenix.org/conference/opml20/presentation/papasian)
|
583 |
+
|
584 |
+
2. ["ML Test Score"
|
585 |
+
paper](https://research.google/pubs/pub46555/)
|
586 |
+
|
587 |
+
3. ["Attack of the Cosmic
|
588 |
+
Rays!"](https://blogs.oracle.com/linux/post/attack-of-the-cosmic-rays)
|
589 |
+
|
590 |
+
4. ["Computers can be
|
591 |
+
understood"](https://blog.nelhage.com/post/computers-can-be-understood/)
|
592 |
+
|
593 |
+
5. ["Systems that defy detailed
|
594 |
+
understanding"](https://blog.nelhage.com/post/systems-that-defy-understanding/)
|
595 |
+
|
596 |
+
6. [Testing section from MadeWithML course on
|
597 |
+
MLOps](https://madewithml.com/courses/mlops/testing/)
|
documents/lecture-03.srt
ADDED
@@ -0,0 +1,244 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,080 --> 00:00:42,480
|
3 |
+
hey folks welcome to the third lecture of full stack deep learning 2022 i'm charles frye today i'll be talking about troubleshooting and testing a high level outline of what we're going to cover today we'll talk about testing software in general and the sort of standard tools and practices that you can use to de-risk shipping software quickly then we'll move on to special considerations for testing machine learning systems the specific techniques and approaches that work best there and then lastly we'll go through what you do when your models are failing their tests how you troubleshoot your model first let's cover concepts for testing software so the general approach that we're gonna take is that tests are gonna help us ship faster with
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:40,800 --> 00:01:26,159
|
7 |
+
fewer bugs but they aren't gonna catch all of our bugs and that means though we're going to use testing tools we aren't going to try and achieve 100 coverage similarly we're going to use linting tools to try and improve the development experience but leave escape valves rather than pedantically just following our style guides lastly we'll talk about tools for automating these workflows so first why are we testing it all tests can help us ship faster even if they aren't catching all bugs before they go into production so as a reminder what even our tests tests are code we write that's designed to fail in an intelligible way when our other code has bugs so for example this little test for the text recognizer from the full stack
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:24,320 --> 00:02:06,240
|
11 |
+
deep learning code base checks whether the output of the text recognizer on a particular input is the same as what was expected and raises an error if it's not and these kinds of tests can help catch some bugs before they're merged into main or shipped into production but they can't catch all bugs and one reason why is that test suites in the tools we're using are not certificates of correctness in some formal systems tests like those can actually be used to prove that code is correct but we aren't working in one of those systems like agda the fear improving language or idris ii we're writing in python and so really it's a loosey-goosey language and all bets are off in terms of code correctness so if test suites aren't
|
12 |
+
|
13 |
+
4
|
14 |
+
00:02:05,439 --> 00:02:49,599
|
15 |
+
like certificates of correctness then what are they like i like this framing from nelson el haga who's at anthropic ai who says that we should think of test suites as being more like classifiers and so to bring our intuition from working with classification algorithms and machine learning so the classification problem is does this commit have a bug or is it okay and what are classifier outputs is that the test pass or the test failed so our tests are our classifier of code and so you should think of that as a prediction of whether there's a bug this kind of frame shift suggests a different way of designing our test suites when we design classifiers we know that we need to trade off detection and false alarms
|
16 |
+
|
17 |
+
5
|
18 |
+
00:02:47,519 --> 00:03:34,159
|
19 |
+
lots of people are thinking about detection when they're designing their test suites they're trying to make sure that they will catch all of the possible bugs but in doing so we can inadvertently introduce false alarms so the classic signature of a false alarm is a failed test that's followed by a commit that fixes the test rather than the code so that's an example from the full stack deep learning code base so in order to avoid introducing too many false alarms it's useful to ask yourself two questions before adding a test so the first question is which real bugs will this test catch what are some actual ways that the world might change around this code or that somebody might introduce a change to to some part of
|
20 |
+
|
21 |
+
6
|
22 |
+
00:03:32,159 --> 00:04:16,160
|
23 |
+
the code base that this test will catch once you've listed a couple of those then ask yourself what are some false alarms that this test might raise what are some ways that the world around the test or the code could change that's still valid in good code but now this test will fail and if you can think of more examples for the latter case than the former then maybe you should reconsider whether you really need this test one caveat to this in some settings it actually is really important that you have a super high degree of confidence in the correctness of your code so this screenshot is from a deep learning diagnostic tool for cardiac ultrasounds by caption health that i worked on in an internship in that project we had a ton
|
24 |
+
|
25 |
+
7
|
26 |
+
00:04:14,159 --> 00:05:00,240
|
27 |
+
of concern about the correctness of the model the confidence people had in the model and regulators expected to see that kind of information so there's other cases where this level of correctness is needed self-driving cars is one example you also see this in banking and finance there are a couple of patterns that immediately arise here one is the presence of regulators uh and more generally high stakes if you're operating in a high-stakes situation where errors have consequences for people's lives and livelihoods even if it's not regulated yet it might be regulated soon and in particular these are also all examples of those autonomous systems that class of low feasibility high impact machine learning project that we talked about in the
|
28 |
+
|
29 |
+
8
|
30 |
+
00:04:58,320 --> 00:05:44,240
|
31 |
+
first lecture this is one of the reasons for their low feasibility is because correctness becomes really important for these kinds of autonomous systems so what does this mindset mean for how we approach testing and quality assurance for our code it means that we're going to use testing tools but we don't want to aim for complete coverage of our code so in terms of tools pi test is the standard tool for testing python code it is a very pythonic implementation and interface and it has also a ton of powerful features like marks for creating separate suites of tests sharing resources across tests and running tests in a variety of parameterized variations in addition to writing the kinds of separate test suites that are standard in lots of
|
32 |
+
|
33 |
+
9
|
34 |
+
00:05:42,080 --> 00:06:29,440
|
35 |
+
languages in python there's a nice built-in tool called doctest for testing the code inside of our docstrings and this helps make sure that our docs strings don't get out of sync with our code which builds trust in the content of those docs strings and makes them easier to maintain doc tests are really nice but there are some limits they're framed around code snippets that could be run in a terminal and so they can only display what can be easily displayed in a terminal notebooks on the other hand can display things like rich media charts images and web pages also interleaved with code execution and text so for example with our data processing code we have some notebooks that that have charts and images in them that explain choices in
|
36 |
+
|
37 |
+
10
|
38 |
+
00:06:27,520 --> 00:07:08,960
|
39 |
+
that data processing trouble is notebooks are hard to test we use a cheap and dirty solution we make sure that our notebooks run and to end then we add some assert statements and we use nb format to run the notebooks and flag when they sail so once you start adding lots of different types of tests and as your code base grows you're going to want to have tooling for recording what kind of code is actually being checked or covered by the tests typically this is done in terms of lines of code but some tools can be a little bit more finer grained the tool that we recommend for this is called codecov it generates a lot of really nice visualizations that you can use to drill down or get a high level overview of the current state of
|
40 |
+
|
41 |
+
11
|
42 |
+
00:07:07,360 --> 00:07:53,039
|
43 |
+
your testing this is a great tool for helping you understand your testing and its state it can be incorporated into your testing effectively saying i'm going to reject commits not only where tests fail but also where test coverage goes down below some value or by a certain amount but we actually recommend against that personal experience interviews and even some published research suggests that only a small fraction of the tests that you write are going to generate the majority of your value and so the right tactic engineering wise is to expend the limited engineering effort that we have on the highest impact tests and making sure those are super high quality but if you set a coverage target then you're
|
44 |
+
|
45 |
+
12
|
46 |
+
00:07:51,280 --> 00:08:36,560
|
47 |
+
instead going to write tests in order to meet that coverage target regardless of their quality so you end up spending more effort both to write the tests and then to maintain and deal with their low quality in addition to checking that our code is correct we're going to also want to check that our code is clean with linting tools but with the caveat that we always want to make sure that there are escape valves from these tools when we say the code is clean what we mean is that it's of a uniform style and of a standard style so uniform style helps avoid spending engineering time on arguments over style in pull requests and code review it also helps improve the utility of our version control system by cutting down on
|
48 |
+
|
49 |
+
13
|
50 |
+
00:08:33,360 --> 00:09:20,959
|
51 |
+
unnecessary noisy components of dips and reducing their size both of these things will make it easier for humans to visually parse the dips in our version control system and make it easier to build automation around them and then we also generally want to adopt a standard style in whatever community it is that we are writing our code if you're an open source repository this is going to make it easier to accept contributions and even if you're working on a closed source team if your new team members are familiar with this style that's standard in the community they'll be faster to onboard one aspect of consistent style is consistent formatting of code with things like white space the standard tool for that in python is the black
|
52 |
+
|
53 |
+
14
|
54 |
+
00:09:18,320 --> 00:10:04,399
|
55 |
+
python formatter it's a very opinionated tool but it has a fairly narrow scope in terms of style it focuses on things that can be fully automated so you can see it not only detects deviations from style but also implements the fix so that's really nice integrated into your editor integrate it into automated workflows and avoid engineers having to implement these things themselves for non-automatable aspects of style the tool we recommend is flake 8. non-automatable aspects of style are things like missing doc strings we don't have good enough automation tools to reliably generate doc strings for code automatically so these are going to require engineers to intervene in order to fix them one of the best things about flakegate is that
|
56 |
+
|
57 |
+
15
|
58 |
+
00:10:02,000 --> 00:10:47,680
|
59 |
+
it comes with tons of extensions and plugins so you can check things like doctrine style and completeness like type hinting and even for security issues and common bugs all via flake8 extensions so those cover your python code ml code bases often have both python code and shell scripts in them shell scripts are really powerful but they also have a lot of sharp edges so shell check knows about all these kinds of weird behaviors of bash that often cause errors and issues that aren't immediately obvious and it also provides explanations for why it's raising a warning or an error it's a very fast to run tool so you can incorporate it into your editor and because it includes explanations you can often resolve the
|
60 |
+
|
61 |
+
16
|
62 |
+
00:10:45,279 --> 00:11:33,600
|
63 |
+
issue without having to go to google or stack overflow and switch contexts out of your editing environment so these tools are great and a uniform style is important but really pedantically enforcing style can be self-defeating so i searched for the word slaykate on github and found over a hundred thousand commits mentioning placating these kinds of automated style enforcement tools and all these commits sort of drip frustration from engineers who are spending time on this that they wish that they were not so to avoid frustration with code style and linting we recommend filtering your rules down to the minimal style that achieves the goals that we set of sticking with standards and of avoiding arguments and
|
64 |
+
|
65 |
+
17
|
66 |
+
00:11:30,640 --> 00:12:12,880
|
67 |
+
of keeping version control history clean another suggestion is to have an opt-in rather than an opt-out application of rules so by default many of these rules may not be applied to all files in the code base but you can opt in and add a particular rule to a particular file and then you can sort of grow this coverage over time and avoid these kinds of frustrations this is especially important for applying these kinds of style recommendations to existing code bases which may have thousands of lines of code that need to be fixed in order to make best use of these testing and linting practices you're going to want to embrace automation as much as possible in your development workflows for the things we talked about already
|
68 |
+
|
69 |
+
18
|
70 |
+
00:12:11,760 --> 00:12:58,240
|
71 |
+
with testing and linting you're going to want to automate these and connect them to your cloud version control system and run these tasks in the cloud or otherwise outside of development environments so connecting diversion control state reduces friction when trying to reproduce or understand errors and running things outside of developer environments means that you can run these tests in parallel to other development work so you can kick off tests that might take 10 or 20 minutes and spend that time responding to slack messages or moving on to other work one of the best places to learn about best practices for automation are popular open source repos so i checked out pytorch's github repository and found
|
72 |
+
|
73 |
+
19
|
74 |
+
00:12:55,040 --> 00:13:42,800
|
75 |
+
that they had tons and tons of automated workflows built into the repository they also followed what i think are some really nice practices like they had some workflows that are automatically running on every push and pull and these are mostly code related tasks that run for less than 10 minutes so that's things like linting and maybe some of the quicker tests other tasks that aren't directly code related but maybe do things like check dependencies and any code-related tasks that take more than 10 minutes to run are run on a schedule so we can see that for example closing stale pull requests is done on a schedule because it's not code related pytorch also runs a periodic suite of tests that takes hours to run you don't
|
76 |
+
|
77 |
+
20
|
78 |
+
00:13:40,639 --> 00:14:23,920
|
79 |
+
want to run that every time that you push or pull so the tool that they use and that we recommend is github actions this ties your automation entirely directly to your version control system and that has tons of benefits also github actions is really powerful it's really flexible there's a generous free tier it's performant and on top of all this it's really easy to use it's got really great documentation configuring github actions is done just using a yaml file and because of all these features it's been embraced by the open source community which has contributed lots and lots of github actions that maybe already automate the workflow that you're interested in that's why we recommend github actions there are other
|
80 |
+
|
81 |
+
21
|
82 |
+
00:14:21,360 --> 00:15:07,120
|
83 |
+
options precommit.ci circleci and jenkins all great choices all automation tools that i've seen work but github actions seems to have won hearts and minds in the open source community in the last couple years so that makes sure that these tests and lints are being run in code before it's shipped or before it's merged into main but part of our goal was to keep our version control history as clean as possible so we want to be able to run these locally as well and before committing and so for that we recommend a tool called pre-commit which can run all kinds of different tools and automations automatically before commits so it's extremely flexible and can run lots of stuff you will want to keep the total run time to just a few seconds or
|
84 |
+
|
85 |
+
22
|
86 |
+
00:15:05,440 --> 00:15:46,160
|
87 |
+
you'll discourage engineers from committing which can lead to work getting lost pre-commit super easy to run locally in part because it separates out the environment for these linting tools from the rest of the development environment which avoids a bunch of really annoying tooling and system administration headaches they're also super easy to automate with github actions automation to ensure the quality and integrity of our software is a huge productivity enhancer that's broader than just ci cd which is how you might which is how you might hear tools like github actions referred to automation helps you avoid context switching if a task is being run fully automatically then you don't have to switch context and remember the
|
88 |
+
|
89 |
+
23
|
90 |
+
00:15:44,160 --> 00:16:26,880
|
91 |
+
command line arguments that you need in order to run your tool it services issues more quickly than if these things were being run manually it's a huge force multiplier for small teams that can't just throw engineer hours at problems and it's better documented than manual processes the script or artifact that you're using to automate a process serves as documentation for how a process is done if somebody wants to do it manually the one caveat is that fully embracing automation requires really knowing your tools well knowing docker well enough to use it is not the same as knowing docker well enough to automate it and bad automation like bad tests can take more time away than it saves so organizationally that actually makes
|
92 |
+
|
93 |
+
24
|
94 |
+
00:16:24,720 --> 00:17:08,559
|
95 |
+
automation a really good task for senior engineers who have knowledge of these tools have ownership over code and can make these kinds of decisions around automation perhaps with junior engineer mentees to actually write the implementations so in summary automate tasks with github actions to reduce friction in development and move more quickly use the standard python tool cat for testing and cleaning your projects and choose in that toolkit the testing and linting practices with the 80 20 principle for tests with shipping velocity and with usability and developer experience in mind now that we've covered general ideas for testing software systems let's talk about the specifics that we need for testing machine learning systems the key point
|
96 |
+
|
97 |
+
25
|
98 |
+
00:17:06,880 --> 00:17:53,039
|
99 |
+
in this section is that testing email is difficult but if we adapt ml specific coding practices and focus on low hanging fruit to start then we can test our ml code and then additionally testing machine learning means testing in production but testing in production doesn't mean that you can just release bad code and let god sort it out so why is testing machine learning hard so software engineering is where a lot of testing practices have been developed and in software engineering we compile source code into programs so we write source code and a compiler turns that into a program that can take inputs and return outputs in machine learning training compiles in a sense data into a model and all of these components are
|
100 |
+
|
101 |
+
26
|
102 |
+
00:17:51,520 --> 00:18:46,240
|
103 |
+
harder to test in the machine learning case than in the software engineering case data is heavier and more inscrutable than source code training is more complex less well-defined and less mature than compilation and models have worse tools for debugging and inspection than compiled programs so this means that ml is the dark souls of software testing it's a notoriously difficult video game but just because something is difficult doesn't mean that it's impossible in the latest souls game elden ring a player named let me solo her defeated one of the hardest bosses in the game wearing nothing but a jar on their head if testing machine learning code is the dark souls of software testing then with practice and with the
|
104 |
+
|
105 |
+
27
|
106 |
+
00:18:44,080 --> 00:19:28,400
|
107 |
+
right techniques you can become the let me solo her of software testing and so in our recommendations in this section we're going to focus mostly on what are sometimes called smoke tests which let you know when something is on fire and help you resolve that issue so these tests are easy to implement but they are still very effective so they're among the 20 percent of tests that get us 80 of the value for data the kind of smoke testing we recommend is expectation testing so we test our data by checking basic properties we express our expectations about the data which might be things like there are no nulls in this column the completion date is after the start date and so with these you're going to want to start small checking
|
108 |
+
|
109 |
+
28
|
110 |
+
00:19:26,160 --> 00:20:13,679
|
111 |
+
only a few properties and grow them slowly and only test things that are worth raising alarms over worth sending people notifications worth bringing people in to try and resolve them so you might be tempted to say oh these are human heights they should be between four and eight feet but actually there are people between the heights of two and ten feet so loosening these expectations to avoid false positives is an important way to make them more useful so you can even say that i should be not negative and less than 30 feet and that will catch somebody maybe accidentally entering a height in inches but it doesn't express strong expectations about the statistical distribution of heights you could try and build something for expectation
|
112 |
+
|
113 |
+
29
|
114 |
+
00:20:11,200 --> 00:20:52,480
|
115 |
+
testing with a tool like pie test but there's enough specifics and there's good enough tools that it's worth reaching for something else so the tool we recommend is great expectations in part because great expectation automatically generates documentation for your data and quality reports and has built-in logging and learning designed for expectation testing so we are going to go through this in the lab we'll go through a lot of the other tools that we've talked about in the lab this week so if you want to check out great expectations we recommend the made with ml tutorial on great expectations by gogumontis loose expectation testing is a really uh is a great start for testing your data pipeline what do you
|
116 |
+
|
117 |
+
30
|
118 |
+
00:20:50,720 --> 00:21:33,600
|
119 |
+
do as you move forward from that the number one recommendation that i have is to stay as close to your data as possible so from top to bottom we have data annotation setups going from furthest away from the model development team to closest one common pattern is that there's some benchmark data set with annotations that you're using uh which is super common in academia or there's an external annotation team which is very common in industry and in that case a lot of the detailed information about the data that you can learn by looking at it and using it yourself are going to be internalized into the organization so one way that that sometimes does get internalized is that at the start of the project some
|
120 |
+
|
121 |
+
31
|
122 |
+
00:21:31,280 --> 00:22:16,080
|
123 |
+
data will get annotated ad hoc by model developers especially if you're not using some external benchmark data set or you don't yet have budget for an external annotation team and that's an improvement but if the model developers who around at the start of the project move on and as more developers get onboarded that knowledge is diluted better than that is an internal annotation team that has regular information flow whether that's stand-ups and syncs or exchange of documentation that information flows to the model developers but probably the best practice and one that i saw recommended by shreya shankar on twitter is to have a regular on-call rotation where model developers annotate data themselves ideally fresh data so that
|
124 |
+
|
125 |
+
32
|
126 |
+
00:22:13,919 --> 00:22:59,840
|
127 |
+
all members of the team who are developing models know about the data and develop intuition and expertise in the data for testing our training code we're going to use memorization testing so memorization is the simplest form of learning steep neural networks are very good at memorizing data and so checking whether your model can memorize a very small fraction of the full data set is a great smoke test for trading and if a model can't memorize then something is clearly very wrong only really gross issues with training are going to show up with this test so your gradients aren't being calculated correctly you have a numerical issue your labels have been shuffled and subtle bugs in your model or your data
|
128 |
+
|
129 |
+
33
|
130 |
+
00:22:57,760 --> 00:23:38,240
|
131 |
+
are not going to show up in this but you can improve the coverage of this test by including the run time in the test because regressions there can reveal bugs that just checking whether you can eventually memorize a small data set wouldn't reveal so if you're including the wall time that can catch performance regressions but also if you're including the number of steps or epochs required to hit some criterion value of the loss then you can catch some of these small issues that make learning harder but not impossible there's a nice feature of pytorch lighting overfit batches that can quickly implement this memorization test and if you design them correctly you can incorporate these tests into end-to-end model deployment testing to
|
132 |
+
|
133 |
+
34
|
134 |
+
00:23:36,880 --> 00:24:14,880
|
135 |
+
check to make sure that the data that the model memorized in training is also something it can correctly respond to in production with these memorization tests you're going to want to tune them to run quickly so that you can run them as often as possible if you can get them to under 10 minutes you might run them on every pull request or on every push so this is something that we worked on in updating the course for 2022 so the simplest way to speed these jobs up is to simply buy faster machines but if you're already on the fastest machines possible or you don't have budget then you start by reducing the size of the data set that the model is memorizing down to the batch size that you want to use in training once you reduce the
|
136 |
+
|
137 |
+
35
|
138 |
+
00:24:13,120 --> 00:24:52,559
|
139 |
+
batch size below what's in training you're starting to step further and further away from the training process that you're trying to test and so going down this list we're getting further and further from the thing that we're actually testing but allowing our tests to run more quickly the next step that can really speed up a memorization test is to turn off regular regularization which is meant to reduce overfitting and memorization is a form of overfitting so that means turning off dropout turning off augmentation you can also reduce the model size without reducing the architecture so reduce the number of layers reduce the width of layers while keeping all of those components in place and if that's not enough you can remove
|
140 |
+
|
141 |
+
36
|
142 |
+
00:24:50,799 --> 00:25:31,679
|
143 |
+
some of the most expensive components and in the end you should end up with a tier of memorization tests which are more or less close to how you actually train models in production that you can run on different time scales one recommendation you'll see moving forward and trying to move past just smoke testing by checking for memorization is to rerun old training jobs with new code so this is never something that you're going to be able to run on every push probably not nightly either if you're looking at training jobs that run for multiple days and the fact that this takes a long time to run is one of the reasons why it's going to be really expensive no matter how you do it for example if you use if you're gonna be
|
144 |
+
|
145 |
+
37
|
146 |
+
00:25:29,360 --> 00:26:10,720
|
147 |
+
doing this with circle ci uh you'll need gpu runners to execute your training jobs but those are only available in the enterprise level plan which is twenty four thousand dollars a year at a bare minimum i've seen some very large bills for running gpus in circleci github actions on the other hand does not have gpu runners available so you'll need to host them yourself though it is on the roadmap to add gpu runners to github actions and that means that you're probably going to maybe double your training spend maybe you're adding an extra machine to rerun your training jobs or maybe you're adding to your cloud budget to pay for more cloud machines to run these jobs and all this expenditure here is only on testing code
|
148 |
+
|
149 |
+
38
|
150 |
+
00:26:09,120 --> 00:26:52,159
|
151 |
+
it doesn't have any connection to the actual models that we're trying to ship the best thing to do is to test training by regularly running training with new data that's coming in from production this is still going to be expensive because you're going to be running more training than you were previously but now that training spend is going to model development not code testing so it's easier to justify having this set up requires the data flywheel that we talked about in lecture one and that requires production monitoring tooling and all kinds of other things that we'll talk about in the monitoring and continual learning lecture lastly for testing our models we're going to adapt regression testing at a very base level
|
152 |
+
|
153 |
+
39
|
154 |
+
00:26:49,440 --> 00:27:31,279
|
155 |
+
models are effectively functions so we can test them like functions they take in inputs they produce outputs we can write down what those outputs should be for specific inputs and then test them this is easiest for classification and other tasks with simple output if you have really complex output like for example our text recognizer that returns tests then these tests can often become really flaky and hard to maintain but even for those kinds of outputs you can use these tests to check for differences between how the model is behaving in training and how it's behaving in production the better approach is still relatively straight forward is to use the values of the loss and your metrics to help build documented regression test
|
156 |
+
|
157 |
+
40
|
158 |
+
00:27:29,120 --> 00:28:10,880
|
159 |
+
suites out of your data out of the data you're using in training and the data you see in production the framing that i like to bring to this comes from test driven development so test driven development is a paradigm for testing that says first you write the test and that test fails because you haven't written any of the code that it's testing and then you write code until you pass the test this is straightforward or incorporate into testing our models because in some sense we're already doing it think of the loss as like a fuzzy test signal rather than simply failing or not failing the loss tells us how badly a test was failed so how badly did we miss the expected output on this particular input and so
|
160 |
+
|
161 |
+
41
|
162 |
+
00:28:09,360 --> 00:28:51,919
|
163 |
+
just like in test driven development that's a test that's written before that code writing process and during training our model is changing and it changes in order to do better on the tests that we're providing and the model stops changing once it passes the test so in some sense gradient descent is already test driven development and maybe that is an explanation for the carpathi quote that gradient descent writes better code than me but just because gradient scent is test-driven development doesn't mean that we're done testing our models because what's missing here is that the loss and other metrics are telling us that we're failing but they aren't giving us actionable insights or a way to resolve that failure the simplest and
|
164 |
+
|
165 |
+
42
|
166 |
+
00:28:49,679 --> 00:29:37,600
|
167 |
+
most generic example is to find data points with the highest loss in your validation and test set or coming from production and put them in a suite labeled hard but note that the problem isn't always going to be with the model searching for high loss examples does reveal issues about what your model is learning but it also reveals issues in your data like bad labels so this doesn't just test models it also tests data and then we want to aggregate individual failures that we observe into named suites of specific types of failure so this is an example from a self-driving car task of detecting pedestrians cases where pedestrians were not detected it's much easier to incorporate this into your workflows if you already have a
|
168 |
+
|
169 |
+
43
|
170 |
+
00:29:36,240 --> 00:30:13,840
|
171 |
+
connection between your model development team and your annotation team reviewing these examples here we can see that what seems to be the same type of failure is occurring more than once in two examples there's a pedestrian who's not visible because they're covered by shadows in two examples there are reflections off of the windshield that are making it harder to see the pedestrian and then some of the examples come from night scenes so we can collect these up create a data set with that label and treat these as test suites to drive model development decisions and so this is kind of like the machine learning version of a type of testing called regression testing where you take bugs that you've observed
|
172 |
+
|
173 |
+
44
|
174 |
+
00:30:12,159 --> 00:30:56,320
|
175 |
+
in production and add them to your test suite to make sure that they don't come up again so the process that i described is very manual but there's some hope that this process might be automated in the near future a recent paper described a method called domino that uses much much larger cross-modal embedding models so foundation models to understand what kinds of errors a smaller model like a specific model designed just to detect birds or pedestrians what kinds of mistakes is it making on images as your models get more mature and you understand their behavior and the data that they're operating on better you can start to test more features of your models so for more ways to test models with an emphasis on nlp see the
|
176 |
+
|
177 |
+
45
|
178 |
+
00:30:54,480 --> 00:31:38,399
|
179 |
+
checklist paper that talks about different ways to do behavioral testing of models in addition to testing data training and models in our development environment we're also going to want to test in production and the reason why is that production environments differ from development environments this is something that is true for complex software systems outside of machine learning so charity majors of honeycomb has been a big proponent of testing and production on these grounds and this is especially true for machine learning models because data is an important component of both the production and the development environments and it's very difficult to ensure that those two things are close to each other and so
|
180 |
+
|
181 |
+
46
|
182 |
+
00:31:36,080 --> 00:32:18,720
|
183 |
+
the solution in this case is to run our tests in production but testing in production doesn't mean only testing in production testing in production means monitoring production for errors and fixing them quickly as chip win the author of designing machine learning systems pointed out this means building infrastructure and tooling so that errors in production are quickly fixed doing this safely and effectively and ergonomically requires tooling to monitor production and a lot of that tooling is fairly new especially tooling that can handle the particular type of production monitoring that we need in machine learning we'll cover it along with monitoring and continual learning in that lecture so in summary we
|
184 |
+
|
185 |
+
47
|
186 |
+
00:32:16,720 --> 00:33:03,440
|
187 |
+
recommend focusing on some of the low-hanging fruit when testing ml and sticking to tests that can alert you alert you to when the system is on fire so that means expectation tests of simple properties of data memorization tests for training and data-based regression tests for models but what about as your code base and your team matures one really nice rubric for organizing the testing of a really mature ml code base is the ml test score so the ml test score came out of google research and it's this really strict rubric for ml test quality so it includes tests for data models training infrastructure and production monitoring and it overlaps with but goes beyond some of the recommendations that we've
|
188 |
+
|
189 |
+
48
|
190 |
+
00:33:00,799 --> 00:33:48,559
|
191 |
+
given already maintaining and automating all of these tests is really expensive but it can be worth it for a really high stakes or large scale machine learning system so we didn't use the ml test score to design the text recognizer code base but we can check what we implemented against it some of the recommendations in the machine learning test score didn't end up being relevant for our model for example some of the data tests are organized around tabular data for traditional machine learning rather than for deep learning but there's still lots of really great suggestions in the ml test score so you might be surprised to see we're only hitting a few of these criteria in each category but that's a function of how
|
192 |
+
|
193 |
+
49
|
194 |
+
00:33:46,240 --> 00:34:33,040
|
195 |
+
strict this testing rubric is so they also provide some data on how teams doing ml at google did on this rubric and if we compare ourselves to that standard the text recognizer is about in the middle which is not so bad for a team not working with google scale resources tests alert us to the presence of bugs but in order to resolve them we'll need to do some troubleshooting and one of the components of the machine learning pipeline that's going to need the most troubleshooting and which is going to require very specialized approaches is troubleshooting models so the key idea in this section is to take a three-step approach to troubleshooting your model first make it run by avoiding the common kinds of errors that can
|
196 |
+
|
197 |
+
50
|
198 |
+
00:34:30,159 --> 00:35:18,240
|
199 |
+
cause crashes shape issues out of memory errors and numerical problems then make your model fast by profiling it and removing any bottlenecks then lastly make the model write improve its performance on test metrics by scaling out the model and the data and sticking with proven architectures first how do we make a model run luckily this step is actually relatively easy in that only a small portion of bugs in machine learning cause the kind of loud failure that we're tackling here so there's shape errors out of memory errors and numerical errors shape errors occur when the shapes of tensors don't match the shapes expected by the operations applied to them so while you're writing your pytorch code it's a good idea to
|
200 |
+
|
201 |
+
51
|
202 |
+
00:35:15,200 --> 00:35:58,640
|
203 |
+
keep notes on what you expect the shapes of your tensors to be to annotate those in the code as we do in the full stack deep learning code base and to even step through this code in a debugger checking the shapes as you go another one of the most common errors in deep learning is out of memory when you when you try and push a tensor to the gpu that's too large to fit on it something of a right of passage for deep learning engineering luckily pytorch lightning has a bunch of really nice tools built into this first make sure you're using the lowest precision that your training can tolerate a good default is half precision floats or 16-bit floats a common culprit is that you're trying to run your model on too much data at once
|
204 |
+
|
205 |
+
52
|
206 |
+
00:35:57,040 --> 00:36:40,400
|
207 |
+
on too large of a batch so you can use the auto scale batch size feature in pi torch lightning to pick a batch size that uses as much gpu memory as you have but no more and if that batch size is too small to get you stable gradients that can be used for training you can use gradient accumulation across batches also easily within lightning to get the same gradients that you would have gotten if you calculated on a much larger batch if none of those work and you're already operating on gpus with the maximum amount of ram then you'll have to look into manual techniques like tensor parallelism and gradient checkpointing another cause of crashes for machine learning models is numerical errors when tensors end up with nands or
|
208 |
+
|
209 |
+
53
|
210 |
+
00:36:38,240 --> 00:37:23,119
|
211 |
+
infinite values in them most commonly these numerical issues appear first in the gradient the gradients explode or shrink to zero and then the values of parameters or activations become infinite or nan so you can observe some of these gradient spikes occurring in some of the experiments for the dolly mini project that have been publicly posted pi torch lightning comes with a nice tool for tracking gradient norms and logging them so that you can see them and correlate them with the appearance of nance and infinities and crashes in your training a nice debugging step to check what the cause might be whether the cause is due to precision issues or due to a more fundamental numerical issue is to switch to double precision floats the default
|
212 |
+
|
213 |
+
54
|
214 |
+
00:37:20,640 --> 00:38:02,960
|
215 |
+
floating point size in python 64-bit floats and see if that causes these issues to go away if it doesn't then that means that there's some kind of issue with your numerical code and you'll want to find a numerically stable implementation to base your work off of or apply error analysis techniques and one of the most common causes of these kinds of numerical errors are the normalization layers like batch norm and layer norm that's what's involved in these gradient spikes in dolly mini so you'll want to make sure to check carefully that you're using normalization in the way that's been found to work for the types of data and architectures that you're using once your battle can actually run end to end and calculate gradients correctly the
|
216 |
+
|
217 |
+
55
|
218 |
+
00:38:00,000 --> 00:38:47,040
|
219 |
+
next step is to make it go fast this could be tricky because the performance of deep neural network training code is very counter-intuitive for example with typical hyper-parameter choices transformer layers spend more time on the plain old mlp component than they do on the intention component and as we saw in lecture two for popular optimizers just keeping track of the optimizer state actually uses more gpu memory than any of the other things you might expect would take up that memory like model parameters or data and then furthermore without careful parallelization what seem like fairly trivial components like loading data can end up dwarfing what would seem like the actual performance bottlenecks like the
|
220 |
+
|
221 |
+
56
|
222 |
+
00:38:45,200 --> 00:39:31,440
|
223 |
+
forwards and backwards passes and parameter updates the only solution here is to kind of roll up your sleeves and get your hands dirty and actually profile your code so we'll see this in the lab but the good news is that you can often find relatively low hanging fruit to speed up training like making changes just in the regular python code and not in not any component of the model and lastly once you've got a model that can run and that runs quickly it's time to make the model correct by reducing its loss on tester production data the normal recommendation for software engineering is make it run make it right make it fast so why is make it right last in this case and the reason why is that machine learning models are
|
224 |
+
|
225 |
+
57
|
226 |
+
00:39:29,359 --> 00:40:12,480
|
227 |
+
always wrong production performance is never perfect and if we think of non-zero loss as a partial test failure for our models then our tests are always at least partially failing so it's never really possible to truly make it right and then the other reason that we want to put performance first is they can kind of solve all your problems with model correctness with scale so if your model is over fitting to the training data and your production loss is way higher then you can scale up your data if your model is underfitting and you're you can't get the training loss to go down as much as you'd like then scale up your model if you're have distribution shift which means that your training and validation loss are both low but your
|
228 |
+
|
229 |
+
58
|
230 |
+
00:40:09,920 --> 00:40:57,680
|
231 |
+
production or test loss is really high then just scale up both folks at openai and elsewhere have done work demonstrating that the performance benefits from scale can be very rigorously measured and predicted across compute budget data set size and parameter count generating these kinds of scaling law charts is an important component of openai's workflows for deciding how to build models and how to run training but scaling costs money so what do you do if you can't afford the level of scale required to reach the performance that you want in that case you're going to want to fine-tune or make use of a model trained at scale for your tasks this is something we'll talk about in the building on foundation models lecture all the other advice
|
232 |
+
|
233 |
+
59
|
234 |
+
00:40:55,200 --> 00:41:40,800
|
235 |
+
around addressing overfitting addressing underfitting resolving distribution shift is going to be model and task specific and it's going to be hard to know what is going to work without trying it so this is just a selection of some of the advice i've seen given or been given about improving model performance and they're mutually exclusive in many cases because they're so tied to the particular task and model and data that they're being applied to so the easiest way to resolve this is to stick as close as possible to working architectures and hyper parameter choices that you can get from places like the hugging face hub or papers with code and in fact this is really how these hyperparameter choices and architectures arise it's via a slow
|
236 |
+
|
237 |
+
60
|
238 |
+
00:41:38,560 --> 00:42:22,480
|
239 |
+
evolutionary process of people building on techniques and hyperparameter choices that work rather than people designing things entirely from scratch so that brings us to the end of the troubleshooting and testing lecture we covered the general approach to testing software both tools and practices that you can use to ship more safely more quickly then we covered the specific things that you need in order to test ml systems data sets training procedures and models both the most basic tests that you should implement at the beginning and then how to grow those into more sophisticated more robust tests and then lastly we considered the workflows and techniques that you need to troubleshoot model performance so
|
240 |
+
|
241 |
+
61
|
242 |
+
00:42:20,160 --> 00:42:42,359
|
243 |
+
we'll see more on all these topics in the lab for this week if you'd like to learn more about any of these topics check out the slides online for a list of recommended twitter follows project templates and medium to long form text resources to learn more about troubleshooting and testing that's all for this lecture thanks for listening and happy testing
|
244 |
+
|
documents/lecture-04.md
ADDED
@@ -0,0 +1,421 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: Sourcing, storing, exploring, processing, labeling, and versioning data for deep learning.
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 4: Data Management
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/Jlm4oqW41vY?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Sergey Karayev](https://sergeykarayev.com).<br />
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published August 29, 2022.
|
14 |
+
[Download slides](https://fsdl.me/2022-lecture-04-slides).
|
15 |
+
|
16 |
+
## 1 - Introduction
|
17 |
+
|
18 |
+
One thing people don't quite get as they enter the field of ML is how
|
19 |
+
much of it deals with data - putting together datasets, exploring the
|
20 |
+
data, wrangling the data, etc. The key points of this lecture are:
|
21 |
+
|
22 |
+
1. Spend 10x as much time exploring the data as you would like to.
|
23 |
+
|
24 |
+
2. Fixing, adding, and augmenting the data is usually the best way to
|
25 |
+
improve performance.
|
26 |
+
|
27 |
+
3. Keep it all simple!
|
28 |
+
|
29 |
+
## 2 - Data Sources
|
30 |
+
|
31 |
+
![](./media/image9.png)
|
32 |
+
|
33 |
+
There are many possibilities for the sources of data. You might have
|
34 |
+
images, text files, logs, or database records. In deep learning, you
|
35 |
+
need to get that data into a local filesystem disk next to a GPU. **How
|
36 |
+
you send data from the sources to training is different for each
|
37 |
+
project**.
|
38 |
+
|
39 |
+
- With images, you can simply download them from S3.
|
40 |
+
|
41 |
+
- With text files, you need to process them in some distributed way,
|
42 |
+
analyze the data, select a subset, and put that on a local
|
43 |
+
machine.
|
44 |
+
|
45 |
+
- With logs and database records, you can use a data lake to aggregate
|
46 |
+
and process the data.
|
47 |
+
|
48 |
+
![](./media/image2.png)
|
49 |
+
|
50 |
+
|
51 |
+
The basics will be the same - a filesystem, object storage, and
|
52 |
+
databases.
|
53 |
+
|
54 |
+
### Filesystem
|
55 |
+
|
56 |
+
The **filesystem** is a fundamental abstraction. Its fundamental unit is
|
57 |
+
a file - which can be text or binary, is not versioned, and is easily
|
58 |
+
overwritten. The filesystem is usually on a disk connected to your
|
59 |
+
machine - physically connected on-prem, attached in the cloud, or even
|
60 |
+
distributed.
|
61 |
+
|
62 |
+
The first thing to know about discs is that their speed and bandwidth
|
63 |
+
range - from hard discs to solid-state discs. There are two orders of
|
64 |
+
magnitude differences between the slowest (SATA SSD) and the fastest
|
65 |
+
(NVMe SSD) discs. Below are some latency numbers you should know, with
|
66 |
+
the human-scale numbers in parentheses:
|
67 |
+
|
68 |
+
![](./media/image12.png)
|
69 |
+
|
70 |
+
|
71 |
+
What formats should the data be stored on the local disc?
|
72 |
+
|
73 |
+
- If you work with binary data like images and audio, just use the
|
74 |
+
standard formats like JPEG or MP3 that it comes in.
|
75 |
+
|
76 |
+
- If you work with metadata (like labels), tabular data, or text data,
|
77 |
+
then compressed JSON or text files are just fine. Alternatively,
|
78 |
+
Parquet is a table format that is fast, compact, and widely used.
|
79 |
+
|
80 |
+
### Object Storage
|
81 |
+
|
82 |
+
The **object storage** is an API over the filesystem. Its fundamental
|
83 |
+
unit is an object, usually in a binary format (an image, a sound file, a
|
84 |
+
text file, etc.). We can build versioning or redundancy into the object
|
85 |
+
storage service. It is not as fast as the local filesystem, but it can b
|
86 |
+
fast enough within the cloud.
|
87 |
+
|
88 |
+
### Databases
|
89 |
+
|
90 |
+
**Databases** are persistent, fast, and scalable storage and retrieval
|
91 |
+
of structured data systems. A helpful mental model for this is: all the
|
92 |
+
data that the databases hold is actually in the computer\'s RAM, but the
|
93 |
+
database software ensures that if the computer gets turned off,
|
94 |
+
everything is safely persisted to disk. If too much data is in the RAM,
|
95 |
+
it scales out to disk in a performant way.
|
96 |
+
|
97 |
+
You should not store binary data in the database but the object-store
|
98 |
+
URLs instead. [Postgres](https://www.postgresql.org/) is
|
99 |
+
the right choice most of the time. It is an open-source database that
|
100 |
+
supports unstructured JSON and queries over that JSON.
|
101 |
+
[SQLite](https://www.sqlite.org/) is also perfectly good
|
102 |
+
for small projects.
|
103 |
+
|
104 |
+
Most coding projects that deal with collections of objects that
|
105 |
+
reference each other will eventually implement a crappy database. Using
|
106 |
+
a database from the beginning with likely save you time. In fact, most
|
107 |
+
MLOps tools are databases at their core (e.g.,
|
108 |
+
[W&B](https://wandb.ai/site) is a database of experiments,
|
109 |
+
[HuggingFace Hub](https://huggingface.co/models) is a
|
110 |
+
database of models, and [Label
|
111 |
+
Studio](https://labelstud.io/) is a database of labels).
|
112 |
+
|
113 |
+
![](./media/image11.png)
|
114 |
+
|
115 |
+
|
116 |
+
**Data warehouses** are stores for online analytical processing (OLAP),
|
117 |
+
as opposed to databases being the data stores for online transaction
|
118 |
+
processing (OLTP). You get data into the data warehouse through a
|
119 |
+
process called **ETL (Extract-Transform-Load)**: Given a number of data
|
120 |
+
sources, you extract the data, transform it into a uniform schema, and
|
121 |
+
load it into the data warehouse. From the warehouse, you can run
|
122 |
+
business intelligence queries. The difference between OLAP and OLTP is
|
123 |
+
that: OLAPs are column-oriented, while OLTPs are row-oriented.
|
124 |
+
|
125 |
+
![](./media/image13.png)
|
126 |
+
|
127 |
+
|
128 |
+
**Data lakes** are unstructured aggregations of data from multiple
|
129 |
+
sources. The main difference between them and data warehouses is that
|
130 |
+
data lakes use ELT (Extract-Load-Transform) process: dumping all the
|
131 |
+
data in and transforming them for specific needs later.
|
132 |
+
|
133 |
+
**The big trend is unifying both data lake and data warehouse, so that
|
134 |
+
structured data and unstructured data can live together**. The two big
|
135 |
+
platforms for this are
|
136 |
+
[Snowflake](https://www.snowflake.com/) and
|
137 |
+
[Databricks](https://www.databricks.com/). If you are
|
138 |
+
really into this stuff, "[Designing Data-Intensive
|
139 |
+
Applications](https://dataintensive.net/)" is a great book
|
140 |
+
that walks through it from first principles.
|
141 |
+
|
142 |
+
## 3 - Data Exploration
|
143 |
+
|
144 |
+
![](./media/image4.png)
|
145 |
+
|
146 |
+
To explore the data, you must speak its language, mostly SQL and,
|
147 |
+
increasingly, DataFrame. **SQL** is the standard interface for
|
148 |
+
structured data, which has existed for decades. **Pandas** is the main
|
149 |
+
DataFrame in the Python ecosystem that lets you do SQL-like things. Our
|
150 |
+
advice is to become fluent in both to interact with both transactional
|
151 |
+
databases and analytical warehouses and lakes.
|
152 |
+
|
153 |
+
[Pandas](https://pandas.pydata.org/) is the workhorse of
|
154 |
+
Python data science. You can try [DASK
|
155 |
+
DataFrame](https://examples.dask.org/dataframe.html) to
|
156 |
+
parallelize Pandas operations over cores and
|
157 |
+
[RAPIDS](https://rapids.ai/) to do Pandas operations on
|
158 |
+
GPUs.
|
159 |
+
|
160 |
+
## 4 - Data Processing
|
161 |
+
|
162 |
+
![](./media/image8.png)
|
163 |
+
|
164 |
+
Talking about data processing, it's useful to have a motivational
|
165 |
+
example. Let's say we have to train a photo popularity predictor every
|
166 |
+
night. For each photo, the training data must include:
|
167 |
+
|
168 |
+
1. Metadata (such as posting time, title, and location) that sits in
|
169 |
+
the database.
|
170 |
+
|
171 |
+
2. Some features of the user (such as how many times they logged in
|
172 |
+
today) that are needed to be computed from logs.
|
173 |
+
|
174 |
+
3. Outputs of photo classifiers (such as content and style) that are
|
175 |
+
needed to run the classifiers.
|
176 |
+
|
177 |
+
Our ultimate task is to train the photo predictor model, but we need to
|
178 |
+
output data from the database, compute the logs, and run classifiers to
|
179 |
+
output their predictions. As a result, we have **task dependencies**.
|
180 |
+
Some tasks can't start until others are finished, so finishing a task
|
181 |
+
should kick off its dependencies.
|
182 |
+
|
183 |
+
Ideally, dependencies are not always files but also programs and
|
184 |
+
databases. We should be able to spread this work over many machines and
|
185 |
+
execute many dependency graphs all at once.
|
186 |
+
|
187 |
+
![](./media/image7.png)
|
188 |
+
|
189 |
+
|
190 |
+
- [Airflow](https://airflow.apache.org/) is a standard
|
191 |
+
scheduler for Python, where it's possible to specify the DAG
|
192 |
+
(directed acyclic graph) of tasks using Python code. The operator
|
193 |
+
in that graph can be SQL operations or Python functions.
|
194 |
+
|
195 |
+
- To distribute these jobs, the workflow manager has a queue for the
|
196 |
+
tasks and manages the workers that pull from them. It will restart
|
197 |
+
jobs if they fail and ping you when the jobs are finished.
|
198 |
+
|
199 |
+
- [Prefect](https://www.prefect.io/) and
|
200 |
+
[Dagster](https://dagster.io/) are contenders to
|
201 |
+
improve and replace Airflow in the long run.
|
202 |
+
|
203 |
+
The primary advice here is not to **over-engineer things**. You can get
|
204 |
+
machines with many CPU cores and a lot of RAM nowadays. For example,
|
205 |
+
UNIX has powerful parallelism, streaming, and highly optimized tools.
|
206 |
+
|
207 |
+
## 5 - Feature Store
|
208 |
+
|
209 |
+
![](./media/image3.png)
|
210 |
+
|
211 |
+
Let's say your data processing generates artifacts you need for
|
212 |
+
training. How do you make sure that, in production, the trained model
|
213 |
+
sees the same processing taking place (which happened during training)?
|
214 |
+
How do you avoid recomputation during retraining?
|
215 |
+
|
216 |
+
**Feature stores** are a solution to this (that you may not need!).
|
217 |
+
|
218 |
+
- The first mention of feature stores came from [this Uber blog post
|
219 |
+
describing their ML platform,
|
220 |
+
Michelangelo](https://eng.uber.com/michelangelo-machine-learning-platform/).
|
221 |
+
They had an offline training process and an online prediction
|
222 |
+
process, so they built an internal feature store for both
|
223 |
+
processes to be in sync.
|
224 |
+
|
225 |
+
- [Tecton](https://www.tecton.ai/) is the leading SaaS
|
226 |
+
solution to feature store.
|
227 |
+
|
228 |
+
- [Feast](https://feast.dev/) is a common open-source
|
229 |
+
option.
|
230 |
+
|
231 |
+
- [Featureform](https://www.featureform.com/) is a
|
232 |
+
relatively new option.
|
233 |
+
|
234 |
+
## 6 - Datasets
|
235 |
+
|
236 |
+
![](./media/image1.png)
|
237 |
+
|
238 |
+
What about datasets specifically made for machine learning?
|
239 |
+
|
240 |
+
[HuggingFace
|
241 |
+
Datasets](https://huggingface.co/docs/datasets) is a great
|
242 |
+
source of machine learning-ready data. There are 8000+ datasets covering
|
243 |
+
a wide variety of tasks, like computer vision, NLP, etc. The Github-Code
|
244 |
+
dataset on HuggingFace is a good example of how these datasets are
|
245 |
+
well-suited for ML applications. Github-Code can be streamed, is in the
|
246 |
+
modern Apache Parquet format, and doesn't require you to download 1TB+
|
247 |
+
of data in order to properly work with it. Another sample dataset is
|
248 |
+
RedCaps, which consists of 12M image-text pairs from Reddit.
|
249 |
+
|
250 |
+
![](./media/image15.png)
|
251 |
+
|
252 |
+
|
253 |
+
Another interesting dataset solution for machine learning is
|
254 |
+
[Activeloop](https://www.activeloop.ai/). This tool is
|
255 |
+
particularly well equipped to work with data and explore samples without
|
256 |
+
needing to download it.
|
257 |
+
|
258 |
+
## 7 - Data Labeling
|
259 |
+
|
260 |
+
![](./media/image10.png)
|
261 |
+
|
262 |
+
### No Labeling Required
|
263 |
+
|
264 |
+
The first thing to talk about when it comes to labeling data
|
265 |
+
is...**maybe we don\'t have to label data?** There are a couple of
|
266 |
+
options here we will cover.
|
267 |
+
|
268 |
+
**Self-supervised learning** is a very important idea that allows you to
|
269 |
+
avoid painstakingly labeling all of your data. You can use parts of your
|
270 |
+
data to label other parts of your data. This is very common in NLP right
|
271 |
+
now. This is further covered in the foundation model lecture. The long
|
272 |
+
and short of it is that models can have elements of their data masked
|
273 |
+
(e.g., the end of a sentence can be omitted), and models can use earlier
|
274 |
+
parts of the data to predict the masked parts (e.g., I can learn from
|
275 |
+
the beginning of the sentence and predict the end). This can even be
|
276 |
+
used across modalities (e.g., computer vision *and* text), as [OpenAI
|
277 |
+
CLIP](https://github.com/openai/CLIP) demonstrates.
|
278 |
+
|
279 |
+
![](./media/image14.png)
|
280 |
+
|
281 |
+
|
282 |
+
**Image data augmentation** is an almost compulsory technique to adopt,
|
283 |
+
especially for vision tasks. Frameworks like
|
284 |
+
[torchvision](https://github.com/pytorch/vision) help with
|
285 |
+
this. In data augmentation, samples are modified (e.g., brightened)
|
286 |
+
without actually changing their core "meaning." Interestingly,
|
287 |
+
augmentation can actually replace labels.
|
288 |
+
[SimCLR](https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html)
|
289 |
+
is a model that demonstrates this - where its learning objective is to
|
290 |
+
maximize agreement between augmented views of the same image and
|
291 |
+
minimize agreement between different images.
|
292 |
+
|
293 |
+
For other forms of data, there are a couple of augmentation tricks that
|
294 |
+
can be applied. You can delete some cells in tabular data to simulate
|
295 |
+
missing data. In text, there aren't established techniques, but ideas
|
296 |
+
include changing the order of words or deleting words. In speech, you
|
297 |
+
could change the speed, insert pauses, etc.
|
298 |
+
|
299 |
+
**Synthetic data** is an underrated idea. You can synthesize data based
|
300 |
+
on your knowledge of the label. For example, you can [create
|
301 |
+
receipts](https://github.com/amoffat/metabrite-receipt-tests)
|
302 |
+
if your need is to learn how to recognize receipts from images. This can
|
303 |
+
get very sophisticated and deep, so tread carefully.
|
304 |
+
|
305 |
+
You can also get creative and ask your users to label data for you.
|
306 |
+
Google Photos, as any user of the app knows, regularly gets users to
|
307 |
+
label images about where people in photos are the same or different.
|
308 |
+
|
309 |
+
![](./media/image16.png)
|
310 |
+
|
311 |
+
|
312 |
+
This is an example of the data flywheel. Improving the data allows the
|
313 |
+
user to improve the model, which in turn makes their product experience
|
314 |
+
better.
|
315 |
+
|
316 |
+
### Labeling Solutions
|
317 |
+
|
318 |
+
These are all great options for avoiding labeling data. However,
|
319 |
+
**you'll usually have to label some data to get started.**
|
320 |
+
|
321 |
+
Labeling has standard annotation features, like bounding boxes, that
|
322 |
+
help capture information properly. Training annotators properly is more
|
323 |
+
important than the particular kind of annotation. Standardizing how
|
324 |
+
annotators approach a complex, opinable task is crucial. Labeling
|
325 |
+
guidelines can help capture the exact right label from an annotator.
|
326 |
+
Quality assurance is key to ensuring annotation and labeling are
|
327 |
+
happening properly.
|
328 |
+
|
329 |
+
There are a few options for sourcing labor for annotations:
|
330 |
+
|
331 |
+
1. Full-service data labeling vendors offer end-to-end labeling
|
332 |
+
solutions.
|
333 |
+
|
334 |
+
2. You can hire and train annotators yourself.
|
335 |
+
|
336 |
+
3. You can crowdsource annotation on a platform like Mechanical Turk.
|
337 |
+
|
338 |
+
**Full-service companies offer a great solution that abstracts the need
|
339 |
+
to build software, manage labor, and perform quality checks**. It makes
|
340 |
+
sense to use one. Before settling on one, make sure to dedicate time to
|
341 |
+
vet several. Additionally, label some gold standard data yourself to
|
342 |
+
understand the data yourself and to evaluate contenders. Take calls with
|
343 |
+
several contenders, ask for work samples on your data, and compare them
|
344 |
+
to your own labeling performance.
|
345 |
+
|
346 |
+
- [Scale AI](https://scale.com/) is the dominant data
|
347 |
+
labeling solution. It offers an API that allows you to spin up
|
348 |
+
tasks.
|
349 |
+
|
350 |
+
- Additional contenders include
|
351 |
+
[Labelbox](https://labelbox.com/) and
|
352 |
+
[Supervisely](https://supervise.ly/).
|
353 |
+
|
354 |
+
- [LabelStudio](https://labelstud.io/) is an open-source
|
355 |
+
solution for performing annotation yourself, with a companion
|
356 |
+
enterprise version. It has a great set of features that allow you
|
357 |
+
to design your interface and even plug-in models for active
|
358 |
+
learning!
|
359 |
+
|
360 |
+
- [Diffgram](https://diffgram.com/) is a competitor to
|
361 |
+
Label Studio.
|
362 |
+
|
363 |
+
- Recent offerings, like
|
364 |
+
[Aquarium](https://www.aquariumlearning.com/) and
|
365 |
+
[Scale Nucleus](https://scale.com/nucleus), have
|
366 |
+
started to help concentrate labeling efforts on parts of the
|
367 |
+
dataset that are most troublesome for models.
|
368 |
+
|
369 |
+
- [Snorkel](https://snorkel.ai/) is a dataset management
|
370 |
+
and labeling platform that uses weak supervision, which is a
|
371 |
+
similar concept. You can leverage composable rules (e.g., all
|
372 |
+
sentences that have the term "amazing" are positive sentiments)
|
373 |
+
that allow you to quickly label data faster than if you were to
|
374 |
+
treat every data point the same.
|
375 |
+
|
376 |
+
In conclusion, try to avoid labeling using techniques like
|
377 |
+
self-supervised learning. If you can't, use labeling software and
|
378 |
+
eventually outsource the work to the right vendor. If you can't afford
|
379 |
+
vendors, consider hiring part-time work rather than crowdsourcing the
|
380 |
+
work to ensure quality.
|
381 |
+
|
382 |
+
## 8 - Data Versioning
|
383 |
+
|
384 |
+
![](./media/image6.png)
|
385 |
+
|
386 |
+
Data versioning comes with a spectrum of approaches:
|
387 |
+
|
388 |
+
1. Level 0 is bad. In this case, data just lives on some file system.
|
389 |
+
In these cases, the issue arises because the models are
|
390 |
+
unversioned since their data is unversioned. Models are part code,
|
391 |
+
part data. This will lead to the consequence of being unable to
|
392 |
+
get back to a previous level of performance if need be.
|
393 |
+
|
394 |
+
2. You can prevent this event with Level 1, where you snapshot your
|
395 |
+
data each time you train. This somewhat works but is far from
|
396 |
+
ideal.
|
397 |
+
|
398 |
+
3. In Level 2, data is versioned like code, as a commingled asset with
|
399 |
+
versioned code. You can use a system like
|
400 |
+
[git-lfs](https://git-lfs.github.com/) that allows
|
401 |
+
you to store large data assets alongside code. This works really
|
402 |
+
well!
|
403 |
+
|
404 |
+
4. Level 3 involves specialized solutions for working with large data
|
405 |
+
files, but this may not be needed unless you have a very specific
|
406 |
+
need (i.e., uniquely large or compliance-heavy files).
|
407 |
+
|
408 |
+
![](./media/image5.png)
|
409 |
+
|
410 |
+
[DVC](https://dvc.org/) is a great tool for this. DVC
|
411 |
+
helps upload your data asset to a remote storage location every time you
|
412 |
+
commit changes to the data file or trigger a commit; it functions like a
|
413 |
+
fancier git-lfs. It adds features like lineage for data and model
|
414 |
+
artifacts, allowing you to recreate pipelines.
|
415 |
+
|
416 |
+
Several techniques are associated with privacy-controlled data, like
|
417 |
+
[federated
|
418 |
+
learning](https://blog.ml.cmu.edu/2019/11/12/federated-learning-challenges-methods-and-future-directions/),
|
419 |
+
differential privacy, and learning on encrypted data. These techniques
|
420 |
+
are still in research, so they aren't quite ready for an FSDL
|
421 |
+
recommendation.
|
documents/lecture-04.srt
ADDED
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,080 --> 00:00:45,920
|
3 |
+
hey everyone welcome to week four of full stack deep learning my name is sergey i have my assistant mishka right here there she is and today we're going to be talking about data management one of the things that people don't quite get as they enter the field of machine learning is just how much of it is actually just dealing with data putting together data sets looking at data munching data it's like half of the problem and it's more than half of the job for a lot of people but at the same time it's not something that people want to do the key points of this presentation are going to be that you should do a lot of it you should spend about 10 times as much time exploring the data as you would
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:44,879 --> 00:01:36,240
|
7 |
+
like to and let it really just flow through you and usually the best way to improve performance of your model is going to be fixing your data set adding to the data set or maybe augmenting your data as you train and the last key point is keep it all simple you might be overwhelmed especially if you haven't been exposed to a lot of this stuff before there's a lot of words and terminology in different companies you don't have to do any of it and in fact you might benefit if you keep it as simple as possible that said we're going to be talking about this area of the ammo ops landscape and we'll start with the sources of data there's many possibilities for the sources of data you might have images you might have text files you might have
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:34,880 --> 00:02:22,319
|
11 |
+
maybe logs database records but in deep learning you're going to have to get that data onto some kind of local file system disk right next to a gpu so you can send data and train and how exactly you're going to do that is different for every project different for every company so maybe you're training on images and you simply download the images that's all it's going to be from s3 or maybe you have a bunch of text that you need to process in some distributed way then analyze the data select the subset of it put that on the local machine or maybe you have a nice process with a data lake that ingests logs and database records and then from that you can aggregate and process it so that's always going to be different
|
12 |
+
|
13 |
+
4
|
14 |
+
00:02:20,480 --> 00:03:13,440
|
15 |
+
but the basics are always going to be the same and they concern the file system object storage and databases so the file system is the fundamental abstraction and the fundamental unit of it is a file which can be a text file or a binary file it's not versioned and it can be easily overwritten or deleted and usually this is the file system is on a disk that's connected to your machine may be physically connected or maybe attached in the cloud or maybe it's even the distributed file system although that's less common now and we'll be talking about directly connected disks the first thing to know about disks is that the speed of them and the bandwidth of them is a quite quite a range from hard disks which are
|
16 |
+
|
17 |
+
5
|
18 |
+
00:03:11,200 --> 00:04:05,120
|
19 |
+
usually spinning magnetic disks to solid-state disks which can be connected through the sata protocol or the nvme protocol and there's two orders of magnitude difference between the slowest which is like sata spinning disks and the fastest which are nvme solid state disks and making these slides i realized okay i'm showing you that but there's also some other latency numbers you should know so there's a famous document that you might have seen on the internet originally credited to jeff dean who i think credited peter norvig from google but i added human scale numbers in parens so here's how it's going to go so if you access the l1 l2 maybe even l3 cache of the cpu it's a very limited store of data but
|
20 |
+
|
21 |
+
6
|
22 |
+
00:04:03,599 --> 00:04:59,280
|
23 |
+
it's incredibly fast it only takes a name a second to access and in human scale you might think of it as taking a second and then accessing ram is the next fastest thing and it's about 100 times slower but it's still incredibly fast and then that's just kind of finding something in ram but reading a whole megabyte sequentially from ram is now 250 microseconds which if the cache access took a second now it's taken two and a half days to read a megabyte from ram and if you're reading a megabyte from a sata connected ssd drive now you're talking about weeks so it's one and a half weeks and if you're reading a one one megabit of data from a spinning disk now we're talking about months and finally if you're sending a packet
|
24 |
+
|
25 |
+
7
|
26 |
+
00:04:57,120 --> 00:05:51,120
|
27 |
+
of data from california across the ocean to europe and then back we're talking about years on a human scale in a 150 millisecond on the absolute scale and if gpu timing info i'd love to include it here so please just send it over to full stack so what format should data be stored on the local disk if it's binary data like images or audio just use the standard formats like jpegs or mp3 that it comes in they're already compressed you can't really do better than that for the metadata like labels or tabular data or text data compress json or text files just fine or parquet is a table format that's fast it's compressed by default as it's written and read that's compact and it's very widely used now let's talk about
|
28 |
+
|
29 |
+
8
|
30 |
+
00:05:48,960 --> 00:06:46,000
|
31 |
+
object storage i think of it as an api over the file system where the fundamental unit is now an object and it's usually binary so it's maybe an image or a sound file but it could also be a text we can build in versioning or redundancy into the object storage service so instead of a file that can easily be overridden and isn't versioned we can say that an object whenever i update it it's actually just updating the version of it s3 is the fundame is the most common example and it's not as fast as local file system but it's fast enough especially if you're staying within the cloud databases are persistent fast and scalable storage and retrieval of structured data systems the metal model that i like to use is
|
32 |
+
|
33 |
+
9
|
34 |
+
00:06:45,039 --> 00:07:38,960
|
35 |
+
that all the data that the database holds is actually in the ram of the computer but the database software ensures that if the computer gets turned off everything is safely persisted to disk and if it actually is too much data for ram it scales out to disk but still in a very performant way do not store binary data in the database you should store the object store urls to the binary data in the database instead postgres is the right choice it's an open source database and most of the time it's what you should use for example it supports unstructured json and queries over that unstructured json but sqlite is perfectly good for small projects it's a self-contained binary every language has an interface to it
|
36 |
+
|
37 |
+
10
|
38 |
+
00:07:35,919 --> 00:08:24,879
|
39 |
+
even your browser has it and i want to stress that you should probably be using a database most coding projects like anything that deals with collections of objects that reference each other like maybe you're dealing with snippets of text that come from documents and documents of authors and maybe authors have companies or something like that this is very common and that code base will probably implement some kind of database and you can save yourself time and gain performance if you just use the database from the beginning and many mo ops tools specifically are at their core databases like weights and biases is a database of experiments hugging phase model hub is a database of models label studio which we'll talk about is a
|
40 |
+
|
41 |
+
11
|
42 |
+
00:08:22,960 --> 00:09:22,320
|
43 |
+
database of labels plus obviously user interfaces for generating the labels and uploading the models and stuff like that but coming from an academic background i think it's important to fully appreciate databases data warehouses are stores for online analytical processing as opposed to databases which are data stores for online transaction processing and the difference i'll cover in a second but the way you get data into data warehouses is another acronym called etl extract transform load so maybe you have a number of data sources here it's like files database otp database and some sources in the cloud you'll extract data transform it into a uniform schema and then load it into the data warehouse and then from the
|
44 |
+
|
45 |
+
12
|
46 |
+
00:09:20,160 --> 00:10:15,839
|
47 |
+
warehouse we can run business intelligence queries we know that it's archived and so what's the difference between olaps and otps like why are they different software platforms instead of just using postgres for everything so the difference is all laps for analytical processing are usually column oriented which lets you do queries what's the mean length of the text of comments over the last 30 days and it lets them be more compact because if you're storing the column you can compress that whole column in storage and oltps are usually row oriented and those are for queries select all the comments for this given user data lakes are unstructured aggregation of data from multiple sources so the main difference to data
|
48 |
+
|
49 |
+
13
|
50 |
+
00:10:12,399 --> 00:11:10,720
|
51 |
+
warehouses is that instead of extract transform load its extract load into the lake and then transform later and the trend is unifying both so both unstructured and structured data should be able to live together the big two platforms for this our snowflake and databricks and if you're interested in this stuff this is a really great book that walks through the stuff from first principles that i think you will enjoy now that we have our data stored if we would like to explore it we have to speak the language of data and the language of data is mostly sql and increasingly it's also data frames sql is the standard interface for structured data it's existed for decades it's not going away it's worth
|
52 |
+
|
53 |
+
14
|
54 |
+
00:11:08,480 --> 00:12:01,200
|
55 |
+
being able to at least read and it's well worth being able to write and for python pandas is the main data frame solution which basically lets you do sql-like things but in code without actually writing sql our advice is to become fluent in both this is how you interact with both transactional databases and analytical warehouses and lakes pandas is really the workhorse of python data science i'm sure you've seen it i just wanted to give you some tips if pandas are slow on something it's worth trying das data frames have the same interface but they paralyze operations over many cores and even over multiple machines if you set that up and something else that's worth trying if you have gpus available is rapids and
|
56 |
+
|
57 |
+
15
|
58 |
+
00:11:59,600 --> 00:12:55,040
|
59 |
+
video rapids lets you do a subset of what pandas can do but on gpus so significantly faster for a lot of types of data so talking about data processing it's useful to have a motivational example so let's say we have to train a photo popularity predictor every night and for each photo training data must include maybe metadata about the photos such as the posting time the title that the user gave the location was taken maybe some features of the user and then maybe outputs of classifiers of the photo for content maybe style so the metadata is going to be in the database the features we might have to compute from logs and the photo classifications we're going to need to run those classifiers so we have dependencies our ultimate
|
60 |
+
|
61 |
+
16
|
62 |
+
00:12:52,959 --> 00:13:50,800
|
63 |
+
task is to train the photopredictor model but to do we need to output data from database compute stuff from logs and run classifiers to output their predictions what we'd like is to define what we have to do and as things finish they should kick off their dependencies and everything should ideally not only have not only be files but programs and databases we should be able to spread this work over many machines and we're not the only ones running this job or this isn't the only job that's running on these machines how do we actually schedule multiple such jobs airflow is a pretty standard solution for python where it's possible to specify the acyclical graph of tasks using python code and the operators in that graph can be
|
64 |
+
|
65 |
+
17
|
66 |
+
00:13:48,320 --> 00:14:52,320
|
67 |
+
sql operations or actually python functions and other plugins for airflow and to distribute these jobs the workflow manager has a queue has workers that report to it will restart jobs if they fail and will ping you when the jobs are done prefect is another is another solution that's been to improve over air flow it's more modern and dagster is another contender for the airflow replacement the main piece of advice here is don't over engineer this you can get machines with many cpu cores and a ton of ram nowadays and unix itself has powerful parallelism streaming tools that are highly optimized and this is a little bit of a contrived example from a decade ago but hadoop was all the rage in 2014 it was a distributed data processing
|
68 |
+
|
69 |
+
18
|
70 |
+
00:14:51,120 --> 00:15:43,279
|
71 |
+
framework and so to run some kind of job that just aggregated a bunch of text files and computed some statistics over them the author spanned set up a hadoop job and it took 26 minutes to run but just writing a simple unix command that reads all the files grabs for the string sorts it and gives you the unique things was only 70 seconds and part of the reason is that this is all actually happening in parallel so it's making use of your cores pretty efficiently and you can make even more efficient use of them with the parallel command or here it's an argument to x-args and that's not to say that you should do everything just in unix but it is to say that just because the solution exists doesn't mean that it's right for you it
|
72 |
+
|
73 |
+
19
|
74 |
+
00:15:41,680 --> 00:16:39,120
|
75 |
+
might be the case that you can just run your stuff in a single python script on your 32 core pc feature stores you might have heard about the situation that they deal with is all the data processing we we're doing is generating artifacts that we'll need for training time so how do we ensure that in production the model that was trained sees data where the same processing took place as it as as happened during training time and also when we retrain how do we avoid recomputing things that we don't need to recompute so feature store is our solution to this that you may not need the first mention i saw feature stores were was in this blog post from uber describing their machine learning platform michelangelo
|
76 |
+
|
77 |
+
20
|
78 |
+
00:16:36,560 --> 00:17:43,520
|
79 |
+
and so they had offline training process and an online prediction process and they had feature stores for both that had to be in sync tecton is probably the leading sas solution to a feature storage for open source solutions feast is a common one and i recently came across feature form that looks pretty good as well so this is something you need check it out if it's not something you need don't feel like you have to use it in summary binary data like images sound files maybe compressed text store is object metadata about the data like labels or user activity with object should be stored in the database don't be afraid of sql but also know if you're using data frames there are accelerated solutions to them
|
80 |
+
|
81 |
+
21
|
82 |
+
00:17:41,200 --> 00:18:35,200
|
83 |
+
if dealing with stuff like logs and other sources of data that are disparate it's worth setting up a data lake to aggregate all of it in one place you should have a repeatable process to aggregate the data you need for training which might involve stuff like airflow and depending on the expense and complexity of processing a feature store could be useful at training time the data that you need should be copied over to a file system on a really fast local drive and then you should optimize gpu transfer so what about specifically data sets for machine learning training hugging phase data sets is a great hub of data there's over 8 000 data sets revision nlp speech etc so i wanted to take a look at a few
|
84 |
+
|
85 |
+
22
|
86 |
+
00:18:33,360 --> 00:19:28,320
|
87 |
+
example data sets here's one called github code it's over a terabyte of text 115 million code files the hugging face library the datasets library allows you to stream it so you don't have to download the terabyte of data in order to see some examples of it and the underlying format of the data is parquet tables so there's thousands of parquet tables each about half a gig that you can download piece by piece another example data set is called red caps pretty recently released 12 million image text pairs from reddit the images don't come with the data you need to download the images yourself make sure as you download it's multithreaded they give you example code and the underlying format then of the
|
88 |
+
|
89 |
+
23
|
90 |
+
00:19:26,080 --> 00:20:21,600
|
91 |
+
database are the images you download plus json files that have the labels or the text that came with the images so the real foundational format of the data is just the json files and there's just urls in those files to the objects that you can then download here's another example data set common voice from wikipedia 14 000 hours of speech in 87 languages the format is mp3 files plus text files with the transcription of what the person's saying there's another interesting data set solution called active loop where you can also explore data stream data to your local machine and even transform data without saving it locally it look it has a pretty cool viewer of the data so here's looking at microsoft
|
92 |
+
|
93 |
+
24
|
94 |
+
00:20:18,159 --> 00:21:14,159
|
95 |
+
coco computer vision data set and in order to get it onto your local machine it's a simple hub.load the next thing we should talk about is labeling and the first thing to talk about when it comes to labeling is maybe we don't have to label data self-supervised learning is a very important idea that you can use parts of your data to label other parts of your data so in natural language this is super common right now and we'll talk more about this in the foundational models lecture but given a sentence i can mask the last part of the sentence and to use the first part of the sentence to predict how it's going to end but i can also mask the middle of the sentence and use the whole sentence to predict the middle or i can even mask
|
96 |
+
|
97 |
+
25
|
98 |
+
00:21:12,640 --> 00:22:04,640
|
99 |
+
the beginning of the sentence and use the completion of the sentence to predict the beginning in vision you can extract patches and then predict the relationship of the patches to each other and you can even do it across modalities so openai clip which we'll talk about in a couple of weeks is trained in this contrastive way where a number of images and the number of text captions are given to the model and the learning objective is to minimize the distance between the image and the text that it came with and to maximize the distance between the image and the other texts the and when i say between the image and the text the embedding of the image and the embedding of the texts and this led to great results this is
|
100 |
+
|
101 |
+
26
|
102 |
+
00:22:02,640 --> 00:22:56,960
|
103 |
+
one of the best vision models for all kinds of tasks right now data augmentation is something that must be done for training vision models there's frameworks that provide including torch vision that provide you functions to do this it's changing the brightness of the data the contrast cropping it skewing it flipping it all kinds of transformations that basically don't change the meaning of the image but change the pixels of the image this is usually done in parallel to gpu training on the cpu and interestingly the augmentation can actually replace labels so there's a paper called simclear where the learning objective is to extract different views of an image and maximize the agreement or the similarity of the
|
104 |
+
|
105 |
+
27
|
106 |
+
00:22:55,679 --> 00:23:50,000
|
107 |
+
embeddings of the views of the same image and minimize the agreement between the views of the different images so without labels and just with data augmentation and a clever learning objective they were able to learn a model that performs very well for even supervised tasks for non-vision data augmentation if you're dealing with tabular data you could delete some of the table cells to simulate what it would be like to have missing data for text i'm not aware of like really well established techniques but you could maybe delete words replace words with synonyms change the order of things and for speech you could change the speed of the file you could insert pauses you could remove some stuff you
|
108 |
+
|
109 |
+
28
|
110 |
+
00:23:47,039 --> 00:24:36,000
|
111 |
+
can add audio effects like echo you can strip out certain frequency bands synthetic data is also something where the labels would basically be given to you for free because you use the label to generate the data so you know the label and it's still somewhat of an underrated idea that's often worth starting with we certainly do this in the lab but it can get really deep right so you can even use 3d rendering engines to generate very realistic vision data where you know exactly the label of everything in the image and this was done for receipts in this project that i link here you can also ask your users if you have users to label data for you i love how google photos does this they always ask me is this the same or different person
|
112 |
+
|
113 |
+
29
|
114 |
+
00:24:34,480 --> 00:25:29,919
|
115 |
+
and this is sometimes called the data flywheel right where i'm incentivized to answer because it helps me experience the product but it helps google train their models as well because i'm constantly generating data but usually you might have to label some data as well and data labeling always has some standard set of features there's bounding boxes key points or part of speech tagging for text there's classes there's captions what's important is training the annotators so whoever will be doing the annotation make sure that they have a complete rulebook of how they should be doing it because there's reasonable ways to interpret the task so here's some examples like if i'm only seeing the head of the fox should i label only
|
116 |
+
|
117 |
+
30
|
118 |
+
00:25:27,360 --> 00:26:21,360
|
119 |
+
the head or should i label the inferred location of the entire fox behind the rock it's unclear and quality assurance is something that's going to be key to annotation efforts because different people are just differently able to uh adhere to the rules where do you get people to annotate you can work with full-service data labeling companies you can hire your own annotators probably part-time and maybe promote the most the most able ones to quality control or you could potentially crowdsource this was popular in the past with mechanical turk the full service companies provide you the software stack the labor to do it and quality assurance and it probably makes sense to use them so how do you pick one you should at
|
120 |
+
|
121 |
+
31
|
122 |
+
00:26:18,480 --> 00:27:12,880
|
123 |
+
first label some data yourself to make sure that you understand the task and you have a gold standard that you can evaluate companies on then you should probably take calls with several of the companies or just try them out if they let you try it out online get a work sample and then look at how the work sample agrees with your own gold standard and then see how the price of the annotation compares scale dot ai is probably the dominant data labeling solution today and they take an api approach to this where it's you create tasks for them and then receive results and there are many other annotations like label box supervisedly and there's just a million more label studio is an open source solution that you can run yourself
|
124 |
+
|
125 |
+
32
|
126 |
+
00:27:11,440 --> 00:28:05,600
|
127 |
+
there's an enterprise edition for managed hosting but there's an open source edition that you can just run in the docker container on your own machine we're going to use it in the lab and it has a lot of different interfaces for text images you can create your own interfaces you can even plug in models and do active learning for annotation diff gram is something i've come across but i haven't used it personally they claim to be better than label studio and it looks pretty good an interesting feature that that i've seen some software offerings have is evaluate your current model on your data and then explore how it performed such that you can easily select subsets of data for further labeling or potentially
|
128 |
+
|
129 |
+
33
|
130 |
+
00:28:03,360 --> 00:28:56,320
|
131 |
+
find mistakes in your labeling and just understand how your model is performing on the data there's aquarium learning and scale nucleus are both solutions to this that you can check out snorkel you might have heard about and it's using this idea of weak supervision where if you have a lot of data to label some of it is probably really easy to label if you're labeling sentiment of text and if they're using the word wonderful then it's probably positive so if you can create a rule that says if the text contains the word wonderful just apply the positive label to it and you create a number of these labeling functions and then the software intelligently composes them and it could be a really fast way to to
|
132 |
+
|
133 |
+
34
|
134 |
+
00:28:54,159 --> 00:29:44,640
|
135 |
+
go through a bunch of data there's the open source project of snorkel and there's the commercial platform and i recently came across rubrics which is a very similar idea that's fully open source so in conclusion for labeling first think about how you can do self-supervised learning and avoid labeling if you need to label which you probably will need to do use labeling software and really get to know your data by labeling it yourself for a while after you've done that you can write out detailed rules and then outsource to a full service company otherwise if you don't want to outsource you can't afford it you should probably hire some part-time contractors and not try to crowdsource because crowdsourcing is a lot of
|
136 |
+
|
137 |
+
35
|
138 |
+
00:29:42,480 --> 00:30:35,120
|
139 |
+
quality assurance overhead it's a lot better to just find a good person who can trust to do the job and just have them label lastly in today's lecture we can talk about versioning i like to think of data versioning as a spectrum where the level zero is unversioned and level three is specialized data versioning solution so label level one level zero is bad okay where you have data that just lives on the file system or is on s3 or in a database and it's not version so you train a model you deploy the model and the problem is when you deploy the model what you're deploying is partly the code but partly the data that generated the weights right and if the data is not versioned then your model is in effect not
|
140 |
+
|
141 |
+
36
|
142 |
+
00:30:33,039 --> 00:31:21,679
|
143 |
+
versioned and so what will probably happen is that your performance will degrade at some point and you won't be able to get back to a previous level of high performance so you can solve this with level one each time you train you just take a snapshot of your data and you store it somewhere so this kind of works because you'll be able to get back to that performance by retraining but it'd be nicer if i could just version the data as easily as code not through some separate process and that's where we arrive at level two where we just we version data exactly in the same way as reversion code so let's say we're having a data set of audio files and text transcriptions so we're going to upload the audio files
|
144 |
+
|
145 |
+
37
|
146 |
+
00:31:19,679 --> 00:32:12,559
|
147 |
+
to s3 that's probably where they were to begin with and the labels for the files we can just store in a parquet file or a json file where it's going to be the s3 url and the transcription of it now even this metadata file can get pretty big it's a lot of text but you can use git lfs which stands for large file storage and we can just add them and the git add will version the data file exactly the same as your version your code file and this can totally work you do not need to definitely go to level three would be using a specialized solution for versioning data and this usually helps you store large files directly and it could totally make sense but just don't assume that you need it right away if you can get away with just
|
148 |
+
|
149 |
+
38
|
150 |
+
00:32:11,039 --> 00:33:08,240
|
151 |
+
get lfs that would be the fstl recommendation if it's starting to break then the leading solution for level three versioning is dvc and there's a table comparing the different versioning solutions like pachyderm but this table is biased towards dvc because it's by a solution that's github for dbc called dags hub and the way dvc works is you set it up you add your data file and then the most basic thing it does is it can upload to s3 or google cloud storage or whatever some other network storage whatever you set up every time you commit it'll upload your data somewhere and it'll make sure it's versioned so it's like a replacement for git lfs but you can go further and you can also record the lineage of
|
152 |
+
|
153 |
+
39
|
154 |
+
00:33:06,000 --> 00:34:05,519
|
155 |
+
the data so how exactly was this data generated how does this model artifact get generated so you can use dvc run to mark that and then use dvc to recreate the pipelines the last thing i want to say is we get a lot of questions at fstl about privacy sensitive data and this is still a research area there's no kind of off-the-shelf solution we can really recommend federated learning is a research area that refers to training a global model from data on local devices without the the model training process having access to the local data so it's there's a federated server that has the model and it sends what to do to local models and then it syncs back the models and differential privacy is another term
|
156 |
+
|
157 |
+
40
|
158 |
+
00:34:02,640 --> 00:34:52,200
|
159 |
+
this is for aggregating data such that even though you have the data it's aggregated in such a way that you can't identify the individual points so it should be safe to train on sensitive data because you won't actually be able to understand the individual points of it and another topic that is in the same vein is learning on encrypted data so can i have data that's encrypted that i can't decrypt but can i still do machine learning on it in a way that generates useful models and these three things are all research areas and i'm not aware of like really good off-the-shelf solutions for them unfortunately that concludes our lecture on data management thank you
|
160 |
+
|
documents/lecture-05.md
ADDED
@@ -0,0 +1,788 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: How to turn an ML model into an ML-powered product
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 5: Deployment
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/W3hKjXg7fXM?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).<br />
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published September 5, 2022.
|
14 |
+
[Download slides](https://fsdl.me/2022-lecture-05-slides).
|
15 |
+
|
16 |
+
## Introduction
|
17 |
+
|
18 |
+
![](./media/image21.png)
|
19 |
+
|
20 |
+
Deploying models is a critical part of making your models good, to begin
|
21 |
+
with. When you only evaluate the model offline, it's easy to miss the
|
22 |
+
more subtle flaws that the model has, where it doesn't actually solve
|
23 |
+
the problem that your users need it to solve. Oftentimes, when we deploy
|
24 |
+
a model for the first time, only then do we really see whether that
|
25 |
+
model is actually doing a good job or not. Unfortunately, for many data
|
26 |
+
scientists and ML engineers, model deployment is an afterthought
|
27 |
+
relative to other techniques we have covered.
|
28 |
+
|
29 |
+
Much like other parts of the ML lifecycle, we'll focus on deploying a
|
30 |
+
minimum viable model as early as possible, which entails **keeping it
|
31 |
+
simple and adding complexity later**. Here is the process that this
|
32 |
+
lecture covers:
|
33 |
+
|
34 |
+
- Build a prototype
|
35 |
+
|
36 |
+
- Separate your model and UI
|
37 |
+
|
38 |
+
- Learn the tricks to scale
|
39 |
+
|
40 |
+
- Consider moving your model to the edge when you really need to go
|
41 |
+
fast
|
42 |
+
|
43 |
+
## 1 - Build a Prototype To Interact With
|
44 |
+
|
45 |
+
There are many great tools for building model prototypes.
|
46 |
+
[HuggingFace](https://huggingface.co/) has some tools
|
47 |
+
built into its playground. They have also recently acquired a startup
|
48 |
+
called [Gradio](https://gradio.app/), which makes it easy
|
49 |
+
to wrap a small UI around the model.
|
50 |
+
[Streamlit](https://streamlit.io/) is another good option
|
51 |
+
with a bit more flexibility.
|
52 |
+
|
53 |
+
![](./media/image19.png)
|
54 |
+
|
55 |
+
|
56 |
+
Here are some best practices for prototype deployment:
|
57 |
+
|
58 |
+
1. **Have a basic UI**: The goal at this stage is to play around with
|
59 |
+
the model and collect feedback from other folks. Gradio and
|
60 |
+
Streamlit are your friends here - often as easy as adding a couple
|
61 |
+
of lines of code to create a simple interface for the model.
|
62 |
+
|
63 |
+
2. **Put it behind a web URL**: An URL is easier to share. Furthermore,
|
64 |
+
you will start thinking about the tradeoffs you'll be making when
|
65 |
+
dealing with more complex deployment schemes. There are cloud
|
66 |
+
versions of [Streamlit](https://streamlit.io/cloud)
|
67 |
+
and [HuggingFace](https://huggingface.co/) for this.
|
68 |
+
|
69 |
+
3. **Do not stress it too much**: You should not take more than a day
|
70 |
+
to build a prototype.
|
71 |
+
|
72 |
+
A model prototype won't be your end solution to deploy. Firstly, a
|
73 |
+
prototype has limited frontend flexibility, so eventually, you want to
|
74 |
+
be able to build a fully custom UI for the model. Secondly, a prototype
|
75 |
+
does not scale to many concurrent requests. Once you start having users,
|
76 |
+
you'll hit the scaling limits quickly.
|
77 |
+
|
78 |
+
![](./media/image18.png)
|
79 |
+
|
80 |
+
|
81 |
+
Above is an abstract diagram of how your application might look. The
|
82 |
+
**client** is your user's device that interacts with your application.
|
83 |
+
This device can be a browser, a vehicle, or a mobile phone. This device
|
84 |
+
calls over a network to a **server**. The server talks to a **database**
|
85 |
+
(where data is stored), used to power the application.
|
86 |
+
|
87 |
+
![](./media/image6.png)
|
88 |
+
|
89 |
+
|
90 |
+
There are different ways of structuring your application to fit an ML
|
91 |
+
model inside. The prototype approach mentioned in the beginning fits
|
92 |
+
into the **model-in-service** approach - where your hosted web server
|
93 |
+
has a packaged version of the model sitting inside it. This pattern has
|
94 |
+
pros and cons.
|
95 |
+
|
96 |
+
The biggest pro is that if you are doing something complex, you get to
|
97 |
+
reuse your existing infrastructure. It does not require you as a model
|
98 |
+
developer to set up new things from scratch.
|
99 |
+
|
100 |
+
However, there is a number of pronounced cons:
|
101 |
+
|
102 |
+
1. **Your web server may be written in a different language**, so
|
103 |
+
getting your model into that language can be difficult.
|
104 |
+
|
105 |
+
2. **Models may change more frequently than server code** (especially
|
106 |
+
early in the lifecycle of building your model). If you have a
|
107 |
+
well-established application and a nascent model, you do not want
|
108 |
+
to redeploy the entire application every time that you make an
|
109 |
+
update to the model (sometimes multiple updates per day).
|
110 |
+
|
111 |
+
3. If you have a large model to run inference on, you'll have to load
|
112 |
+
that model on your web server. **Large models can eat into the
|
113 |
+
resources for your web server**. That might affect the user
|
114 |
+
experience for people using that web server, even if they are not
|
115 |
+
interacting with the model.
|
116 |
+
|
117 |
+
4. **Server hardware is generally not optimized for ML workloads**. In
|
118 |
+
particular, you rarely will have a GPU on these devices.
|
119 |
+
|
120 |
+
5. **Your model and application may have different scaling
|
121 |
+
properties**, so you might want to be able to scale them
|
122 |
+
differently.
|
123 |
+
|
124 |
+
## 2 - Separate Your Model From Your UI
|
125 |
+
|
126 |
+
### 2.1 - Batch Prediction
|
127 |
+
|
128 |
+
![](./media/image8.png)
|
129 |
+
|
130 |
+
|
131 |
+
The first pattern to pull your model from your UI is called **batch
|
132 |
+
prediction**. You get new data in and run your model on each data point.
|
133 |
+
Then, you save the results of each model inference into a database. This
|
134 |
+
can work well under some circumstances. For example, if there are not a
|
135 |
+
lot of potential inputs to the model, you can re-run your model on some
|
136 |
+
frequency (every hour, every day, or every week). You can have
|
137 |
+
reasonably fresh predictions to return to those users that are stored in
|
138 |
+
your database. Examples of these problems include the early stages of
|
139 |
+
building recommender systems and internal-facing tools like marketing
|
140 |
+
automation.
|
141 |
+
|
142 |
+
To run models on a schedule, you can leverage the data processing and
|
143 |
+
workflow tools mentioned in our previous lecture on data management. You
|
144 |
+
need to re-run data processing, load the model, run predictions, and
|
145 |
+
store those predictions in your database. This is exactly a **Directed
|
146 |
+
Acyclic Graph workflow of data operations** that tools like
|
147 |
+
[Dagster](https://dagster.io/),
|
148 |
+
[Airflow](https://airflow.apache.org/), or
|
149 |
+
[Prefect](https://www.prefect.io/) are designed to solve.
|
150 |
+
It's worth noting that there are also tools like
|
151 |
+
[Metaflow](https://metaflow.org/) that are designed more
|
152 |
+
for ML or data science use cases that might be potentially even an
|
153 |
+
easier way to get started.
|
154 |
+
|
155 |
+
Let's visit the pros and cons of this batch prediction pattern. Starting
|
156 |
+
with the pros:
|
157 |
+
|
158 |
+
1. Batch prediction is **simple to implement** since it reuses existing
|
159 |
+
batch processing tools that you may already be using for training
|
160 |
+
your model.
|
161 |
+
|
162 |
+
2. It **scales very easily** because databases have been engineered for
|
163 |
+
decades for such a purpose.
|
164 |
+
|
165 |
+
3. Even though it looks like a simple pattern, it has been **used in
|
166 |
+
production by large-scale production systems for years**. This is
|
167 |
+
a tried-and-true pattern you can run and be confident that it'll
|
168 |
+
work well.
|
169 |
+
|
170 |
+
4. It is **fast to retrieve the prediction** since the database is
|
171 |
+
designed for the end application to interact with.
|
172 |
+
|
173 |
+
Switching to the cons:
|
174 |
+
|
175 |
+
1. Batch prediction **doesn't scale to complex input types**. For
|
176 |
+
instance, if the universe of inputs is too large to enumerate
|
177 |
+
every single time you need to update your predictions, this won't
|
178 |
+
work.
|
179 |
+
|
180 |
+
2. **Users won't be getting the most up-to-date predictions from your
|
181 |
+
model**. If the feature that goes into your model changes every
|
182 |
+
hour, minute, or subsecond, but you only run your batch prediction
|
183 |
+
job every day, the predictions your users see might be slightly
|
184 |
+
stale.
|
185 |
+
|
186 |
+
3. **Models frequently become "stale."** If your batch jobs fail for
|
187 |
+
some reason, it can be hard to detect these problems.
|
188 |
+
|
189 |
+
### 2.2 - Model-as-Service
|
190 |
+
|
191 |
+
The second pattern is called **model-as-service**: we run the model
|
192 |
+
online as its own service. The service is going to interact with the
|
193 |
+
backend or the client itself by making requests to the model service and
|
194 |
+
receiving responses back.
|
195 |
+
|
196 |
+
![](./media/image16.png)
|
197 |
+
|
198 |
+
|
199 |
+
The pros of this pattern are:
|
200 |
+
|
201 |
+
1. **Dependability** - model bugs are less likely to crash the web
|
202 |
+
application.
|
203 |
+
|
204 |
+
2. **Scalability** - you can choose optimal hardware for the model and
|
205 |
+
scale it appropriately.
|
206 |
+
|
207 |
+
3. **Flexibility** - you can easily reuse a model across multiple
|
208 |
+
applications.
|
209 |
+
|
210 |
+
The cons of this pattern are:
|
211 |
+
|
212 |
+
1. Since this is a separate service, you add a network call when your
|
213 |
+
server or client interacts with the model. That can **add
|
214 |
+
latency** to your application.
|
215 |
+
|
216 |
+
2. It also **adds infrastructural complexity** because you are on the
|
217 |
+
hook for hosting and managing a separate service.
|
218 |
+
|
219 |
+
Even with these cons, **the model-as-service pattern is still a sweet
|
220 |
+
spot for most ML-powered products** since you really need to be able to
|
221 |
+
scale independently of the application in most complex use cases. We'll
|
222 |
+
walk through the basic components of building your model service -
|
223 |
+
including REST APIs, dependency management, performance optimization,
|
224 |
+
horizontal scaling, rollout, and managed options.
|
225 |
+
|
226 |
+
#### REST APIs
|
227 |
+
|
228 |
+
**Rest APIs** serve predictions in response to canonically-formatted
|
229 |
+
HTTP requests. There are other alternative protocols to interact with a
|
230 |
+
service that you host on your infrastructures, such as
|
231 |
+
[GRPC](https://grpc.io/) (used in TensorFlow Serving) and
|
232 |
+
[GraphQL](https://graphql.org/) (common in web development
|
233 |
+
but not terribly relevant to model services).
|
234 |
+
|
235 |
+
![](./media/image3.png)
|
236 |
+
|
237 |
+
|
238 |
+
Unfortunately, there is currently no standard for formatting requests
|
239 |
+
and responses for REST API calls.
|
240 |
+
|
241 |
+
1. [Google Cloud](https://cloud.google.com/) expects a
|
242 |
+
batch of inputs structured as a list called "instances" (with keys
|
243 |
+
and values).
|
244 |
+
|
245 |
+
2. [Azure](https://azure.microsoft.com/en-us/) expects a
|
246 |
+
list of things called "data", where the data structure itself
|
247 |
+
depends on what your model architecture is.
|
248 |
+
|
249 |
+
3. [AWS Sagemaker](https://aws.amazon.com/sagemaker/)
|
250 |
+
expects instances that are formatted differently than they are in
|
251 |
+
Google Cloud.
|
252 |
+
|
253 |
+
Our aspiration for the future is to move toward **a standard interface
|
254 |
+
for making REST API calls for ML services**. Since the types of data
|
255 |
+
that you might send to these services are constrained, we should be able
|
256 |
+
to develop a standard as an industry.
|
257 |
+
|
258 |
+
#### Dependency Management
|
259 |
+
|
260 |
+
Model predictions depend on **code**, **model weights**, and
|
261 |
+
**dependencies**. In order for your model to make a correct prediction,
|
262 |
+
all of these dependencies need to be present on your web server.
|
263 |
+
Unfortunately, dependencies are a notorious cause of trouble as it is
|
264 |
+
hard to ensure consistency between your development environment and your
|
265 |
+
server. It is also hard to update since even changing a TensorFlow
|
266 |
+
version can change your model.
|
267 |
+
|
268 |
+
At a high level, there are two strategies for managing dependencies:
|
269 |
+
|
270 |
+
1. **Constrain the dependencies for your model** by saving your model
|
271 |
+
in an agnostic format that can be run anywhere.
|
272 |
+
|
273 |
+
2. **Use containers** to constrain the entire inference program.
|
274 |
+
|
275 |
+
![](./media/image11.png)
|
276 |
+
|
277 |
+
|
278 |
+
##### Constraining Model Dependencies
|
279 |
+
|
280 |
+
The primary way to constrain the dependencies of just your model is
|
281 |
+
through a library called [ONNX](https://onnx.ai/) - the
|
282 |
+
Open Neural Network Exchange. The goal of ONNX is to be **an
|
283 |
+
interoperability standard for ML models**. The promise is that you can
|
284 |
+
define a neural network in any language and run it consistently
|
285 |
+
anywhere. The reality is that since the underlying libraries used to
|
286 |
+
build these models change quickly, there are often bugs in the
|
287 |
+
translation layer, which creates even more problems to solve for you.
|
288 |
+
Additionally, ONNX doesn't deal with non-library code such as feature
|
289 |
+
transformations.
|
290 |
+
|
291 |
+
##### Containers
|
292 |
+
|
293 |
+
To understand how to manage dependencies with containers, we need to
|
294 |
+
understand [the differences between Docker and Virtual
|
295 |
+
Machines](https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b),
|
296 |
+
how Docker images are built via Docker files and constructed via layers,
|
297 |
+
the ecosystem around Docker, and specific wrappers around Docker that
|
298 |
+
you can use for ML.
|
299 |
+
|
300 |
+
![](./media/image10.png)
|
301 |
+
|
302 |
+
|
303 |
+
In a **virtual machine**, you package up the entire operating system
|
304 |
+
(OS) as well as the libraries and applications that are built on top of
|
305 |
+
that OS. A virtual machine tends to be very heavyweight because the OS
|
306 |
+
itself has a lot of code and is expensive to run. A **container** such
|
307 |
+
as Docker removes that need by packaging the libraries and applications
|
308 |
+
together. A Docker engine that runs on top of your OS knows how to
|
309 |
+
virtualize the OS and run the libraries/applications.
|
310 |
+
|
311 |
+
By virtue of being **lightweight**, Docker is used differently than how
|
312 |
+
Virtual Machines were used. A common pattern is to spin up [a new
|
313 |
+
Docker container](https://www.docker.com/what-container)
|
314 |
+
for every discrete task. For example, a web application might have four
|
315 |
+
containers: a web server, a database, a job queue, and a worker. These
|
316 |
+
containers are run together as part of an orchestration system.
|
317 |
+
|
318 |
+
![](./media/image15.png)
|
319 |
+
|
320 |
+
|
321 |
+
Docker containers are created from [Docker
|
322 |
+
files](https://docs.docker.com/engine/reference/builder/).
|
323 |
+
Each Docker file runs a sequence of steps to define the environment
|
324 |
+
where you will run your code. Docker also allows you to build, store,
|
325 |
+
and pull Docker containers from a Docker Hub that is hosted on some
|
326 |
+
other servers or your cloud. You can experiment with a code environment
|
327 |
+
that is on your local machine but will be identical to the environment
|
328 |
+
you deploy on your server.
|
329 |
+
|
330 |
+
Docker is separated into [three different
|
331 |
+
components](https://docs.docker.com/engine/docker-overview):
|
332 |
+
|
333 |
+
1. The **client** is where you'll be running on your laptop to build an
|
334 |
+
image from a Dockerfile that you define locally using some
|
335 |
+
commands.
|
336 |
+
|
337 |
+
2. These commands are executed by a **Docker Host**, which can run on
|
338 |
+
either your laptop or your server (with more storage or more
|
339 |
+
performance).
|
340 |
+
|
341 |
+
3. That Docker Host talks to a **registry** - which is where all the
|
342 |
+
containers you might want to access are stored.
|
343 |
+
|
344 |
+
![](./media/image1.png)
|
345 |
+
|
346 |
+
|
347 |
+
With this separation of concerns, you are not limited by the amount of
|
348 |
+
compute and storage you have on your laptop to build, pull, and run
|
349 |
+
Docker images. You are also not limited by what you have access to on
|
350 |
+
your Docker Host to decide which images to run.
|
351 |
+
|
352 |
+
In fact, there is a powerful ecosystem of Docker images that are
|
353 |
+
available on different public Docker Hubs. You can easily find these
|
354 |
+
images, modify them, and contribute them back to the Hubs. It's easy to
|
355 |
+
store private images in the same place as well. Because of this
|
356 |
+
community and the lightweight nature of Docker, it has become
|
357 |
+
[incredibly popular in recent
|
358 |
+
years](https://www.docker.com/what-container#/package_software)
|
359 |
+
and is ubiquitous at this point.
|
360 |
+
|
361 |
+
There is a bit of a learning curve to Docker. For ML, there are a few
|
362 |
+
open-source packages designed to simplify this:
|
363 |
+
[Cog](https://github.com/replicate/cog),
|
364 |
+
[BentoML](https://github.com/bentoml/BentoML), and
|
365 |
+
[Truss](https://github.com/trussworks). They are built by
|
366 |
+
different model hosting providers that are designed to work well with
|
367 |
+
their model hosting service but also just package your model and all of
|
368 |
+
its dependencies in a standard Docker container format.
|
369 |
+
|
370 |
+
![](./media/image12.png)
|
371 |
+
|
372 |
+
These packages have **two primary components**: The first one is a
|
373 |
+
standard way of defining your prediction service. The second one is a
|
374 |
+
YAML file that defines the other dependencies and package versions that
|
375 |
+
will go into the Docker container running on your laptop or remotely.
|
376 |
+
|
377 |
+
If you want to have the advantages of using Docker for making your ML
|
378 |
+
models reproducible but do not want to go through the learning curve of
|
379 |
+
learning Docker, it's worth checking out these three libraries.
|
380 |
+
|
381 |
+
#### Performance Optimization
|
382 |
+
|
383 |
+
!!! info "What about performance _monitoring_?"
|
384 |
+
In this section, we focus on ways to improve the performance of your
|
385 |
+
models, but we spend less time on how exactly that performance is monitored,
|
386 |
+
which is a challenge in its own right.
|
387 |
+
|
388 |
+
Luckily, one of the
|
389 |
+
[student projects](../project-showcase/) for the 2022 cohort,
|
390 |
+
[Full Stack Stable Diffusion](../project-showcase/#full-stack-stable-diffusion),
|
391 |
+
took up that challenge and combined
|
392 |
+
[NVIDIA's Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server),
|
393 |
+
the [Prometheus monitoring tool](https://en.wikipedia.org/wiki/Prometheus_(software)),
|
394 |
+
and
|
395 |
+
the [Grafana analytics dashboarding tool](https://en.wikipedia.org/wiki/Grafana)
|
396 |
+
to monitor a robust, scalable, and observable deployment of Stable Diffusion models.
|
397 |
+
|
398 |
+
Check out the repo on GitHub
|
399 |
+
[here](https://github.com/okanlv/fsdl-full-stack-stable-diffusion-2022)
|
400 |
+
if you want to see a worked example of a fully-monitored DL-powered application.
|
401 |
+
|
402 |
+
To make model inference on your machine more efficient, we need to
|
403 |
+
discuss GPU, concurrency, model distillation, quantization, caching,
|
404 |
+
batching, sharing the GPU, and libraries that automate these tasks for
|
405 |
+
you.
|
406 |
+
|
407 |
+
##### GPU or no GPU?
|
408 |
+
|
409 |
+
There are some advantages to hosting your model on a GPU:
|
410 |
+
|
411 |
+
1. It's probably the same hardware you train your model on, to begin
|
412 |
+
with. That can eliminate any lost-in-translation issues.
|
413 |
+
|
414 |
+
2. As your model gets big and your techniques get advanced, your
|
415 |
+
traffic gets large. GPUs provide high throughput to deal with
|
416 |
+
that.
|
417 |
+
|
418 |
+
However, GPUs introduce a lot of complexity:
|
419 |
+
|
420 |
+
1. They are more complex to set up.
|
421 |
+
|
422 |
+
2. They are more expensive.
|
423 |
+
|
424 |
+
As a result, **just because your model is trained on a GPU does not mean
|
425 |
+
that you need to actually host it on a GPU in order for it to work**. In
|
426 |
+
the early version of your model, hosting it on a CPU should suffice. In
|
427 |
+
fact, it's possible to get high throughput from CPU inference at a low
|
428 |
+
cost by using some other techniques.
|
429 |
+
|
430 |
+
##### Concurrency
|
431 |
+
|
432 |
+
With **concurrency**, multiple copies of the model run in parallel on
|
433 |
+
different CPUs or cores on a single host machine. To do this, you need
|
434 |
+
to be careful about thread tuning. There's [a great Roblox
|
435 |
+
presentation](https://www.youtube.com/watch?v=Nw77sEAn_Js)
|
436 |
+
on how they scaled BERT to serve a billion daily requests, just using
|
437 |
+
CPUs.
|
438 |
+
|
439 |
+
##### Model Distillation
|
440 |
+
|
441 |
+
With **model distillation**, once you have a large model that you've
|
442 |
+
trained, you can train a smaller model that imitates the behavior of
|
443 |
+
your larger one. This entails taking the knowledge that your larger
|
444 |
+
model learned and compressing that knowledge into a much smaller model
|
445 |
+
that you may not have trained to the same degree of performance from
|
446 |
+
scratch. There are several model distillation techniques pointed out in
|
447 |
+
[this blog
|
448 |
+
post](https://heartbeat.comet.ml/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb).
|
449 |
+
They can be finicky to do by yourself and are infrequently used in
|
450 |
+
practice. An exception is distilled versions of popular models (such as
|
451 |
+
[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)).
|
452 |
+
|
453 |
+
##### Quantization
|
454 |
+
|
455 |
+
With **quantization**, you execute some or potentially all of the
|
456 |
+
operations in your model in a lower fidelity representation of the
|
457 |
+
numbers that you are doing the math. These representations can be 16-bit
|
458 |
+
floating point numbers or 8-bit integers. This introduces some tradeoffs
|
459 |
+
with accuracy, but it's worth making these tradeoffs because the
|
460 |
+
accuracy you lose is limited relative to the performance you gain.
|
461 |
+
|
462 |
+
The recommended path is to use built-in quantization methods in
|
463 |
+
[PyTorch](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)
|
464 |
+
and TensorFlow. More specifically, [HuggingFace
|
465 |
+
Optimum](https://huggingface.co/docs/optimum) is a good
|
466 |
+
choice if you have already been using HuggingFace's pre-trained models.
|
467 |
+
You can also run **quantization-aware training**, which often results in
|
468 |
+
higher accuracy.
|
469 |
+
|
470 |
+
![](./media/image5.png)
|
471 |
+
|
472 |
+
|
473 |
+
##### Caching
|
474 |
+
|
475 |
+
With **caching**, you realize that for some ML models, some inputs are
|
476 |
+
more common than others. Instead of always calling the model every time
|
477 |
+
a user makes a request, let's store the common requests in a cache.
|
478 |
+
Then, let's check that cache before running an expensive operation.
|
479 |
+
Caching techniques can get fancy, but the basic way of doing this is to
|
480 |
+
use [functools library in
|
481 |
+
Python](https://docs.python.org/3/library/functools.html).
|
482 |
+
|
483 |
+
![](./media/image2.png)
|
484 |
+
|
485 |
+
|
486 |
+
##### Batching
|
487 |
+
|
488 |
+
With **batching**, you take advantage of the fact that ML models often
|
489 |
+
achieve a higher throughput when doing prediction in parallel,
|
490 |
+
especially in a GPU. To accomplish this, you need to gather predictions
|
491 |
+
until you have a batch, run those predictions, and return them to your
|
492 |
+
user. You want to tune the batch size that deals optimally with the
|
493 |
+
latency-throughput tradeoff. You also need to have a way to shortcut the
|
494 |
+
process if latency becomes too long. Batching is complicated to
|
495 |
+
implement, so you probably do not want to implement this yourself.
|
496 |
+
|
497 |
+
##### Sharing the GPU
|
498 |
+
|
499 |
+
Your model may not take up all of the GPU memory with your inference
|
500 |
+
batch size. **Why don't you run multiple models on the same GPU?** This
|
501 |
+
is a place where you want to use a model serving solution that supports
|
502 |
+
GPU sharing out of the box.
|
503 |
+
|
504 |
+
##### Libraries
|
505 |
+
|
506 |
+
There are offerings from TensorFlow, PyTorch, and third-party tools from
|
507 |
+
NVIDIA and Anyscale. NVIDIA's choice is probably the most powerful but
|
508 |
+
can be difficult to get started with. Starting with Anyscale's [Ray
|
509 |
+
Serve](https://docs.ray.io/en/latest/serve/index.html) may
|
510 |
+
be an easier way to get started.
|
511 |
+
|
512 |
+
![](./media/image20.png)
|
513 |
+
|
514 |
+
|
515 |
+
#### Horizontal Scaling
|
516 |
+
|
517 |
+
If you're going to scale up to a large number of users interacting with
|
518 |
+
your model, it's not going to be enough to get the most efficiency out
|
519 |
+
of one server. At some point, you'll need to scale horizontally to have
|
520 |
+
traffic going to multiple copies of your model running on different
|
521 |
+
servers. This is called **horizontal scaling**. This technique involves
|
522 |
+
taking traffic that would usually go to a single machine and splits
|
523 |
+
across multiple machines.
|
524 |
+
|
525 |
+
Each machine has a copy of the service, and a tool called a load
|
526 |
+
balancer distributes traffic to each machine. In practice, there are two
|
527 |
+
ways to do this: with either **container orchestration** (e.g.
|
528 |
+
Kubernetes) or **serverless** (e.g. AWS Lambda).
|
529 |
+
|
530 |
+
##### Container Orchestration
|
531 |
+
|
532 |
+
In container orchestration, we use
|
533 |
+
[Kubernetes](https://kubernetes.io/) to help manage
|
534 |
+
containerized applications (in Docker containers, for example) and run
|
535 |
+
them across machines.
|
536 |
+
|
537 |
+
![](./media/image14.png)
|
538 |
+
|
539 |
+
|
540 |
+
Kubernetes is quite interesting, but it's probably overkilled to learn
|
541 |
+
too much about it if your only goal is to deploy machine learning
|
542 |
+
models. There are a number of frameworks that make it easiest to deploy
|
543 |
+
ML models with Kubernetes, including
|
544 |
+
[Kubeflow](https://www.kubeflow.org/),
|
545 |
+
[Seldon](https://www.seldon.io/), etc.
|
546 |
+
|
547 |
+
##### Serverless
|
548 |
+
|
549 |
+
If Kubernetes isn't the path for you (e.g. you don't want to have to
|
550 |
+
worry about infrastructure at all), serverless is another option for
|
551 |
+
deploying models. In this paradigm, app code and dependencies are
|
552 |
+
packaged into .zip files or Docker containers with a single entry point
|
553 |
+
function, which is a single function (e.g. *model.predict()*) that will
|
554 |
+
be run repeatedly. This package is then deployed to a service like [AWS
|
555 |
+
Lambda](https://aws.amazon.com/lambda/), which almost
|
556 |
+
totally manages the infrastructure required to run the code based on the
|
557 |
+
input. Scaling to thousands of requests and across multiple machines is
|
558 |
+
taken care of by these services. In return, you pay for the compute time
|
559 |
+
that you consume.
|
560 |
+
|
561 |
+
Since model services tend to run discretely and not continuously (like a
|
562 |
+
web server), serverless is a great fit for machine learning deployment.
|
563 |
+
|
564 |
+
![](./media/image7.png)
|
565 |
+
|
566 |
+
|
567 |
+
**Start with serverless!** It's well worth the time saved in managing
|
568 |
+
infrastructure and dealing with associated challenges. There are still
|
569 |
+
some problems you should be aware of though.
|
570 |
+
|
571 |
+
1. First, the size of the actual deployment package that can be sent to
|
572 |
+
a serverless service tends to be limited, which makes large models
|
573 |
+
impossible to run.
|
574 |
+
|
575 |
+
2. Second, there is also a cold start problem. If there is no traffic
|
576 |
+
being sent to the service in question, the service will "wind
|
577 |
+
down" to zero compute use, at which point it takes time to start
|
578 |
+
again. This lag in starting up upon the first request to the
|
579 |
+
serverless service is known as the "cold start" time. This can
|
580 |
+
take seconds or even minutes.
|
581 |
+
|
582 |
+
3. Third, it can be hard to actually build solid software engineering
|
583 |
+
concepts, like pipelines, with serverless. Pipelines enable rapid
|
584 |
+
iteration, while serverless offerings often do not have the tools
|
585 |
+
to support rapid, automated changes to code of the kind pipelines
|
586 |
+
are designed to do.
|
587 |
+
|
588 |
+
4. Fourth, state management and deployment tooling are related
|
589 |
+
challenges here.
|
590 |
+
|
591 |
+
5. Finally, most serverless functions are CPU only and have limited
|
592 |
+
execution time. If you need GPUs for inference, serverless might
|
593 |
+
not be for you quite yet. There are, however, new offerings like
|
594 |
+
[Banana](https://www.banana.dev/) and
|
595 |
+
[Pipeline](https://www.pipeline.ai/) that are
|
596 |
+
seeking to solve this problem of serverless GPU inference!
|
597 |
+
|
598 |
+
#### Model Rollouts
|
599 |
+
|
600 |
+
If serving is how you turn a model into something that can respond to
|
601 |
+
requests, rollouts are how you manage and update these services. To be
|
602 |
+
able to make updates effectively, you should be able to do the
|
603 |
+
following:
|
604 |
+
|
605 |
+
1. **Roll out gradually**: You may want to incrementally send traffic
|
606 |
+
to a new model rather than the entirety.
|
607 |
+
|
608 |
+
2. **Roll back instantly**: You may want to immediately pull back a
|
609 |
+
model that is performing poorly.
|
610 |
+
|
611 |
+
3. **Split traffic between versions**: You may want to test differences
|
612 |
+
between models and therefore send some traffic to each.
|
613 |
+
|
614 |
+
4. **Deploy pipelines of models**: Finally, you may want to have entire
|
615 |
+
pipeline flows that ensure the delivery of a model
|
616 |
+
|
617 |
+
Building these capabilities in a reasonably challenging infrastructure
|
618 |
+
problem that is beyond the scope of this course. In short, managed
|
619 |
+
services are a good option for this that we'll now discuss!
|
620 |
+
|
621 |
+
#### Managed Options
|
622 |
+
|
623 |
+
All of the major cloud providers offer their managed service options for
|
624 |
+
model deployment. There are a number of startups offering solutions as
|
625 |
+
well, like BentoML or Banana.
|
626 |
+
|
627 |
+
![](./media/image9.png)
|
628 |
+
|
629 |
+
The most popular managed service is [AWS
|
630 |
+
Sagemaker](https://aws.amazon.com/sagemaker/). Working with
|
631 |
+
Sagemaker is easier if your model is already in a common format like a
|
632 |
+
Huggingface class or a SciKit-Learn model. Sagemaker has convenient
|
633 |
+
wrappers for such scenarios. Sagemaker once had a reputation for being a
|
634 |
+
difficult service to work with, but this is much less the case for the
|
635 |
+
clear-cut use case of model inference. Sagemaker, however, does have
|
636 |
+
real drawbacks around ease of use for custom models and around cost. In
|
637 |
+
fact, Sagemaker instances tend to be 50-100% more expensive than EC2.
|
638 |
+
|
639 |
+
### 2.3 - Takeaways
|
640 |
+
|
641 |
+
To summarize this section, remember the following:
|
642 |
+
|
643 |
+
1. You *probably* don't need GPU inference, which is hard to access and
|
644 |
+
maintain. Scaling CPUs horizontally or using serverless can
|
645 |
+
compensate.
|
646 |
+
|
647 |
+
2. Serverless is probably the way to go!
|
648 |
+
|
649 |
+
3. Sagemaker is a great way to get started for the AWS user, but it can
|
650 |
+
get quite expensive.
|
651 |
+
|
652 |
+
4. Don't try to do your own GPU inference; use existing tools like
|
653 |
+
TFServing or Triton to save time.
|
654 |
+
|
655 |
+
5. Watch out for new startups focused on GPU inference.
|
656 |
+
|
657 |
+
## 3 - Move to the Edge?
|
658 |
+
|
659 |
+
Let's now consider the case of moving models out of web service and all
|
660 |
+
the way to the "edge", or wholly on-device. Some reasons you may need to
|
661 |
+
consider this include a lack of reliable internet access for users or
|
662 |
+
strict data security requirements.
|
663 |
+
|
664 |
+
If such hard and fast requirements aren't in place, you'll need to take
|
665 |
+
into account the tradeoff between accuracy and latency and how this can
|
666 |
+
affect the end-user experience. Put simply, **if you have exhausted all
|
667 |
+
options to reduce model prediction time (a component of latency),
|
668 |
+
consider edge deployment**.
|
669 |
+
|
670 |
+
![](./media/image4.png)
|
671 |
+
|
672 |
+
|
673 |
+
Edge deployment adds considerable complexity, so it should be considered
|
674 |
+
carefully before being selected as an option. In edge prediction, model
|
675 |
+
weights are directly loaded on our client device after being sent via a
|
676 |
+
server (shown above), and the model is loaded and interacted with
|
677 |
+
directly on the device.
|
678 |
+
|
679 |
+
This approach has compelling pros and cons:
|
680 |
+
|
681 |
+
1. Some pros to particularly call out are the latency advantages that
|
682 |
+
come without the need for a network and the ability to scale for
|
683 |
+
"free," or the simple fact that you don't need to worry about the
|
684 |
+
challenges of running a web service if all inference is done
|
685 |
+
locally.
|
686 |
+
|
687 |
+
2. Some specific cons to call out are the often limited hardware and
|
688 |
+
software resources available to run machine learning models on
|
689 |
+
edge, as well as the challenge of updating models since users
|
690 |
+
control this process more than you do as the model author.
|
691 |
+
|
692 |
+
### 3.1 - Frameworks
|
693 |
+
|
694 |
+
Picking the right framework to do edge deployment depends both on how
|
695 |
+
you train your model and what the target device you want to deploy it on
|
696 |
+
is.
|
697 |
+
|
698 |
+
- [TensorRT](https://developer.nvidia.com/tensorrt): If
|
699 |
+
you're deploying to NVIDIA, this is the choice to go with.
|
700 |
+
|
701 |
+
- [MLKit](https://developers.google.com/ml-kit) and
|
702 |
+
[CoreML](https://developer.apple.com/documentation/coreml)**:**
|
703 |
+
For phone-based deployment on either Android **or** iPhone, go
|
704 |
+
with MLKit for the former and CoreML for the latter.
|
705 |
+
|
706 |
+
- [PyTorch Mobile](https://pytorch.org/mobile)**:** For
|
707 |
+
compatibility with both iOS and Android, use PyTorch Mobile.
|
708 |
+
|
709 |
+
- [TFLite](https://www.tensorflow.org/lite): A great
|
710 |
+
choice for using TensorFlow in a variety of settings, not just on
|
711 |
+
a phone or a common device.
|
712 |
+
|
713 |
+
- [TensorFlow JS](https://www.tensorflow.org/js)**:**
|
714 |
+
The preferred framework for deploying machine learning in the
|
715 |
+
browser.
|
716 |
+
|
717 |
+
- [Apache TVM](https://tvm.apache.org/): A library
|
718 |
+
agnostic, target device agnostic option. This is the choice for
|
719 |
+
anyone trying to deploy to as diverse a number of settings as
|
720 |
+
possible.
|
721 |
+
|
722 |
+
Keep paying attention to this space! There are a lot of startups like
|
723 |
+
[MLIR](https://mlir.llvm.org/),
|
724 |
+
[OctoML](https://octoml.ai/),
|
725 |
+
[TinyML](https://www.tinyml.org/), and
|
726 |
+
[Modular](https://www.modular.com/) that are aiming to
|
727 |
+
solve some of these problems.
|
728 |
+
|
729 |
+
### 3.2 - Efficiency
|
730 |
+
|
731 |
+
No software can help run edge-deployed models that are simply too large;
|
732 |
+
**model efficiency** is important for edge deployment! We previously
|
733 |
+
discussed quantization and distillation as options for model efficiency.
|
734 |
+
However, there are also network architectures specifically designed to
|
735 |
+
work better in edge settings like
|
736 |
+
[MobileNets](https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d).
|
737 |
+
MobileNets replace the more expensive computations typical of server-run
|
738 |
+
models with simpler computations and achieve acceptable performance
|
739 |
+
oftentimes.
|
740 |
+
|
741 |
+
![](./media/image17.png)
|
742 |
+
|
743 |
+
|
744 |
+
MobileNets are a great tool for model deployments and are a great case
|
745 |
+
study in model efficiency. Another similarly great case study is
|
746 |
+
[DistillBERT](https://medium.com/huggingface/distilbert-8cf3380435b5).
|
747 |
+
|
748 |
+
![](./media/image13.png)
|
749 |
+
|
750 |
+
### 3.3 - Mindsets
|
751 |
+
|
752 |
+
As we wrap up this lecture, keep in mind the following mindsets as you
|
753 |
+
consider edge deployment:
|
754 |
+
|
755 |
+
1. **Start with the edge requirement, not the architecture choice**.
|
756 |
+
It's easy to pick a high-performing model architecture, only to
|
757 |
+
then find it impossible to run on the edge device. Avoid this
|
758 |
+
scenario at all costs! Tricks like quantization can account for up
|
759 |
+
to 10x improvement, but not much more.
|
760 |
+
|
761 |
+
2. **Once you have a model that works on the edge, you can iterate
|
762 |
+
locally without too much additional re-deployment.** In this case,
|
763 |
+
make sure to add metrics around the model size and edge
|
764 |
+
performance to your experiment tracking.
|
765 |
+
|
766 |
+
3. **Treat tuning the model as an additional risk and test
|
767 |
+
accordingly.** With the immaturity of edge deployment frameworks,
|
768 |
+
it's crucial to be especially careful when testing your model on
|
769 |
+
the exact hardware you'll be deploying on.
|
770 |
+
|
771 |
+
4. **Make sure to have fallbacks!** Models are finicky and prone to
|
772 |
+
unpredictable behavior. In edge cases, it's especially important
|
773 |
+
to have easily available fallback options for models that aren't
|
774 |
+
working.
|
775 |
+
|
776 |
+
### 3.4 - Conclusion
|
777 |
+
|
778 |
+
To summarize this section:
|
779 |
+
|
780 |
+
1. Web deployment is easier, so use edge deployment only if you need
|
781 |
+
to.
|
782 |
+
|
783 |
+
2. Choose your framework to match the available hardware and
|
784 |
+
corresponding mobile frameworks, or try Apache TVM to be more
|
785 |
+
flexible.
|
786 |
+
|
787 |
+
3. Start considering hardware constraints at the beginning of the
|
788 |
+
project and choose architectures accordingly.
|
documents/lecture-05.srt
ADDED
@@ -0,0 +1,396 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,240 --> 00:00:32,000
|
3 |
+
hey everybody welcome back this week we're going to talk about deploying models into production so we're talking about this part of the life cycle and why do we spend a whole week on this maybe the answer is obvious right which is if you want to build a machine learning powered product you need some way of getting your model into production but i think there's a more subtle reason as well which is that i think of deploying models as a really critical part of making your models good to begin with the reason for that is when you only evaluate your model offline it's really easy to miss some of the more subtle flaws that model has where it doesn't actually solve the problem that your users needed to solve
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:30,320 --> 00:01:07,040
|
7 |
+
oftentimes when we deploy a model for the first time only then do we really see whether that model is actually doing a good job or not but unfortunately for a lot of data scientists and ml engineers model deployment is kind of an afterthought relative to some of the other techniques that you've learned and so the goal of this lecture is to cover different ways of deploying models into production and we're not going to be able to go in depth in all of them because it's it's a broad and deep topic worthy probably of a course itself and i'm not personally an expert in it but what we will do is we'll cover like a couple of happy paths that will take you to getting your first model in production for most use cases and then
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:05,680 --> 00:01:41,119
|
11 |
+
we'll give you a tour of some of the other techniques that you might need to learn about if you want to do something that is outside of that normal 80 so to summarize it's really important to get your model into production because only there do you see if it actually works if it actually solves the task that you set out to solve the technique that we're going to emphasize that you use for this is much like what we use in other parts of the life cycle and it's focused on like getting an mvp out early deploy early deploy a minimum viable model as early as possible and deploy often we're also going to emphasize keeping it simple and adding to bluxy later and so we'll start we'll walk through this the following process starting with building
|
12 |
+
|
13 |
+
4
|
14 |
+
00:01:39,280 --> 00:02:10,959
|
15 |
+
a prototype then we'll talk about how to separate your model in your ui which is sort of one of the first things that you'll need to do to make a more complex ui or to scale then we'll talk about some of the tricks that you need to do in order to scale your model up to serve many users and then finally we'll talk about more advanced techniques that you might use when you need your model to be really fast which often means moving it from a web server to the edge so the first thing that we'll talk about is how to build the first prototype of your production model and the goal here is just something that you can play around with yourself and share with your friends luckily unlike when we first taught this class there's many great
|
16 |
+
|
17 |
+
5
|
18 |
+
00:02:08,879 --> 00:02:46,160
|
19 |
+
tools for building prototypes of models hugging face has some tools built into their playground they've also recently acquired a company called gradio which we'll be using in the lab for the course which makes it very easy to wrap a small user interface around the model and then streamlit is also a great tool for doing this streamlight gives you a little bit more flexibility than something like radio or hugging face spaces at the cost of just needing to put a little bit more thought into how to pull all the pieces together in your ui but it's still very easy to use a few best practices to think about when you're deploying the prototype model first i would encourage you to have a basic ui for the model not
|
20 |
+
|
21 |
+
6
|
22 |
+
00:02:44,720 --> 00:03:20,480
|
23 |
+
just to have an api and the reason for that is you know the goal at this stage is to play around with the model get feedback on the model both yourself and also from your friends or your co-workers or whoever else you're talking with this project about gradio and streamlight are really your friends here gradio really it's often as easy as adding a couple of lines of code to create a simple interface for a model streamlit is a little bit more ambitious in that it's a tool that allows you to build pretty complex uis just using python so it'll be familiar interfaces for you if you're a python developer but will require a little bit more thought about how you want to structure things but still very easy next best practice
|
24 |
+
|
25 |
+
7
|
26 |
+
00:03:18,800 --> 00:03:51,040
|
27 |
+
is don't just run this on your laptop it's actually worth at this stage putting it behind a web url why is that important one it's easier to share right so part of the goal here is to collect feedback from other folks but it also starts to get you thinking about some of the trade-offs that you'll be making when you do a more complex deployment how much latency does this model actually have luckily there are cloud versions of both streamlit and hub and face which are which make this very easy so there's at this point in time not a lot of excuse not to just put this behind a simple url so you can share with people and then the last tip here is just don't stress too much at this stage again this is a prototype this is
|
28 |
+
|
29 |
+
8
|
30 |
+
00:03:49,360 --> 00:04:20,880
|
31 |
+
something that should take you not more than like maybe a day if you're doing it for the first time but if you're building many of these models maybe it even just takes you a couple hours we've talked about this first step which is buildings prototype and next i want to talk about why is this not going to work like why is this not going to be the end solution that you use to deploy your model so where will this fail the first big thing is with any of these tools that we discussed you're going to have limited flexibility in terms of how you build the user interface for your model and extremely gives you more flexibility there than gradio but still relatively limited flexibility and so eventually you're gonna want to be able to build a
|
32 |
+
|
33 |
+
9
|
34 |
+
00:04:19,040 --> 00:04:53,520
|
35 |
+
fully custom ui for the model and then secondly these systems tend not to scale very well to many concurrent requests so if it's just you or you and a couple friends playing around the model that's probably fine but once you start to have users you'll hit the scaling limits of these pretty quickly and this is a good segue to talk about at a high level different ways you can structure your machine learning power application in particular where the model fits into that application so we'll start with an abstract diagram of how your application might look there's a few different components to this on the left we have a client and the client is essentially your user and that's the device that they're using to interact with the
|
36 |
+
|
37 |
+
10
|
38 |
+
00:04:52,160 --> 00:05:28,320
|
39 |
+
application that you built so it could be a browser it could be a vehicle whatever that device is that they're interacting with then that device will make calls over a network to a server that server is typically if you're building a web app where most of your code is running that server will talk to a database where there's data stored that's used for powering the application and there's different ways of structuring this application to fit a machine learning model inside the prototype approach that we just described mostly fits into this model in service approach where the web server that you're hosting actually just has a packaged version of the model sitting inside of it when you write a streamled script for a gradioscript part of that
|
40 |
+
|
41 |
+
11
|
42 |
+
00:05:26,560 --> 00:06:02,000
|
43 |
+
script will be to load the model and so that script will be building your ui as well as running the model at the same time so this pattern like all patterns has pros and cons the biggest pro i think is one it's really easy if you're using one of these prototype development tools but two even if you are doing something a little bit more complicated like you're reusing your web infrastructure for the app that your company is building you get to reuse a lot of existing infrastructure so it doesn't require you as a model developer to set up a lot of new things just to try your model out and that's really great but there are a number of pretty pronounced cons to this as well the first is that your web server in many
|
44 |
+
|
45 |
+
12
|
46 |
+
00:05:59,840 --> 00:06:34,080
|
47 |
+
cases like once you get beyond this streamlight and gradio type example might be written in a different language than your model like it might be written in in ruby or in javascript or something like that getting your model into that language can be difficult the second reason is that oftentimes especially in early in the life cycle of building your model your model might be changing more frequently than your server code so if you have a relatively well established application but a model that you're still building you might not want to have to redeploy the entire application every single time that you make an update to the model which might be every day or even multiple times a day the third con of this approach is that it
|
48 |
+
|
49 |
+
13
|
50 |
+
00:06:31,360 --> 00:07:06,240
|
51 |
+
doesn't scale very well with model size so if you have a really large model that you're trying to run inference on you'll have to load that on your web server and so that's going to start to eat into the resources of that web server and might affect the user experience for people using that web server even if they're not interacting with the model or that's not the primary thing that they're doing in that web application because all of the resources from that web server are being directed to making this model run the fourth reason is that server hardware the hardware that you're probably running your web application or your mobile application on is generally not optimized very well for machine learning workloads and so in particular
|
52 |
+
|
53 |
+
14
|
54 |
+
00:07:04,240 --> 00:07:38,720
|
55 |
+
you're very rarely going to have a gpu on these devices that may or may not be a deal breaker which we'll come back to later in the lecture and the last con is that your model itself and the application that's part of might have very different scaling properties and you might want to be able to scale them differently so for example if you're running a very lightweight ui then it might not take a lot of resources or a lot of thought to scale it to many users but if your model itself is really complicated or very large you might need to get into some of the advanced techniques in this lecture and host these models on gpus to get them to scale you don't want to necessarily have to bring all of that complexity to your
|
56 |
+
|
57 |
+
15
|
58 |
+
00:07:37,120 --> 00:08:13,840
|
59 |
+
web server it's important when there's different scaling properties to be able to separate these concerns as part of the application that you're building so that brings us to the second step which is pulling your model out of the ui and there's a couple of different ways that we can do this and we'll talk about two different patterns here the first is to pull your model out of the ui and have it interact directly with the database this is called batch prediction so how does this work periodically you will get new data in and you'll run your model on each of those data points then you'll save the results of that model inference into a database this can work really well in some circumstances so for example if there's just not a lot of
|
60 |
+
|
61 |
+
16
|
62 |
+
00:08:11,599 --> 00:08:50,240
|
63 |
+
potential inputs to the model if you have one prediction per user or one prediction per customer or something along those lines then you can rerun your model on some frequency like every hour or every day or every week and you can have reasonably fresh predictions to return to those users just stored in your database so examples of types of problems where this can work well are you know in the early stages of building out a recommender system in some cases for doing more internal facing use cases like marketing automation if for example you want to give each of your marketing leads a score that tells your marketing your sales team how much effort to put into closing those leads then you'll have this finite universe of leads that
|
64 |
+
|
65 |
+
17
|
66 |
+
00:08:48,800 --> 00:09:23,680
|
67 |
+
needs a prediction for the model so you can just run a model prediction on every single possible lead store that in a database and then let your users interact with it from there how can you actually do this how do you actually run the model on the schedule the data processing and workflow tools that we talked about in the previous lecture also work really well here what you'll need to do is you'll need to re-run your data pre-processing you'll then need to load the model run the predictions and store the predictions in the database that you're using for your application and so this is exactly a directed acyclic graph a workflow of data operations that tools like dagster airflow or prefix are designed to solve
|
68 |
+
|
69 |
+
18
|
70 |
+
00:09:22,240 --> 00:09:58,720
|
71 |
+
it's worth noting here that there's also tools like metaflow that are designed more for a machine learning or data science use case that might be potentially even an easier way to get started so what are the pros and cons of this pattern of running your model offline and putting the predictions in a database the biggest pro is that this is just really easy to implement right it's reusing these existing batch processing tools that you may already be using for trading your model and it doesn't require you to host any type of new web server to get those predictions to users you can just put the predictions in the database that your product is already using it also scales very easily because databases themselves are designed and
|
72 |
+
|
73 |
+
19
|
74 |
+
00:09:57,040 --> 00:10:35,120
|
75 |
+
have been engineered for decades to scale really easily it's also you know it seems like a simple pattern but it's used in production by very large scale production systems by large companies and it has been for years often times this is for things like recommender systems this is a tried and true pattern that you can run and become pretty confident that it'll work well and then it's also relatively low latency because the database itself is designed for the end application to interact with so latency was a concern that the database designers were able to solve for us there's also some very pronounced cons to this approach and the most important one is that it just doesn't work for every type of model if you have complex
|
76 |
+
|
77 |
+
20
|
78 |
+
00:10:33,360 --> 00:11:08,880
|
79 |
+
inputs to your model if the universe of inputs is too large to enumerate every single time you need to update your predictions then this just isn't going to work second con is that your users are not going to be getting the most up-to-date predictions from your model if the features going into your model let's say change every hour or every minute or some second but you only run this batch prediction job every day then the predictions your users see might be slightly stale think about this in the context of a recommender system if you're only running the predictions of the recommender system every day then those recommendations that you serve to your users won't take into account all of the contacts that those users have
|
80 |
+
|
81 |
+
21
|
82 |
+
00:11:07,040 --> 00:11:42,720
|
83 |
+
provided you in between those predictions so the movies that they watch today the tv shows that they watch today those won't be taken into account in at least the machine learning part of their recommendations but there's you know there's other algorithmic ways to make sure that you don't do things like show users the same movie twice and the final con here is that models frequently can become stale so if your batch job fails for some reason there's a timeout in one of your data pre-processing steps and the new predictions don't get dumped into the database these types of things can make this problem of not getting up-to-date predictions worse and worse and they can be very hard to detect although there's tools for data quality
|
84 |
+
|
85 |
+
22
|
86 |
+
00:11:41,200 --> 00:12:17,920
|
87 |
+
that can really help detect them the next pattern that we're going to talk about is rather than running the model offline and putting the predictions in a database instead let's run the model online as its own service the service is going to interact with the backend or the client itself by making requests to this model service sending hey what is the prediction for this particular input and receiving responses back the model says that the prediction for this input is this particular value the pros of this approach are it's dependable if you have a bug in your model if your model is running directly in the web server then that can crash your entire application but hosting this as an independent service in your application
|
88 |
+
|
89 |
+
23
|
90 |
+
00:12:16,160 --> 00:12:51,680
|
91 |
+
means that's less likely second it's more scalable so you can choose what is the best hardware what is the best infrastructure setup for the model itself and scale that as you need to without needing to worry about how that affects the rest of your application third it's really flexible if you stand up a model service for a particular model you can reuse that service in other applications or other parts of your application very easily concierge since this is a separate service you add a network call when your server or your client interacts with the model it has to make a request and receive a response over the network so that can add some latency to your application it adds infrastructure complexity relative to
|
92 |
+
|
93 |
+
24
|
94 |
+
00:12:50,560 --> 00:13:29,680
|
95 |
+
the other techniques that we've talked about before because now you're on the hook for hosting and managing a separate service just to host your model this is i think really the challenge for a lot of ml teams is that hey i'm good at training models i'm not sure how to run a web service however i do think this is the sweet spot for most ml powered products because the cons of the other approaches are just too great you really need to be able to scale models independently of the application itself in most complex use cases and for a lot of interesting uses of ml we don't have a finite universe of inputs to the model that we can just enumerate every day we really need to be able to have our users send us whatever requests that they want
|
96 |
+
|
97 |
+
25
|
98 |
+
00:13:27,440 --> 00:13:59,680
|
99 |
+
to get and receive a customized response back in this next section we'll talk through the basics of how to build your model service there's a few components to this we will talk about rest apis which are the language that your service will use to interact with the rest of your application we'll talk about dependency management so how to deal with these pesky versions of pi torch or tensorflow that you might need to be upgrading and we'll talk about performance optimization so how to make this run fast and scale well and then we'll talk about rollout so how to get the next version of your model into production once you're ready to deploy it and then finally we'll once we've covered sort of the technical
|
100 |
+
|
101 |
+
26
|
102 |
+
00:13:58,560 --> 00:14:40,399
|
103 |
+
considerations that you'll need to think about we'll talk about managed options that solve a lot of these technical problems for you first let's talk about rest apis what are rest apis rest apis serve predictions in response to canonically formatted http requests there's other alternative protocols to rest for interacting with a service that you host on your infrastructure probably the most common one that you'll see in ml is grpc which is used in a lot of google products like tensorflow serving graphql is another really commonly used protocol in web development that is not terribly relevant for building model services so what does a rest api look like you may have seen examples of this before but when you are sending data to
|
104 |
+
|
105 |
+
27
|
106 |
+
00:14:37,199 --> 00:15:20,000
|
107 |
+
a web url that's formatted as json blog oftentimes this is a rest request this is an example of what it might look like to interact with the rest api in this example we are sending some data to this url which is where the rest api is hosted api.fullstackdeeplearning.com and we're using the post method which is one of the parts of the rest standard that tells the server how it's going to interact with the data that we're sending and then we're sending this json blob of data that represents the inputs to the model that we want to receive a prediction from so one question you might ask is there any standard for how to format the inputs that we send to the model and unfortunately there isn't really any standard yet here are a few
|
108 |
+
|
109 |
+
28
|
110 |
+
00:15:16,959 --> 00:16:01,279
|
111 |
+
examples from rest apis for model services hosted in the major clouds and we'll see some differences here between how they expect the inputs to the model to be formatted for example in google cloud they expect a batch of inputs that is structured as a list of what they call instances each of which has values and a key in azure they expect a list of things called data where the data structure itself depends on what your model architecture is and in sagemaker they also expect instances but these instances are formatted differently than they are in google cloud so one thing i would love to see in the future is moving toward a standard interface for making rest api calls for machine learning services since the types of
|
112 |
+
|
113 |
+
29
|
114 |
+
00:16:00,079 --> 00:16:35,839
|
115 |
+
data that you might send to these services is pretty constrained we should be able to develop a standard as an industry the next topic we'll cover is dependency management model predictions depend not only on the weights of the model that you're running the prediction on but also on the code that's used to turn those weights into the prediction including things like pre-processing and the dependencies the specific library versions that you need in order to run the function that you called and in order for your model to make a correct prediction all of these dependencies need to be present on your web server unfortunately dependencies are a notorious cause of trouble in web applications in general and in
|
116 |
+
|
117 |
+
30
|
118 |
+
00:16:34,000 --> 00:17:10,799
|
119 |
+
particular in machine learning web services the reason for that is a few things one they're very hard to make consistent between your development environment and your server how do you make sure that the server is running the exact same version of tensorflow pytorch scikit-learn numpy whatever other libraries you depend on as your jupyter notebook was when you train those models the second is that they're hard to update if you update dependencies in one environment you need to update them in all environments and in machine learning in particular since a lot of these libraries are moving so quickly small changes in something like a tensorflow version can change the behavior of your model so it's important to be like
|
120 |
+
|
121 |
+
31
|
122 |
+
00:17:09,439 --> 00:17:47,360
|
123 |
+
particularly careful about these versions in ml at a high level there's two strategies that will cover for managing dependencies the first is to constrain the dependencies for just your model to save your model in a format that is agnostic that can be run anywhere and then the second is to wrap your entire inference program your entire predict function for your model into what's called a container so let's talk about how to constrain the dependencies of just your model the primary way that people do this today is through this library called onyx the open neural network exchange and the goal of onyx is to be an interoperability standard for machine learning models what they want you to be able to do is to define a neural network
|
124 |
+
|
125 |
+
32
|
126 |
+
00:17:45,280 --> 00:18:24,960
|
127 |
+
in any language and run it consistently anywhere no matter what inference framework you're using hardware you're using etc that's the promise the reality is that since the underlying libraries used to build these models are currently changing so quickly there's often bugs in this translation layer and in many cases this can create more problems than it actually solves for you and the other sort of open problem here is this doesn't really deal with non-library code in many cases in ml things like feature transformations image transformations you might do as part of your tensorflow or your pi torch graph but you might also just do as a python function that wraps those things and these open neural network standards like
|
128 |
+
|
129 |
+
33
|
130 |
+
00:18:22,960 --> 00:18:57,440
|
131 |
+
onyx don't really have a great story for how to handle pre-processing that brings us to a second strategy for managing dependencies which is containers how can you manage dependencies with containers like docker so we'll cover a few things here we'll talk about the differences between docker and general virtual machines which you might have covered in a computer science class we'll talk about how docker images are built via docker files and constructed via layers we'll talk a little bit about the ecosystem around docker and then we'll talk about specific wrappers around docker that you can use for machine learning the first thing to know about docker is how it differs from virtual machines which is an older technique for
|
132 |
+
|
133 |
+
34
|
134 |
+
00:18:55,600 --> 00:19:33,440
|
135 |
+
packaging up dependencies in a virtual machine you essentially package up the entire operating system as well as all the libraries and applications that are built on top of that operating system so it tends to be very heavy weight because the operating system is itself just a lot of code and expensive to run the improvement that docker made is by removing the need to package up the operating system alongside the application instead you have the libraries and applications packaged up together in something called a container and then you have a docker engine that runs on top of your the operating system on your laptop or on your server that knows how to to virtualize the os and run your bins and libraries and
|
136 |
+
|
137 |
+
35
|
138 |
+
00:19:32,080 --> 00:20:08,559
|
139 |
+
applications on top of it so we just learned that docker is much more lightweight than the typical virtual machine and by virtue of being lightweight it is used very differently than vms were used in particular a common pattern is to spin up a new docker container for every single discrete task that's part of your application so for example if you're building a web application you wouldn't just have a single docker container like you might if you were using a virtual machine instead you might have four you might have one for the web server itself one for the database one for job queue and one for your worker since each one of these parts of your application serves a different function it has different library dependencies and maybe
|
140 |
+
|
141 |
+
36
|
142 |
+
00:20:07,200 --> 00:20:43,039
|
143 |
+
in the future you might need to scale it differently each one of them goes into its own container and those containers are are run together as part of an orchestration system which we'll talk about in a second how do you actually create a docker container docker containers are created from docker files this is what a docker file looks like it runs a sequence of steps to define the environment that you're going to run your code in so in this case it is importing another container that has some pre-packaged dependencies for running python 2.7 hopefully you're not running python 2.7 but if you were you could build a docker container that uses it using this from command at the top and then doing other things like adding
|
144 |
+
|
145 |
+
37
|
146 |
+
00:20:41,440 --> 00:21:21,120
|
147 |
+
data from your local machine hip installing packages exposing ports and running your actual application you can build these docker containers on your laptop and store them there if you want to when you're doing development but one of the really powerful things about docker is it also allows you to build store and pull docker containers from a docker hub that's hosted on some other server on docker servers or on your cloud provider for example the way that you would run a docker container typically is by using this docker run command so what that will do is in this case it will find this container on the right called gordon slash getting started part two and it'll try to run that container but if you're connected
|
148 |
+
|
149 |
+
38
|
150 |
+
00:21:19,360 --> 00:21:59,280
|
151 |
+
to a docker hub and you don't have that docker image locally then what it'll do is it'll automatically pull it from the docker hub that you're connected to the server that your docker engine is connected to it'll download that docker container and it will run it on your local machine so you can experiment with that code environment that's going to be identical to the one that you deploy on your server and in a little bit more detail docker is separated into three different components the first is the client this is what you'll be running on your laptop to build an image from a docker file that you define locally to pull an image that you want to run some code in on your laptop to run a command inside of an image those commands are
|
152 |
+
|
153 |
+
39
|
154 |
+
00:21:56,799 --> 00:22:35,440
|
155 |
+
actually executed by a docker host which is often run on your laptop but it doesn't have to be it can also be run on a server if you want more storage or more performance and then that docker host talks to a registry which is where all of the containers that you might want to access are stored this separation of concerns is one of the things that makes docker really powerful because you're not limited by the amount of compute and storage you have on your laptop to build pull and run docker images and you're not limited by what you have access to on your docker host to decide which images to run in fact there's a really powerful ecosystem of docker images that are available on different public docker hubs you can
|
156 |
+
|
157 |
+
40
|
158 |
+
00:22:33,360 --> 00:23:07,600
|
159 |
+
easily find these images modify them and contribute them back and have the full power of all the people on the internet that are building docker files and docker images there might just be one that already solves your use case out of the box it's easy to store private images in the same place as well so because of this community and lightweight nature of docker it's become incredibly popular in recent years and is pretty much ubiquitous at this point so if you're thinking about packaging dependencies for deployment this is probably the tool that you're going to want to use docker is not as hard to get started with as it sounds you'll need to read some documentation and play around with docker files a little bit to get a
|
160 |
+
|
161 |
+
41
|
162 |
+
00:23:05,679 --> 00:23:40,240
|
163 |
+
feel for how they work and how they fit together you oftentimes won't need to build your own docker image at all because of docker hubs and you can just pull one that already works for your use case when you're getting started that being said there is a bit of a learning curve to docker isn't there some way that we can simplify this if we're working on machine learning and there's a number of different open source packages that are designed to do exactly that one is called cog another is called bento ml and a third is called truss and these are all built by different model hosting providers that are designed to work well with their model hosting service but also just package your model and all of its dependencies in a
|
164 |
+
|
165 |
+
42
|
166 |
+
00:23:38,720 --> 00:24:14,960
|
167 |
+
standard docker container format so you could run it anywhere that you want to and the way that these systems tend to work is there's two components the first is there's a standard way of defining your prediction service so your like model.predict function how do you wrap that in a way that this service understands so in cog it's this base predictor class that you see on the bottom left in truss it's dependent on the model library that you're using like you see on the right hand side that's the first thing is how do you actually package up this model.predict function and then the second thing is a yaml file which sort of defines the other dependencies and package versions that are going to go into this docker
|
168 |
+
|
169 |
+
43
|
170 |
+
00:24:13,360 --> 00:24:47,520
|
171 |
+
container that will be run on your laptop or remotely and so this this sort of a simplified version of the steps that you would put into your docker build command but at the end of the day it packages up in the standard format so you can deploy it anywhere so if you want to have some of the advantages of using docker for making your machine learning models reproducible and deploying them but you don't want to actually go through the learning curve of learning docker or you just want something that's a little bit more automated for machine learning use cases then it's worth checking out these three libraries the next topic we'll discuss is performance optimization so how do we make models go bur how do we make them
|
172 |
+
|
173 |
+
44
|
174 |
+
00:24:45,919 --> 00:25:22,559
|
175 |
+
go fast and there's a few questions that we'll need to answer here first is should we use a gpu to do inference or not we'll talk about concurrency model distillation quantization caching batching sharing the gpu and then finally libraries that automate a lot of these things for you so the spirit of this is going to be sort of a whirlwind tour through some of the major techniques of making your models go faster and we'll try to give you pointers where you can go to learn more about each of these topics the first question you might ask is should you host your model on a gpu or on a cpu there's some advantages to hosting your model on a gpu the first is that it's probably the same hardware that you train your model on to begin with so
|
176 |
+
|
177 |
+
45
|
178 |
+
00:25:20,640 --> 00:25:59,919
|
179 |
+
that can eliminate some loss and translation type moments the second big con is that as your model gets really big and as your techniques get relatively advanced your traffic gets very large this is usually how you can get the sort of maximum throughput like the most number of users that are simultaneously hitting your model is by hosting the model on a gpu but gpus introduce a lot of complexity as well they're more complex to set up because they're not as well trodden the path for hosting web services as cpus are and they're often almost always actually more expensive so i think one point that's worth emphasizing here since it's a common misconception i see all the time is just because your model was trained on a gpu does not mean that you
|
180 |
+
|
181 |
+
46
|
182 |
+
00:25:57,760 --> 00:26:36,080
|
183 |
+
need to actually host it on a gpu in order for it to work so consider very carefully whether you really need a gpu at all or whether you're better off especially for an early version of your model just hosting it on a cpu in fact it's possible to get very high throughput just from cpu inference at relatively low cost by using some other techniques and so one of the main ones here is concurrency concurrency means on a single host machine not just having a single copy of the model running but having multiple copies of the model running in parallel on different cpus or different cpu cores how can you actually do this the main technique that you need to be careful about here is thread tuning so making sure that in torch it
|
184 |
+
|
185 |
+
47
|
186 |
+
00:26:34,400 --> 00:27:10,320
|
187 |
+
knows which threads you need to use in order to actually run the model otherwise the different torch models are going to be competing for threads on your machine there's a great blog post from roblox about how they scaled up bert to serve a billion daily requests just using cpus and they found this to be much easier and much more cost effective than using gpus cpus can be very effective for scaling up to high throughput as well you don't necessarily need gpus to do that the next technique that we'll cover is model distillation what is model distillation model distillation means once you have your model that you've trained maybe a very large or very expensive model that does very well at the task that you want to
|
188 |
+
|
189 |
+
48
|
190 |
+
00:27:08,000 --> 00:27:44,399
|
191 |
+
solve you can train a smaller model that tries to imitate the behavior of your larger one and so this generally is a way of taking the knowledge that your larger model learned and compressing that knowledge into a much smaller model that maybe you couldn't have trained to the same degree of performance from scratch but once you have that larger model it's able to imitate it so how does this work i'll just point you to this blog post that covers several techniques for how you can do this it's worth noting that this can be tricky to do on your own and is i would say relatively infrequently done in practice in production a big exception to that is oftentimes there are distilled versions of popular models the stilbert is a
|
192 |
+
|
193 |
+
49
|
194 |
+
00:27:42,399 --> 00:28:21,120
|
195 |
+
great example of this that are pre-trained for you that you can use for very limited performance trade-off the next technique that we're going to cover is quantization what is it this means that rather than taking all of the matrix multiplication math that you do when you make a prediction with your model and doing that all in the sort of full precision 64 or 32-bit floating point numbers that your model weights might be stored in instead you execute some of those operations or potentially all of them in a lower fidelity representation of the numbers that you're doing the math with and so these can be 16-bit floating point numbers or even in some cases 8-bit integers this introduces some trade-offs with accuracy
|
196 |
+
|
197 |
+
50
|
198 |
+
00:28:19,279 --> 00:28:55,600
|
199 |
+
but oftentimes this is a trade-off that's worth making because the accuracy you lose is pretty limited relative to the performance that you gain how can you do this the recommended path is to use the built-in methods in pytorch and hugging face and tensorflow lite rather than trying to roll this on your own and it's also worth starting to think about this even when you're training your model because techniques called quantization aware training can result in higher accuracy with quantized models than just naively training your model and then running quantization after the fact i want to call out one tool in particular for doing this which is relatively new optimum library from uh hugging face which just makes this very
|
200 |
+
|
201 |
+
51
|
202 |
+
00:28:53,840 --> 00:29:31,840
|
203 |
+
easy and so if you're already using hugging face models there's a little downside to trying this out next we'll talk about caching what is caching for some machine learning models if you look at the patterns of the inputs that users are requesting that model to make predictions on there's some inputs that are much more common than others so rather than asking the model to make those predictions from scratch every single time users make those requests first let's store the common requests in a cache and then let's check that cache before we actually run this expensive operation of running a forward pass on our neural network how can you do this there's a huge depth of techniques that you can use for intelligent caching but
|
204 |
+
|
205 |
+
52
|
206 |
+
00:29:29,760 --> 00:30:07,679
|
207 |
+
there's also a very basic way to do this using func tools library in python and so this looks like it's just adding a wrapper to your model.predict code that will essentially check the cache to see if this input is stored there and return the sort of cached prediction if it's there otherwise run the function itself and this is also one of the techniques used in the roblox blog post that i highlighted before for scaling this up to a billion requests per day the pretty important part of their approach so for some use cases you can get a lot of lift just by simple caching the next technique that we'll talk about is batching so what is the idea behind batching well typically when you run inference on a machine learning model
|
208 |
+
|
209 |
+
53
|
210 |
+
00:30:05,919 --> 00:30:46,240
|
211 |
+
unlike in training you are running it with bat shy as equals one so you have one request come in from a user and then you respond with the prediction for that request and the fact that we are running a prediction on a single request is part of why generally speaking gpus are not necessarily that much more efficient than cpus for running inference what batching does is it takes advantage of the fact that gpus can achieve much higher throughput much higher number of concurrent predictions when they do that prediction in parallel on a batch of inputs rather than on a single input at a time how does this work you have individual predictions coming in from users i want a prediction for this input i want a prediction for this input so
|
212 |
+
|
213 |
+
54
|
214 |
+
00:30:44,799 --> 00:31:23,360
|
215 |
+
you'll need to gather these inputs together until you have a batch of a sufficient size and then you'll run a prediction on that batch and then split the batch into the predictions that correspond to the individual requests and return those to the individual users so there's a couple of pretty tricky things here one is you'll need to tune this batch size in order to trade off between getting the most throughput from your model which generally requires a larger batch size and reducing the inference latency for your users because if you need to wait too long in order to gather enough predictions to fit into that batch then your users are gonna pay the cost of that they're gonna be the ones waiting for that response to come
|
216 |
+
|
217 |
+
55
|
218 |
+
00:31:21,840 --> 00:31:57,440
|
219 |
+
back so you need to tune the batch size to trade off between those two considerations you'll also need some way to shortcut this process if latency becomes too long so let's say that you have a lull in traffic and normally it takes you a tenth of a second to gather your 128 inputs that you're going to put into a bash but now all of a sudden it's taking a full second to get all those inputs that can be a really bad user experience if they just have to wait for other users to make predictions in order to see their response back so you'll want some way of shortcutting this process of gathering all these data points together if the latency is becoming too long for your user experience so hopefully it's clear from
|
220 |
+
|
221 |
+
56
|
222 |
+
00:31:55,600 --> 00:32:32,559
|
223 |
+
this that this is pretty complicated to implement and it's probably not something that you want to implement on your own but luckily it's built into a lot of the libraries for doing model hosting on gpus which we'll talk about in a little bit the next technique that we'll talk about is sharing the gpu between models what does this mean your model may not necessarily fully utilize your gpu for inference and this might be because your batch size is too small or because there's too much other delay in the system when you're waiting for requests so why not just have multiple models if you have multiple model services running on the same view how can you do this this is generally pretty hard and so this is also a place where
|
224 |
+
|
225 |
+
57
|
226 |
+
00:32:30,799 --> 00:33:08,080
|
227 |
+
you'll want to run an out-of-the-box model serving solution that solves this problem for you so we talked about how in gpu inference if you want to make that work well there's a number of things like sharing the gpu between models and intelligently batching the inputs to the models to trade off between latency and throughput that you probably don't want to implement yourself luckily there's a number of libraries that will solve some of these gpu hosting problems for you there's offerings from tensorflow which is pretty well baked into a lot of google cloud's products and pytorch as well as third-party tools from nvidia and any scale and ray nvidia's is probably the most powerful and is the one that i often see from companies that are trying
|
228 |
+
|
229 |
+
58
|
230 |
+
00:33:06,399 --> 00:33:43,200
|
231 |
+
to do very high throughput model serving but can also often be difficult to get started with starting with ray serve or the one that's specific to your neural net library is maybe an easier way to get started if you want to experiment with this all right we've talked about how to make your model go faster and how to optimize the performance of the model on a single server but if you're going to scale up to a large number of users interacting with your model it's not going to be enough to get the most efficiency out of one server at some point you'll need to scale horizontally to have traffic going to multiple copies of your model running on different servers so what is horizontal scaling if you have too much traffic for a single
|
232 |
+
|
233 |
+
59
|
234 |
+
00:33:41,360 --> 00:34:18,079
|
235 |
+
machine you're going to take that stream of traffic that's coming in and you're going to split it among multiple machines how can you actually achieve this each machine that you're running your model on will have its own separate copy of your service and then you'll route traffic between these different copies using a tool called a load balancer in practice there's two common methods of doing this one is container orchestration which is a sort of set of techniques and technologies kubernetes being the most popular for managing a large number of different containers that are running as part of one application on your infrastructure and then a second common method especially in machine learning is serverless so
|
236 |
+
|
237 |
+
60
|
238 |
+
00:34:16,800 --> 00:34:52,320
|
239 |
+
we'll talk about each of these let's start with container orchestration when we talked about docker we talked about how docker is different than typical deployment and typical virtual machines because rather than running a separate copy of the operating system for every virtual machine or program that you want to run instead you run docker on your server and then docker is able to manage these lightweight virtual machines that run each of the parts of your application that you want to run so when you deploy docker typically what you'll do is you'll run a docker host on a server and then you'll have a bunch of containers that the docker host is responsible for managing and running on that server but when you want to scale
|
240 |
+
|
241 |
+
61
|
242 |
+
00:34:50,960 --> 00:35:28,240
|
243 |
+
out horizontally so when you want to have multiple copies of your application running on different servers then you'll need a different tool in order to coordinate between all of these different machines and docker images the most common one is called kubernetes kubernetes works together with very closely with docker to build and run containerized distributed applications kubernetes helps you remove the sort of constraint that all of the containers are running on the same machine kubernetes itself is a super interesting topic that is worth reading about if you're interested in distributed computing and infrastructure and scaling things up but for machine learning deployment if your only goal is to deploy ml models it's probably overkill
|
244 |
+
|
245 |
+
62
|
246 |
+
00:35:26,720 --> 00:36:01,520
|
247 |
+
to learn a ton about kubernetes there's a number of frameworks that are built on top of kubernetes that make it easier to use for deploying models the most commonly used ones in practice tend to be kubeflow serving and selden but even if you use one of these libraries on top of kubernetes for container orchestration you're still going to be responsible for doing a lot of the infrastructure management yourself and serverless functions are an alternative that remove a lot of the need for infrastructure management and are very well suited for machine learning models the way these work is you package up your app code and your dependencies into a docker container and that docker container needs to have a single entry
|
248 |
+
|
249 |
+
63
|
250 |
+
00:36:00,079 --> 00:36:36,079
|
251 |
+
point function like one function that you're going to run over and over again in that container so for example in machine learning this is most often going to be your model.predict function then you deploy that container to a service service like aws lambda or the equivalence in google or azure clouds and that service is responsible for running that predict function inside of that container for you over and over and over again and takes care of everything else scaling load balancing all these other considerations that if you're horizontally scaling a server would be your problem to solve on top of that there's a different pricing model so if you're running a web server then you control that whole web server and so you
|
252 |
+
|
253 |
+
64
|
254 |
+
00:36:34,400 --> 00:37:08,560
|
255 |
+
pay for all the time that it's running 24 hours a day but with serverless you only pay for the time that these servers are actually being used to run your model you know if your model is only serving predictions or serving most of its predictions eight hours a day let's say then you're not paying for the other 16 hours where it's not serving any predictions because of all these things serverless tends to be very well suited to building model services especially if you are not an infrastructure expert and you want a quick way to get started so we recommend this as a starting point for once you get past your prototype application so the genius idea here is your servers can't actually go down if you don't have any we're doing
|
256 |
+
|
257 |
+
65
|
258 |
+
00:37:06,960 --> 00:37:41,280
|
259 |
+
serverless serverless is not without its cons one of the bigger challenges that has gotten easier recently but is still often a challenge in practice is that the packages that you can deploy with these serverless applications tend to be limited in size so if you have an absolutely massive model you might run into those limits there's also a cold start problem what this means is serverless is designed to scale all the way down to zero so if you're not receiving any traffic if you're not receiving any requests for your model then you're not going to pay which is one of the big advantages of serverless but the problem is when you get that first request after the serverless function has been cold for a while it
|
260 |
+
|
261 |
+
66
|
262 |
+
00:37:39,520 --> 00:38:15,440
|
263 |
+
takes a while to start up it can be seconds or even minutes to get that first prediction back once you've gotten that first prediction back it's faster to get subsequent predictions back but it's still worth being aware of this limitation another challenge practical challenge is that many of these server these serverless services are not well designed for building pipelines and models so if you have a complicated chaining of logic to produce your prediction then it might be difficult to implement that in a server-less context there's little little or no state management available in serverless functions so for example if caching is really important for your application it can be difficult to build that caching
|
264 |
+
|
265 |
+
67
|
266 |
+
00:38:13,920 --> 00:38:50,320
|
267 |
+
in if you're deploying your model in serverless and there's often limited deployment tooling as well so rolling out new versions of the serverless function there's often not all the tooling that you'd want to make that really easy and then finally these serverless functions today are cpu only and they have limited execution time of you know a few seconds or a few minutes so if you truly need gpus for imprints then serverless is not going to be your answer but i don't think that limitation is going to be true forever in fact i think we might be pretty close to serverless gpus there's already a couple of startups that are claiming to offer serverless gpu for inference and so if you want to do inference on gpus but you
|
268 |
+
|
269 |
+
68
|
270 |
+
00:38:48,560 --> 00:39:24,880
|
271 |
+
don't want to manage gpu machines yourself i would recommend checking out these two options from these two young startups the next topic that we'll cover in building a model service is rollouts so what do you need to think about in terms of rolling out new models if serving is how you turn your machine learning model into something that can respond to requests that lives on a web server that anyone or anyone that you want to can send a request to and get a prediction back then rollouts are how you manage and update these services so if you have a new version of a model or if you want to split traffic between two different versions to run an a b test how do you actually do that from an infrastructure perspective you probably
|
272 |
+
|
273 |
+
69
|
274 |
+
00:39:23,440 --> 00:39:56,960
|
275 |
+
want to have the ability to do a few different things so one is to roll out new versions gradually what that means is when you have version n plus one of your model and you want to replace version n with it it's sometimes helpful to be able to rather than just instantly switching over all the traffic to n plus one instead start by sending one percent of your traffic to n plus one and then ten percent and then 50 and then once you're confident that it's working well then switch all of your traffic over to it so you'll want to be able to roll out new versions gradually on the flip side you'll want to be able to roll back to an old version instantly so if you detect a problem with the new version of the model that you deployed hey on this
|
276 |
+
|
277 |
+
70
|
278 |
+
00:39:55,200 --> 00:40:28,720
|
279 |
+
10 of traffic that i'm sending to the new model users are not responding well to it or it's sending a bunch of errors you'll want to be able to instantly revert to sending all of your traffic to the older version of the model you want to be able to split traffic between versions a sort of a prerequisite for doing these things as well as running an av test you also want to be able to deploy pipelines of models or deploy models in a way such that they can shadow the prediction traffic they can look at the same inputs as your main model and produce predictions that don't get sent back to users so that you can test whether the predictions look reasonable before you start to show them to users this is just kind of like a
|
280 |
+
|
281 |
+
71
|
282 |
+
00:40:26,800 --> 00:41:06,720
|
283 |
+
quick flavor of some of the things that you might want to solve for in a way of doing model rollouts this is a challenging infrastructure problem so it's beyond the scope of this lecture in this class really if you're using a managed option which we'll come to in a bit or you have infrastructure that's provided for you by your team it may take care of this for you already but if not then looking into a managed option might be a good idea so manage options take care of a lot of the scaling and roll out challenges that you'd otherwise face if you host models yourself even on something like aws lambda there's a few different categories of options here the cloud providers all provide their own sort of managed options as well as in
|
284 |
+
|
285 |
+
72
|
286 |
+
00:41:04,560 --> 00:41:42,079
|
287 |
+
most of the end-to-end ml platforms so if you're already using one of these cloud providers or end-to-end ml platforms pretty heavily it's worth checking out their offering to see if that works for you and there's also a number of startups that have offerings here so there's a couple that are i would say more focused on developer experience like bento ml and cortex so if you find sagemaker really difficult to use or you just hate the developer experience for it it might be worth checking one of those out cortex recently was acquired by databricks so it might also start to be incorporated more into their offerings then there's startups that are have offerings that are more have good ease of use but are also really focused on performance
|
288 |
+
|
289 |
+
73
|
290 |
+
00:41:39,760 --> 00:42:17,440
|
291 |
+
banana is a sort of popular upcoming example of that to give you a feel of what these manage options look like i want to double click on sagemaker which is probably the most popular managed offering the happy path in sagemaker is if your model is already in a digestible format a hugging face model or a scikit-learn model or something like that and in those cases deploying the sagemaker is pretty easy so you will instead of using like kind of a base hugging face class you'll instead use this sagemaker wrapper for the hogging face class and then call fit like you normally would that can also be run on the cloud and then to deploy it you just will call the dot deploy method of this hugging face wrapper and you'll specify
|
292 |
+
|
293 |
+
74
|
294 |
+
00:42:15,920 --> 00:42:52,319
|
295 |
+
how many instances you want this to run on as well as how beefy you need the hardware to be to run it then you can just call predictor.predicts using some input data and it'll run that prediction on the cloud for you in order to return your response back you know i would say in the past sagemaker had a reputation for being difficult to use if you're just doing inference i don't think that reputation is that warranted i think it's actually like pretty easy to use and in many cases is a very good choice for deploying models because it has a lot of easy wrappers to prevent you from needing to build your own docker containers or things like that and it offers options for both deploying model to a dedicated web server like you see
|
296 |
+
|
297 |
+
75
|
298 |
+
00:42:50,960 --> 00:43:29,119
|
299 |
+
in this example as well as to a serverless instance the main trade-offs with using sagemaker are one is you want to do something more complicated than standard huggy face or psychic learn model you'll again still need to deploy a container and the interface for deploying a container is maybe not as user friendly or straightforward as you might like it to be interestingly as of yesterday it was quite a bit more expensive for employing models to dedicated instances than raw ec2 but maybe not so much more expensive than serverless if you're going to go serverless anyway and you're willing to pay 20 overhead to have something that is a better experience for deploying most machine learning models then sagemaker is worth checking out if
|
300 |
+
|
301 |
+
76
|
302 |
+
00:43:26,960 --> 00:44:07,839
|
303 |
+
you're already on amazon take aways from building a model service first you probably don't need to do gpu inference and if you're doing cpu inference then oftentimes scaling horizontally to more servers or even just using serverless is the simplest option is often times enough serverless is probably the recommended option to go with if you can get away with cpus and it's especially helpful if your traffic is spiky so if you have more users in the morning or if you only send your model predictions at night or if your traffic is low volume where you wouldn't max out a full beefy web server anyway sagemaker is increasingly a perfectly good way to get started if you're on aws can get expensive once you've gotten to the
|
304 |
+
|
305 |
+
77
|
306 |
+
00:44:06,400 --> 00:44:42,720
|
307 |
+
point where that cost really starts to matter then you can consider other options if you do decide to go down the route of doing gpu inference then don't try to roll your own gpu inference instead it's worth investing in using a tool like tensorflow serving or triton because these will end up saving you time and leading to better performance in the end and lastly i think it's worth keeping an eye on the startups in this space for on-demand gpu inference because i think that could change the equation of whether gpu inference is really worth it for machine learning models the next topic that we'll cover is moving your model out of a web server entirely and pushing it to the edge so pushing it to where your users are when
|
308 |
+
|
309 |
+
78
|
310 |
+
00:44:41,520 --> 00:45:15,520
|
311 |
+
should you actually start thinking about this sometimes it's just obvious let's say that you uh your users have no reliable internet connection they're driving a self-driving car in the desert or if you have very strict data security or privacy requirements if you're building on an apple device and you can't send the data that you need you need to make the predictions back to a web server otherwise if you don't have those strict requirements the trade-off that you'll need to consider is both the accuracy of your model and the latency of your user receiving a response from that model affect the thing that we ultimately care about which is building a good end user experience latency has a couple of different components to it one
|
312 |
+
|
313 |
+
79
|
314 |
+
00:45:13,599 --> 00:45:50,240
|
315 |
+
component to it is the amount of time it takes the model to make the prediction itself but the other component is the network round trip so how long it takes for the user's request to get to your model service and how long it takes for the prediction to get back to the client device that your user is running on and so if you have exhausted your options for reducing the amount of time that it takes for them all to make a prediction or if your requirements are just so strict that there's no way for you to get within your latency sla by just reducing the amount of time it takes for the model to make prediction then it's worth considering moving to the edge even if you have you know reliable internet connection and don't have very
|
316 |
+
|
317 |
+
80
|
318 |
+
00:45:48,640 --> 00:46:23,119
|
319 |
+
strict data security and privacy requirements but it's worth noting that moving to the edge adds a lot of complexity that isn't present in web development so think carefully about whether you really need this this is the model that we're considering in edge prediction where the model itself is running on the client device as opposed to running on the server or in its own service the way this works is you'll send the waste to the client device and then the client will load the model and interact with it directly there's a number of pros and cons to this approach the biggest pro is that this is the lowest latency way that you can build machine learning powered products and latency is often a pretty important
|
320 |
+
|
321 |
+
81
|
322 |
+
00:46:21,440 --> 00:46:56,240
|
323 |
+
driver of user experience it doesn't require an internet connection so if you're building robots or other types of devices that you want to run ml on this can be a very good option it's great with data security because the data that needs to make the prediction never needs to leave the user's device and in some sense you get scale for free right because rather than needing to think about hey how do i scale up my web service to serve the needs of all my users each of those users will bring their own hardware that will be used to run the model's predictions so you don't need to think as much about how to scale up and down the resources you need for running model inference there's some pretty pronounced cons to this approach
|
324 |
+
|
325 |
+
82
|
326 |
+
00:46:54,640 --> 00:47:32,079
|
327 |
+
as well first of all on these edge devices you generally have very limited hardware resources available so if you're used to running every single one of your model predictions on beefy modern agpu machine you're going to be in for a bit of a shock when it comes to trying to get your model to work on the devices that you needed to work on the tools that you use to do this to make models run on limited hardware are less full featured and in many cases harder to use and more error in bug prone than the neural network libraries that you might be used to working with in tensorflow and pi torch since you need to send updated model weights to the device it can be very difficult to update models in web deployment you have
|
328 |
+
|
329 |
+
83
|
330 |
+
00:47:30,480 --> 00:48:05,520
|
331 |
+
full control over what version of the model is deployed and so there's a bug you can roll out a fix very quickly but on the edge you need to think a lot more carefully about your strategy for updating the version of the model that your users are running on their devices because they may not always be able to get the latest model and then lastly when things do go wrong so if your if your model has is making errors or mistakes it can be very difficult to detect those errors and fix them and debug them because you don't have the raw data that's going through your models available to you as a model developer since it's all on the device of your user next we're gonna give a lightning tour of the different frameworks that you can use for doing
|
332 |
+
|
333 |
+
84
|
334 |
+
00:48:04,000 --> 00:48:40,960
|
335 |
+
edge deployment and the right framework to pick depends both on how you train your model and what the target device you want to deploy it on is so we're not going to aim to go particularly deep on any of these options but really just to give you sort of a broad picture of what are the options you can consider as you're making this decision so we'll split this up mostly by what device you're deploying to so simplest answer is if you're deploying to an nvidia device then the right answer is probably tensor rt so whether that's like a gpu like the one you train your model on or one of the nvidia's devices that's more specially designed to deploy on the edge tensorrt tends to be a go-to option there if instead
|
336 |
+
|
337 |
+
85
|
338 |
+
00:48:38,720 --> 00:49:23,359
|
339 |
+
you're deploying not to an nvidia device but to a phone then both android and apple have libraries for deploying neural networks on their particular os's which are good options if you know that you're only going to be deploying to an apple device or to an android device but if you're using pytorch and you want to be able to deploy both on ios and on android then you can look into pytorch mobile which compiles pi torch down into something that can be run on either of those operating systems similarly tensorflow lite aims to make tensorflow work on different mobile os's as well as well as other edge devices that are neither mobile devices nor nvidia devices if you're deploying not to a nvidia device not to a phone and not to
|
340 |
+
|
341 |
+
86
|
342 |
+
00:49:21,839 --> 00:49:58,800
|
343 |
+
some other edge device that you might consider but deploying to the browser for reasons of performance or scalability or data privacy then tensorflow.js is probably the main example to look at here i'm not aware of a good option for deploying pytorch to the browser and then lastly you know you might be thinking why is there such a large universe of options like i need to follow this complicated decision tree to pick something that depends on the way i train my model the target device i'm deploying it to there aren't even good ways of filling in some of the cells in that graph like how do you run a pi torch model on an edge device that is not a phone for example it's maybe not super clear in that case it might be
|
344 |
+
|
345 |
+
87
|
346 |
+
00:49:56,720 --> 00:50:33,680
|
347 |
+
worth looking into this library called apache tvm apache tvm aims to be a library agnostic and target agnostic tool for compiling your model down into something that can run anywhere the idea is build your model anywhere run it anywhere patrick tvm has some adoption but is i would say at this point still pretty far from being a standard in the industry but it's an option that's worth looking into if you need to make your models work on many different types of devices and then lastly i would say pay attention to this space i think this is another sort of pretty active area for development for machine learning startups in particular there's a startup around patchy tvm called octoml which is worth looking into and there's a new
|
348 |
+
|
349 |
+
88
|
350 |
+
00:50:32,000 --> 00:51:11,839
|
351 |
+
startup that's built by the developers of lower level library called mlir called modular which is also aiming to solve potentially some of the problems around edge deployment as well as tinyml which is a project out of google we talked about the frameworks that you can use to actually run your model on the edge but those are only going to go so far if your model is way too huge to actually put it on the edge at all and so we need ways of creating more efficient models in a previous section we talked about quantization and distillation both of those techniques are pretty helpful for designing these types of models but there's also model architectures that are specifically designed to work well on mobile or edge
|
352 |
+
|
353 |
+
89
|
354 |
+
00:51:09,760 --> 00:51:48,559
|
355 |
+
devices and the operative example here is mobile nets the idea of mobile nets is to take some of the expensive operations in a typical comp net like convolutional layers with larger filter sizes and replace them with cheaper operations like one by one convolutions and so it's worth checking out this mobilenet paper if you want to learn a little bit more about how mobile networks and maybe draw inspiration for how to design a mobile-friendly architecture for your problem mobile desks in particular are a very good tool for a mobile deployment they tend to not have a huge trade-off in terms of accuracy relative to larger models but they are much much smaller and easier to fit on edge devices another case study
|
356 |
+
|
357 |
+
90
|
358 |
+
00:51:46,720 --> 00:52:23,200
|
359 |
+
that i recommend checking out is looking into distilbert distilbert is an example of model distillation that works really well to get a smaller version of bert that removes some of the more expensive operations and uses model distillation to have a model that's not much less performant than bert but takes up much less space and runs faster so to wrap up our discussion on edge deployment i want to talk a little bit about some of the sort of key mindsets for edge deployment that i've learned from talking to a bunch of practitioners who have a lot more experience than i do in deploying machine learning models on the edge the first is there's a temptation i think to finding the perfect model architecture first and then figuring out how to make
|
360 |
+
|
361 |
+
91
|
362 |
+
00:52:21,119 --> 00:52:59,599
|
363 |
+
it work on your device and oftentimes if you're pulling on a web server you can make this work because you always have the option to scale up horizontally and so if you have a huge model it might be expensive to run but you can still make it work but on the edge practitioners believe that the best thing to do is to choose your architecture with your target hardware in mind so you should not be considering architectures that have no way of working on your device and kind of a rule of thumb is you might be able to make up for a factor of let's say an order of magnitude 2 to 10x in terms of inference time or model size through some combination of distillation quantization and other tricks but usually you're not going to get much
|
364 |
+
|
365 |
+
92
|
366 |
+
00:52:58,000 --> 00:53:33,920
|
367 |
+
more than a 10x improvement so if your model is 100 times too large or too slow to run in your target context then you probably shouldn't even consider that architecture the next mindset is once you have one version of the model that works on your edge device you can iterate locally without needing to necessarily test all the changes that you make on that device which is really helpful because deploying and testing on the edge itself is tricky and potentially expensive but you can iterate locally once the version that you're iterating on does work as long as you only gradually add to the size of the model or the latency of the model and one thing that practitioners recommended doing that is i think a step
|
368 |
+
|
369 |
+
93
|
370 |
+
00:53:32,960 --> 00:54:09,599
|
371 |
+
that's worth taking if you're going to do this is to add metrics or add tests for model size and latency so that if you're iterating locally and you get a little bit carried away and you double the size of your model or triple the size of your model you'll at least have a test that reminds you like hey you probably need to double check to make sure that this model is actually going to run on the device that we needed to run on another mindset that i learned from practitioners of edge supplement is to treat tuning the model for your device as an additional risk in the model deployment life cycle and test it accordingly so for example always test your models on production hardware before actually deploying them to
|
372 |
+
|
373 |
+
94
|
374 |
+
00:54:07,839 --> 00:54:45,359
|
375 |
+
production hardware now this may seem obvious but it's not the easiest thing to do in practice and so some folks that are newer to edge deployment will skip this step the reason why this is important is because since these edge deployment libraries are immature there can often be minor differences in the way that the neural network works on your edge device versus how it works on your training device or on your laptop so it's important to run the prediction function of your model on that edge device on some benchmark data set to test both the latency as well as the accuracy of the model on that particular hardware before you deploy it otherwise the differences in how your model works on that hardware versus how it works in
|
376 |
+
|
377 |
+
95
|
378 |
+
00:54:43,280 --> 00:55:24,240
|
379 |
+
your development environment can lead to unforeseen errors or unforeseen degradations and accuracy of your deployed model then lastly since machinery models in general can be really finicky it's a good idea to build fallback mechanisms into the application in case the model fails or you accidentally roll out a bad version of the model or the model is running too slow to solve the task for your user and these fallback mechanisms can look like earlier versions of your model much simpler or smaller models that you know are going to be reliable and run in the amount of time you need them to run in or even just like rule-based functions where if your model is taking too long to make a prediction or is erroring out
|
380 |
+
|
381 |
+
96
|
382 |
+
00:55:22,799 --> 00:55:58,000
|
383 |
+
or something you still have something that is going to return a response to your end user so to wrap up our discussion of edge deployment first thing to remind you of is web deployment is truly much easier than edge fluid so only use edge deployment if you really need to second you'll need to choose a framework to do edge deployment and the way that you'll do this is by matching the library that you use to build your neural network and the available hardware picking the corresponding edge deployment framework that matches those two constraints if you want to be more flexible like if you want your model to be able to work on multiple devices it's worth considering something like apache tvm third start considering the
|
384 |
+
|
385 |
+
97
|
386 |
+
00:55:56,480 --> 00:56:29,440
|
387 |
+
additional constraints that you'll get from edge deployment at the beginning of your project don't wait until you've invested three months into building the perfect model to think about whether that model is actually going to be able to run on the edge instead make sure that those constraints for your edge deployment are taken into consideration from day one and choose your architectures and your training methodologies accordingly to wrap up our discussion of deploying machine learning models fully models is a necessary step of building a machine learning power product but it's also a really useful one for making your models better because only in real life do you get to see how your model actually works on the
|
388 |
+
|
389 |
+
98
|
390 |
+
00:56:27,839 --> 00:57:03,040
|
391 |
+
task that we really care about so the mindsets that we encourage you to have here are deploy early and deploy often so you can start collecting that feedback from the real world as quickly as possible keep it simple and add complexity only as you need to because this deployment is a can be a rabbit hole and there's a lot of complexity to deal with here so make sure that you really need that complexity so start by building a prototype then once you need to start to scale it up then separate your model from your ui by either doing bath predictions or building a model service then once the like sort of naive way that you've deployed your model stops scaling then you can either learn the tricks to scale or use a managed
|
392 |
+
|
393 |
+
99
|
394 |
+
00:57:00,559 --> 00:57:29,839
|
395 |
+
service or a cloud provider option to handle a lot of that scaling for you and then lastly if you really need to be able to operate your model on a device that doesn't have consistent access to the internet if you have very hard data security requirements or if you really really really want to go fast then consider moving your model to the edge but be aware that's going to add a lot of complexity and force you to deal with some less mature tools when you want to do that that wraps up our lecture on deployment and we'll see you next week
|
396 |
+
|
documents/lecture-06.md
ADDED
@@ -0,0 +1,809 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: How to continuously improve models in production
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 6: Continual Learning
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/nra0Tt3a-Oc?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published September 12, 2022.
|
14 |
+
[Download slides](https://fsdl.me/2022-lecture-06-slides).
|
15 |
+
|
16 |
+
## 1 - Overview
|
17 |
+
|
18 |
+
The core justification for continual learning is that, unlike in
|
19 |
+
academia, we never deal with static data distributions in the real
|
20 |
+
world. The implication is that: **if you want to use ML in production
|
21 |
+
and build ML-powered products, you need to think about your goal of
|
22 |
+
building a continual learning system, not just a static model**.
|
23 |
+
|
24 |
+
Recalling the data flywheel that we've described in this class before:
|
25 |
+
as you get more users, those users bring more data. You can use the data
|
26 |
+
to make a better model. A better model helps you attract even more users
|
27 |
+
and build a better model over time. Andrej Karpathy described the most
|
28 |
+
optimistic version of it as "[Operation
|
29 |
+
Vacation](https://www.youtube.com/watch?v=hx7BXih7zx8)" -
|
30 |
+
if we make our continual learning system good enough, it'll get better
|
31 |
+
on its own over time, and ML engineers can just go on vacation.
|
32 |
+
|
33 |
+
![](./media/image6.png)
|
34 |
+
|
35 |
+
The reality is quite different. Initially, we gather, clean, and label
|
36 |
+
some data. We train a model on that data. Then we evaluate the model and
|
37 |
+
loop back to training the model to improve it based on our evaluations.
|
38 |
+
Finally, we get a minimum viable model and deploy it into production.
|
39 |
+
|
40 |
+
![](./media/image1.png)
|
41 |
+
|
42 |
+
The problem begins after we deploy the model: we generally don't have a
|
43 |
+
great way of measuring how our models are actually performing in
|
44 |
+
production. Often, we just spot-check some predictions to see if they
|
45 |
+
are doing what they are supposed to do. If it seems to work, then it's
|
46 |
+
great. We move on to work on other things.
|
47 |
+
|
48 |
+
![](./media/image8.png)
|
49 |
+
|
50 |
+
Unfortunately, the ML engineer is probably not the one who discovers the
|
51 |
+
problems, to begin with. Some business user or product manager gets
|
52 |
+
complaints from users about a dipping metric, which leads to an
|
53 |
+
investigation. This already costs the company money because the product
|
54 |
+
and business teams must investigate the problem.
|
55 |
+
|
56 |
+
![](./media/image12.png)
|
57 |
+
|
58 |
+
Eventually, they point back to the ML engineer and the model he is
|
59 |
+
responsible for. At this point, we are stuck on doing ad-hoc analyses
|
60 |
+
because we don't know what caused the model failure. Eventually, we can
|
61 |
+
run a bunch of SQL queries and paste together some Jupyter notebooks to
|
62 |
+
figure out what the problem is. If we are lucky, we can run an A/B test.
|
63 |
+
If the test looks good, we'll deploy it into production. Then, we are
|
64 |
+
back to where we started - **not getting ongoing feedback about how the
|
65 |
+
model is doing in production**.
|
66 |
+
|
67 |
+
The upshot is that **continual learning is the least well-understood
|
68 |
+
part of the production ML lifecycle**. Very few companies are doing this
|
69 |
+
in production today. This lecture focuses on how to improve different
|
70 |
+
steps of the continual learning process, pointers to learn about each
|
71 |
+
step, and recommendations for doing it pragmatically and adopting it
|
72 |
+
gradually.
|
73 |
+
|
74 |
+
## 2 - How to Think About Continual Learning
|
75 |
+
|
76 |
+
Our opinionated view about continual learning is **training a sequence
|
77 |
+
of models that can adapt to a continuous stream of data that comes into
|
78 |
+
production.** You can think about continual learning as an outer loop in
|
79 |
+
your training process. On one end of the loop is your application, which
|
80 |
+
consists of a model and some other code that users interact with that
|
81 |
+
application by submitting requests, getting predictions back, and
|
82 |
+
submitting feedback about how well the model did at providing that
|
83 |
+
prediction.
|
84 |
+
|
85 |
+
The continual learning loop starts with **logging**, which is how we get
|
86 |
+
all the data into the loop. Then we have **data curation**, **triggers**
|
87 |
+
for the retraining process, **dataset formation** to pick the data to
|
88 |
+
retrain on, the **training** process itself, and **offline testing** to
|
89 |
+
validate whether the retrained model is good enough to go into
|
90 |
+
production. After the model is deployed, we have **online testing**, and
|
91 |
+
that brings the next version of the model into production, where we can
|
92 |
+
start the loop all over.
|
93 |
+
|
94 |
+
Each of these stages passes the output to the next step. Output is
|
95 |
+
defined by a set of rules. These rules combine to form our **retraining
|
96 |
+
strategy**. Let's discuss what the retraining strategy looks like for
|
97 |
+
each stage:
|
98 |
+
|
99 |
+
![](./media/image7.png)
|
100 |
+
|
101 |
+
|
102 |
+
At the **logging** stage, the key question answered by the retraining
|
103 |
+
strategy is **what data should we store**? At the end of this stage, we
|
104 |
+
have an "infinite stream" of potentially unlabeled data coming from
|
105 |
+
production and can be used for downstream analysis.
|
106 |
+
|
107 |
+
![](./media/image3.png)
|
108 |
+
|
109 |
+
|
110 |
+
At the **curation** stage, the key rules we need to define are **what
|
111 |
+
data from that infinite stream will we prioritize for labeling and
|
112 |
+
potential retraining?** At the end of this stage, we have a reservoir of
|
113 |
+
candidate training points that have labels and are fully ready to be fed
|
114 |
+
back into a training process.
|
115 |
+
|
116 |
+
![](./media/image5.png)
|
117 |
+
|
118 |
+
|
119 |
+
At the **retraining trigger** stage, the key question is **when should
|
120 |
+
we retrain?** The output of this stage is a signal to kick off a
|
121 |
+
retraining job.
|
122 |
+
|
123 |
+
![](./media/image2.png)
|
124 |
+
|
125 |
+
|
126 |
+
At the **dataset formation** stage, the key rules we need to define are
|
127 |
+
**from this entire reservoir of data, what specific subset of that data
|
128 |
+
are we using to train on for this particular training job?** The output
|
129 |
+
of this stage is a view into that reservoir or training data that
|
130 |
+
specifies the exact data points to be used for the training job.
|
131 |
+
|
132 |
+
![](./media/image22.png)
|
133 |
+
|
134 |
+
|
135 |
+
At the **offline testing** stage, the key rule we need to define is
|
136 |
+
**what "good enough" looks like for all stakeholders.** The output of
|
137 |
+
this stage is equivalent to a "pull request" report card for your model
|
138 |
+
with a clear sign-off process. Once you are signed off, the new model
|
139 |
+
will roll out into production.
|
140 |
+
|
141 |
+
![](./media/image21.png)
|
142 |
+
|
143 |
+
|
144 |
+
Finally, at the **deployment and online testing** stage, the key rule to
|
145 |
+
define is **how do we know if this deployment was successful?** The
|
146 |
+
output of this stage is a signal to roll this model out fully to all of
|
147 |
+
your users.
|
148 |
+
|
149 |
+
In an idealized world, from an ML engineer's perspective, once the model
|
150 |
+
is deployed, the first version of the model is to not retrain the model
|
151 |
+
directly. Instead, we want the model to sit on top of the retraining
|
152 |
+
strategy and try to improve that strategy over time. Rather than
|
153 |
+
training models daily, we look at metrics about how well the strategy is
|
154 |
+
working and how well it's solving the task of improving our model over
|
155 |
+
time in response to changes in the world. The input that we provide is
|
156 |
+
by tuning the strategy to do a better job of solving that task.
|
157 |
+
|
158 |
+
For most ML engineers, our jobs don't feel like that at a high level.
|
159 |
+
**Our retraining strategy is just retraining models whenever we feel
|
160 |
+
like it**. We can get good results from ad-hoc retraining, but when you
|
161 |
+
start getting consistent results and no one is actively working on the
|
162 |
+
model day to day anymore, then it's worth starting to add some
|
163 |
+
automation. Alternatively, if you find yourself needing to retrain the
|
164 |
+
model more than once a week (or even more frequently than that) to deal
|
165 |
+
with changing results in the real world, then it's worth investing in
|
166 |
+
automation just to save yourself.
|
167 |
+
|
168 |
+
## 3 - Periodic Retraining
|
169 |
+
|
170 |
+
The first baseline retraining strategy that you should consider after
|
171 |
+
you move on from ad-hoc is just **periodic retraining**:
|
172 |
+
|
173 |
+
1. At the logging stage, we simply log everything.
|
174 |
+
|
175 |
+
2. At the curation stage, we sample uniformly at random from the data
|
176 |
+
that we've logged up until we get the maximum number of data
|
177 |
+
points that we are able to handle. Then we label them using some
|
178 |
+
automated tools.
|
179 |
+
|
180 |
+
3. Our retraining trigger will just be periodic.
|
181 |
+
|
182 |
+
4. We train once a week, but we do it on the last month's data, for
|
183 |
+
example.
|
184 |
+
|
185 |
+
5. Then we compute the test set accuracy after each training, set a
|
186 |
+
threshold on that, or more likely manual review the results each
|
187 |
+
time, and spot-check some of the predictions.
|
188 |
+
|
189 |
+
6. When we deploy the model, we do spot evaluations of that deployed
|
190 |
+
model on a few individual predictions to make sure things look
|
191 |
+
healthy.
|
192 |
+
|
193 |
+
![](./media/image17.png)
|
194 |
+
|
195 |
+
|
196 |
+
Periodic retraining won't work in every circumstance. There are several
|
197 |
+
failure modes:
|
198 |
+
|
199 |
+
1. The first category is that you have more data than you can log or
|
200 |
+
label. If you have a **high volume** of data, you might need to be
|
201 |
+
more careful about what data to sample and enrich, particularly if
|
202 |
+
that data comes from **a long-tail distribution** - where you have
|
203 |
+
edge cases that your model needs to perform well on, but those
|
204 |
+
edge cases might not be caught by just doing standard uniform
|
205 |
+
sampling. Or if that data is expensive to label like in a
|
206 |
+
**human-in-the-loop** scenario - where you need custom labeling
|
207 |
+
rules or labeling is a part of the product. In either of those
|
208 |
+
cases, you need to be more careful about what subset of your data
|
209 |
+
you log and enrich to be used down the road.
|
210 |
+
|
211 |
+
2. The second category has to do with **managing the cost of
|
212 |
+
retraining**. If your model is expensive to retrain, retraining it
|
213 |
+
periodically is not going to be the most cost-efficient way to go,
|
214 |
+
especially if you do it on a rolling window of data every single
|
215 |
+
time. You will leave a lot of performance on the table by not
|
216 |
+
retraining more frequently. You can partially solve this by
|
217 |
+
increasing the retraining frequency, but this will increase the
|
218 |
+
costs even further.
|
219 |
+
|
220 |
+
3. The final failure mode is situations where you have **a high cost of
|
221 |
+
bad predictions**. Every time you retrain your model, it
|
222 |
+
introduces risk, which comes from the fact that the data you're
|
223 |
+
training the model on might be bad in some way. It might be
|
224 |
+
corrupted, might have been attacked by an adversary, or might not
|
225 |
+
be representative anymore of all the cases that your model needs
|
226 |
+
to perform well on. The more frequently you retrain and the more
|
227 |
+
sensitive you are to model failures, the more thoughtful you need
|
228 |
+
to be about careful model evaluation such that you are not unduly
|
229 |
+
taking on too much risk from frequent retraining.
|
230 |
+
|
231 |
+
## 4 - Iterating On Your Retraining Strategy
|
232 |
+
|
233 |
+
The main takeaway from this section is that **we will use monitoring and
|
234 |
+
observability to determine what changes we want to make to our
|
235 |
+
retraining strategy**.
|
236 |
+
|
237 |
+
1. We'll do that by monitoring just the metrics that actually that
|
238 |
+
matter and using all other metrics for debugging.
|
239 |
+
|
240 |
+
2. When we debug an issue with our model, that will lead to potentially
|
241 |
+
retraining our model. But more broadly than that, we can think of
|
242 |
+
it as a change to the retraining strategy - changing our
|
243 |
+
retraining triggers, our offline tests, our sampling strategies,
|
244 |
+
the metrics for observability, etc.
|
245 |
+
|
246 |
+
3. As we get more confident in our monitoring, we can introduce more
|
247 |
+
automation to our system.
|
248 |
+
|
249 |
+
There are no real standards or best practices on model monitoring yet.
|
250 |
+
The main principles we'll follow are: (1) We'll focus on monitoring what
|
251 |
+
matters and what breaks empirically; and (2) We'll compute other signals
|
252 |
+
too but use them for observability and debugging.
|
253 |
+
|
254 |
+
![](./media/image13.png)
|
255 |
+
|
256 |
+
|
257 |
+
What does it mean to monitor a model in production? We think about it
|
258 |
+
as: You have some metric to assess the model quality (i.e, accuracy) and
|
259 |
+
a time series of how that metric changes over time. The question you try
|
260 |
+
to answer is: **Is this bad or okay?** Do you need to pay attention to
|
261 |
+
this degradation or not?
|
262 |
+
|
263 |
+
The questions we'll need to answer are:
|
264 |
+
|
265 |
+
1. What metrics should we be looking at when we are monitoring?
|
266 |
+
|
267 |
+
2. How can we tell if those metrics are bad and warrant an
|
268 |
+
intervention?
|
269 |
+
|
270 |
+
3. What are the tools that help us with this process?
|
271 |
+
|
272 |
+
### What Metrics to Monitor
|
273 |
+
|
274 |
+
Choosing the right metric to monitor is probably the most important part
|
275 |
+
of this process. Below you can find different types of metrics ranked in
|
276 |
+
order of how valuable they are.
|
277 |
+
|
278 |
+
![](./media/image11.png)
|
279 |
+
|
280 |
+
|
281 |
+
#### Outcomes and Feedback From Users
|
282 |
+
|
283 |
+
The most valuable one to look at is **outcome data or feedback from your
|
284 |
+
users**. Unfortunately, there are no one-size-fits-all ways to do this
|
285 |
+
because it depends a lot on the specifics of the product you are
|
286 |
+
building. This is more of a product management question of how to design
|
287 |
+
your product in a way that you can capture feedback from your users as
|
288 |
+
part of the product experience.
|
289 |
+
|
290 |
+
#### Model Performance Metrics
|
291 |
+
|
292 |
+
The next most valuable signal to look at is **model performance
|
293 |
+
metrics**. These are offline metrics such as accuracy. This is less
|
294 |
+
useful than user feedback because of loss mismatch. A common experience
|
295 |
+
many ML practitioners have is that improving model performance leads to
|
296 |
+
the same or worse outcome. There's very little excuse for not doing
|
297 |
+
this. To some degree, you can label some production data each day by
|
298 |
+
setting up an on-call rotation or throwing a labeling party. These
|
299 |
+
practices will give you some sense of how your model performance trends
|
300 |
+
over time.
|
301 |
+
|
302 |
+
![](./media/image10.png)
|
303 |
+
|
304 |
+
|
305 |
+
#### Proxy Metrics
|
306 |
+
|
307 |
+
The next best thing to look at is **proxy metrics**, which are
|
308 |
+
correlated with bad model performance. These are mostly domain-specific.
|
309 |
+
For example, if you are building text generation with a language model,
|
310 |
+
two examples would be repetitive and toxic outputs. If you are building
|
311 |
+
a recommendation system, an example would be the share of personalized
|
312 |
+
responses. **Edge cases** can be good proxy metrics. If there are
|
313 |
+
certain problems you know that you have with your model, if those
|
314 |
+
increase in prevalence, that might mean your model is not doing very
|
315 |
+
well.
|
316 |
+
|
317 |
+
There's an academic direction that aims at being able to take any metric
|
318 |
+
you care about and approximate it on previously unseen data. How well do
|
319 |
+
we think our model is doing on this new data? Which would make these
|
320 |
+
proxy metrics a lot more practically useful? There are a number of
|
321 |
+
different approaches here: from training an auxiliary model to predict
|
322 |
+
how well your main model might do on this offline data, to using
|
323 |
+
heuristics and human-in-the-loop methods.
|
324 |
+
|
325 |
+
![](./media/image20.png)
|
326 |
+
|
327 |
+
|
328 |
+
An unfortunate result from this literature is that it's not possible to
|
329 |
+
have a single method you use in all circumstances to approximate how
|
330 |
+
your model is doing on out-of-distribution data. Let's say you are
|
331 |
+
looking at the input data to predict how the model will perform on those
|
332 |
+
input points. Then the label distribution changes. As a result, you
|
333 |
+
won't be able to take into account that change in your approximate
|
334 |
+
metric.
|
335 |
+
|
336 |
+
#### Data Quality
|
337 |
+
|
338 |
+
The next signal to look at is **data quality.** [Data quality
|
339 |
+
testing](https://lakefs.io/data-quality-testing/) is a set
|
340 |
+
of rules you apply to measure the quality of your data. This deals with
|
341 |
+
questions such as: How well does a piece of information reflect reality?
|
342 |
+
Does it fulfill your expectations of what's comprehensive? Is your
|
343 |
+
information available when you need it? Some common examples include
|
344 |
+
checking whether the data has the right schema, the data is in the
|
345 |
+
expected range, and the number of records is not anomalous.
|
346 |
+
|
347 |
+
![](./media/image19.png)
|
348 |
+
|
349 |
+
This is useful because data problems tend to be the most common issue
|
350 |
+
with ML models in practice. In [a Google
|
351 |
+
report](https://www.usenix.org/conference/opml20/presentation/papasian)
|
352 |
+
which covered 15 years of different pipeline outages with a particular
|
353 |
+
ML model, most of the outages that happened with that model were
|
354 |
+
distributed systems problems, commonly data problems.
|
355 |
+
|
356 |
+
#### Distribution Drift
|
357 |
+
|
358 |
+
##### Why Measure Distribution Drift?
|
359 |
+
|
360 |
+
Your model's performance is only guaranteed on **data sampled from the
|
361 |
+
same distribution** as it was trained on. This can have a huge impact in
|
362 |
+
practice. A recent example includes changes in model behavior during the
|
363 |
+
pandemic. A bug in the retraining pipeline caused the recommendations
|
364 |
+
not to be updated for new users, leading to millions of dollars in
|
365 |
+
revenue lost.
|
366 |
+
|
367 |
+
##### Types of Distribution Drift
|
368 |
+
|
369 |
+
Distribution drift manifests itself in different ways in the wild:
|
370 |
+
|
371 |
+
1. **Instantaneous drift** happens when a model is deployed in a new
|
372 |
+
domain, a bug is introduced in the pre-processing pipeline, or a
|
373 |
+
big external shift like COVID occurs.
|
374 |
+
|
375 |
+
2. **Gradual drift** happens when users\' preferences change or new
|
376 |
+
concepts get introduced to the corpus over time.
|
377 |
+
|
378 |
+
3. **Periodic drift** happens when users' preferences are seasonal or
|
379 |
+
people in different time zones use your model differently.
|
380 |
+
|
381 |
+
4. **Temporary drift** happens when a malicious user attacks your
|
382 |
+
model, a new user tries your product and churns, or someone uses
|
383 |
+
your product in an unintended way.
|
384 |
+
|
385 |
+
##### How to Measure It?
|
386 |
+
|
387 |
+
How to tell if your distribution is drifted?
|
388 |
+
|
389 |
+
1. Your first **select a window of "good" data to serve as a
|
390 |
+
reference**. To select that reference, you can use a fixed window
|
391 |
+
of production data you believe to be healthy. [Some
|
392 |
+
papers](https://arxiv.org/abs/1908.04240) advocate
|
393 |
+
for using a sliding window of production data. In practice, most
|
394 |
+
of the time you probably should use your validation data as the
|
395 |
+
reference.
|
396 |
+
|
397 |
+
2. Once you have that reference data, you **select a new window of
|
398 |
+
production data to measure your distribution distance on**. This
|
399 |
+
is not a super principled approach and tends to be
|
400 |
+
problem-dependent. A pragmatic solution is to pick one or several
|
401 |
+
window sizes with a reasonable amount of data and slide them.
|
402 |
+
|
403 |
+
3. Finally, once you have your reference window and production window,
|
404 |
+
you **compare the windows using a distribution distance metric**.
|
405 |
+
|
406 |
+
##### What Metrics To Use?
|
407 |
+
|
408 |
+
Let's start by considering the one-dimensional case, where you have a
|
409 |
+
particular feature that is one-dimensional and can compute a density of
|
410 |
+
that feature on your reference/production windows. You want some metric
|
411 |
+
that approximates the distance between these two distributions.
|
412 |
+
|
413 |
+
![](./media/image9.png)
|
414 |
+
|
415 |
+
|
416 |
+
There are a few options here:
|
417 |
+
|
418 |
+
1. The commonly recommended ones are the KL divergence and the KS test.
|
419 |
+
But they are actually bad choices.
|
420 |
+
|
421 |
+
2. Sometimes-better options would be (1) infinity norm or 1-norm of the
|
422 |
+
diff between probabilities for each category, and (2)
|
423 |
+
Earth-mover's distance (a more statistically principled approach).
|
424 |
+
|
425 |
+
Check out [this Gantry blog
|
426 |
+
post](https://gantry.io/blog/youre-probably-monitoring-your-models-wrong/)
|
427 |
+
to learn more about why the commonly recommended metrics are not so good
|
428 |
+
and the other ones are better.
|
429 |
+
|
430 |
+
##### Dealing with High-Dimensional Data
|
431 |
+
|
432 |
+
In the real world for most models, we have potentially many input
|
433 |
+
features or even unstructured data that is very high-dimensional. How do
|
434 |
+
we deal with detecting distribution drift in those cases?
|
435 |
+
|
436 |
+
1. You can measure **drift on all of the features independently**: If
|
437 |
+
you have a lot of features, you will hit [the multiple hypothesis
|
438 |
+
testing
|
439 |
+
problem](https://multithreaded.stitchfix.com/blog/2015/10/15/multiple-hypothesis-testing/).
|
440 |
+
Furthermore, this doesn't capture cross-correlation.
|
441 |
+
|
442 |
+
2. You can measure **drift on only the important features**: Generally
|
443 |
+
speaking, it's a lot more useful to measure drift on the outputs
|
444 |
+
of the model than the inputs. You can also [rank the importance
|
445 |
+
of your input
|
446 |
+
features](https://christophm.github.io/interpretable-ml-book/feature-importance.html)
|
447 |
+
and measure drift on the most important ones.
|
448 |
+
|
449 |
+
3. You can look at **metrics that natively compute or approximate the
|
450 |
+
distribution distance between high-dimensional distributions**:
|
451 |
+
The two that are worth checking out are [maximum mean
|
452 |
+
discrepancy](https://jmlr.csail.mit.edu/papers/v13/gretton12a.html)
|
453 |
+
and [approximate Earth-mover's
|
454 |
+
distance](https://arxiv.org/abs/1904.05877). The
|
455 |
+
caveat here is that they are pretty hard to interpret.
|
456 |
+
|
457 |
+
![](./media/image14.png)
|
458 |
+
|
459 |
+
A more principled way to measure distribution drift for high-dimensional
|
460 |
+
inputs to the model is to use **projections**. The idea of a projection
|
461 |
+
is that:
|
462 |
+
|
463 |
+
1. You first take some high-dimensional input to the model and run that
|
464 |
+
through a function.
|
465 |
+
|
466 |
+
2. Each data point your model makes a prediction on gets tagged by this
|
467 |
+
projection function. The goal of this projection function is to
|
468 |
+
reduce the dimensionality of that input.
|
469 |
+
|
470 |
+
3. Once you've reduced the dimensionality, you can do drift detection
|
471 |
+
on that lower-dimensional representation of the high-dimensional
|
472 |
+
data.
|
473 |
+
|
474 |
+
This approach works for any kind of data, no matter what the
|
475 |
+
dimensionality is or what the data type is. It's also highly flexible.
|
476 |
+
There are different types of projections that can be useful:
|
477 |
+
**analytical projections** (e.g., mean pixel value, length of sentence,
|
478 |
+
or any other function), **random projections** (e.g., linear), and
|
479 |
+
**statistical projections** (e.g., autoencoder or other density models,
|
480 |
+
T-SNE).
|
481 |
+
|
482 |
+
##### Cons of Looking at Distribution Drift
|
483 |
+
|
484 |
+
![](./media/image18.png)
|
485 |
+
|
486 |
+
**Models are designed to be robust to some degree of distribution
|
487 |
+
drift**. The figure on the left above shows a toy example to demonstrate
|
488 |
+
this point. We have a classifier that's trained to predict two classes.
|
489 |
+
We've induced a synthetic distribution shift to shift the red points on
|
490 |
+
the top left to bottom. These two distributions are extremely different,
|
491 |
+
but the model performs equally well on the training data and the
|
492 |
+
production data. In other words, knowing the distribution shift doesn't
|
493 |
+
tell you how the model has reacted to that shift.
|
494 |
+
|
495 |
+
The figure on the right is a research project that used data generated
|
496 |
+
from a physics simulator to solve problems on real-world robots. The
|
497 |
+
training data was highly out of distribution (low-fidelity, random
|
498 |
+
images). However, by training on this set of training data, the model
|
499 |
+
was able to generalize to real-world scenarios on the test data.
|
500 |
+
|
501 |
+
Beyond the theoretical limitations of measuring distribution drift, this
|
502 |
+
is just hard to do in practice. You have to window size correctly. You
|
503 |
+
have to keep all this data around. You need to choose metrics. You need
|
504 |
+
to define projections to make your data lower-dimensional.
|
505 |
+
|
506 |
+
#### System Metrics
|
507 |
+
|
508 |
+
The last thing to consider looking at is your standard **system
|
509 |
+
metrics** such as CPU utilization, GPU memory usage, etc. These don't
|
510 |
+
tell you anything about how your model is actually performing, but they
|
511 |
+
can tell you when something is going wrong.
|
512 |
+
|
513 |
+
#### Practical Recommendations
|
514 |
+
|
515 |
+
We also want to look at how hard it is to compute the aforementioned
|
516 |
+
stages in practice. As seen below, the Y-axis shows the **value** of
|
517 |
+
each signal and the X-axis shows the **feasibility** of measuring each
|
518 |
+
signal.
|
519 |
+
|
520 |
+
1. Measuring outcomes or feedback has pretty wide variability in terms
|
521 |
+
of how feasible it is to do, as it depends on how your product is
|
522 |
+
set up.
|
523 |
+
|
524 |
+
2. Measuring model performance tends to be the least feasible thing to
|
525 |
+
do because it involves collecting some labels.
|
526 |
+
|
527 |
+
3. Proxy metrics are easier to compute because they don't involve
|
528 |
+
labels.
|
529 |
+
|
530 |
+
4. System metrics and data quality metrics are highly feasible because
|
531 |
+
you have off-the-shelf tools for them.
|
532 |
+
|
533 |
+
![](./media/image15.png)
|
534 |
+
|
535 |
+
|
536 |
+
Here are our practical recommendations:
|
537 |
+
|
538 |
+
1. Basic data quality checks are zero-regret, especially if you are
|
539 |
+
retraining your model.
|
540 |
+
|
541 |
+
2. Get some way to measure feedback, model performance, or proxy
|
542 |
+
metrics, even if it's hacky or not scalable.
|
543 |
+
|
544 |
+
3. If your model produces low-dimensional outputs, monitoring those for
|
545 |
+
distribution shifts is also a good idea.
|
546 |
+
|
547 |
+
4. As you evolve your system, practice the **observability** mindset.
|
548 |
+
|
549 |
+
While you can think of monitoring as measuring the known unknowns (e.g.,
|
550 |
+
setting alerts on a few key metrics), [observability is measuring
|
551 |
+
unknown
|
552 |
+
unknowns](https://www.honeycomb.io/blog/observability-a-manifesto/)
|
553 |
+
(e.g., having the power to ask arbitrary questions about your system
|
554 |
+
when it breaks). An observability mindset means two implications:
|
555 |
+
|
556 |
+
1. You should keep around the context or raw data that makes up the
|
557 |
+
metrics that you are computing since you want to be able to drill
|
558 |
+
all the way down to potentially the data points themselves that
|
559 |
+
make up the degraded metric.
|
560 |
+
|
561 |
+
2. You can go crazy with measurement by defining a lot of different
|
562 |
+
metrics. You shouldn't necessarily set alerts on each of those
|
563 |
+
since you don't want too many alerts. Drift is a great example
|
564 |
+
since it is useful for debugging but less so for monitoring.
|
565 |
+
|
566 |
+
Finally, it's important to **go beyond aggregate metrics**. If your
|
567 |
+
model is 99% accurate in aggregate but only 50% accurate for your most
|
568 |
+
important user, is it still "good"? The way to deal with this is by
|
569 |
+
flagging important subgroups or cohorts of data and alerting on
|
570 |
+
important metrics across them. Some examples are categories you don't
|
571 |
+
want to be biased against, "important" categories of users, and
|
572 |
+
categories you might expect to perform differently on (languages,
|
573 |
+
regions, etc.).
|
574 |
+
|
575 |
+
### How To Tell If Those Metrics are "Bad"
|
576 |
+
|
577 |
+
We don't recommend statistical tests (e.g., KS-Test) because they try to
|
578 |
+
return a p-value for the likelihood that the data distributions are not
|
579 |
+
the same. When you have a lot of data, you will get very small p-values
|
580 |
+
for small shifts. This is not what we actually care about since models
|
581 |
+
are robust to a small number of distribution shifts.
|
582 |
+
|
583 |
+
Better options than statistical tests include fixed rules, specific
|
584 |
+
ranges, predicted ranges, and unsupervised detection of new patterns.
|
585 |
+
[This article on dynamic data
|
586 |
+
testing](https://blog.anomalo.com/dynamic-data-testing-f831435dba90?gi=fb4db0e2ecb4)
|
587 |
+
has the details.
|
588 |
+
|
589 |
+
![](./media/image16.png)
|
590 |
+
|
591 |
+
### Tools for Monitoring
|
592 |
+
|
593 |
+
The first category is **system monitoring** tools, a premature category
|
594 |
+
with different companies in it
|
595 |
+
([Datadog](https://www.datadoghq.com/),
|
596 |
+
[Honeycomb](https://www.honeycomb.io/), [New
|
597 |
+
Relic](https://newrelic.com/), [Amazon
|
598 |
+
CloudWatch](https://aws.amazon.com/cloudwatch/), etc.).
|
599 |
+
They help you detect problems with any software system, not just ML
|
600 |
+
models. They provide functionality for setting alarms when things go
|
601 |
+
wrong. Most cloud providers have decent monitoring solutions, but if you
|
602 |
+
want something better, you can look at monitoring-specific tools to
|
603 |
+
monitor anything.
|
604 |
+
|
605 |
+
This raises the question of whether we should just use these system
|
606 |
+
monitoring tools to monitor ML metrics as well. [This blog
|
607 |
+
post](https://www.shreya-shankar.com/rethinking-ml-monitoring-3/)
|
608 |
+
explains that it's feasible but highly painful due to many technical
|
609 |
+
reasons. Thus, it's better to use ML-specific tools.
|
610 |
+
|
611 |
+
Two popular open-source monitoring tools are
|
612 |
+
[EvidentlyAI](https://github.com/evidentlyai) and
|
613 |
+
[whylogs](https://github.com/whylabs/whylogs).
|
614 |
+
|
615 |
+
- Both are similar in that you provide them with samples of data and
|
616 |
+
they produce a nice report that tells you where their distribution
|
617 |
+
shifts are.
|
618 |
+
|
619 |
+
- The big limitation of both is that they don't solve the data
|
620 |
+
infrastructure and the scale problem. You still need to be able to
|
621 |
+
get all that data into a place where you can analyze it with these
|
622 |
+
tools.
|
623 |
+
|
624 |
+
- The main difference between them is that whylogs is more focused on
|
625 |
+
gathering data from the edge by aggregating the data into
|
626 |
+
statistical profiles at inference time. You don't need to
|
627 |
+
transport all the data from your inference devices back to your
|
628 |
+
cloud.
|
629 |
+
|
630 |
+
![](./media/image4.png)
|
631 |
+
|
632 |
+
Lastly, there are a bunch of different SaaS vendors for ML monitoring
|
633 |
+
and observability: [Gantry](https://gantry.io/),
|
634 |
+
[Aporia](https://www.aporia.com/),
|
635 |
+
[Superwise](https://superwise.ai/),
|
636 |
+
[Arize](https://arize.com/),
|
637 |
+
[Fiddler](https://www.fiddler.ai/),
|
638 |
+
[Arthur](https://arthur.ai/), etc.
|
639 |
+
|
640 |
+
|
641 |
+
## 5 - Retraining Strategy
|
642 |
+
|
643 |
+
We’ve talked about monitoring and observability, which allow you to identify issues with your continual learning system. Now, we’ll talk about how we will fix the various stages of the continual learning process based on what we learn from monitoring and observability.
|
644 |
+
|
645 |
+
|
646 |
+
### Logging
|
647 |
+
|
648 |
+
The first stage of the continual learning loop is **logging**. As a reminder, the goal of logging is to get data from your model to a place where you can analyze it. The key question to answer here is: “**what data should I actually log?**”
|
649 |
+
|
650 |
+
For most of us, the best answer is just to log all of the data. Storage is cheap. It's better to have data than not to have it. There are, however, some situations where you can't do that. For example, if you have too much traffic going through your model to the point where it's too expensive to log all of it, or if you have data privacy concerns, or if you're running your model at the edge, you simply may not be able to able to log all your data.
|
651 |
+
|
652 |
+
In these situations, there are two approaches that you can take. The first approach is **profiling**. With profiling, rather than sending all the data back to your cloud and then using that to monitor, you instead compute **statistical profiles** of your data on the edge that describe the data distribution that you're seeing. This is great from a data security perspective because it doesn't require you to send all the data back home. It minimizes your storage cost. Finally, you don't miss things that happen in the tails, which is an issue for the next approach. That'll describe the place to use. This approach is best used for security-critical applications. Computing statistical profiles is a pretty interesting topic in computer science and data summarization that is worth checking out if you’re interested in this approach.
|
653 |
+
|
654 |
+
![alt_text](./media/image22.png "image_tooltip")
|
655 |
+
|
656 |
+
|
657 |
+
The other approach is **sampling**. With sampling, you'll just take certain data points and send those back to your monitoring and logging system. The advantage of sampling is that it has minimal impact on your inference resources. You don't have to actually spend the computational budget to compute profiles. You also get to have access to the raw data for debugging and retraining, albeit a smaller amount. This is the approach we recommend for any other kind of application.
|
658 |
+
|
659 |
+
|
660 |
+
### Curation
|
661 |
+
|
662 |
+
The next step in the continual learning loop is **curation**. The goal of curation is to take the infinite stream of production data, which is potentially unlabeled, and turn it into a finite reservoir of enriched data suitable for training. Here, we must answer, “**what data should be enriched?**”
|
663 |
+
|
664 |
+
You could **sample and enrich data randomly**, but that may not prove helpful to your model. Importantly, you miss rare classes or events. A better approach can be to perform **stratified subsampling**, wherein you sample specific proportions of individuals from various subpopulations (e.g. race). The most advanced strategy for picking data to enrich is to **curate data points** that are somehow interesting for the purpose of improving your model.
|
665 |
+
|
666 |
+
There are a few different ways of doing this: **user-driven curation loops** via feedback loops, **manual curation** via error analysis, and **automatic curation** via active learning.
|
667 |
+
|
668 |
+
User-driven curation is a great approach that is easy to implement, assuming you have a clear way of gathering user feedback. If your user churns, clicks thumbs down, or performs some other similar activity on the model’s output, you have an easy way of understanding data that could be enriched for future training jobs.
|
669 |
+
|
670 |
+
![alt_text](./media/image23.png "image_tooltip")
|
671 |
+
|
672 |
+
If you don't have user feedback, or if you need even more ways of gathering interesting data from your system, the second most effective way is by doing **manual error analysis**. In this approach, we look at the errors that our model is making, reason about the different types of failure modes that we're seeing, and try to write functions or rules that help capture these error modes. We'll use those functions to gather more data that might represent those error cases. Some examples of these function-based approaches are **similarity-based curation**, which uses nearest neighbors, and **projection-based curation**, wherein we train a new function or model to recognize key data points.
|
673 |
+
|
674 |
+
The last way to curate data is to do so automatically using a class of algorithms called **[active learning](https://lilianweng.github.io/posts/2022-02-20-active-learning/)**. The way active learning works is that, given a large amount of unlabeled data, we will try to determine which data points would improve model performance the most (if you were to label those data points next and train on them). These algorithms define **a sampling strategy**, rank all of your unlabeled examples using **a scoring function** that defines the sampling strategy, and mark the data points with the highest scores for future labeling.
|
675 |
+
|
676 |
+
There are a number of different scoring function approaches that are shown below.
|
677 |
+
|
678 |
+
|
679 |
+
|
680 |
+
1. **Most uncertain**: sample low-confidence and high-entropy predictions or predictions that an ensemble disagrees on.
|
681 |
+
2. **Highest predicted loss**: train a separate model that predicts loss on unlabeled points, then sample the highest predicted loss.
|
682 |
+
3. **Most different from labels**: train a model to distinguish labeled and unlabeled data, then sample the easiest to distinguish.
|
683 |
+
4. **Most representative**: choose points such that no data is too far away from anything we sampled.
|
684 |
+
5. **Big impact on training**: choose points such that the expected gradient is large or points where the model changes its mind the most about its prediction during training.
|
685 |
+
|
686 |
+
Uncertainty scoring tends to be the most commonly used method since it is simple and easy to implement.
|
687 |
+
|
688 |
+
You might have noticed that there's a lot of similarity between some of the ways that we do data curation and the way that we do monitoring. That's no coincidence--**monitoring and data curation are two sides of the same coin!** They're both interested in solving the problem of finding data points where the model may not be performing well or where we're uncertain about how the model is performing on those data points.
|
689 |
+
|
690 |
+
![alt_text](./media/image24.png "image_tooltip")
|
691 |
+
|
692 |
+
Some examples of people practically applying data curation are OpenAI’s DALL-E 2, which uses [active learning and manual curation](https://openai.com/blog/dall-e-2-pre-training-mitigations/), Tesla, which uses [feedback loops and manual curation](https://www.youtube.com/watch?v=hx7BXih7zx8), and Cruise, which uses feedback loops.
|
693 |
+
|
694 |
+
Some tools that help with data curation are [Scale Nucleus](https://scale.com/nucleus), [Aquarium](https://www.aquariumlearning.com/), and [Gantry](https://gantry.io/).
|
695 |
+
|
696 |
+
To summarize then, here are our final set of recommendations for applying data curation.
|
697 |
+
|
698 |
+
|
699 |
+
|
700 |
+
1. Random sampling is a fine starting point. If you want to avoid bias or have rare classes, do stratified sampling instead.
|
701 |
+
2. If you have a feedback loop, then user-driven curation is a no-brainer. If not, confidence-based active learning is easy to implement.
|
702 |
+
3. As your model performance increases, you’ll have to look harder for challenging training points. Manual techniques are unavoidable and should be embraced. Know your data!
|
703 |
+
|
704 |
+
|
705 |
+
### Retraining Triggers
|
706 |
+
|
707 |
+
After we've curated our infinite stream of unlabeled data down to a reservoir of labeled data that's ready to potentially train on, the next thing that we'll need to decide is “**what trigger are we gonna use to retrain?**”
|
708 |
+
|
709 |
+
The main takeaway here is that moving to automated retraining is **not** always necessary. In many cases, just manually retraining is good enough. It can save you time and lead to better model performance. It's worth understanding when it makes sense to actually make the harder move to automated retraining.
|
710 |
+
|
711 |
+
The main prerequisite for moving to automated retraining is being able to reproduce model performance when retraining in a fairly automated fashion. If you're able to do that and you are not really working on the model actively, it's probably worth implementing some automated retraining. As a rule of thumb, if you’re retraining the model more than once a month, automated retraining may make sense.
|
712 |
+
|
713 |
+
When it's time to move to automated training, the main recommendation is to just keep it simple and **retrain periodically**, e.g. once a week. The main question though is, how do you pick the right training schedule? The recommendation here is to:
|
714 |
+
|
715 |
+
|
716 |
+
|
717 |
+
1. Apply measurement to figure out a reasonable retraining schedule.
|
718 |
+
2. Plot your model performance and degradation over time.
|
719 |
+
3. Compare how retraining the model at various intervals would have resulted in improvements to its performance.
|
720 |
+
|
721 |
+
As seen below, the area between the curves represents the opportunity cost, so always remember to balance the upside of retraining with the operational costs of retraining.
|
722 |
+
|
723 |
+
![alt_text](./media/image25.png "image_tooltip")
|
724 |
+
|
725 |
+
This is a great area for future academic research! More specifically, we can look at ways to automate determining the optimal retraining strategy based on performance decay, sensitivity to performance, operational costs, and retraining costs.
|
726 |
+
|
727 |
+
An additional option for retraining, rather than time-based intervals, is **performance triggers** (e.g. retrain when the model accuracy dips below 90%). This helps react more quickly to unexpected changes and is more cost-optimal, but requires very good instrumentation to process these signals along with operational complexity.
|
728 |
+
|
729 |
+
An idea that probably won't be relevant but is worth thinking about is **online learning**. In this paradigm, you train on every single data point as it comes in. It's not very commonly used in practice.
|
730 |
+
|
731 |
+
A version of this idea that is used fairly frequently in practice is **online adaptation**. This method operates not at the level of retraining the whole model itself but rather on the level of adapting the policy that sits on top of the model. What is a policy you ask? A policy is the set of rules that takes the raw prediction that the model made, like the score or the raw output of the model, and turns it into the output the user sees. In online adaptation, we use algorithms like multi-armed bandits to tune these policies. If your data changes very frequently, it is worth looking into this method.
|
732 |
+
|
733 |
+
|
734 |
+
### Dataset Formation
|
735 |
+
|
736 |
+
Imagine we've fired off a trigger to start a new training job. The next question we need to answer is, among all of the labeled data in our reservoir of data, **what specific data points should we train on for this particular new training job?**
|
737 |
+
|
738 |
+
We have four options here. Most of the time in deep learning, we'll just use the first option and **train on all the data that we have available** to us. Remember to keep your data version controlled and your curation rules consistent.
|
739 |
+
|
740 |
+
![alt_text](./media/image26.png "image_tooltip")
|
741 |
+
|
742 |
+
If you have too much data to do that, you can use recency as a heuristic for a second option and **train on only a sliding window of the most recent data** (if recency is important) or **sample a smaller portion** (if recency isn’t). In the latter case, compare the aggregate statistics between the old and new windows to ensure there aren’t any bugs. It’s also important in both cases to compare the old and new datasets as they may not be related in straightforward ways.
|
743 |
+
|
744 |
+
![alt_text](./media/image27.png "image_tooltip")
|
745 |
+
|
746 |
+
A useful third option is **online batch selection**, which can be used when recency doesn’t quite matter. In this method, we leverage label-aware selection functions to choose which items in mini-batches to train on.
|
747 |
+
|
748 |
+
![alt_text](./media/image28.png "image_tooltip")
|
749 |
+
|
750 |
+
A more difficult fourth option that isn’t quite recommended is **continual fine-tuning**. Rather than retraining from scratch every single time, you train your existing model on just new data. The reason why you might wanna do this primarily is because it's much more cost-effective. The paper below shares some findings from GrubHub, where they found a 45x cost improvement by doing this technique relative to sliding windows.
|
751 |
+
|
752 |
+
![alt_text](./media/image29.png "image_tooltip")
|
753 |
+
|
754 |
+
The big challenge here is that unless you're very careful, it's easy for the model to forget what it learned in the past. The upshot is that you need to have mature evaluation practices to be very careful that your model is performing well on all the types of data that it needs to perform well on.
|
755 |
+
|
756 |
+
|
757 |
+
### Offline Testing
|
758 |
+
|
759 |
+
After the previous steps, we now have a new candidate model that we think is ready to go into production. The next step is to test that model. The goal of this stage is to produce a report that our team can sign off on that answers the question of whether this new model is good enough or whether it's better than the old model. The key question here is, “**what should go into that report?**”
|
760 |
+
|
761 |
+
This is a place where there's not a whole lot of standardization, but the recommendation we have here is to compare your current model with the previous version of the model on all of the metrics that you care about, all of the subsets of data that you've flagged are important, and all the edge cases you’ve defined. Remember to adjust the comparison to account for any sampling bias.
|
762 |
+
|
763 |
+
Below is a sample comparison report. Note how the validation set is broken out into concrete subgroups. Note also how there are specific validation sets assigned to common error cases.
|
764 |
+
|
765 |
+
![alt_text](./media/image30.png "image_tooltip")
|
766 |
+
|
767 |
+
In continual learning, evaluation sets are dynamically refined just as much as training sets are. Here are some guidelines for how to manage evaluation sets in a continual learning system:
|
768 |
+
|
769 |
+
|
770 |
+
|
771 |
+
1. As you curate new data, add some of it to your evaluation sets. For example, if you change how you do sampling, add that newly sampled data to your evaluation set. Or if you encounter a new edge case, create a test case for it.
|
772 |
+
2. Corollary 1: you should version control your evaluation sets as well.
|
773 |
+
3. Corollary 2: if your data changes quickly, always hold out the most recent data for evaluation.
|
774 |
+
|
775 |
+
Once you have the testing basics in place, a more advanced option that you can look into here is **expectation testing**. Expectation tests work by taking pairs of examples where you know the relationship between the two. These tests help a lot with understanding the generalizability of models.
|
776 |
+
|
777 |
+
![alt_text](./media/image31.png "image_tooltip")
|
778 |
+
|
779 |
+
Just like how data curation is highly analogous to monitoring, so is offline testing. We want to observe our metrics, not just in aggregate but also across all of our important subsets of data and across all of our edge cases. One difference between these two is that **you will have different metrics available in offline testing and online testing**. For example, you’re much more likely to have labels offline. Online, you’re much more likely to have feedback. We look forward to more research that can predict online metrics from offline ones.
|
780 |
+
|
781 |
+
|
782 |
+
### Online Testing
|
783 |
+
|
784 |
+
Much of this we covered in the last lecture, so we’ll keep it brief! Use shadow mode and A/B tests, roll out models gradually, and roll back models if you see issues during rollout.
|
785 |
+
|
786 |
+
|
787 |
+
## 6 - The Continual Improvement Workflow
|
788 |
+
|
789 |
+
To tie it all together, we’ll conclude with an example. Monitoring and continual learning are two sides of the same coin. We should be using the signals that we monitor to very directly change our retraining strategy. This section describes the future state that comes as a result of investing in the steps laid out previously.
|
790 |
+
|
791 |
+
Start with a place to store and version your strategy. The components of your continual learning strategy should include the following:
|
792 |
+
|
793 |
+
|
794 |
+
|
795 |
+
* Inputs, predictions, user feedback, and labels.
|
796 |
+
* Metric definitions for monitoring, observability, and offline testing.
|
797 |
+
* Projection definitions for monitoring and manual data curation.
|
798 |
+
* Subgroups and cohorts of interest for monitoring and offline testing.
|
799 |
+
* Data curation logic.
|
800 |
+
* Datasets for training and evaluation.
|
801 |
+
* Model comparison reports.
|
802 |
+
|
803 |
+
Walk through this example to understand how changes to the retraining strategy occur as issues surface in our machine learning system.
|
804 |
+
|
805 |
+
![alt_text](./media/image32.png "image_tooltip")
|
806 |
+
|
807 |
+
## 7 - Takeaways
|
808 |
+
|
809 |
+
To summarize, continual learning is a nascent, poorly understood topic that is worth continuing to pay attention to. Watch this space! In this lecture, we focused on all the steps and techniques that allow you to use retraining effectively. As MLEs, leverage monitoring to strategically improve your model. Always start simple, and get better!
|
documents/lecture-06.srt
ADDED
@@ -0,0 +1,440 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,080 --> 00:00:38,960
|
3 |
+
hi everybody welcome back to full stack deep learning this week we're going to talk about continual learning which is in my opinion one of the most exciting topics that we cover in this class continual learning describes the process of iterating on your models once they're in production so using your production data to retrain your models for two purposes first to adapt your model to any changes in the real world that happen after you train your model and second to use data from the real world to just improve your model in general so let's dive in the sort of core justification for continual learning is that unlike in academia in the real world we never deal with static data distributions and so the implication of
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:36,719 --> 00:01:14,640
|
7 |
+
that is if you want to use ml in production if you want to build a good machine learning powered product you need to think about your goal as building a continual learning system not just building a static model so i think how we all hope this would work is the data flywheel that we've described in this class before so as you get more users those users bring more data you can use the data to make better model that better model helps you attract even more users and build a better model over time and the most automated version of this the most optimistic version of it was described by andre karpathy as operation vacation if we make our continual learning system good enough then it'll just get better on its own
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:12,880 --> 00:01:47,920
|
11 |
+
over time and we as machine learning engineers can just go on vacation and when we come back the model will be better but the reality of this is actually quite different i think it starts out okay so we gather some data we clean and label that data we train a model on the data then we evaluate the model we loop back to training the model to make it better based on the evaluations that we made and finally we get to the point where we're done we have a minimum viable model and we're ready to ship it into production and so we deploy it the problem begins after we deploy it which is that we generally don't really have a great way of measuring how our models are actually performing in production so often what
|
12 |
+
|
13 |
+
4
|
14 |
+
00:01:46,159 --> 00:02:21,440
|
15 |
+
we'll do is we'll just spot check some predictions to see if it looks like it's doing what it's supposed to be doing and if it seems to be working then that's great we probably move on and work on some other project that is until the first problem pops up and now unfortunately i as a machine learning engineer and probably not the one that discovers that problem to begin with it's probably you know some business user or some pm that realizes that hey we're getting complaints from a user or we're having a metric that's dipped and this leads to an investigation this is already costing the company money because the product and the business team are having to investigate this problem eventually they are able to
|
16 |
+
|
17 |
+
5
|
18 |
+
00:02:19,440 --> 00:02:53,280
|
19 |
+
point this back to me and to the model that i am responsible for and at this point you know i'm kind of stuck doing some ad hoc analyses because i don't really know what the cause of the model of the failure of the model is maybe i haven't even looked at this model for a few weeks or a few months you know maybe eventually i'm able to like run a bunch of sql queries you know paste together some jupiter notebooks and figure out what i think the problem is so i'll retrain the model i'll redeploy it and if we're lucky we can run an a b test and if that a b test looks good then we'll deploy it into production and we're sort of back where we started not getting ongoing feedback about how the model is really doing in production the
|
20 |
+
|
21 |
+
6
|
22 |
+
00:02:51,519 --> 00:03:29,040
|
23 |
+
upshot of all this is that continual learning is really the least well understood part of the production machine learning lifecycle and very few companies are actually doing this well in production today and so this lecture in some ways is going to feel a little bit different than some of the other lectures a big part of the focus of this lecture is going to be being opinionated about how we think you should think about the structure of continual learning problems this is you know some of what we say here will be sort of well understood industry best practices and some of it will be sort of our view on what we think this should look like i'm going to throw a lot of information at you about each of the different steps of the
|
24 |
+
|
25 |
+
7
|
26 |
+
00:03:27,120 --> 00:04:06,000
|
27 |
+
continual learning process how to think about improving how you do these parts once you have your first model in production and like always we'll provide some recommendations for how to do this pragmatically and how to adopt it gradually so first i want to give sort of an opinionated take on how i think you should think about continual learning so i'll define continual learning as training a sequence of models that is able to adapt to a continuous stream of data that's coming in in production you can think about continual learning as an outer loop on your training process on one end of the loop is your application which consists of a model as well as some other code users interact with that application by submitting
|
28 |
+
|
29 |
+
8
|
30 |
+
00:04:03,840 --> 00:04:41,440
|
31 |
+
requests getting predictions back and then submitting feedback about how well the model did at providing that prediction the continual learning loop starts with logging which is how we get all the data into the loop then we have data curation triggers for doing the retraining process data set formation to pick the data to actually retrain on and we have the training process itself then we have offline testing which is how we validate whether the retrained model is good enough to go into production after it's deployed we have online testing and then that brings the next version of the model into production where we can start the loop all over again each of these stages passes an output to the next step and the way that output is defined is by
|
32 |
+
|
33 |
+
9
|
34 |
+
00:04:39,040 --> 00:05:17,360
|
35 |
+
using a set of rules and all these rules together roll up into something called a retraining strategy next we'll talk about what the retraining strategy defines for each stage and what the output looks like so the logging stage the key question that's answered by the retraining strategy is what data should we actually store and at the end of this we have an infinite stream of potentially unlabeled data that's coming from production and is able to be used for downstream analysis at the curation stage the key rules that we need to define are what data from that infinite stream are we going to prioritize for labeling and potential retraining and at the end of the stage we'll have a reservoir of a finite number of
|
36 |
+
|
37 |
+
10
|
38 |
+
00:05:15,919 --> 00:05:52,880
|
39 |
+
candidate training points that have labels and are fully ready to be fed back into a training process at the retraining trigger stage the key question to answer is when should we actually retrain how do we know when it's time to hit the retrain button and the output of the stage is a signal to kick off a retraining job at the data set formation stage the key rules we need to define are from among this entire reservoir of data what specific subset of that data are we actually going to train on for this particular training job you can think of the output of this as a view into that reservoir of training data that specifies the exact data points that are going to go into this training job at the offline testing
|
40 |
+
|
41 |
+
11
|
42 |
+
00:05:50,960 --> 00:06:28,800
|
43 |
+
stage the key rules that we need to define are what is good enough look like for all of our stakeholders how are we going to agree that this model is ready to be deployed and the output of the stage looks like something like the equivalent of a pull request a report card for your model that has a clear sign-off process that once you're signed off the new model will roll out into prod and then finally at the deployment online testing stage the key rules that we need to find are how do we actually know if this deployment was successful and the output of the stage will be the signal to actually roll this model out fully to all of your users in an idealized world the way i think we should think of our role as machine
|
44 |
+
|
45 |
+
12
|
46 |
+
00:06:26,720 --> 00:07:07,360
|
47 |
+
learning engineers once we've deployed the first version of the model is not to retrain the model directly but it's to sit on top of the retraining strategy and babysit that strategy and try to improve the strategy itself over time so rather than training models day-to-day we're looking at metrics about how well the strategy is working how well it's solving the task of improving our model over time in response to changes to the world and the input that we provide is by tuning the strategy by changing the rules that make up the strategy to help the strategy do a better job of solving that task that's a description of the goal state of our role as an ml engineer in the real world today for most of us our job doesn't really feel like this at
|
48 |
+
|
49 |
+
13
|
50 |
+
00:07:05,520 --> 00:07:39,759
|
51 |
+
a high level because for most of us our retraining strategy is just retraining models whenever we feel like it and that's not actually as bad as it seems you can get really good results from ad hoc retraining but when you start to be able to get really consistent results when you retrain models and you're not really working on the model day-to-day anymore then it's worth starting to add some automation alternatively if you find yourself needing to retrain the model more than you know once a week or even more frequently than that to deal with changing results in the real world then it's also worth investing in automation just to save yourself time the first baseline retraining strategy that you should consider after you move
|
52 |
+
|
53 |
+
14
|
54 |
+
00:07:38,240 --> 00:08:18,479
|
55 |
+
on from ad hoc is just periodic retraining and this is what you'll end up doing in most cases in the near term so let's describe this periodic retraining strategy so at the logging stage we'll simply log everything a curation will sample uniformly at random from the data that we've logged up until we get the max number of data points that we're able to handle we're able to label or we're able to train on and then we'll label them using some automated tool our retraining trigger will just be periodic so we'll train once a week but we'll do it on the last month's data for example and then we will compute the test set accuracy after each training set a threshold on that or more likely manually review the results each time and spot check some of
|
56 |
+
|
57 |
+
15
|
58 |
+
00:08:16,400 --> 00:08:51,600
|
59 |
+
the predictions and then when we deploy the model we'll do spot evaluations of that deployed model on a few individual predictions just to make sure things look healthy and we'll move on this baseline looks something like what most companies do for automated retraining in the real world retraining periodically is a pretty good baseline and in fact it's what i would suggest doing when you're ready to start doing automated retraining but it's not going to work in every circumstance so let's talk about some of the failure modes the first category of failure modes has to do with when you have more data than you're able to log or able to label if you have a high volume of data you might need to be more careful about what data you sample
|
60 |
+
|
61 |
+
16
|
62 |
+
00:08:48,640 --> 00:09:30,160
|
63 |
+
and enrich particularly if either that data comes from a long tail distribution where you have edge cases that your model needs to perform well on but those edge cases might not be caught by just doing standard uniform random sampling or if that data is expensive to label like in a human in the loop scenario where you need custom labeling rules or labeling is part of the product in either of those cases long tail distribution or human in the loop setup you probably need to be more careful about what subset of your data that you log and enrich to be used down the road second category of where this might fail has to do with managing the cost of retraining if your model is really expensive to retrain then
|
64 |
+
|
65 |
+
17
|
66 |
+
00:09:28,720 --> 00:10:01,600
|
67 |
+
retraining it periodically is probably not going to be the most cost efficient way to go especially if you do it on a rolling window of data every single time let's say that you retrain your model every week but your data actually changes a lot every single day you're going to be leaving a lot of performance on the table by not retraining more frequently you could increase the frequency and retrain say every few hours but this is going to increase costs even further the final failure mode is situations where you have a high cost of bad predictions one thing that you should think about is every single time you retrain your model it introduces risk that risk comes from the fact that the data that you're training
|
68 |
+
|
69 |
+
18
|
70 |
+
00:10:00,320 --> 00:10:38,480
|
71 |
+
the model on might be bad in some way it might be corrupted it might have been attacked by an adversary or it might just not be representative anymore of all the cases that your model needs to perform well on so the more frequently you retrain and the more sensitive you are to failures of the model the more thoughtful you need to be about how do we make sure that we're carefully evaluating this model such that we're not unduly taking on risk too much risk from retraining frequently when you're ready to move on from periodic retraining it's time to start iterating on your strategy and this is the part of the lecture where we're going to cover a grab box of tools that you can use to help figure out how to iterate on your strategy and what
|
72 |
+
|
73 |
+
19
|
74 |
+
00:10:36,959 --> 00:11:10,320
|
75 |
+
changes the strategy to make you don't need to be familiar in depth with every single one of these tools but i'm hoping to give you a bunch of pointers here that you can use when it's time to start thinking about how to make your model better so the main takeaway from this section is going to be we're going to use monitoring and observability as a way of determining what changes we want to make to our retraining strategy and we're going to do that by monitoring just the metrics that actually matter the most important ones for us to care about and then using all of their metrics and information for debugging when we debug an issue with our model that's going to lead to potentially retraining our model but more broadly
|
76 |
+
|
77 |
+
20
|
78 |
+
00:11:09,040 --> 00:11:45,760
|
79 |
+
than that we can think of it as a change to the retraining strategy like changing our retraining triggers changing our offline tests our sampling strategies the metrics we use for observability etc and then lastly another principle for iterating on your strategy is as you get more confident in your monitoring as you get more confident that you'll be able to catch issues with your model if they occur then you can start to introduce more automation into your system so do things manually at first and then as you get more confident in your monitoring start to automate them let's talk about how to monitor and debug models in production so that we can figure out how to improve our retraining strategy the tldr here is like many parts of this
|
80 |
+
|
81 |
+
21
|
82 |
+
00:11:43,839 --> 00:12:18,160
|
83 |
+
lecture there's no real standards or best practices here yet and there's also a lot of bad advice out there the main principles that we're gonna follow here are we're gonna focus on monitoring things that really matter and also things that tend to break empirically and we're going to also compute all the other signals that you might have heard of data drift all these other sorts of things but we're primarily going to use those for debugging and observability what does it mean to monitor a model in production the way i think about it is you have some metric that you're using to assess the quality of your model like your accuracy let's say then you have a time series of how that metric changes over time and the question that you're
|
84 |
+
|
85 |
+
22
|
86 |
+
00:12:16,240 --> 00:12:51,839
|
87 |
+
trying to answer is is this bad or is this okay do i need to pay attention to this degradation or do i not need to pay attention so the questions that we'll need to answer are what metrics should we be looking at when we're doing monitoring how can we tell if those metrics are bad and it warrants an intervention and then lastly we'll talk about some of the tools that are out there to help you with this process choosing the right metric to monitor is probably the most important part of this process and here are the different types of metrics or signals that you can look at ranked in order of how valuable they are if you're able to get them the most valuable thing that you can look at is outcome data or feedback from your users
|
88 |
+
|
89 |
+
23
|
90 |
+
00:12:50,160 --> 00:13:26,079
|
91 |
+
if you're able to get access to this signal then this is by far the most important thing to look at unfortunately there's no one-size-fits-all way to do this because it just depends a lot on the specifics of the product that you're building for example if you're building a recommender system then you might measure feedback based on did the user click on the recommendation or not but if you're building a self-driving car that's not really a useful or even feasible signal to gather so you might instead gather data on whether the user intervened and grabbed the wheel to take over autopilot from the car and this is really more of a product design or product management question of how can you actually design your product in such
|
92 |
+
|
93 |
+
24
|
94 |
+
00:13:24,480 --> 00:13:59,279
|
95 |
+
a way that it that you're able to capture feedback from your users as part of that product experience and so we'll come back and talk a little bit more about this in the ml product management lecture the next most valuable signal to look at if you can get it is model performance metrics these are your offline model metrics things like accuracy the reason why this is less useful than user feedback is because of loss mismatch so i think a common experience that many ml practitioners have is you spend let's say a month trying to make your accuracy one or two percentage points better and then you deploy the new version of the model and it turns out that your users don't care they react just the same way they did
|
96 |
+
|
97 |
+
25
|
98 |
+
00:13:57,360 --> 00:14:35,120
|
99 |
+
before or even worse to that new theoretically better version of the model there's often very little excuse for not doing this at least to some degree you can just label some production data each day it doesn't have to be a ton you can do this by setting up an on-call rotation or just throwing a labeling party each day where you spend 30 minutes with your teammates you know labeling 10 or 20 data points each even just like that small amount will start to give you some sense of how your model's performance is trending over time if you're not able to measure your actual model performance metrics then the next best thing to look at are proxy metrics proxy metrics are metrics that are just correlated with bad model performance
|
100 |
+
|
101 |
+
26
|
102 |
+
00:14:33,199 --> 00:15:07,279
|
103 |
+
these are mostly domain specific so for example if you're building text generation with a language model then two examples here would be repetitive outputs and toxic outputs if you're building a recommender system then an example would be the share of personalized responses if you're seeing fewer personalized responses then that's probably an indication that your model is doing something bad if you're looking for ideas for proxy metrics edge cases can be good proxy metrics if there's certain problems that you know that you have with your model if those increase in prevalence then that might mean that your model's not doing very well that's the practical side of proxy metrics today they're very domain specific
|
104 |
+
|
105 |
+
27
|
106 |
+
00:15:06,000 --> 00:15:44,720
|
107 |
+
either you're going to have good proxy metrics or you're not but i don't think it has to be that way there's an academic direction i'm really excited about that is aimed at being able to take any metric that you care about like your accuracy and approximate it on previously unseen data so how well do we think our model is doing on this new data which would make these proxy metrics a lot more practically useful there's a number of different approaches here ranging from training an auxiliary model to predict how well your main model might do on these on this offline data to heuristics to human loop methods and so it's worth checking these out if you're interested in seeing how people might do this in two or three years one unfortunate
|
108 |
+
|
109 |
+
28
|
110 |
+
00:15:43,519 --> 00:16:21,440
|
111 |
+
result from this literature though that's worth pointing out is that it's probably not going to be possible to have a single method that you use in all circumstances to approximate how your model is doing on out of distribution data so the way to think about that is let's say that you have you're looking at the input data to predict how the model is going to perform on those input points and then the label distribution changes if you're only looking at the input points then how would you be able to take into account that label distribution change in your approximate metric but there's more theoretical rounding for this result as well all right back to our more pragmatic scheduled programming the next signal that you can look at is data
|
112 |
+
|
113 |
+
29
|
114 |
+
00:16:19,519 --> 00:16:59,680
|
115 |
+
quality and data quality testing is just a set of rules that you can apply to measure the quality of your data this is dealing with questions like how well does the data reflect reality um how comprehensive is it and how consistent is it over time some common examples of data quality testing include checking whether the data has the right schema whether the values in each of the columns are in the range that you'd expect that you have enough columns that you don't have too much missing data simple rules like that the reason why this is useful is because data problems tend to be the most common issue with machine learning models in practice so this is a report from google where they covered 15 years of different pipeline outages
|
116 |
+
|
117 |
+
30
|
118 |
+
00:16:57,600 --> 00:17:37,840
|
119 |
+
with a particular machine learning model and their main finding was that most of the outages that happened with that model did not really a lot to do with ml at all they were often distributed systems problems or also really commonly there were data problems one example that they give is a common type of failure where a data pipeline lost the permissions to read the data source that it depended on and so was starting to fail so these types of data issues are often what will cause models to fail spectacularly in production the next most helpful signal to look at is distribution drift even though distribution drift is a less useful signal than say user feedback it's still really important to be able to measure whether your data
|
120 |
+
|
121 |
+
31
|
122 |
+
00:17:35,840 --> 00:18:17,679
|
123 |
+
distributions change so why is that well your model's performance is only guaranteed if the data that it's evaluated on is sampled from the same distribution as it was trained on and this can have a huge impact in practice recent examples include total change in model behavior during the pandemic as words like corona took on new meeting or bugs and retraining pipelines that cause millions of dollars of losses for companies because they led to changing data distributions distribution drift manifests itself in different ways in the wild there's a few different types that you might see so you might have an instantaneous drift like when a model is deployed in a new domain or a bug is introduced in a re in a pre-processing
|
124 |
+
|
125 |
+
32
|
126 |
+
00:18:15,039 --> 00:18:53,120
|
127 |
+
pipeline or some big external shift like covid you could have a gradual drift like if user preferences change over time or new concepts keep getting added to your corpus you could have periodic drifts like if your user preferences are seasonal or you could have a temporary drift like if a malicious user attacks your model and each of these different types of drifts might need to be detected in slightly different ways so how do you tell if your distribution is drifted the approach we're going to take here is we're going to first select a window of good data that's going to serve as a reference going forward how do you select that reference well you can use a fixed window of production data that you believe to be healthy so
|
128 |
+
|
129 |
+
33
|
130 |
+
00:18:51,200 --> 00:19:30,080
|
131 |
+
if you think that your model was really healthy at the beginning of the month you can use that as your reference window some papers advocate for sliding this window of production data to use as your reference but in practice most of the time what most people do is they'll use something like their validation data as their reference once you have that reference data then you'll select your new window of production data to measure your distribution distance on there isn't really a super principal approach for how to select the window of data to measure drift on and it tends to be pretty problem specific so a pragmatic solution that what a lot of people do is they'll just pick one window size or even they'll just pick a few window
|
132 |
+
|
133 |
+
34
|
134 |
+
00:19:27,679 --> 00:20:08,160
|
135 |
+
sizes with some reasonable amount of data so that's not too noisy and then they'll just slide those windows and lastly once you have your reference window and your production window then you'll compare these two windows using a distribution distance metric so what metrics should you use let's start by considering the one-dimensional case where you have a particular feature that is one-dimensional and you are able to compute a density of that feature on your reference window and your production window then the way to think about this problem is you're going to have some metric that approximates the distance between these two distributions there's a few options here the ones that are commonly recommended are the kl divergence and
|
136 |
+
|
137 |
+
35
|
138 |
+
00:20:06,160 --> 00:20:39,679
|
139 |
+
the ks test unfortunately those are commonly recommended but they're also bad choices sometimes better options would be things like using the infinity norm or the one norm which are what google advocates for using or the earth mover's distance which is a bit more of a statistically principled approach and i'm not going to go into details of these metrics here but check out the blog post at the bottom if you want to learn more about why the commonly recommended ones are not so good and the other ones are better so that's the one-dimensional case if you just have a single input feature that you're trying to measure distribution distance on but in the real world for most models we have potentially many input features or
|
140 |
+
|
141 |
+
36
|
142 |
+
00:20:37,840 --> 00:21:13,919
|
143 |
+
even unstructured data that is very high dimensional so how do we deal with detecting distribution drift in those cases one thing you could consider doing is just measuring drifts on all of the features independently problem that you'll run into there is if you have a lot of features you're going to hit the multiple hypothesis testing problem and secondly this doesn't capture cross correlation so if so if you have two features and the distributions of each of those features stay the same but the correlation between the features changed then that wouldn't be captured using this type of system another common thing to do would be to measure drift only on the most important features one heuristic here is that generally
|
144 |
+
|
145 |
+
37
|
146 |
+
00:21:11,919 --> 00:21:49,039
|
147 |
+
speaking it's a lot more useful to measure drift on the outputs of the model than the inputs the reason for that is because inputs change all the time your model tends to be robust to some degree of distribution shift of the inputs but if the outputs change then that might be more indicative that there's a problem and also outputs tend to be for most machine learning models tend to be lower dimensional so it's a little bit easier to monitor you can also rank the importance of your input features and measure drift on the most important ones you can do this just heuristically using the ones that you think are important or you can compute some notion of feature importance and use that to rank the features that you want to monitor lastly
|
148 |
+
|
149 |
+
38
|
150 |
+
00:21:47,440 --> 00:22:27,679
|
151 |
+
there are metrics that you can look at that natively compute or approximate the distribution distance between high dimensional distributions and the two that are most worth checking out there are the maximum mean discrepancy and the approximate earth mover's distance the caveat here is that these are pretty hard to interpret so if you have a maximum mean discrepancy alert that's triggered that doesn't really tell you much about where to look for the potential failure that caused that distribution drift a more principled way in my opinion to measure distribution drift for high dimensional inputs to the model is to use projections the idea of a projection is you take some high dimensional input to the model or output
|
152 |
+
|
153 |
+
39
|
154 |
+
00:22:25,440 --> 00:23:08,240
|
155 |
+
an image or text or just a really large feature vector and then you run that through a function so each data point that your model makes a prediction on gets tagged by this projection function and the goal of the projection function is to reduce the dimensionality of that input then once you've reduced the dimensionality you can do your drift detection on that lower dimensional representation of the high dimensional data and the great thing about this approach is that it works for any kind of data whether it's images or text or anything else no matter what the dimensionality is or what the data type is and it's highly flexible there's many different types of projections that can be useful you can define analytical projections that are
|
156 |
+
|
157 |
+
40
|
158 |
+
00:23:05,039 --> 00:23:46,880
|
159 |
+
just functions of your input data and so these are things like looking at the mean pixel value of an image or the length of a sentence that's an input to the model or any other function that you can think of analytical projections are highly customizable they're highly interpretable and can often detect problems in practice if you don't want to use your domain knowledge to craft projections by writing analytical functions then you can also just do generic projections like random projections or statistical projections like running each of your inputs through an auto encoder something like that this is my recommendation for detecting drift for high dimensional and unstructured data and it's worth also just taking note of
|
160 |
+
|
161 |
+
41
|
162 |
+
00:23:44,799 --> 00:24:24,720
|
163 |
+
this concept of projections because we're going to see this concept pop up in a few other places as we discuss other aspects of continual learning distribution drift is an important signal to look at when you're monitoring your models and in fact it's what a lot of people think of when they think about model monitoring so why do we rank it so low on the list let's talk about the cons of looking at distribution drift i think the big one is that models are designed to be robust to some degree of distribution drift the figure on the left shows sort of a toy example to demonstrate this point which is we have a classifier that's trained to predict two classes and we've induced a synthetic distribution shift just
|
164 |
+
|
165 |
+
42
|
166 |
+
00:24:22,880 --> 00:24:59,039
|
167 |
+
shifting these points from the red ones on the top left to the bottom ones on the bottom right these two distributions are extremely different the marginal distributions in the chart on the bottom and then chart on the right-hand side have very large distance between the distributions but the model performs actually equally well on the training data as it does on the production data because the shift is just shifted directly along the classifier boundary so that's kind of a toy example that demonstrates that you know distribution shift is not really the thing that we care about when we're monitoring our models because just knowing that the distribution has changed doesn't tell us how the models has reacted to that
|
168 |
+
|
169 |
+
43
|
170 |
+
00:24:57,440 --> 00:25:35,520
|
171 |
+
distribution change and then another example that's worth illustrating is some of my research when i was in grad school was using data that was generated from a physics simulator to solve problems on real world robots and the data that we used was highly out of distribution for the test case that we cared about the data looked like these kind of very low fidelity random images like on the left and we found that by training on a huge variety of these low fidelity random images our model was able to actually generalize to real world scenario like the one on the right so huge distribution shifts intuitively between the data the model was trained on and the data it was evaluated on but it was able to perform well on both
|
172 |
+
|
173 |
+
44
|
174 |
+
00:25:33,760 --> 00:26:15,039
|
175 |
+
beyond the theoretical limitations of measuring distribution drift this can also just be hard to do in practice you have to pick window sizes correctly you have to keep all this data around you need to choose metrics you need to define projections to make your data lower dimensional so it's not a super reliable signal to look at and so that's why we advocate for looking at ones that are more correlated with the thing that actually matters the last thing you should consider looking at is your standard system metrics like cpu utilization or how much gpu memory your model is taking up things like that so those don't really tell you anything about how your model is actually performing but they can tell you when something is going
|
176 |
+
|
177 |
+
45
|
178 |
+
00:26:12,880 --> 00:26:49,840
|
179 |
+
wrong okay so this is a ranking of all the different types of metrics or signals that you could look at if you're able to compute them but to give you a more concrete recommendation here we also have to talk about how hard it is to compute these different signals in practice we'll put the sort of value of each of these types of signals on the y-axis and on the x-axis we'll talk about the feasibility like how easy is it to actually measure these things measuring outcomes or feedback has pretty wide variability in terms of how feasible it is to do depends a lot on how your product is set up and the type of problem that you're working on measuring model performance tends to be the least feasible thing to do because
|
180 |
+
|
181 |
+
46
|
182 |
+
00:26:47,360 --> 00:27:26,799
|
183 |
+
it does involve collecting some labels and so things like proxy metrics are a little bit easier to compute because they don't involve labels whereas system metrics and data quality metrics are highly feasible because there's you know great off-the-shelf libraries and tools that you can use for them and they don't involve doing anything sort of special from a machine learning perspective so the practical recommendation here is getting basic data quality checks is effectively zero regret especially if you are in the phase where you're retraining your model pretty frequently because data quality issues are one of the most common causes of bad model performance in practice and they're very easy to implement the next
|
184 |
+
|
185 |
+
47
|
186 |
+
00:27:24,000 --> 00:28:02,000
|
187 |
+
recommendation is get some way of measuring feedback or model performance or if you really can't do either of those things than a proxy metric even if that way of measuring model performance is hacky or not scalable this is the most important signal to look at and is really the only thing that will be able to reliably tell you if your model is doing what it's supposed to be doing or not doing what it's supposed to be doing and then if your model is producing low dimensional outputs like if you're doing binary classification or something like that then monitoring the output distribution the score distribution also tends to be pretty useful and pretty easy to do and then lastly as you evolve your system like once you have these
|
188 |
+
|
189 |
+
48
|
190 |
+
00:28:00,000 --> 00:28:45,120
|
191 |
+
basics in place and you're iterating on your model and you're trying to get more confident about evaluation i would encourage you to adopt a mindset about metrics that you compute that's borrowed from the concept of observability so what is the observability mindset we can think about monitoring as measuring the known unknowns so if there's four or five or ten metrics that we know that we care about accuracy latency user feedback the monitoring approach would be to measure each of those signals we might set alerts on even just a few of those key metrics on the other hand observability is about measuring the unknown unknowns it's about having the power to be able to ask arbitrary questions about your system when it
|
192 |
+
|
193 |
+
49
|
194 |
+
00:28:42,640 --> 00:29:19,520
|
195 |
+
breaks for example how does my accuracy break out across all of the different regions that i've been considering what is my distribution drift for each of my features not signals that you would necessarily set alerts on because you don't have any reason to believe that these signals are things that are going to cause problems in the future but when you're in the mode of debugging being able to look at these things is really helpful and if you choose to adopt the observability mindset which i would highly encourage you to do especially in machine learning because it's just very very critical to be able to answer arbitrary questions to debug what's going on with your model then there's a few implications first
|
196 |
+
|
197 |
+
50
|
198 |
+
00:29:17,919 --> 00:29:55,760
|
199 |
+
you should really keep around the context or the raw data that makes up the metrics that you're computing because you're gonna want to be able to drill all the way down to potentially the data points themselves that make up the metric that has degraded it's also as a side note helpful to keep around the raw data to begin with for things like retraining the second implication is that you can kind of go crazy with measurement you can define lots of different metrics on anything that you can think of that might potentially go wrong in the future but you shouldn't necessarily set alerts on each of those or at least not very or at least not very precise alerts because you don't want to have the problem of getting too
|
200 |
+
|
201 |
+
51
|
202 |
+
00:29:54,399 --> 00:30:31,120
|
203 |
+
many alerts you want to be able to use these signals for the purpose of debugging when something is going wrong drift is a great example of this it's very useful for debugging because let's say that your accuracy was lower yesterday than it was the rest of the month well one way that you might debug that is by trying to see if there's any input fields or projections that look different that distinguish yesterday from the rest of the month those might be indicators of what is going wrong with your model and the last piece of advice i have on model monitoring and observability is it's very important to go beyond aggregate metrics let's say that your model is 99 accurate and let's say that's really good but for one
|
204 |
+
|
205 |
+
52
|
206 |
+
00:30:29,120 --> 00:31:06,799
|
207 |
+
particular user who happens to be your most important user it's only 50 accurate can we really still consider that mobs to be good and so the way to deal with this is by flagging important subgroups or cohorts of data and being able to slice and dice performance along those cohorts and potentially even set alerts on those cohorts some examples of this are categories of users that you don't want your model to be biased against or categories of users that are particularly important for your business or just ones where you might expect your model to perform differently on them like if you're rolled out in a bunch of different regions or a bunch of different languages it might be helpful to look at how your performance breaks
|
208 |
+
|
209 |
+
53
|
210 |
+
00:31:04,799 --> 00:31:44,399
|
211 |
+
out across those regions or languages all right that was a deep dive in different metrics that you can look at for the purpose of monitoring the next question that we'll talk about is how to tell if those metrics are good or bad there's a few different options for doing this that you'll see recommended one that i don't recommend and i alluded to this a little bit before is two sample statistical tests like aks test the reason why i don't recommend this is because if you think about what these two sample tests are actually doing they're trying to return a p-value for the likelihood that this data and this data are not coming from the same distribution and when you have a lot of data that just means that even really tiny shifts
|
212 |
+
|
213 |
+
54
|
214 |
+
00:31:42,399 --> 00:32:21,679
|
215 |
+
in the distribution will get very very small p values because even if the distributions are only a tiny bit different if you have a ton of samples you'll be able to very confidently say that those are different distributions but that's not actually what we care about since models are robust to small amounts of distribution shift better options than statistical tests include the following you can have fixed rules like there should never be any null values in this column you can have specific ranges so your accuracy should always be between 90 and 95 there can be predicted ranges so the accuracy is within what an off-the-shelf anomaly detector thinks is reasonable or there's also unsupervised detection of
|
216 |
+
|
217 |
+
55
|
218 |
+
00:32:19,600 --> 00:32:55,200
|
219 |
+
just new patterns in this signal and the most commonly used ones in practice are the first two fixed rules and specified ranges but predicted ranges via anomaly detection can also be really useful especially if there's some seasonality in your data the last topic i want to cover on model monitoring is the different tools that are available for monitoring your models the first category is system monitoring tools so this is a pretty mature category with a bunch of different companies in it and these are tools that help you detect problems with any software system not just machine learning models and they provide functionality for setting alarms when things go wrong and most of the cloud providers have pretty decent
|
220 |
+
|
221 |
+
56
|
222 |
+
00:32:53,679 --> 00:33:31,440
|
223 |
+
solutions here but if you want something better you can look at one of the observability or monitoring specific tools like honeycomb or datadog you can monitor pretty much anything in these systems and so it kind of raises the question of whether we should just use systems like this for monitoring machine learning metrics as well there's a great blog post on exactly this topic that i recommend reading if you're interested in learning about why this is feasible but pretty painful thing to do and so maybe it's better to use something that's ml specific here in terms of ml specific tools there's some open source tools the two most popular ones are evidently ai and y logs and these are both similar in that you provide them
|
224 |
+
|
225 |
+
57
|
226 |
+
00:33:29,120 --> 00:34:07,200
|
227 |
+
with samples of data and they produce a nice report that tells you where is their distribution shifts how have your model metrics changed etc the big limitation of these tools is that they don't solve the data infrastructure and the scale problem for you you still need to be able to get all that data into a place where you can analyze it with these tools and in practice that ends up being one of the hardest parts about this problem the main difference between these tools is that why logs is a little bit more focused on gathering data from the edge and the way they do that is by aggregating the data into statistical profiles at inference time itself so you don't need to transport all the data from your inference devices back to your
|
228 |
+
|
229 |
+
58
|
230 |
+
00:34:05,600 --> 00:34:44,960
|
231 |
+
cloud which in some cases can be very helpful and lastly there's a bunch of different sas vendors for ml monitoring and observability my startup gantry has some functionality around this and there's a bunch of other options as well all right so we've talked about model monitoring and observability and the goal of monitoring and observability in the context of continual learning is to give you the signals that you need to figure out what's going wrong with your continual learning system and how you can change the strategy in order to influence that outcome next we're going to talk about for each of the stages in the continual learning loop what are the different ways that you might be able to go beyond the basics and
|
232 |
+
|
233 |
+
59
|
234 |
+
00:34:43,520 --> 00:35:19,040
|
235 |
+
use what we learned from monitoring and observability to improve those stages the first stage of the continual learning loop is logging as a reminder the goal of logging is to get data from your model to a place where you can analyze it and the key question to answer is what data should i actually log for most of us the best answer is just to log all of your data storage is cheap and it's better to have data than not have it but there's some situations where you can't do that for example if you have just too much traffic going through your model to the point where it's too expensive to log all of it um if you have data privacy concerns if you're not actually allowed to look at your users data or if you're running
|
236 |
+
|
237 |
+
60
|
238 |
+
00:35:16,880 --> 00:35:56,400
|
239 |
+
your model at the edge and it's too expensive to get all that data back because you don't have enough network bandwidth if you can't log all of your data there's two things that you can do the first is profiling the idea of profiling is that rather than sending all the data back to your cloud and then using that to do monitoring or observability or retraining instead you can compute statistical profiles of your data on the edge that describe the data distribution that you're seeing so the nice thing about this is it's great from a data security perspective because it doesn't require you to send all the data back home it minimizes your storage cost and lastly you don't miss things that happen in the tails which is an issue for the
|
240 |
+
|
241 |
+
61
|
242 |
+
00:35:55,040 --> 00:36:30,560
|
243 |
+
next approach that we'll describe the place to use this really is primarily for security critical applications the other approach is sampling in sampling you'll just take certain data points and send those back home the advantage of sampling is that it has minimal impact on your inference resources so you don't have to actually spend the computational budget to compute profiles and you get to have access to the raw data for debugging and retraining and so this is what we recommend doing for pretty much every other application should describe in a little bit more detail how statistical profiles work because it's kind of interesting let's say that you have a stream of data that's coming in from two classes cat and dog and you
|
244 |
+
|
245 |
+
62
|
246 |
+
00:36:28,800 --> 00:37:13,200
|
247 |
+
want to be able to estimate what is the distribution of cat and dog over time without looking at all of the raw data so for example maybe in the past you saw three examples of a dog and two examples of a cat a statistical profile that you can store that summarizes this data is just a histogram so the histogram says we saw three examples of a dog and two a cat and over time as more and more examples stream in rather than actually storing those data we can just increment the histogram and keep track of how many total examples of each category that we've seen over time and so like a neat fact of statistics is that for a lot of the statistics that you might be interested in looking at quantiles means accuracy other statistics you can
|
248 |
+
|
249 |
+
63
|
250 |
+
00:37:10,720 --> 00:37:49,280
|
251 |
+
compute you can approximate those statistics pretty accurately by using statistical profiles called sketches that have minimal size so if you're interested in going on a tangent and learning more about an interesting topic in computer science that's one i'd recommend checking out next step in the continual learning loop is curation to remind you the goal of curation is to take your infinite stream of production data which is potentially unlabeled and turn this into a finite reservoir of data that has all the enrichments that it needs like labels to train your model on the key question that we need to answer here is similar to the one that we need to answer when we're sampling data at log time which is what data
|
252 |
+
|
253 |
+
64
|
254 |
+
00:37:47,280 --> 00:38:25,760
|
255 |
+
should we select for enrichment the most basic strategy for doing this is just sampling data randomly but especially as your model gets better most of the data that you see in production might not actually be that helpful for improving your model and if you do this you could miss rare classes or events like if you have an event that happens you know one time in every 10 000 examples in production but you are trying to improve your model on it then you might not sample any examples of that at all if you just sample randomly a way to improve on random sampling is to do what's called stratified sampling the idea here is to sample specific proportions of data points from various subpopulations so common ways that you might stratify for
|
256 |
+
|
257 |
+
65
|
258 |
+
00:38:23,200 --> 00:39:05,520
|
259 |
+
sampling in ml could be sampling to get a balance among classes or sampling to get a balance among categories that you don't want your model to be biased against like gender lastly the most advanced and interesting strategy for picking data to enrich is to curate data points that are somehow interesting for the purpose of improving your model and there's a few different ways of doing this that we'll cover the first is to have this notion of interesting data be driven by your users which will come from user feedback and feedback loops the second is to determine what is interesting data yourself by defining error cases or edge cases and then the third is to let an algorithm define this for you and this is a category of techniques known as
|
260 |
+
|
261 |
+
66
|
262 |
+
00:39:04,079 --> 00:39:40,800
|
263 |
+
active learning if you already have a feedback loop or a way of gathering feedback from your users in your machine learning system which you really should if you can then this is probably the easiest and potentially also the most effective way to pick interesting data for the purpose of curation and the way this works is you'll pick data based on signals that come from your users that they didn't like your prediction so this could be the user churned after interacting with your model it could be that they filed a support ticket about a particular prediction the model made it could be that they you know click the thumbs down button that you put in your products that they changed the label that your model produced for them or
|
264 |
+
|
265 |
+
67
|
266 |
+
00:39:38,880 --> 00:40:19,599
|
267 |
+
that they intervened with an automatic system like they grab the wheel of their autopilot system if you don't have user feedback or if you need even more ways of gathering interesting data from your system then probably the second most effective way of doing this is by doing manual error analysis the way this works is we will look at the errors that our model is making we will reason about the different types of failure modes that we're seeing we'll try to write functions or rules that help capture these error modes and then we'll use those functions to gather more data that might represent those error cases two sub-categories of how to do this one is what i would call similarity-based curation and the way this works is if
|
268 |
+
|
269 |
+
68
|
270 |
+
00:40:17,359 --> 00:40:57,119
|
271 |
+
you have some data that represents your errors or data that you think might be an error then you can pick an individual data point or a handful of data points and run a nearest neighbor similarity search algorithm to find the data points in your stream that are the closest to the one that your model is maybe making a mistake on the second way of doing this which is potentially more powerful but a little bit harder to do is called projection based curation the way this works is rather than just picking an example and grabbing the nearest neighbors of that example instead we are going to find an error case like the one on the bottom right where there's a person crossing the street with a bicycle and then we're gonna write a
|
272 |
+
|
273 |
+
69
|
274 |
+
00:40:54,400 --> 00:41:35,599
|
275 |
+
function that attempts to detect that error case and this could just be trading a simple neural network or it could be just writing some heuristics the advantage of doing similarity-based curation is that it's really easy and fast right like you just have to click on a few examples and you'll be able to get things that are similar to those examples this is beginning to be widely used in practice thanks to the explosion of vector search databases on the market it's relatively easy to do this and what this is particularly good for is events that are rare they don't occur very often in your data set but they're pretty easy to detect like if you had a problem with your self-driving car where you have llamas crossing the road a
|
276 |
+
|
277 |
+
70
|
278 |
+
00:41:33,760 --> 00:42:10,480
|
279 |
+
similarity search-based algorithm would probably do a reasonably good job of detecting other llamas in your training set on the other hand projection-based curation requires some domain knowledge because it requires you to think a little bit more about what is the particular error case that you're seeing here and write a function to detect it but it's good for more subtle error modes where a similarity search algorithm might be too coarse-screened it might find examples that look similar on the surface to the one that you are detecting but don't actually cause your model to fail the last way to curate data is to do so automatically using a class of algorithms called active learning the way active learning works
|
280 |
+
|
281 |
+
71
|
282 |
+
00:42:08,240 --> 00:42:42,640
|
283 |
+
is given a large amount of unlabeled data what we're going to try to do is determine which data points would improve model performance the most if you were to label those data points next and train on them and the way that these algorithms work is by defining a sampling strategy or a query strategy and then you rank all of your unlabeled examples using a scoring function that defines that strategy and take the ones with the highest scores and send them off to be labeled i'll give you a quick tour of some of the different types of scoring functions that are out there and if you want to learn more about this then i'd recommend the blog post linked on the bottom you have scoring functions that sample data points that the model
|
284 |
+
|
285 |
+
72
|
286 |
+
00:42:40,560 --> 00:43:16,800
|
287 |
+
is very unconfident about you have scoring functions that are defined by trying to predict what is the error that the model would make on this data point if we had a label for it you have scoring functions that are designed to detect data that doesn't look anything like the data that you've already trained on so can we distinguish these data points from our training data if so maybe those are the ones that we should sample and label we have scoring functions that are designed to take a huge data set of points and boil it down to the small number of data points that are most representative of that distribution lastly there's scoring functions that are designed to detect data points that if we train on them we
|
288 |
+
|
289 |
+
73
|
290 |
+
00:43:15,119 --> 00:43:49,839
|
291 |
+
think would have a big impact on training so where they would have a large expected gradient or would tend to cause the model to change its mind so that's just a quick tour of different types of scoring functions that you might implement uncertainty based scoring tends to be the one that i see the most in practice largely because it's very simple to implement and tends to produce pretty decent results but it's worth diving a little bit deeper into this if you do decide to go down this route if you're paying close attention you might have noticed that there's a lot of similarity between some of the ways that we do data curation the way that we pick interesting data points and the way that we do monitoring i
|
292 |
+
|
293 |
+
74
|
294 |
+
00:43:47,359 --> 00:44:25,839
|
295 |
+
think that's no coincidence monitoring and data curation are two sides of the same coin they're both interested in solving the problem of finding data points where the model may not be performing well or where we're uncertain about how the model is performing on those data points so for example user driven curation is kind of another side of the same coin of monitoring user feedback metrics both of these things look at the same metrics stratified sampling is a lot like doing subgroup or cohort analysis making sure that we're getting enough data points from subgroups that are important or making sure that our metrics are not degrading on those subgroups projections are used in both data curation and monitoring to
|
296 |
+
|
297 |
+
75
|
298 |
+
00:44:23,599 --> 00:45:05,200
|
299 |
+
take high dimensional data and break them down into distributions that we think are interesting for some purpose and then in active learning some of the techniques also have mirrors in monitoring like predicting the loss on an unlabeled data point or using the model's uncertainty on that data point next let's talk about some case studies of how data curation is done in practice the first one is a blog post on how openai trained dolly2 to detect malicious inputs to the model there's two techniques that they used here they used active learn learning using uncertainty sampling to reduce the false positives for the model and then they did a manual curation actually they did it kind of an automated way but they did
|
300 |
+
|
301 |
+
76
|
302 |
+
00:45:02,319 --> 00:45:41,599
|
303 |
+
similarity search to find similar examples to the ones that the model was not performing well on the next example from tesla this is a talk i love from andre carpathi about how they build a data flywheel of tesla and they use two techniques here one is feedback loops so gathering information about when users intervene with the autopilot and then the second is manual curation via projections for edge case detection and so this is super cool because they actually have infrastructure that allows ml engineers when they discover a new edge case to write an edge case detector function and then actually deploy that on the fleet that edge case detector not only helps them curate data but it also helps them decide which data to sample
|
304 |
+
|
305 |
+
77
|
306 |
+
00:45:40,000 --> 00:46:17,280
|
307 |
+
which is really powerful the last case study i want to talk about is from cruz they also have this concept of building a continual learning machine and the main way they do that is through feedback loops that's kind of a quick tour of what some people actually use to build these data curation systems in practice there's a few tools that are emerging to help with data curation scale nucleus and aquarium are relatively similar tools that are focused on computer vision and they're especially good at nearest neighbor based sampling at my startup gantry we're also working on some tools to help with this across a wide variety of different applications concrete recommendations on data curation random sampling is probably a fine starting
|
308 |
+
|
309 |
+
78
|
310 |
+
00:46:14,400 --> 00:46:51,359
|
311 |
+
point for most use cases but if you have a need to avoid bias or if there's rare classes in your data set you probably should start even with stratified sampling or at the very least introduce that pretty soon after you start sampling if you have a feedback loop as part of your machine learning system and i hope you're taking away from this that how helpful it is to have these feedback loops then user-driven curation is kind of a no-brainer this is definitely something that you should be doing and is probably going to be the thing that is the most effective in the early days of improving your model if you don't have a feedback loop then using confidence-based active learning is a next best bet because it's pretty easy
|
312 |
+
|
313 |
+
79
|
314 |
+
00:46:49,599 --> 00:47:25,680
|
315 |
+
to implement and works okay in practice and then finally as your model performance increases you're gonna have to look harder and harder for these challenging training points at the end of the day if you want to squeeze the maximum performance out of your model there's no avoiding manually looking at your your data and trying to find interesting failure modes there's no substitute for knowing your data after we've curated our infinite stream of unlabeled data down to a reservoir of labeled data that's ready to potentially train on the next thing that we'll need to decide is what trigger are we going to use to retrain and the main takeaway here is that moving to automated retraining is not always necessary in many cases just
|
316 |
+
|
317 |
+
80
|
318 |
+
00:47:23,839 --> 00:47:59,520
|
319 |
+
manually refraining is good enough but it can save you time and lead to better better model performance so it's worth understanding when it makes sense to actually make that move the main prerequisite for moving to automated retraining is just being able to reproduce model performance when retraining in a fairly automated fashion so if you're able to do that and you are not really working on this model very actively anymore then it's probably worth implementing some automated retraining if you just find yourself retraining this model super frequently then it'll probably save you time to implement this earlier when it's time to move to automated training the main recommendation is just keep it simple and retrain periodically like once a
|
320 |
+
|
321 |
+
81
|
322 |
+
00:47:57,280 --> 00:48:32,319
|
323 |
+
week rerun training on that schedule the main question though is how do you pick that training schedule so what i recommend doing here is doing a little bit of like measurement to figure out what is a reasonable retraining schedule you can plot your model performance over time and then compare to how the model would have performed if you had retrained on different frequencies you can just make basic assumptions here like if you retrain you'll be able to reach the same level of accuracy and what you're going to be doing here is looking at these different retraining schedules and looking at the area between these curves like on the chart on the top right the area between these two curves is your opportunity cost in
|
324 |
+
|
325 |
+
82
|
326 |
+
00:48:30,880 --> 00:49:09,920
|
327 |
+
terms of like how much model performance you're leaving on the table by not retraining more frequently and then once you have a number of these different opportunity costs for different retraining frequencies you can plot those opportunity costs and then you can sort of run the ad hoc exercise of trying to balance you know where is the rate trade-off point for us between the performance gain that we get from retraining more frequently and the cost of that retraining which is both the cost of running the retraining itself as well as the operational cost that we'll introduce by needing to evaluate that model more frequently that's what i'd recommend doing in practice a request i have for research is i think it'd be great i think it's
|
328 |
+
|
329 |
+
83
|
330 |
+
00:49:08,240 --> 00:49:43,839
|
331 |
+
very feasible to have a technique that would automatically determine the optimal retraining strategy based on how performance tends to decay how sensitive you are to that performance decay your operational costs and your retraining costs so i think you know eventually we won't need to do the manual data analysis every single time to determine this retraining frequency if you're more advanced then the other thing you can consider doing is retraining based on performance triggers this looks like setting triggers on metrics like accuracy and only retraining when that accuracy dips below a predefined threshold some big advantages to doing this are you can react a lot more quickly to unexpected changes that
|
332 |
+
|
333 |
+
84
|
334 |
+
00:49:42,160 --> 00:50:16,240
|
335 |
+
happen in between your normal training schedule it's more cost optimal because you can skip a retraining if it wouldn't actually improve your model's performance but the big cons here are that since you don't know in advance when you're gonna be retraining you need to have good instrumentation and measurement in place to make sure that when you do retrain you're doing it for the right reasons and that the new model is actually doing well these techniques i think also don't have a lot of good theoretical justification and so if you are the type of person that wants to understand you know why theoretically this should work really well i don't think you're going to find that today and probably the most important con is
|
336 |
+
|
337 |
+
85
|
338 |
+
00:50:14,720 --> 00:50:50,800
|
339 |
+
that this adds a lot of operational complexity because instead of just knowing like hey at 8 am i know my retraining is going live and so i can check in on that instead this retraining could happen at any time so your whole system needs to be able to handle that and that just introduces a lot of new infrastructure that you'll need to build and then lastly an idea that probably won't be relevant to most of you but is worth thinking about because i think it's it could be really powerful in the future is online learning where you train on every single data point as it comes in it's not very commonly used in practice but one sort of relaxation of this idea that is used fairly frequently in practice is online adaptation the way
|
340 |
+
|
341 |
+
86
|
342 |
+
00:50:48,160 --> 00:51:28,319
|
343 |
+
online adaptation works is it operates not the level of retraining the whole model itself but it operates on the level of adapting the policy that sits on top of the model what is a policy a policy is the set of rules that takes the raw prediction that the model made like the score or the raw output of the model and then turns that into the actual thing that the user sees so like a classification threshold is an example of a policy or if you have many different versions of your model that you're ensembling what are the weights of those ensembles or even which version of the model is this particular request going to be routed to in online adaptation rather than retraining the model on each new data point as it comes
|
344 |
+
|
345 |
+
87
|
346 |
+
00:51:26,000 --> 00:52:07,359
|
347 |
+
in instead we use an algorithm like multi-arm bandits to tune the weights of this policy online as more data comes in so if your data changes really frequently in practice or you are have a hard time training your model frequently enough to adapt to it then online adaptation is definitely worth looking into next we've fired off a trigger to start a training job and the next question we need to answer is among all of the labeled data in our reservoir of data which specific data points should we train on for this particular training job most of the time in deep learning we'll just train on all the data that we have available to us but if you have too much data to do that then depending on whether recency of data is
|
348 |
+
|
349 |
+
88
|
350 |
+
00:52:05,680 --> 00:52:43,599
|
351 |
+
an important signal to determine whether that data is useful you'll either slide a window to make sure that you're looking at the most recent data therefore in many cases the most useful data or we'll use techniques like sampling or online batch selection if not and a more advanced technique to be aware of that is hard to execute in practice today is continual fine-tuning we'll talk about that as well so the first option is just to train on all available data so you have a data set that you'll keep track of that your last model was trained on then over time between your last training and your next training you'll have a bunch of new data come in you'll curate some of that data then you'll just take all that data
|
352 |
+
|
353 |
+
89
|
354 |
+
00:52:42,160 --> 00:53:20,240
|
355 |
+
you'll add it to the data set and you'll train the new model on the combined data set so the keys here are you need to keep this data version controlled so that you know which data was added to each training iteration and it's also important if you want to be able to evaluate the model properly to keep track of the rules that you use to curate that new data so if you're sampling in a way that's not uniform from your distribution you should keep track of the rules that you use to sample so that you can determine where that data actually came from second option is to bias your sampling toward more recent data by using a sliding window the way this works is at each point when you train your model you look
|
356 |
+
|
357 |
+
90
|
358 |
+
00:53:17,119 --> 00:53:55,760
|
359 |
+
backward and you gather a window of data that leads up to the current moment and then at your next training you slide that window forward and so there might be a lot of overlap potentially between these two data sets but you have all the new data or like a lot of the new data and you get rid of the oldest data in order to form the new data set couple key things to do here are it's really helpful to look at the different statistics between the old and new data sets to catch bugs like if you have a large change in a particular distribution of one of the columns that might be indicative of a new bug that's been introduced and one challenge that you'll find here is just comparing the old and the new versions of the models
|
360 |
+
|
361 |
+
91
|
362 |
+
00:53:52,960 --> 00:54:30,960
|
363 |
+
since they are not trained on data that is related in a very straightforward way if you're working in a setting where you need to sample data you can't train on all of your data but there isn't any reason to believe that recent data is much better than older data then you can sample data from your reservoir using a variety of techniques the most promising of which is called online batch selection normally if we were doing stochastic gradient descent then what we do is we would sample mini batches on every single training iteration until we run out of data or until we run out of compute budget in online batch selection instead what we do is before each training step we sample a larger batch like much larger than the mini batch
|
364 |
+
|
365 |
+
92
|
366 |
+
00:54:29,359 --> 00:55:08,480
|
367 |
+
that we ultimately want to train on we rank each of the items in the mini batch according to a label aware selection function and then we take the top n items according to that function and train on those the paper on the right describes a label aware selection function called the reducible holdout loss selection function that performs pretty well on some relatively large data sets and so if you're going to look into this technique this is probably where i would start the last option that we'll discuss which is not recommended to do today is continual fine-tuning the way this works is rather than retraining from scratch every single time instead just only train your existing model on just new data the reason why you might
|
368 |
+
|
369 |
+
93
|
370 |
+
00:55:06,880 --> 00:55:41,839
|
371 |
+
want to do this primarily is because it's much more cost effective the paper on the right shares some findings from grubhub where they found a 45x cost improvement by doing this technique relative to sliding windows but the big challenge here is that unless you're very careful it's easy for the model to forget what it learned in the past so the upshot is that you need to have pretty mature evaluation to be able to be very careful that your model is performing well on all the types of data that it needs to perform well on before it's worth implementing something like this so now we've triggered a retraining we have selected the data points that are going to go into the training job we've trained our model you know run our
|
372 |
+
|
373 |
+
94
|
374 |
+
00:55:39,839 --> 00:56:13,839
|
375 |
+
hyperparameter sweeps if we want to and we have a new candidate model that we think is ready to go into production the next step is to test that model the goal of this stage is to produce a report that our team can sign off on that answers the question of whether this new model is good enough or whether it's better than the old model and the key question here is what should go into that report again this is a place where there's not a whole lot of standardization but the recommendation we have here is to compare your current model with the previous version of the model on all the following all the metrics that you care about all of the slices or subsets of data that you've flagged is important all of the edge
|
376 |
+
|
377 |
+
95
|
378 |
+
00:56:11,520 --> 00:56:47,760
|
379 |
+
cases that you've defined and in a way that's adjusted to account for any sample and bias that you might have introduced by your curation strategy an example of what such a report could look like is the following across the top we have all of our metrics in this case accuracy precision and recall and then all on the left are all of the data sets and slices that we're looking at so the things to notice here are we have our main validation set which is like what most people use for evaluating models but rather than just looking at that those numbers in the aggregate we also break it out across a couple of different categories in this case the age of the user and the age of the account that belongs to that user and
|
380 |
+
|
381 |
+
96
|
382 |
+
00:56:45,680 --> 00:57:25,200
|
383 |
+
then below the main validation set we also have more specific validation sets that correspond to particular error cases that we know have given our model trouble or a previous version of our model trouble in the past these could be like just particular edge cases that you've found in the past like maybe your model handles examples of poor grammar very poorly or it doesn't know what some gen z slang terms mean like these are examples of failure modes you've found for your model in the past that get rolled into data sets to test the next version of your model in continual learning just like how training sets are dynamic and change over time evaluation sets are dynamic as well as you curate new data you should add some of it to
|
384 |
+
|
385 |
+
97
|
386 |
+
00:57:23,440 --> 00:57:58,400
|
387 |
+
your training sets but also add some of it to your evaluation sets for example if you change how you do sampling you might want to add some of that newly sampled data to your eval set as well to make sure that your eval set represents that new sampling strategy or if you discover a new edge case instead of only adding that edge case to the training set it's worth holding out some examples of that edge case as a particular unit test to be part of that offline evaluation suite two corollaries to note of the fact that evaluation sets are dynamic the first is that you should also version control your evaluation sets just like you do your training sets the second is that if your data is evolving really quickly then part of the
|
388 |
+
|
389 |
+
98
|
390 |
+
00:57:56,559 --> 00:58:36,079
|
391 |
+
data that you hold out should always be the most recent data the data from you know the past day or the past hour or whatever it is to make sure that your model is generalizing well to new data once you have the basics in place a more advanced thing that you can look into here that i think is pretty promising is the idea of expectation tests the way that expectation tests work are you take pairs of examples where you know the relationship so let's say that you're doing sentiment analysis and you have a sentence that says my brother is good if you make the positive word in that sentence more positive and instead say my brother is great then you would expect your sentiment classifier to become even more positive about that sentence these types
|
392 |
+
|
393 |
+
99
|
394 |
+
00:58:33,200 --> 00:59:16,400
|
395 |
+
of tests have been explored in nlp as well as recommendation systems and they're really good for testing whether your model generalizes in predictable ways and so they give you more granular information than just aggregate performance metrics about how your model does on previously unseen data one observation to make here is that just like how data curation is highly analogous to monitoring so is offline testing just like in monitoring we want to observe our metrics not just in aggregate but also across all of our important subsets of data and across all of our edge cases one difference between these two is that you will in general have different metrics available in offline testing and online testing for
|
396 |
+
|
397 |
+
100
|
398 |
+
00:59:13,440 --> 00:59:51,520
|
399 |
+
example you are much more likely to have labels available offline in fact you always have labels available offline because that is uh how you're going to train your model but online you're much more likely to have feedback and so even though these two ideas are highly analogous and should share a lot of metrics and definitions of subsets and things like that one point of friction that you that will occur between online monitoring and offline testing is that the metrics are a little bit different so one direction for research that i think would be really exciting to see more of is using offline metrics like accuracy to predict online metrics like user engagement and then lastly once we've tested our candidate model offline
|
400 |
+
|
401 |
+
101
|
402 |
+
00:59:49,359 --> 01:00:28,319
|
403 |
+
it's time to deploy it and evaluate it online so we talked about this last time so i don't want to reiterate too much but as a reminder if you have the infrastructural capability to do so then you should do things like first running your model in shadow mode before you um actually roll it out to real users then running an a b test to make sure that users are responding to it better than they did the old model then once you have a successful av test rolling it out to all of your users but doing so gradually and then finally if you see issues during that rollout just to roll it back to the old version of the model and try to figure out what went wrong so we talked about the different stages of continual learning from
|
404 |
+
|
405 |
+
102
|
406 |
+
01:00:25,280 --> 01:01:07,440
|
407 |
+
logging data to curating it to triggering retraining testing the model and rolling out to production and we also talked about monitoring and observability which is about giving you a set of rules that you can use to tell whether your retraining strategy needs to change and we observed that in a bunch of different places the fundamental elements that you study in monitoring like projections and user feedback and model uncertainty are also useful for different parts of the continual learning process and that's no coincidence i see monitoring and continual learning as two sides of the same coin we should be using the signals that we monitor to very directly change our retraining strategy so the last thing i want to do is just try to
|
408 |
+
|
409 |
+
103
|
410 |
+
01:01:05,839 --> 01:01:42,160
|
411 |
+
make this a little bit more concrete by walking through an example of a workflow that you might have from detecting an issue in your model to altering the strategy this section describes more of a feature state until you've invested pretty heavily in infrastructure it's going to be hard to make it feel as seamless as this in practice but i wanted to mention it anyway because i think it provides like a nice end state for what we should aspire to in our continual learning workflows the thing you would need to have in place before you're able to actually execute what i'm going to describe next is a place to store and version all of the elements of your strategy which include metric definitions for both online and offline
|
412 |
+
|
413 |
+
104
|
414 |
+
01:01:40,319 --> 01:02:20,240
|
415 |
+
testing performance thresholds for those metrics definitions of any of the projections that you want to use for monitoring and also for data curation subgroups or cohorts that you think are particularly important to break out your metrics along the logic that defines how you do data curation whether it's sampling rules or anything else and then finally the specific data sets that you use for each different run of your training or evaluation our example continue improvement loop starts with an alert and in this case that alert might be our user feedback got worse today and so our job is now to figure out what's going on so the next thing we'll use is some of our observability tools to investigate what's going on here and we
|
416 |
+
|
417 |
+
105
|
418 |
+
01:02:18,720 --> 01:02:56,079
|
419 |
+
might you know run some subgroup analyses and look at some raw data and figure out that the problem is really mostly isolated to new users the next thing that we might do is do error analysis so look at those new users and the data points that they're sending us and try to reason about why those data points are performing worse and what we might discover is something like our model was trained assuming that people were going to write emails but now users are submitting a bunch of text that has things that aren't normally found in emails like emojis and that's causing our model problems so here's where we might make the first change to our retraining strategy we could define new users as a cohort of interest because we
|
420 |
+
|
421 |
+
106
|
422 |
+
01:02:54,400 --> 01:03:30,160
|
423 |
+
never want performance to decline on new users again without getting an alert about that then we could define a new projection that helps us detect data that has emojis and add that projection to our observability metrics so that anytime in the future if we want as part of an investigation to see how our performance differs between users that are submitting emojis and ones that are not we can always do that without needing to rewrite the projection next we might search our reservoir for historical examples that contain emojis so that we can use them to make our model better and then adjust our strategy by adding that subset of data as a new test case so now whenever we test the model going forward we'll
|
424 |
+
|
425 |
+
107
|
426 |
+
01:03:27,839 --> 01:04:03,920
|
427 |
+
always see how it performs on data with emojis in addition to adding emoji examples to as a test case we would also curate them and add them back into our training set and do a retraining then once we have the new model that's trained we'll get this new model comparison report which will include also the new cohort that we defined as part of this process and the new emoji edge case data set that we defined and then finally if we're doing manual deployment we can just deploy that model and that completes the continual improvement loop so to wrap up what do i want you to take away from this continual learning is a complicated rapidly evolving and poorly understood topic so this is an area to pay attention to if you're interested in
|
428 |
+
|
429 |
+
108
|
430 |
+
01:04:02,640 --> 01:04:39,599
|
431 |
+
seeing how the cutting edge of production machine learning is evolving and the main takeaway from this lecture is we broke down the concept of a retraining strategy which consists of a number of different pieces definitions of metrics subgroups of interest projections that help you break down and analyze high dimensional data performance thresholds for your metrics logic for curating new data sets and the specific data sets that you're going to use for retraining and evaluation at a high level the way that we can think about our role as machine learning engineers once we've deployed the first version of the model is to use rules that we define as part of our observability and monitoring suite to iterate on the strategy for many of you
|
432 |
+
|
433 |
+
109
|
434 |
+
01:04:37,920 --> 01:05:13,280
|
435 |
+
in the near term this won't feel that different from just using that data to retrain the model however you'd like to but i think thinking of this as a strategy that you can tune at a higher level is a productive way of understanding it as you move towards more and more automated retraining lastly just like every other aspect of the ml life cycle that we talked about in this course our main recommendation here is to start simple and add complexity later in the context of continual learning what that means is it's okay to retrain your models manually to start as you get more advanced you might want to automate retraining and you also might want to think more intelligently about how you sample data to make sure that you're
|
436 |
+
|
437 |
+
110
|
438 |
+
01:05:11,119 --> 01:05:19,200
|
439 |
+
getting the data that is most useful for improving your model going forward that's all for this week see you next time
|
440 |
+
|
documents/lecture-07.md
ADDED
@@ -0,0 +1,285 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: Building on Transformers, GPT-3, CLIP, StableDiffusion, and other Large Models.
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 7: Foundation Models
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/Rm11UeGwGgk?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Sergey Karayev](https://twitter.com/sergeykarayev).
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published September 19, 2022.
|
14 |
+
[Download slides](https://fsdl.me/2022-lecture-07-slides).
|
15 |
+
|
16 |
+
Foundation models are very large models trained on very large datasets that
|
17 |
+
can be used for multiple downstream tasks.
|
18 |
+
|
19 |
+
We’ll talk about fine-tuning, Transformers, large language models, prompt engineering, other applications of large models, and vision and text-based models like CLIP and image generation.
|
20 |
+
|
21 |
+
![alt_text](media/image-1.png "image_tooltip")
|
22 |
+
|
23 |
+
## 1 - Fine-Tuning
|
24 |
+
|
25 |
+
Traditional ML uses a lot of data and a large model, which takes a long time. But if you have a small amount of data, you can use **transfer learning** to benefit from the training on a lot of data. You basically use the same model that you have pre-trained, add a few layers, and unlock some weights.
|
26 |
+
|
27 |
+
We have been doing this in computer vision since 2014. Usually, you train a model on ImageNet, keep most of the layers, and replace the top three or so layers with newly learned weights. Model Zoos are full of these models like AlexNet, ResNet, etc. in both TensorFlow and PyTorch.
|
28 |
+
|
29 |
+
In NLP, pre-training was initially limited only to the first step: word embeddings. The input to a language model is words. One way you can encode them to be a vector (instead of a word) is **one-hot encoding**. Given a large matrix of words, you can make an embedding matrix and embed each word into a real-valued vector space. This new matrix is down to the dimension on the order of a thousand magnitude. Maybe those dimensions correspond to some semantic notion.
|
30 |
+
|
31 |
+
![alt_text](media/image-2.png "image_tooltip")
|
32 |
+
|
33 |
+
|
34 |
+
[Word2Vec](https://jalammar.github.io/illustrated-word2vec/) trained a model like this in 2013. It looked at which words frequently co-occur together. The learning objective was to maximize cosine similarity between their embeddings. It could do cool demos of vector math on these embeddings. For example, when you embed the words “king,” “man,” and “woman,” you can do vector math to get a vector that is close to the word “queen” in this embedding space.
|
35 |
+
|
36 |
+
It’s useful to see more context to embed words correctly because words can play different roles in the sentence (depending on their context). If you do this, you’ll improve accuracy on all downstream tasks. In 2018, a number of models such as ELMO and ULMFit [published pre-trained LSTM-based models that set state-of-the-art results on most NLP tasks](https://ruder.io/nlp-imagenet/).
|
37 |
+
|
38 |
+
But if you look at the model zoos today, you won’t see any LSTMs. You’ll only see Transformers everywhere. What are they?
|
39 |
+
|
40 |
+
|
41 |
+
## 2 - Transformers
|
42 |
+
|
43 |
+
Transformers come from a paper called “[Attention Is All You Need](https://arxiv.org/abs/1706.03762)” in 2017, which introduced a groundbreaking architecture that sets state-of-the-art results on translation first and a bunch of NLP tasks later.
|
44 |
+
|
45 |
+
![alt_text](media/image-3.png "image_tooltip")
|
46 |
+
|
47 |
+
|
48 |
+
It has a decoder and an encoder. For simplicity, let’s take a look at the encoder. The interesting components here are self-attention, positional encoding, and layer normalization.
|
49 |
+
|
50 |
+
|
51 |
+
### Self-Attention
|
52 |
+
|
53 |
+
![alt_text](media/image-4.png "image_tooltip")
|
54 |
+
|
55 |
+
|
56 |
+
Basic self-attention follows: Given an input sequence of vectors x of size t, we will produce an output sequence of tensors of size t. Each tensor is a weighted sum of the input sequence. The weight here is just a dot product of the input vectors. All we have to do is to make that weighted vector sum to 1. We can represent it visually, as seen below. The input is a sentence in English, while the output is a translation in French.
|
57 |
+
|
58 |
+
![alt_text](media/image-5.png "image_tooltip")
|
59 |
+
|
60 |
+
|
61 |
+
So far, there are no learned weights and no sequence order. Let’s learn some weights!* If we look at the input vectors, we use them in three ways: as **queries** to compare two other input vectors, as **keys** to compare them to input vectors and produce the corresponding output vector, and as **values **to sum up all the input vectors and produce the output vector.
|
62 |
+
* We can process each input vector with three different matrices to fulfill these roles of query, key, and value. We will have three weighted matrices, and everything else remains the same. If we learn these matrices, we learn attention.
|
63 |
+
* It’s called **multi-head attention **because we learn different sets of weighted matrices simultaneously, but we implement them as just a single matrix.
|
64 |
+
|
65 |
+
So far, we have learned the query, key, and value. Now we need to introduce some notion of order to the sequence by encoding each vector with its position. This is called **positional encoding**.
|
66 |
+
|
67 |
+
|
68 |
+
### Positional Encoding
|
69 |
+
|
70 |
+
![alt_text](media/image-6.png "image_tooltip")
|
71 |
+
|
72 |
+
|
73 |
+
Let’s say we have an input sequence of words
|
74 |
+
|
75 |
+
]* The first step is to embed the words into a dense, real-valued word embedding. This part can be learned.
|
76 |
+
* However, there is no order to that embedding. Thus, we will add another embedding that only encodes the position.
|
77 |
+
* In brief, the first embedding encodes only the content, while the second embedding encodes only the position. If you add them, you now have information about both the content and the position.
|
78 |
+
|
79 |
+
|
80 |
+
### Layer Normalization
|
81 |
+
|
82 |
+
![alt_text](media/image-7.png "image_tooltip")
|
83 |
+
|
84 |
+
|
85 |
+
Neural network layers work best when the input vectors have uniform mean and standard deviation in each dimension. As activations flow through the network, the means and standard deviations get blown out by the weight matrices. [Layer normalization](https://arxiv.org/pdf/1803.08494.pdf) is a hack to re-normalize every activation to where we want them between each layer.
|
86 |
+
|
87 |
+
That’s it! All the amazing results you’ll see from now on are just increasingly large Transformers with dozens of layers, dozens of heads within each layer, large embedding dimensions, etc. The fundamentals are the same. It’s just the Transformer model.
|
88 |
+
|
89 |
+
[Anthropic](https://www.anthropic.com/) has been publishing great work lately to investigate why Transformers work so well. Check out these publications:
|
90 |
+
|
91 |
+
1. [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html)
|
92 |
+
2. [In-Context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)
|
93 |
+
3. [Toy Models of Superposition](https://transformer-circuits.pub/2022/toy_model/index.html)
|
94 |
+
|
95 |
+
|
96 |
+
## 3 - Large Language Models
|
97 |
+
|
98 |
+
|
99 |
+
### Models
|
100 |
+
|
101 |
+
GPT and [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) came out in 2018 and 2019, respectively. The name means “generative pre-trained Transformers.” They are decoder-only models and use masked self-attention. This means: At a poi that at the output sequence, you can only attend to two input sequence vectors that came before that point in the sequence.
|
102 |
+
|
103 |
+
![alt_text](media/image-8.png "image_tooltip")
|
104 |
+
|
105 |
+
|
106 |
+
These models were trained on 8 million web pages. The largest model has 1.5 billion parameters. The task that GPT-2 was trained on is predicting the next word in all of this text on the web. They found that it works increasingly well with an increasing number of parameters.
|
107 |
+
|
108 |
+
![alt_text](media/image-9.png "image_tooltip")
|
109 |
+
|
110 |
+
|
111 |
+
[BERT](https://arxiv.org/abs/1810.04805) came out around the same time as Bidirectional Encoder Representations for Transformers. It is encoder-only and does not do attention masking. It has 110 million parameters. During training, BERT masks out random words in a sequence and has to predict whatever the masked word is.
|
112 |
+
|
113 |
+
![alt_text](media/image-10.png "image_tooltip")
|
114 |
+
|
115 |
+
|
116 |
+
[T5 (Text-to-Text Transformer)](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) came out in 2020. The input and output are both text strings, so you can specify the task that the model supposes to be doing. T5 has an encoder-decoder architecture. It was trained on the C4 dataset (Colossal Clean Crawled Corpus), which is 100x larger than Wikipedia. It has around 10 billion parameters. You can download [the open-sourced model](https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints) and run it on your machine.
|
117 |
+
|
118 |
+
[GPT-3](https://openai.com/blog/gpt-3-apps/) was one of the state-of-the-art models in 2020. It was 100x larger than GPT/GPT-2 with 175 billion parameters. Because of its size, GPT-3 exhibits unprecedented capabilities of few-shot and zero-shot learning. As seen in the graph below, the more examples you give the model, the better its performance is. The larger the model is, the better its performance is. If a larger model was trained, it would be even better.
|
119 |
+
|
120 |
+
![alt_text](media/image-11.png "image_tooltip")
|
121 |
+
|
122 |
+
|
123 |
+
OpenAI also released [Instruct-GPT](https://openai.com/blog/instruction-following/) earlier this year. It had humans rank different GPT-3 outputs and used reinforcement learning to fine-tune the model. Instruct-GPT was much better at following instructions. OpenAI has put this model, titled ‘text-davinci-002,’ in their API. It is unclear how big the model is. It could be ~10x smaller than GPT-3.
|
124 |
+
|
125 |
+
![alt_text](media/image-12.png "image_tooltip")
|
126 |
+
|
127 |
+
|
128 |
+
DeepMind released [RETRO (Retrieval-Enhanced Transformers)](https://arxiv.org/pdf/2112.04426.pdf) in 2021. Instead of learning language and memorizing facts in the model’s parameters, why don’t we just learn the language in parameters and retrieve facts from a large database of internal text? To implement RETRO, they encode a bunch of sentences with BERT and store them in a huge database with more than 1 trillion tokens. At inference time, they fetch matching sentences and attend to them. This is a powerful idea because RETRO is connected to an always updated database of facts.
|
129 |
+
|
130 |
+
![alt_text](media/image-13.png "image_tooltip")
|
131 |
+
|
132 |
+
|
133 |
+
DeepMind released another model called [Chinchilla](https://gpt3demo.com/apps/chinchilla-deepmind) in 2022, which observed the scaling laws of language models. They [trained over 400 language models](https://arxiv.org/pdf/2203.15556.pdf) from 70 million to 16 billion parameters on 5 billion to 500 billion tokens. They then derived formulas for optimal model and training set size, given a fixed compute budget. They found that most large language models are “undertrained,” meaning they haven’t seen enough data.
|
134 |
+
|
135 |
+
![alt_text](media/image-14.png "image_tooltip")
|
136 |
+
|
137 |
+
|
138 |
+
To prove this, they trained a large model called [Gopher](https://gpt3demo.com/apps/deepmind-gopher) with 280 billion parameters and 300 billion tokens. With Chincilla, they reduced the number of parameters to 70 billion and used four times as much data (1.4 trillion tokens). Chinchilla not only matched Gopher’s performance but actually exceeded it. Check out [this LessWrong post](https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications) if you want to read about people’s opinions on it.
|
139 |
+
|
140 |
+
|
141 |
+
### Vendors
|
142 |
+
|
143 |
+
OpenAI offers four model sizes: Ada, Babbage, Curie, and Davinci. [Each has a different price](https://openai.com/api/pricing/) and different capabilities. Most of the impressive GPT-3 results on the Internet came from Davinci. These correspond to 350M, 1.3B, 6.7B, and 175B parameters. You can also fine-tune models for an extra cost. The quota you get when you sign up is pretty small, but you can raise it over time. You have to apply for review before going into production.
|
144 |
+
|
145 |
+
There are some alternatives to OpenAI:
|
146 |
+
|
147 |
+
1. [Cohere AI](https://cohere.ai/) has similar models for [similar prices](https://cohere.ai/pricing).
|
148 |
+
2. [AI21](https://www.ai21.com/) also has some large models.
|
149 |
+
3. There are also open-source large language models, such as [Eleuther GPT-NeoX](https://www.eleuther.ai/projects/gpt-neox/) (20B parameters), [Facebook OPT-175B](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) (175B parameters), and [BLOOM from BigScience](https://bigscience.huggingface.co/blog/bloom) (176B parameters). If you want to use one of these open-source models but do not have to be responsible for deploying it, you can use [HuggingFace’s inference API](https://huggingface.co/inference-api).
|
150 |
+
|
151 |
+
|
152 |
+
## 4 - Prompt Engineering
|
153 |
+
|
154 |
+
GPT-3 and other large language models are mostly alien technologies. It’s unclear how they exactly work. People are finding out how they work by playing with them. We will cover some notable examples below. Note that if you play around with them long enough, you are likely to discover something new.
|
155 |
+
|
156 |
+
GPT-3 is surprisingly bad at reversing words due to **tokenization**: It doesn’t see letters and words as humans do. Instead, it sees “tokens,” which are chunks of characters. Furthermore, it gets confused with long-ish sequences. Finally, it has trouble merging characters. For it to work, you have to teach GPT-3 the algorithm to use to get around its limitations. Take a look at [this example from Peter Welinder](https://twitter.com/npew/status/1525900849888866307).
|
157 |
+
|
158 |
+
![alt_text](media/image-15.jpg "image_tooltip")
|
159 |
+
|
160 |
+
|
161 |
+
Another crazy prompt engineering is “Let’s Think Step By Step.” This comes from a paper called “[Large Language Models are Zero-Shot Reasoners](https://arxiv.org/pdf/2205.11916.pdf).” Simply adding “Let’s Think Step By Step” into the prompt increases the accuracy of GPT-3 on one math problem dataset from 17% to 78% and another math problem dataset from 10% to 40%.
|
162 |
+
|
163 |
+
![alt_text](media/image-16.png "image_tooltip")
|
164 |
+
|
165 |
+
|
166 |
+
Another unintuitive thing is that the context length of GPT is long. You can give it a **long instruction** and it can return the desired output. [This example](https://twitter.com/goodside/status/1557381916109701120) shows how GPT can output a CSV file and write the Python code as stated. You can also use **formatting tricks **to reduce the training cost, as you can do multiple tasks per call. Take a look at [this example](https://twitter.com/goodside/status/1561569870822653952) for inspiration.
|
167 |
+
|
168 |
+
We have to be careful since our models might get pwnage or possessed. User input in the prompt may instruct the model to do something naughty. This input can even reveal your prompt to [prompt injection attacks](https://simonwillison.net/2022/Sep/12/prompt-injection/) and [possess your AI](https://twitter.com/goodside/status/1564112369806151680). This actually works in GPT-3-powered production apps.
|
169 |
+
|
170 |
+
![alt_text](media/image-17.png "image_tooltip")
|
171 |
+
|
172 |
+
|
173 |
+
Further work is needed before putting GPT-3-powered apps into production. There are some tools for prompt engineering such as [PromptSource](https://github.com/bigscience-workshop/promptsource) and [OpenPrompt](https://github.com/thunlp/OpenPrompt), but we definitely need better tools.
|
174 |
+
|
175 |
+
|
176 |
+
## 5 - Other Applications
|
177 |
+
|
178 |
+
|
179 |
+
### Code
|
180 |
+
|
181 |
+
![alt_text](media/image-18.png "image_tooltip")
|
182 |
+
|
183 |
+
|
184 |
+
One notable application of large foundation models is **code generation**. With a 40- billion-parameter Transformer model pre-trained on all the Github code it could find, [DeepMind Alphacode](https://www.deepmind.com/blog/competitive-programming-with-alphacode) was able to achieve an above-average score on the Codeforce competition. To do this, they used a model to generate a large set of potential solutions and another model to winnow down the options by actually executing them.
|
185 |
+
|
186 |
+
The general idea to highlight from this is **filtering the outputs of a model**. You can have a separate model that does filtering, or you can have some kind of verification + validation process. This can really significantly boost accuracy. OpenAI demonstrates impressive results on [different math word problems](https://openai.com/blog/grade-school-math/), as seen below.
|
187 |
+
|
188 |
+
![alt_text](media/image-19.png "image_tooltip")
|
189 |
+
|
190 |
+
|
191 |
+
Code generation has moved into products of late, like [Github Copilot](https://github.blog/2022-06-21-github-copilot-is-generally-available-to-all-developers/). We highly recommend trying it out! Another option for a similar tool is [replit’s new tool for coding](https://blog.replit.com/ai).
|
192 |
+
|
193 |
+
We’re just getting started with the applications of foundation models to the programming workflow. In fact, things are about to start getting really wild. [A recent paper](https://arxiv.org/pdf/2207.14502.pdf) showed that a large language model that generated its own synthetic puzzles to learn to code could improve significantly. **Models are teaching themselves to get better!**
|
194 |
+
|
195 |
+
![alt_text](media/image-20.png "image_tooltip")
|
196 |
+
|
197 |
+
|
198 |
+
Playing around with systems like GPT-3 and their ability to generate code can feel quite remarkable! Check out some fun experiments Sergey ran ([here](https://twitter.com/sergeykarayev/status/1569377881440276481) and [here](https://twitter.com/sergeykarayev/status/1570848080941154304)).
|
199 |
+
|
200 |
+
![alt_text](media/image-21.jpg "image_tooltip")
|
201 |
+
|
202 |
+
### Semantic Search
|
203 |
+
|
204 |
+
**Semantic search** is another interesting application area. If you have texts like words, sentences, paragraphs, or even whole documents, you can embed that text with large language models to get vectors. If you have queries in sentences or paragraphs, you can also embed them in the same way. With this function, you can generate embeddings and easily find semantic overlap by examining the cosine similarity between embedding vectors.
|
205 |
+
|
206 |
+
![alt_text](media/image-22.png "image_tooltip")
|
207 |
+
|
208 |
+
|
209 |
+
Implementing this semantic search is hard. Computations on large, dense vectors with float data types are intensive. Companies like Google and Facebook that use this approach have developed libraries like [FAISS](https://towardsdatascience.com/using-faiss-to-search-in-multidimensional-spaces-ccc80fcbf949) and [ScaNN](https://cloud.google.com/blog/topics/developers-practitioners/find-anything-blazingly-fast-googles-vector-search-technology) to solve the challenges of implementing semantic search.
|
210 |
+
|
211 |
+
Some open-source options for this include [Haystack from DeepSet](https://www.deepset.ai/haystack) and [Jina.AI](https://github.com/jina-ai/jina). Other vendor options include [Pinecone](https://www.pinecone.io/), [Weaviate](https://weaviate.io/), [Milvus](https://milvus.io/), [Qdrant](https://qdrant.tech/), [Google Vector AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview), etc.
|
212 |
+
|
213 |
+
|
214 |
+
### Going Cross-Modal
|
215 |
+
|
216 |
+
Newer models are bridging the gap between data modalities (e.g. using both vision and text). One such model is [the Flamingo model](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf), which uses a special model component called a **perceiver resampler** (an attention module that translates images into fixed-length sequences of tokens).
|
217 |
+
|
218 |
+
![alt_text](media/image-23.png "image_tooltip")
|
219 |
+
|
220 |
+
|
221 |
+
Another paper about [Socratic Models](https://socraticmodels.github.io/) was recently published. The author trained several large models (a vision model, a language model, and an audio model) that are able to interface with each other using language prompts to perform new tasks.
|
222 |
+
|
223 |
+
Finally, the concept of “Foundation Models” came from the paper “[On the Opportunities and Risks of Foundation Models](https://arxiv.org/abs/2108.07258)” by researchers at Stanford Institute for Human-Centered AI. We think “Large Language Models” or “Large Neural Networks” might be more useful terms.
|
224 |
+
|
225 |
+
|
226 |
+
## 6 - CLIP and Image Generation
|
227 |
+
|
228 |
+
Now, let's talk about some of the most exciting applications of this kind of model: in vision!
|
229 |
+
|
230 |
+
In a 2021 OpenAI paper called “[Learning transferrable visual models from natural language supervision](https://arxiv.org/abs/2103.00020)”, **CLIP (Contrastive Language–Image Pre-training)** was introduced. In this paper, the authors encode text via Transforms, encode images via ResNets or Visual Transformers, and apply contrastive training to train the model. Contrastive training matches correct image and text pairs using cosine similarity. The code for this is tremendously simple!
|
231 |
+
|
232 |
+
![alt_text](media/image-24.png "image_tooltip")
|
233 |
+
|
234 |
+
|
235 |
+
With this powerful trained model, you can map images and text using embeddings, even on unseen data. There are two ways of doing this. One is to use a “linear probe” by training a simple logistic regression model on top of the features CLIP outputs after performing inference. Otherwise, you can use a “zero-shot” technique that encodes all the text labels and compares them to the encoded image. Zero-shot tends to be better, but not always.
|
236 |
+
|
237 |
+
Since OpenAI CLIP was released in an open-source format, there have been many attempts to improve it, including [the OpenCLIP model](https://github.com/mlfoundations/open_clip), which actually outperforms CLIP.
|
238 |
+
|
239 |
+
To clarify, CLIP doesn’t go directly from image to text or vice versa. It uses embeddings. This embedding space, however, is super helpful for actually performing searches across modalities. This goes back to our section on vector search. There are so many cool projects that have come out of these efforts! (like [this](https://rom1504.github.io/clip-retrieval/) and [this](https://github.com/haltakov/natural-language-image-search))
|
240 |
+
|
241 |
+
To help develop mental models for these operations, consider how to actual perform **image captioning** (image -> text) and image generation (text -> image). There are two great examples of this written in [the ClipCap paper](https://arxiv.org/pdf/2111.09734.pdf). At a high level, image captioning is performed through training a separate model to mediate between a frozen CLIP, which generates a series of word embeddings, and a frozen GPT-2, which takes these word embeddings and generates texts.
|
242 |
+
|
243 |
+
The intermediate model is a Transformer model that gets better at modeling images and captions.
|
244 |
+
|
245 |
+
![alt_text](media/image-25.png "image_tooltip")
|
246 |
+
|
247 |
+
|
248 |
+
In **image generation**, the most well-known approach is taken by [DALL-E 2 or unCLIP](https://cdn.openai.com/papers/dall-e-2.pdf). In this method, two additional components are introduced to a CLIP system, a prior that maps from text embedding to image embeddings and a decoder that maps from image embedding to image. The prior exists to solve the problem that many text captions can accurately work for an image.
|
249 |
+
|
250 |
+
![alt_text](media/image-26.png "image_tooltip")
|
251 |
+
|
252 |
+
|
253 |
+
In DALL-E 2’s case, they use an approach for the prior called **a diffusion model**. [Diffusion models](https://towardsdatascience.com/diffusion-models-made-easy-8414298ce4da) are trained to denoise data effectively through training on incrementally noisy data.
|
254 |
+
|
255 |
+
![alt_text](media/image-27.png "image_tooltip")
|
256 |
+
|
257 |
+
|
258 |
+
In DALL-E 2, the diffusion method is applied to the **prior** model, which trains its denoising approach on a sequence of encoded text, CLIP text embedding, the diffusion timestamp, and the noised CLIP embedding, all so it can predict the un-noised CLIP image embedding. In doing so, it helps us bridge the gap between the raw text caption to the model, which can be infinitely complicated and “noisy”, and the CLIP image embedding space.
|
259 |
+
|
260 |
+
![alt_text](media/image-28.png "image_tooltip")
|
261 |
+
|
262 |
+
|
263 |
+
The **decoder** helps us go from the prior’s output of an image embedding to an image. This is a much simpler approach for us to understand. We apply a U-Net structure to a diffusion training process that is able to ultimately “de-noise” the input image embedding and output an image.
|
264 |
+
|
265 |
+
![alt_text](media/image-29.png "image_tooltip")
|
266 |
+
|
267 |
+
|
268 |
+
The results of this model are incredible! You can even generate images and merge images using CLIP embeddings. There are all kinds of funky ways of playing with the embeddings to create various image outputs.
|
269 |
+
|
270 |
+
![alt_text](media/image-30.png "image_tooltip")
|
271 |
+
|
272 |
+
|
273 |
+
Other models of interest are Parti and StableDiffusion.
|
274 |
+
|
275 |
+
* Google published [Parti](https://parti.research.google/) soon after DALL-E 2. Parti uses a VQGAN model instead of a diffusion model, where the image is represented as a sequence of high-dimensional tokens).
|
276 |
+
* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release) has been released publicly, so definitely [check it out](https://github.com/CompVis/latent-diffusion)! It uses a “latent diffusion” model, which diffuses the image in a low-dimensional latent space and decodes the image back into a pixel space.
|
277 |
+
|
278 |
+
![alt_text](media/image-31.png "image_tooltip")
|
279 |
+
|
280 |
+
|
281 |
+
There has been an absolute explosion of these applications. Check out these examples on [image-to-image](https://twitter.com/DiffusionPics/status/1568219366097039361/), [video generation](https://twitter.com/jakedowns/status/1568343105212129280), and [photoshop plugins](https://www.reddit.com/r/StableDiffusion/comments/wyduk1/). The sky is the limit.
|
282 |
+
|
283 |
+
Prompting these models is interesting and can get pretty involved. Someday this may even be tool and code-based. You can learn from other people on [Lexica](https://lexica.art/) and [promptoMANIA](https://promptomania.com/).
|
284 |
+
|
285 |
+
It’s truly a remarkable time to be involved with AI models as they scale to new heights.
|
documents/lecture-08.md
ADDED
@@ -0,0 +1,713 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: Building ML-powered products and the teams who create them
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 8: ML Teams and Project Management
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/a54xH6nT4Sw?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published September 26, 2022.
|
14 |
+
[Download slides](https://fsdl.me/2022-lecture-08-slides).
|
15 |
+
|
16 |
+
## 0 - Why is this hard?
|
17 |
+
|
18 |
+
Building any product is hard:
|
19 |
+
|
20 |
+
- You have to hire great people.
|
21 |
+
|
22 |
+
- You have to manage and develop those people.
|
23 |
+
|
24 |
+
- You have to manage your team's output and make sure your vectors are
|
25 |
+
aligned.
|
26 |
+
|
27 |
+
- You have to make good long-term technical choices and manage
|
28 |
+
technical debt.
|
29 |
+
|
30 |
+
- You have to manage expectations from leadership.
|
31 |
+
|
32 |
+
- You have to define and communicate requirements with stakeholders.
|
33 |
+
|
34 |
+
Machine Learning (ML) adds complexity to that process:
|
35 |
+
|
36 |
+
- ML talent is expensive and scarce.
|
37 |
+
|
38 |
+
- ML teams have a diverse set of roles.
|
39 |
+
|
40 |
+
- Projects have unclear timelines and high uncertainty.
|
41 |
+
|
42 |
+
- The field is moving fast, and ML is the "[high-interest credit card
|
43 |
+
of technical
|
44 |
+
debt](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)."
|
45 |
+
|
46 |
+
- Leadership often doesn't understand ML.
|
47 |
+
|
48 |
+
- ML products fail in ways that are hard for laypeople to understand.
|
49 |
+
|
50 |
+
In this lecture, we'll talk about:
|
51 |
+
|
52 |
+
1. ML-related **roles** and their required skills.
|
53 |
+
|
54 |
+
2. How to **hire** ML engineers (and how to get hired).
|
55 |
+
|
56 |
+
3. How ML teams are **organized** and fit into the broader
|
57 |
+
organization.
|
58 |
+
|
59 |
+
4. How to **manage** an ML team and ML products.
|
60 |
+
|
61 |
+
5. **Design** considerations for ML products.
|
62 |
+
|
63 |
+
## 1 - Roles
|
64 |
+
|
65 |
+
### Common Roles
|
66 |
+
|
67 |
+
Let's look at the most common ML roles and the skills they require:
|
68 |
+
|
69 |
+
- The **ML Product Manager** works with the ML team, other business
|
70 |
+
functions, the end-users, and the data owners. This person designs
|
71 |
+
docs, creates wireframes, and develops a plan to prioritize and
|
72 |
+
execute ML projects.
|
73 |
+
|
74 |
+
- The **MLOps/ML Platform Engineer** builds the infrastructure to make
|
75 |
+
models easier and more scalable to deploy. This person handles the
|
76 |
+
ML infrastructure that runs the deployed ML product using
|
77 |
+
platforms like AWS, GCP, Kafka, and other ML tooling vendors.
|
78 |
+
|
79 |
+
- The **ML Engineer** trains and deploys prediction models. This
|
80 |
+
person uses tools like TensorFlow and Docker to work with
|
81 |
+
prediction systems running on real data in production.
|
82 |
+
|
83 |
+
- The **ML Researcher** trains prediction models, often those that are
|
84 |
+
forward-looking or not production-critical. This person uses
|
85 |
+
libraries like TensorFlow and PyTorch on notebook environments to
|
86 |
+
build models and reports describing their experiments.
|
87 |
+
|
88 |
+
- The **Data Scientist** is a blanket term used to describe all of the
|
89 |
+
roles above. In some organizations, this role entails answering
|
90 |
+
business questions via analytics. This person can work with
|
91 |
+
wide-ranging tools from SQL and Excel to Pandas and Scikit-Learn.
|
92 |
+
|
93 |
+
![](./media/image9.png)
|
94 |
+
|
95 |
+
### Skills Required
|
96 |
+
|
97 |
+
What skills are needed for these roles? The chart below displays a nice
|
98 |
+
visual - where the horizontal axis is the level of ML expertise and the
|
99 |
+
size of the bubble is the level of communication and technical writing
|
100 |
+
(the bigger, the better).
|
101 |
+
|
102 |
+
![](./media/image4.png)
|
103 |
+
|
104 |
+
- The **MLOps** is primarily a software engineering role, which often
|
105 |
+
comes from a standard software engineering pipeline.
|
106 |
+
|
107 |
+
- The **ML Engineer** requires a rare mix of ML and Software
|
108 |
+
Engineering skills. This person is either an engineer with
|
109 |
+
significant self-teaching OR a science/engineering Ph.D. who works
|
110 |
+
as a traditional software engineer after graduate school.
|
111 |
+
|
112 |
+
- The **ML Researcher** is an ML expert who usually has an MS or Ph.D.
|
113 |
+
degree in Computer Science or Statistics or finishes an industrial
|
114 |
+
fellowship program.
|
115 |
+
|
116 |
+
- The **ML Product Manager** is just like a traditional Product
|
117 |
+
Manager but with a deep knowledge of the ML development process
|
118 |
+
and mindset.
|
119 |
+
|
120 |
+
- The **Data Scientist** role constitutes a wide range of backgrounds,
|
121 |
+
from undergraduate to Ph.D. students.
|
122 |
+
|
123 |
+
There is an important distinction between a task ML engineer and a
|
124 |
+
platform ML engineer, coined by Shreya Shankar in [this blog
|
125 |
+
post](https://www.shreya-shankar.com/phd-year-one/):
|
126 |
+
|
127 |
+
1. **Task ML engineers** are responsible for maintaining specific ML
|
128 |
+
pipelines. They only focus on ensuring that these ML models are
|
129 |
+
healthy and updated frequently. They are often overburdened.
|
130 |
+
|
131 |
+
2. **Platform ML engineers** help task ML engineers automate tedious
|
132 |
+
parts of their jobs. They are called MLOps/ML Platform engineers
|
133 |
+
in our parlance.
|
134 |
+
|
135 |
+
## 2 - Hiring
|
136 |
+
|
137 |
+
### The AI Talent Gap
|
138 |
+
|
139 |
+
In 2018 (when we started FSDL), the AI talent gap was the main story.
|
140 |
+
There were so few people who understood this technology, so the biggest
|
141 |
+
block for organizations was that they couldn't find people who were good
|
142 |
+
at ML.
|
143 |
+
|
144 |
+
In 2022, the AI talent gap persists. But it tends to be less of a
|
145 |
+
blocker than it used to be because we have had four years of folks
|
146 |
+
switching careers into ML and software engineers emerging from
|
147 |
+
undergraduate with at least a couple of ML classes under their belts.
|
148 |
+
|
149 |
+
The gap tends to be in folks that understand more than just the
|
150 |
+
underlying technology but also have experience in seeing how ML fails
|
151 |
+
and how to make ML successful when it's deployed. That's the reality of
|
152 |
+
how difficult it is to hire ML folks today, especially those with
|
153 |
+
**production experience**.
|
154 |
+
|
155 |
+
### Sourcing
|
156 |
+
|
157 |
+
Because of this shallow talent pool and the skyrocketing demand, hiring
|
158 |
+
for ML positions is pretty hard. Typical ML roles come in the following
|
159 |
+
structure:
|
160 |
+
|
161 |
+
- ML Adjacent roles: ML product manager, DevOps, Data Engineer
|
162 |
+
|
163 |
+
- Core ML Roles: ML Engineer, ML Research/ML Scientist
|
164 |
+
|
165 |
+
- Business analytics roles: Data Scientist
|
166 |
+
|
167 |
+
For ML-adjacent roles, traditional ML knowledge is less important, as
|
168 |
+
demonstrated interest, conversational understanding, and experience can
|
169 |
+
help these professionals play an impactful role on ML teams. Let's focus
|
170 |
+
on how to hire for **the core ML roles**.
|
171 |
+
|
172 |
+
![](./media/image6.png)
|
173 |
+
|
174 |
+
|
175 |
+
While there's no perfect way to **hire ML engineers**, there's
|
176 |
+
definitely a wrong way to hire them, with extensive job descriptions
|
177 |
+
that demand only the best qualifications (seen above). Certainly, there
|
178 |
+
are many good examples of this bad practice floating around.
|
179 |
+
|
180 |
+
- Rather than this unrealistic process, consider hiring for software
|
181 |
+
engineering skills, an interest in ML, and a desire to learn. You
|
182 |
+
can always train people in the art and science of ML, especially
|
183 |
+
when they come with strong software engineering fundamentals.
|
184 |
+
|
185 |
+
- Another option is to consider adding junior talent, as many recent
|
186 |
+
grads come out with good ML knowledge nowadays.
|
187 |
+
|
188 |
+
- Finally, and most importantly, be more specific about what you need
|
189 |
+
the position and professional to do. It's impossible to find one
|
190 |
+
person that can do everything from full-fledged DevOps to
|
191 |
+
algorithm development.
|
192 |
+
|
193 |
+
To **hire ML researchers**, here are our tips:
|
194 |
+
|
195 |
+
- Evaluate the quality of publications, over the quantity, with an eye
|
196 |
+
toward the originality of the ideas, the execution, etc.
|
197 |
+
|
198 |
+
- Prioritize researchers that focus on important problems instead of
|
199 |
+
trendy problems.
|
200 |
+
|
201 |
+
- Experience outside academia is also a positive, as these researchers
|
202 |
+
may be able to transition to industry more effectively.
|
203 |
+
|
204 |
+
- Finally, keep an open mind about research talent and consider
|
205 |
+
talented people without PhDs or from adjacent fields like physics,
|
206 |
+
statistics, etc.
|
207 |
+
|
208 |
+
To find quality candidates for these roles, here are some ideas for
|
209 |
+
sourcing:
|
210 |
+
|
211 |
+
- Use standard sources like LinkedIn, recruiters, on-campus
|
212 |
+
recruiting, etc.
|
213 |
+
|
214 |
+
- Monitor arXiv and top conferences and flag the first authors of
|
215 |
+
papers you like.
|
216 |
+
|
217 |
+
- Look for good implementations of papers you like.
|
218 |
+
|
219 |
+
- Attend ML research conferences (NeurIPS, ICML, ICLR).
|
220 |
+
|
221 |
+
![](./media/image7.png)
|
222 |
+
|
223 |
+
As you seek to recruit, stay on top of what professionals want and make
|
224 |
+
an effort to position your company accordingly. ML practitioners want to
|
225 |
+
be empowered to do great work with interesting data. Building a culture
|
226 |
+
of learning and impact can help recruit the best talent to your team.
|
227 |
+
Additionally, sell sell sell! Talent needs to know how good your team is
|
228 |
+
and how meaningful the mission can be.
|
229 |
+
|
230 |
+
### Interviewing
|
231 |
+
|
232 |
+
As you interview candidates for ML roles, try to **validate your
|
233 |
+
hypotheses of their strengths while testing a minimum bar on weaker
|
234 |
+
aspects**. For example, ensure ML researchers can think creatively about
|
235 |
+
new ML problems while ensuring they meet a baseline for code quality.
|
236 |
+
It's essential to test ML knowledge and software engineering skills for
|
237 |
+
all industry professionals, though the relative strengths can vary.
|
238 |
+
|
239 |
+
The actual ML interview process is much less well-defined than software
|
240 |
+
engineering interviews, though it is modeled off of it. Some helpful
|
241 |
+
inclusions are projects or exercises that test the ability to work with
|
242 |
+
ML-specific code, like take-home ML projects. Chip Huyen's
|
243 |
+
"[Introduction to ML Interviews
|
244 |
+
Book](https://huyenchip.com/ml-interviews-book/)" is a
|
245 |
+
great resource.
|
246 |
+
|
247 |
+
### Finding A Job
|
248 |
+
|
249 |
+
To find an ML job, you can take a look at the following sources:
|
250 |
+
|
251 |
+
- Standard sources such as LinkedIn, recruiters, on-campus recruiting,
|
252 |
+
etc.
|
253 |
+
|
254 |
+
- ML research conferences (NeurIPS, ICLR, ICML).
|
255 |
+
|
256 |
+
- Apply directly (remember, there's a talent gap!).
|
257 |
+
|
258 |
+
Standing out for competitive roles can be tricky! Here are some tips (in
|
259 |
+
increasing order of impressiveness) that you can apply to differentiate
|
260 |
+
yourself:
|
261 |
+
|
262 |
+
1. Exhibit ML interest (e.g., conference attendance, online course
|
263 |
+
certificates, etc.).
|
264 |
+
|
265 |
+
2. Build software engineering skills (e.g., at a well-known software
|
266 |
+
company).
|
267 |
+
|
268 |
+
3. Show you have a broad knowledge of ML (e.g., write blog posts
|
269 |
+
synthesizing a research area).
|
270 |
+
|
271 |
+
4. Demonstrate ability to get ML projects done (e.g., create side
|
272 |
+
projects, re-implement papers).
|
273 |
+
|
274 |
+
5. Prove you can think creatively in ML (e.g., win Kaggle competitions,
|
275 |
+
publish papers).
|
276 |
+
|
277 |
+
## 3 - Organizations
|
278 |
+
|
279 |
+
### Organization Archetypes
|
280 |
+
|
281 |
+
There exists not yet a consensus on the right way to structure an ML
|
282 |
+
team. Still, a few best practices are contingent upon different
|
283 |
+
organization archetypes and their ML maturity level. First, let's see
|
284 |
+
what the different ML organization archetypes are.
|
285 |
+
|
286 |
+
**Archetype 1 - Nascent and Ad-Hoc ML**
|
287 |
+
|
288 |
+
- These are organizations where no one is doing ML, or ML is done on
|
289 |
+
an ad-hoc basis. Obviously, there is little ML expertise in-house.
|
290 |
+
|
291 |
+
- They are either small-to-medium businesses or less
|
292 |
+
technology-forward large companies in industries like education or
|
293 |
+
logistics.
|
294 |
+
|
295 |
+
- There is often low-hanging fruit for ML.
|
296 |
+
|
297 |
+
- But there is little support for ML projects, and it's challenging to
|
298 |
+
hire and retain good talent.
|
299 |
+
|
300 |
+
**Archetype 2 - ML R&D**
|
301 |
+
|
302 |
+
- These are organizations in which ML efforts are centered in the R&D
|
303 |
+
arm of the organization. They often hire ML researchers and
|
304 |
+
doctorate students with experience publishing papers.
|
305 |
+
|
306 |
+
- They are larger companies in sectors such as oil and gas,
|
307 |
+
manufacturing, or telecommunications.
|
308 |
+
|
309 |
+
- They can hire experienced researchers and work on long-term business
|
310 |
+
priorities to get big wins.
|
311 |
+
|
312 |
+
- However, it is very difficult to get quality data. Most often, this
|
313 |
+
type of research work rarely translates into actual business
|
314 |
+
value, so usually, the amount of investment remains small.
|
315 |
+
|
316 |
+
**Archetype 3 - ML Embedded Into Business and Product Teams**
|
317 |
+
|
318 |
+
- These are organizations where certain product teams or business
|
319 |
+
units have ML expertise alongside their software or analytics
|
320 |
+
talent. These ML individuals report up to the team's
|
321 |
+
engineering/tech lead.
|
322 |
+
|
323 |
+
- They are either software companies or financial services companies.
|
324 |
+
|
325 |
+
- ML improvements are likely to lead to business value. Furthermore,
|
326 |
+
there is a tight feedback cycle between idea iteration and product
|
327 |
+
improvement.
|
328 |
+
|
329 |
+
- Unfortunately, it is still very hard to hire and develop top talent,
|
330 |
+
and access to data and compute resources can lag. There are also
|
331 |
+
potential conflicts between ML project cycles and engineering
|
332 |
+
management, so long-term ML projects can be hard to justify.
|
333 |
+
|
334 |
+
**Archetype 4 - Independent ML Function**
|
335 |
+
|
336 |
+
- These are organizations in which the ML division reports directly to
|
337 |
+
senior leadership. The ML Product Managers work with Researchers
|
338 |
+
and Engineers to build ML into client-facing products. They can
|
339 |
+
sometimes publish long-term research.
|
340 |
+
|
341 |
+
- They are often large financial services companies.
|
342 |
+
|
343 |
+
- Talent density allows them to hire and train top practitioners.
|
344 |
+
Senior leaders can marshal data and compute resources. This gives
|
345 |
+
the organizations to invest in tooling, practices, and culture
|
346 |
+
around ML development.
|
347 |
+
|
348 |
+
- A disadvantage is that model handoffs to different business lines
|
349 |
+
can be challenging since users need the buy-in to ML benefits and
|
350 |
+
get educated on the model use. Also, feedback cycles can be slow.
|
351 |
+
|
352 |
+
**Archetype 5 - ML-First Organizations**
|
353 |
+
|
354 |
+
- These are organizations in which the CEO invests in ML, and there
|
355 |
+
are experts across the business focusing on quick wins. The ML
|
356 |
+
division works on challenging and long-term projects.
|
357 |
+
|
358 |
+
- They are large tech companies and ML-focused startups.
|
359 |
+
|
360 |
+
- They have the best data access (data thinking permeates the
|
361 |
+
organization), the most attractive recruiting funnel (challenging
|
362 |
+
ML problems tends to attract top talent), and the easiest
|
363 |
+
deployment procedure (product teams understand ML well enough).
|
364 |
+
|
365 |
+
- This type of organization archetype is hard to implement in practice
|
366 |
+
since it is culturally difficult to embed ML thinking everywhere.
|
367 |
+
|
368 |
+
### Team Structure Design Choices
|
369 |
+
|
370 |
+
Depending on the above archetype that your organization resembles, you
|
371 |
+
can make the appropriate design choices, which broadly speaking follow
|
372 |
+
these three categories:
|
373 |
+
|
374 |
+
1. **Software Engineer vs. Research**: To what extent is the ML team
|
375 |
+
responsible for building or integrating with software? How
|
376 |
+
important are Software Engineering skills on the team?
|
377 |
+
|
378 |
+
2. **Data Ownership**: How much control does the ML team have over data
|
379 |
+
collection, warehousing, labeling, and pipelining?
|
380 |
+
|
381 |
+
3. **Model Ownership**: Is the ML team responsible for deploying models
|
382 |
+
into production? Who maintains the deployed models?
|
383 |
+
|
384 |
+
Below are our design suggestions:
|
385 |
+
|
386 |
+
If your organization focuses on **ML R&D**:
|
387 |
+
|
388 |
+
- Research is most definitely prioritized over Software Engineering
|
389 |
+
skills. Because of this, there would potentially be a lack of
|
390 |
+
collaboration between these two groups.
|
391 |
+
|
392 |
+
- ML team has no control over the data and typically will not have
|
393 |
+
data engineers to support them.
|
394 |
+
|
395 |
+
- ML models are rarely deployed into production.
|
396 |
+
|
397 |
+
If your organization has **ML embedded into the product**:
|
398 |
+
|
399 |
+
- Software Engineering skills will be prioritized over Research
|
400 |
+
skills. Often, the researchers would need strong engineering
|
401 |
+
skills since everyone would be expected to product-ionize his/her
|
402 |
+
models.
|
403 |
+
|
404 |
+
- ML teams generally do not own data production and data management.
|
405 |
+
They will need to work with data engineers to build data
|
406 |
+
pipelines.
|
407 |
+
|
408 |
+
- ML engineers totally own the models that they deploy into
|
409 |
+
production.
|
410 |
+
|
411 |
+
If your organization has **an independent ML division**:
|
412 |
+
|
413 |
+
- Each team has a potent mix of engineering and research skills;
|
414 |
+
therefore, they work closely together within teams.
|
415 |
+
|
416 |
+
- ML team has a voice in data governance discussions, as well as a
|
417 |
+
robust data engineering function.
|
418 |
+
|
419 |
+
- ML team hands-off models to users but is still responsible for
|
420 |
+
maintaining them.
|
421 |
+
|
422 |
+
If your organization is **ML-First**:
|
423 |
+
|
424 |
+
- Different teams are more or less research-oriented, but in general,
|
425 |
+
research teams collaborate closely with engineering teams.
|
426 |
+
|
427 |
+
- ML team often owns the company-wide data infrastructure.
|
428 |
+
|
429 |
+
- ML team hands the models to users, who are responsible for operating
|
430 |
+
and maintaining them.
|
431 |
+
|
432 |
+
The picture below neatly sums up these suggestions:
|
433 |
+
|
434 |
+
![](./media/image12.png)
|
435 |
+
|
436 |
+
## 4 - Managing
|
437 |
+
|
438 |
+
### Managing ML Teams Is Challenging
|
439 |
+
|
440 |
+
The process of actually managing an ML team is quite challenging for
|
441 |
+
four reasons:
|
442 |
+
|
443 |
+
1. **Engineering Estimation:** It's hard to know how easy or hard an ML
|
444 |
+
project is in advance. As you explore the data and experiment with
|
445 |
+
different models, there is enormous scope for new learnings about
|
446 |
+
the problem that materially impact the timeline. Furthermore,
|
447 |
+
knowing what methods will work is often impossible. This makes it
|
448 |
+
hard to say upfront how long or how much work may go into an ML
|
449 |
+
project.
|
450 |
+
|
451 |
+
2. **Nonlinear Progress:** As the chart below from a [blog
|
452 |
+
post](https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641)
|
453 |
+
by Lukas Biewald (CEO of [Weights and
|
454 |
+
Biases](https://wandb.ai/site)) shows, progress on ML
|
455 |
+
projects is unpredictable over time, even when the effort expended
|
456 |
+
grows considerably. It's very common for projects to stall for
|
457 |
+
extended periods of time.
|
458 |
+
|
459 |
+
![](./media/image1.png)
|
460 |
+
|
461 |
+
3. **Cultural gaps:** The relative culture of engineering and research
|
462 |
+
professionals is very different. Research tends to favor novel,
|
463 |
+
creative ideas, while engineering prefers tried and true methods
|
464 |
+
that work. As a result, ML teams often experience a clash of
|
465 |
+
cultures, which can turn toxic if not appropriately managed. A
|
466 |
+
core challenge of running ML teams is addressing the cultural
|
467 |
+
barriers between ML and software engineering so that teams can
|
468 |
+
harmoniously experiment and deliver ML products.
|
469 |
+
|
470 |
+
4. **Leadership Deficits**: It's common to see a lack of detailed
|
471 |
+
understanding of ML at senior levels of management in many
|
472 |
+
companies. As a result, expressing feasibility and setting the
|
473 |
+
right expectations for ML projects, especially high-priority ones,
|
474 |
+
can be hard.
|
475 |
+
|
476 |
+
### How To Manage ML Teams Better
|
477 |
+
|
478 |
+
Managing ML teams is hardly a solved problem, but you can take steps to
|
479 |
+
improve the process.
|
480 |
+
|
481 |
+
**Plan probabilistically**
|
482 |
+
|
483 |
+
Many engineering projects are managed in a waterfall fashion, with the
|
484 |
+
sequential tasks defined up front clearly. Instead of forcing this
|
485 |
+
method of engineering management on difficult ML projects, try assigning
|
486 |
+
a likelihood of success to different tasks to better capture the
|
487 |
+
experimental process inherent to ML engineering. As these tasks progress
|
488 |
+
or stall, rapidly re-evaluate your task ordering to better match what is
|
489 |
+
working. Having this sense of both (1) **how likely a task is to
|
490 |
+
succeed** and (2) **how important it is** makes project planning
|
491 |
+
considerably more realistic.
|
492 |
+
|
493 |
+
![](./media/image10.png)
|
494 |
+
|
495 |
+
|
496 |
+
**Have a portfolio of approaches**
|
497 |
+
|
498 |
+
Embrace multiple ideas and approaches to solve crucial research
|
499 |
+
challenges that gate production ML. Don't make your plan dependent on
|
500 |
+
one approach working!
|
501 |
+
|
502 |
+
**Measure inputs, not results**
|
503 |
+
|
504 |
+
As you work through several approaches in your portfolio, do not overly
|
505 |
+
emphasize whose ideas ultimately work as a reflection of contribution
|
506 |
+
quality. This can negatively impact team members' creativity, as they
|
507 |
+
focus more on trying to find only what they currently think could work,
|
508 |
+
rather than experimenting in a high-quality fashion (which is ultimately
|
509 |
+
what leads to ML success).
|
510 |
+
|
511 |
+
**Have researchers and engineers work together**
|
512 |
+
|
513 |
+
The collaboration between engineering and research is essential for
|
514 |
+
quality ML products to get into production. Emphasize collaboration
|
515 |
+
across the groups and professionals!
|
516 |
+
|
517 |
+
**Get quick wins**
|
518 |
+
|
519 |
+
Taking this approach makes it more likely that your ML project will
|
520 |
+
succeed in the long term. It allows you to demonstrate progress to your
|
521 |
+
leadership more effectively and clearly.
|
522 |
+
|
523 |
+
**Educate leadership on uncertainty**
|
524 |
+
|
525 |
+
This can be hard, as leadership is ultimately accountable for addressing
|
526 |
+
blind spots and understanding timeline risk. There are things you can
|
527 |
+
do, however, to help improve leadership's knowledge about ML timelines.
|
528 |
+
|
529 |
+
- Avoid building hype around narrow progress metrics material only to
|
530 |
+
the ML team (e.g., "*We improved F1 score by 0.2 and have achieved
|
531 |
+
awesome performance!*").
|
532 |
+
|
533 |
+
- Instead, be realistic, communicate risk, and emphasize real product
|
534 |
+
impact (e.g., "Our model improvements should increase the number
|
535 |
+
of conversions by 10%, though we must continue to validate its
|
536 |
+
performance on additional demographic factors.)
|
537 |
+
|
538 |
+
- Sharing resources like [this a16z primer](https://a16z.com/2016/06/10/ai-deep-learning-machines/),
|
539 |
+
[this class from Prof. Pieter
|
540 |
+
Abbeel](https://executive.berkeley.edu/programs/artificial-intelligence),
|
541 |
+
and [this Google's People + AI
|
542 |
+
guidebook](https://pair.withgoogle.com/guidebook) can
|
543 |
+
increase awareness of your company's leadership.
|
544 |
+
|
545 |
+
### ML PMs are well-positioned to educate the organization
|
546 |
+
|
547 |
+
There are two types of ML product managers.
|
548 |
+
|
549 |
+
1. **Task PMs**: These are the more common form of ML PM. They are
|
550 |
+
generally specialized into a specific product area (e.g. trust and
|
551 |
+
safety) and have a strong understanding of the particular use
|
552 |
+
case.
|
553 |
+
|
554 |
+
2. **Platform PMs**: These are a newer form of PMs. They have a broader
|
555 |
+
mandate to ensure that the ML team (generally centralized in this
|
556 |
+
context) is highest leverage. They manage workflow and priorities
|
557 |
+
for this centralized team. To support this, they tend to have a
|
558 |
+
broad understanding of ML themselves. These PMs are critical for
|
559 |
+
educating the rest of the company about ML and ensuring that teams
|
560 |
+
trust the output of models.
|
561 |
+
|
562 |
+
Both types of PMs are crucial for ML success. Platform PMs tend to have
|
563 |
+
a particularly powerful role to play in pushing an organization's
|
564 |
+
adoption of machine learning and making it successful.
|
565 |
+
|
566 |
+
### What is "Agile" for ML?
|
567 |
+
|
568 |
+
There are two options similar to what Agile is for software development
|
569 |
+
in the ML context. They are shown below:
|
570 |
+
|
571 |
+
![](./media/image2.png)
|
572 |
+
|
573 |
+
|
574 |
+
They are both structured, data-science native approaches to project
|
575 |
+
management. You can use them to provide standardization for project
|
576 |
+
stages, roles, and artifacts.
|
577 |
+
|
578 |
+
**TDSP** tends to be more structured and is a strong alternative to the
|
579 |
+
Agile methodology. **CRISP-DM** is somewhat higher level and does not
|
580 |
+
provide as structured a project management workflow. If you genuinely
|
581 |
+
have a large-scale coordination problem, you can try these frameworks,
|
582 |
+
but don't otherwise. They can slow you down since they are more oriented
|
583 |
+
around "traditional" data science and not machine learning.
|
584 |
+
|
585 |
+
## 5 - Design
|
586 |
+
|
587 |
+
Let's talk about how to actually design machine learning products now.
|
588 |
+
The biggest challenge with designing such products often isn't
|
589 |
+
implementing them; it's **bridging the gap between users' inflated
|
590 |
+
expectations and the reality**.
|
591 |
+
|
592 |
+
Users often expect extremely sophisticated systems capable of solving
|
593 |
+
many more problems than they actually can.
|
594 |
+
|
595 |
+
![](./media/image11.png)
|
596 |
+
|
597 |
+
In reality, machine learning systems are more like dogs that are trained
|
598 |
+
to do a special task; weird little guys with a penchant for distraction
|
599 |
+
and an inability to do much more than they are explicitly told.
|
600 |
+
|
601 |
+
![](./media/image13.png)
|
602 |
+
|
603 |
+
All this leads to a big gap between what can be done and what users
|
604 |
+
expect!
|
605 |
+
|
606 |
+
### The Keys to Good ML Product Design
|
607 |
+
|
608 |
+
In practice, **good ML product design bridges users expectations and
|
609 |
+
reality**. If you can help users understand the benefits and limitations
|
610 |
+
of the model, they tend to be more satisfied. Furthermore, always have
|
611 |
+
backup plans for model failures! Over-automating systems tends to be a
|
612 |
+
recipe for unhappy users. Finally, building in feedback loops can really
|
613 |
+
increase satisfaction over time.
|
614 |
+
|
615 |
+
There are a couple ways to **explain the benefits and limitations** of
|
616 |
+
an ML system to users.
|
617 |
+
|
618 |
+
- Focus on the problems it solves, not the fact that the system is
|
619 |
+
"AI-powered".
|
620 |
+
|
621 |
+
- If you make the system feel "human-like" (unconstrained input,
|
622 |
+
human-like responses), expect users to treat it as human-like.
|
623 |
+
|
624 |
+
- Furthermore, seek to include guardrails or prescriptive interfaces
|
625 |
+
over open-ended, human-like experiences. A good example of the
|
626 |
+
former approach is [Amazon
|
627 |
+
Alexa](https://alexa.amazon.com/), which has specific
|
628 |
+
prompts that its ML system responds to.
|
629 |
+
|
630 |
+
![](./media/image5.png)
|
631 |
+
|
632 |
+
|
633 |
+
**Handling failures** is a key part of keeping ML systems users happy.
|
634 |
+
There's nothing worse than a "smart" system that conks out when you do
|
635 |
+
something slightly unexpected. Having built-in solutions to solve for
|
636 |
+
automation issues is extremely important. One approach is letting users
|
637 |
+
be involved to correct improper responses. Another is to focus on the
|
638 |
+
notion of "model confidence" and only offer responses when the threshold
|
639 |
+
is met. A good example of a handling failure approach is how Facebook
|
640 |
+
recommends photo tags for users, but doesn't go so far as to autoassign.
|
641 |
+
|
642 |
+
### Types of User Feedback
|
643 |
+
|
644 |
+
How can you collect feedback from users in a way that avoids these
|
645 |
+
issues? There are different types of user feedback and how they help
|
646 |
+
with model improvement.
|
647 |
+
|
648 |
+
![](./media/image3.png)
|
649 |
+
|
650 |
+
|
651 |
+
Let's go across this chart.
|
652 |
+
|
653 |
+
1. The simplest form of feedback is **indirect implicit feedback**. For
|
654 |
+
example, did the user churn from the product? That tells you
|
655 |
+
immediately how the user felt about the system without them giving
|
656 |
+
a clear signal themselves.
|
657 |
+
|
658 |
+
2. Another form is **direct implicit feedback**, which involves the
|
659 |
+
user "taking the next step". For example, in an automated user
|
660 |
+
onboarding flow, did the user click through into ensuing steps?
|
661 |
+
This is trickier to implement, but can be useful for future
|
662 |
+
training iterations.
|
663 |
+
|
664 |
+
3. The next type of feedback is **binary explicit feedback**, wherein
|
665 |
+
users are specifically asked (e.g. via thumbs up/down buttons) how
|
666 |
+
they feel about the model performance.
|
667 |
+
|
668 |
+
4. You can make this more sophisticated and add **categorical explicit
|
669 |
+
feedback**, which allows users to sort their feedback into various
|
670 |
+
types.
|
671 |
+
|
672 |
+
5. To really get a sense of how users feel, consider offering **free
|
673 |
+
text feedback**. This is tricky to use for model training and can
|
674 |
+
be involved for users, but it's very useful to highlight the
|
675 |
+
highest friction predictions.
|
676 |
+
|
677 |
+
6. The gold standard, of course, are **model corrections**; they are
|
678 |
+
free labels!
|
679 |
+
|
680 |
+
Whenever building explicit feedback into ML systems, avoid relying on
|
681 |
+
users' altruism and be clear about why they should engage in the
|
682 |
+
feedback. Instead, build positive feedback loops by allowing users to
|
683 |
+
experience the benefits of their feedback quickly.
|
684 |
+
|
685 |
+
**Great ML product experiences are designed from scratch**. ML is a very
|
686 |
+
specific technology with clear advantages and drawbacks. Design needs to
|
687 |
+
be thoughtfully executed around these products. It's especially
|
688 |
+
important to allow users to interact safely with ML products that may
|
689 |
+
fail in unexpected ways. Always try to find ways to build in feedback
|
690 |
+
loops to make the ML product better over time.
|
691 |
+
|
692 |
+
There are tons of resources that can help you get started with this
|
693 |
+
emerging field.
|
694 |
+
|
695 |
+
- [Google's People + AI
|
696 |
+
Guidebook](https://pair.withgoogle.com/guidebook)
|
697 |
+
|
698 |
+
- [Guidelines for Human-AI
|
699 |
+
Interaction](https://dl.acm.org/doi/abs/10.1145/3290605.3300233)
|
700 |
+
|
701 |
+
- [Agency Plus Automation: Designing AI into Interactive
|
702 |
+
Systems](http://idl.cs.washington.edu/files/2019-AgencyPlusAutomation-PNAS.pdf)
|
703 |
+
|
704 |
+
- [Designing Collaborative
|
705 |
+
AI](https://medium.com/@Ben_Reinhardt/designing-collaborative-ai-5c1e8dbc8810)
|
706 |
+
|
707 |
+
In conclusion, we talked through a number of adjacent considerations to
|
708 |
+
building ML systems and products. In short, you ship the team as much
|
709 |
+
you do the code; be thoughtful about how you hire, manage, and structure
|
710 |
+
ML teams as much as ML products!
|
711 |
+
|
712 |
+
![](./media/image8.png)
|
713 |
+
|
documents/lecture-08.srt
ADDED
@@ -0,0 +1,416 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,179 --> 00:00:40,500
|
3 |
+
hey everybody welcome back this week we're going to talk about something a little bit different than we do most weeks most weeks we talk about specific technical aspects of building machine learning powered products but this week we're going to focus on some of the organizational things that you need to do in order to work together on ml-powered products as part of an interdisciplinary team so the the reality of building ml Power Products is that building any product well is really difficult you have to figure out how to hire grade people you need to be able to manage those people and get the best out of them you need to make sure that your team is all working together towards a shared goal you need to make good
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:38,399 --> 00:01:20,280
|
7 |
+
long-term technical choices manage technical debt over time you need to make sure that you're managing expectations not just of your own team but also of leadership of your organization and you need to be able to make sure that you're working well within the confines of the requirements of the rest of the org that you're understanding those requirements well and communicating back to your progress to the rest of the organization against those requirements but machine learning adds even more additional complexity to this machine learning Talent tends to be very scarce and expensive to attract machine learning teams are not just a single role but today they tend to be pretty interdisciplinary which makes
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:18,659 --> 00:02:00,600
|
11 |
+
managing them an even bigger challenge machine learning projects often have unclear timelines and there's a high degree of uncertainty to those timelines machine learning itself is moving super fast and machine learning as we've covered before you can think of as like the high interest credit card of technical debt so keeping up with making good long-term decisions and not incurring too much technical debt is especially difficult in ml unlike traditional software ml is so new that in most organizations leadership tends not to be that well educated in it they might not understand some of the core differences between ML and other technology that you're working with machine learning products tend to fail in ways that are really hard for Lay
|
12 |
+
|
13 |
+
4
|
14 |
+
00:01:58,799 --> 00:02:36,360
|
15 |
+
people to understand and so that makes it very difficult to help the rest of the stakeholders in your organization understand what they could really expect from the technology that you're building and what is realistic for us to achieve so throughout the rest rest of this lecture we're going to kind of touch on some of these themes and cover different aspects of this problem of working together to build ml Power Products as an organization so here are the pieces that we're going to cover we're going to talk about different roles that are involved in building ml products we're going to talk about some of the unique aspects involved in hiring ml Talent we're going to talk about organization of teams and how the ml team tends to
|
16 |
+
|
17 |
+
5
|
18 |
+
00:02:34,739 --> 00:03:16,140
|
19 |
+
fit into the rest of the org and some of the pros and cons of different ways of setting that up we'll talk about managing ml teams and ml product management and then lastly we'll talk about some of the design considerations for how to design a product that is well suited to having a good ml model that backs it so let's dive in and talk about rules the most common ml rules that you might hear of are things like ml product manager ml Ops or ml platform or ml info teams machine learning Engineers machine learning researchers or ml scientists data scientists so there's a bunch of different roles here and one kind of obvious question is what's the difference between all these different things so let's break down the job
|
20 |
+
|
21 |
+
6
|
22 |
+
00:03:14,400 --> 00:03:57,900
|
23 |
+
function that each of these roles plays within the context of building the ml product starting with the ml product manager their goal is to work with the NL team the business team the users and any other stakeholders to prioritize projects and make sure that they're being executed well to meet the requirements of the rest of the organization so what they produce is things like design docs wireframes and work plans and they're using tools like jira like notion to help sort of organize the work of the rest of the team the ml Ops or ml platform team are focused on building the infrastructure needed in order to make models easier to deploy more scalable or generally reduce the workload of individual contributors
|
24 |
+
|
25 |
+
7
|
26 |
+
00:03:56,159 --> 00:04:38,160
|
27 |
+
who are working on different ml models the output of what they build is some infrastructure some shared tools that can be used across the ml teams in your company and they're working with tools like AWS like Kafka or other data infrastructure tools and potentially working with ML infrastructure vendors as well to sort of bring best in breed from traditional data and software tools and this new category of ml vendors that are providing like mlops tools together to create this sort of best solution for the specific problems that your company is trying to solve then we have the ml engineer the ml engineer is kind of a catch-all role and the way that I like to think of their responsibilities is they're the person who is responsible
|
28 |
+
|
29 |
+
8
|
30 |
+
00:04:36,000 --> 00:05:15,660
|
31 |
+
for training and deploying and maintaining the prediction model that powers the mlpark product they're not just the person who is you know solely training the model and then handing it off to someone else but they're also responsible for deploying it and then maintaining it once it's in production and so they need to know Technologies like tensorflow for training models but also like Docker for packaging models and making sure that they run on production infrastructure the next role is the ml researcher so this is a role that exists in some organizations that the responsibility stops after the model has been trained and so oftentimes these models are either handed off to some other team to productionize or these
|
32 |
+
|
33 |
+
9
|
34 |
+
00:05:13,800 --> 00:05:54,660
|
35 |
+
folks are focused on building models that are not yet production critical or forward-looking maybe they're prototyping some use cases that might be useful down the line for the organization and their work product is a trained model and oftentimes it's a report or a code repo that describes what this model does how to use it and how to reproduce their results so they're working with ML trading tools and also prototyping tools like jupyter notebooks to produce a version of a model that just needs to work once to sort of show that the thing that they're trying to do is possible and then lastly we get the data scientist data scientist is kind of a patch-all term for potentially any of the things above in some organizations data science is quite
|
36 |
+
|
37 |
+
10
|
38 |
+
00:05:53,220 --> 00:06:31,319
|
39 |
+
distinct from what we've been thinking of as a machine learning role in this class and these are folks in some organizations that are responsible for answering business questions using analytics so in some organizations a data scientists is you know the same as an ml researcher or an ml engineer and other organizations data science is a distinct function that is responsible for answering business questions using data the ml work is the responsibility of an ml team so the next thing we'll talk about is what are the different skills that you actually need to be successful in these roles we're going to plot this on a two by two on the x-axis is the amount of skill that you need in machine learning like how much ml do you
|
40 |
+
|
41 |
+
11
|
42 |
+
00:06:28,979 --> 00:07:07,500
|
43 |
+
really need to know on the y-axis is the software engineering skill needed and then the size of the bubble is a requirement on communication or technical writing how good do you have to be at communicating your ideas to other people so starting with ML Ops or ml platform teams this is really primarily a software engineering role and oftentimes where these folks will come into the organization is through their you know traditional software engineering or data engineering hiring pipeline or even moving over from a data engineering role in another part of the organization another common pattern for how organizations find ml Ops or ml platform Engineers is they Source them from mles at their organization it's
|
44 |
+
|
45 |
+
12
|
46 |
+
00:07:05,580 --> 00:07:43,740
|
47 |
+
oftentimes like an ml engineer who used to just work on one model and then got frustrated by the lack of tooling so decided to move into more of a platform role the ml engineer since this is someone who is required to understand the models deeply and also be able to productionize them this tends to be a rare mix of ml skills and software engineering skills so there's sort of two paths that I typically see for folks becoming ml Engineers oftentimes these are software Engineers who have a pretty significant amount of self-teaching or on the other hand maybe they are someone who's trained in machine learning traditionally like they have a science or engineering PhD but then they switch careers into software engineering after
|
48 |
+
|
49 |
+
13
|
50 |
+
00:07:41,460 --> 00:08:25,220
|
51 |
+
grad school or after undergrad and then later decided to fuse those two skill sets ml researchers these are your ml experts so this is kind of the only role on this list that I would say it's pretty typical still to see a graduate degree or another path to these roles are these industrial Fellowship programs like Google brain residency that are explicitly designed to train people without a PhD in in this distinct skill of research since data science is kind of like a catch-all term for a bunch of different roles in different organizations it also admits a variety of different backgrounds and oftentimes these are undergrads who went to a data science specific program or their science phds who are making the
|
52 |
+
|
53 |
+
14
|
54 |
+
00:08:23,039 --> 00:09:05,220
|
55 |
+
transition into industry and then lastly mlpms oftentimes these folks come from a traditional product management background but they do need to have a deep understanding of the specifics of the ml development process and that can come from having you know work closely with ML teams for a long time having just a really strong independent interest in ml or oftentimes what I see is folks who are you know former data scientists or ml Engineers who make the switch into PM it can be really effective at pming ml projects because they have a deep understand in the technology one other distinction that I think is worth covering when talking about the variety of different roles in ml organizations is the distinction between a task ml
|
56 |
+
|
57 |
+
15
|
58 |
+
00:09:02,880 --> 00:09:44,100
|
59 |
+
engineer and a platform ml engineer this is a distinction that was coined by Shreya Shankar in blog post that's linked below and the distinction is that some ml Engineers are really responsible for like one ml pipeline or maybe a handful of ml pipelines that they're assigned to and so they're the ones that are day in and day out responsible for making sure that this model is healthy making sure that it's being updated frequently and that any failures are sort of being accounted for these folks are often like very overburdened this can be a very sort of expansive role because they have to be training models and deploying them and understanding where they break since mlgiers are often spread so thin some ml Engineers end up
|
60 |
+
|
61 |
+
16
|
62 |
+
00:09:41,100 --> 00:10:22,260
|
63 |
+
taking on a role that looks more like a ml platform team or ml Ops Team where they work across teams to help ml Engineers automate tedious parts of their jobs and so we in our parlance this is called an ml platform engineer or ml Ops engineer but you'll also hear this referred to as an ml engineer or a platform ml engineer so we've talked a little bit about what are some of the different roles in the process of building ml Power Products now let's talk about hiring so how to think about hiring ml Specialists and if you are an ml specialist looking for a job how to think about making yourself more attractive as a job candidate so a few different things that we'll cover here the first is the AI Talent gap which is
|
64 |
+
|
65 |
+
17
|
66 |
+
00:10:20,339 --> 00:10:59,519
|
67 |
+
sort of the reality of ml hiring these days and we'll talk about how to source for ML Engineers if you're hiring folks we'll talk about interviewing and then lastly we'll talk about finding a job four years ago when we started teaching full stack deep learning the AI Talent Gap was the main story in many cases for what teams found Difficult about building with ML there was just so so few people that understood this technology that the biggest thing blocking a lot of organizations was just they couldn't find people who are good at machine learning four years later the AI Talent Gap persists and there's still you know news stories every few months that are being written about how difficult it is for companies to find ml
|
68 |
+
|
69 |
+
18
|
70 |
+
00:10:57,120 --> 00:11:38,279
|
71 |
+
talent but my observation day to day in the field is that it tends to be less of a blocker than it used to be because you know we've had four years of folks switching careers into ML and four years of you know software Engineers emerging from undergrad with at least a couple of ml classes in many cases under their belts so there's more and more people now that are capable of doing ml but there's still a gap and in particular that Gap tends to be in folks that understand more than just the underlying technology but also have experience in seeing how seeing how it fails and how to make it successful when it's deployed so that's the reality of how difficult it is to hire machine learning folks today especially those who have
|
72 |
+
|
73 |
+
19
|
74 |
+
00:11:36,360 --> 00:12:14,459
|
75 |
+
production experience so if you are hiring ml folks how should you think about finding people if you're hiring ml product managers or ml platform or ml Ops Engineers the main skill set that you need to look for is still the sort of core underlying skill set for those roles so product management or data engineering or platform Engineering in general but it is critical to find folks who have experience at least interacting with teams that are building production ml systems because I think one sort of failure mode that I've seen relatively frequently especially for ML platform teams is if you just bring in folks with pure software engineering background a lot of times it's difficult for them to understand the user requirements well
|
76 |
+
|
77 |
+
20
|
78 |
+
00:12:12,660 --> 00:12:52,920
|
79 |
+
enough in order to engineer things that actually solve the user's problems users here being the task mles who are you know the ones who are going to be using the infrastructure data that we'll focus for the rest of the section mostly on these two roles ml engineer and ml scientist so there's a right and a wrong way to hire ml engineers and the wrong way oftentimes looks maybe something like this so you see a job description for the Unicorn machine learning engineer the duties for this person are they need to keep up with seed of the art they need to implement new models from scratches that come out they need a deep understanding of the underlying mathematics and ability to invent new models for new tasks as it arises they
|
80 |
+
|
81 |
+
21
|
82 |
+
00:12:51,480 --> 00:13:30,060
|
83 |
+
need to also be able to build tooling and infrastructure for the ml team because ml teams need tooling to do their jobs they need to be able to build data pipelines as well because without data ml is nothing they need to deploy these models and monitor them in production because without deploying models you're not actually solving a problem so in order to fulfill all these duties you need these requirements as this unicorn mle role you of course need a PhD you need at least four years of tensorflow experience four years as a software engineer you need to have Publications and nurips or other top ml conferences experience building large-scale distributed systems and so when you add all this up hopefully it's becoming
|
84 |
+
|
85 |
+
22
|
86 |
+
00:13:28,500 --> 00:14:09,720
|
87 |
+
clear why this is the wrong way to hire ml Engineers there's just not really very many people that fit this description today if any and so the implication is the right way to hire ml Engineers is to be very very specific about what you actually need from these folks and in most cases the right answer is to primarily hire for software engineering skills not ml skills you do need folks that have at least a background in ml and a desire to learn ML and you can teach people how to do ml if they have a strong interest in it they know the basics and they're really strong in the software engineering side another approach instead of hiring for software engineering skills and training people in the on the ml side is to go
|
88 |
+
|
89 |
+
23
|
90 |
+
00:14:07,380 --> 00:14:46,680
|
91 |
+
more Junior most undergrads in computer science these days graduate with ML experience and so these are folks that have traditional computer science training and some theoretical ml understanding so they have sort of the seeds of being good at both ML and software engineering but maybe not a lot of experience in either one and then the third way that you can do this more effectively is to be more specific about what you really really need for not the ml engineering function in general but for this particular role right so not every ml engineer needs to be a devops expert to be successful not every ml engineer needs to be able to implement new papers from scratch to be successful either for many of the MLS years that
|
92 |
+
|
93 |
+
24
|
94 |
+
00:14:45,240 --> 00:15:25,440
|
95 |
+
you're hiring what they really need to do is something along the lines of taking a model that is you know pretty established as something that works while pulling it off the shelf or training it using a pretty robust library and then being able to deploy that model into production so focus on hiring people to have those skills not these aspirational skills that you don't actually really need for your company NeXT let's talk about a couple things I've found to be important for hiring ml researchers the first is a lot of folks when they're hiring ml researchers they look first at the number of Publications they have in top conferences I think it's really critical to focus entirely on the Quant the quality of public
|
96 |
+
|
97 |
+
25
|
98 |
+
00:15:23,160 --> 00:16:01,019
|
99 |
+
locations not the quantity and this unfortunately requires a little bit of judgment about what high quality research looks like but hopefully there's someone on your team that can provide that judgment it's more interesting to me to find machine learning researchers who have you know one or two Publications that you think are really creative or very applicable to the field that you're working in or have really really strong promising results then to find someone who's you know published 20 papers but each of them are just sort of an incremental Improvement to the state of the art if you're working in the context of a company where you're trying to build a product and you're hiring researchers then I think another really important
|
100 |
+
|
101 |
+
26
|
102 |
+
00:15:59,339 --> 00:16:34,079
|
103 |
+
thing to filter for is looking for researchers who have an eye for working on problems that really matter a lot of researchers maybe through no fault of their own just because of the incentives and Academia focus on problems that are trendy if everyone else is publishing about reinforcement learning then they'll publish about reinforcement learning if everyone else is publishing about generative models then they'll make an incremental improvements to generative models to get them a publication but what you really want to to look for I think is folks that have an independent sense of what problems are important to work on because in the context of your company no one's going to be telling these folks like hey this
|
104 |
+
|
105 |
+
27
|
106 |
+
00:16:33,300 --> 00:17:08,400
|
107 |
+
is what everyone's going to be publishing about this year oftentimes experience outside of Academia can be a good proxy for this but it's not really necessary it's just sort of one signal to look at if you already have a research team established then it's worth considering hiring talented people from adjacent Fields hiring from physics or statistics or math at open AI they did this with to really strong effects they would look for sort of folks that were really technically talented but didn't have a lot of ml expertise and they would train them in them out this works a lot better if you do have experienced researchers who can provide mentorship and guidance for folks I probably wouldn't hire like a first
|
108 |
+
|
109 |
+
28
|
110 |
+
00:17:06,780 --> 00:17:47,280
|
111 |
+
researcher that doesn't have ml experience and then it's also worth remembering that especially these days you really don't need a PhD to do ml research many undergrads have a lot of experience doing ml research and graduates of some of these industrial Fellowship programs like Googles or Facebooks or open AIS have learned the basics of how to do research regardless of whether they have a PhD so that's how to think about evaluating candidates for ML engineering or ml research roles the next thing I want to talk about is how to actually find those candidates so your standard sources like LinkedIn or recruiters or on campus recruiting all work but another thing that can be really effective if you want to go
|
112 |
+
|
113 |
+
29
|
114 |
+
00:17:44,280 --> 00:18:26,580
|
115 |
+
deeper is every time there's a new dump of papers on archive or every year at nurips and other top conferences just keep an eye on what you think are the most exciting papers and flag mostly the first authors of those papers because those are the ones that tend to be doing most of the work and are generally more recruitable because they tend to be more Junior in their careers Beyond looking at papers you can also do something similar for good re-implementations of papers that like so if you are you know looking at some hot new paper and a week later there's a re-implementation of that paper that has high quality code and hits the main results then chances are whoever wrote that implementation is probably pretty good and so they could
|
116 |
+
|
117 |
+
30
|
118 |
+
00:18:24,960 --> 00:19:01,919
|
119 |
+
be worth recruiting you can do a lot of this in person now that ml research conferences are back in person or you can just reach out to folks that you are interested in talking to over the Internet since there's a talent shortage in ml it's not enough just to know how to find good ml candidates and evaluate them you also need to know how to think about attracting them to your company I want to talk a little bit about from what I've seen what a lot of ml practitioners are interested in the roles they take and then talk about ways that you can make your company Stand Out along those axes so one thing a lot of ml practitioners want is to work with Cutting Edge tools and techniques to be working with latest state of the art
|
120 |
+
|
121 |
+
31
|
122 |
+
00:19:00,419 --> 00:19:36,780
|
123 |
+
research another thing is to build knowledge in an exciting field to like a more exciting branch of ml or application of ml working with excellent people probably pretty consistent across many technical domains but certainly true in ml working on interesting data sets this is kind of one unique thing in ml since the work that you can do is constrained in many cases the data sets that you have access to being able to offer unique data sets can be pretty powerful probably again true in general but I've noticed for a lot of ml folks in particular it's important for them to feel like they're doing work that really matters so how do you stand out on these axes you can work on Research oriented projects even if the sort of mandate of
|
124 |
+
|
125 |
+
32
|
126 |
+
00:19:35,580 --> 00:20:12,780
|
127 |
+
your team is primarily to help your company doing some research work that you can publicize and that you could point to as being indicative of working on The Cutting Edge open source libraries things like that can really help attract top candidates if you want to emphasize the ability of folks to sort of build skills and knowledge in an exciting field you can build a team culture around learning so you can host reading groups in your company you can organize learning days which is something that we did at open AI where we would dedicate back then a day per week just to be focused on learning new things but you can do it less frequently than that professional development budgets conference budgets things like
|
128 |
+
|
129 |
+
33
|
130 |
+
00:20:11,340 --> 00:20:50,820
|
131 |
+
this that you can emphasize and this is probably especially valuable if your strategy is to hire more Junior folks or more software engineering oriented folks and train them up in machine learning emphasize how much they'll be able to learn about MLA company one sort of hack to being able to hire good ml people is to have other good ml people on the team this is maybe easier said than done but one really high profile hire can help attract many many other people in the field and if you don't have the luxury of having someone high profile on your team you can help your existing team become more high profile by helping them publish blogs and papers so that other people start to know how talented your team actually is when you're attracting
|
132 |
+
|
133 |
+
34
|
134 |
+
00:20:48,240 --> 00:21:27,419
|
135 |
+
ml candidates you can focus on sort of emphasizing the uniqueness of your data set in recruiting materials so if you have know the best data set for a particular subset of the legal field or the medical field emphasize how interesting that is to work with how much data you have and how unique it is that you have it and then lastly you know just like any other type of recruiting selling the mission of the company and the potential for ML to have an impact on that mission can be really effective next let's talk about ml interviews what I would recommend testing for if you are on the interviewer side of an ml interview is to try to hire for strengths and meet a minimum bar for everything else and this can help you avoid falling into the Trap
|
136 |
+
|
137 |
+
35
|
138 |
+
00:21:24,780 --> 00:22:01,380
|
139 |
+
of looking for unicorn mles so some things that you can test are you want to validate your hypotheses of candidate strengths so if it's a researcher you want to make sure that they can think creatively about new ml problems and one way you can do this is to probe how thoughtful they were about previous projects if they're Engineers if they're mles then you want to make sure that they're great generalist software Engineers since that's sort of the core skill set in ml engineering and then you want to make sure they meet a minimum bar on weaker areas so for researchers I would advocate for only hiring researchers in Industry contexts who have at least the very basics in place about software engineering knowledge and
|
140 |
+
|
141 |
+
36
|
142 |
+
00:21:59,220 --> 00:22:33,299
|
143 |
+
the ability to write like decent code if not you know really high quality production ready code because in context of working with a team other people are going to need to use their code and it's not something that everyone learns how to do when they're in grad school for ML for software Engineers you want to make sure that they at least meet a minimum bar on machine learning knowledge and this is really testing for like are they passionate about this field that they have put in the requisite effort to learn the basics of ml that's a good indication that they're going to learn ml quickly on the job if you're hiring them mostly for their software engineering skills so what do ml interviews actually consist of so this
|
144 |
+
|
145 |
+
37
|
146 |
+
00:22:31,320 --> 00:23:10,380
|
147 |
+
is today much less well defined than your software engineering interviews some common types of Assessments that I've seen are your normal sort of background and culture fit interviews whiteboard coding interviews similar to you'd see in software engineering pair coding like in software engineering but some more ml specific ones include pair debugging where you and an interviewer will sit down and run some ml code and try to find Hey where's the bug in this code oftentimes this is ml specific code and the goal is to test for how well is this person able to find bugs in ml code since bugs tend to be where we spend most of our time in machine learning math puzzles are often common especially involving things like linear algebra
|
148 |
+
|
149 |
+
38
|
150 |
+
00:23:08,340 --> 00:23:46,080
|
151 |
+
take-home projects other types of Assessments include applied ml questions so typically this will have the flavor of hey here's a problem that we're trying to solve with ML let's talk through the sort of high level pieces of how we'd solve it what type of algorithm we'd use what type of system them we need to build to support it another Common Assessment is probing the past projects that you've listed on your resume or listed as part of the interview process asking you about things you tried will work what didn't work and trying to assess what role you played in that project and how thoroughly you thought through the different alternative paths that you could have considered and then lastly ml Theory questions are also pretty common
|
152 |
+
|
153 |
+
39
|
154 |
+
00:23:43,919 --> 00:24:20,880
|
155 |
+
in these interview type assessments that's sort of the universe of things that you might consider interviewing for if you're trying to hire ml folks or that you might expect to find on an ml interview if you are on the the other side and trying to interview for one of these jobs and the last thing I'll say on interviews is there's a great book from chipwin the introduction to machine learning interviews book which is available for free online which is especially useful I think if you're preparing to interview for machine learning roles speaking of which what else should you be doing if your goal is to find new job in machine learning the first question I typically hear is like where should I even look for ML jobs
|
156 |
+
|
157 |
+
40
|
158 |
+
00:24:19,080 --> 00:24:51,059
|
159 |
+
your standard sources like LinkedIn and recruiters all work ml Research Conference references can also be a fantastic place just go up and talk to the folks that are standing around the booths at those conferences they tend to be you know looking for candidates and you can also just apply directly and this is sort of something that people tell you not to do for most roles but remember there's a talent Gap in machine learning so this can actually be more effective than you might think when you're applying what's the best way to think about how to stand out for these roles so I think like sort of a baseline thing is for many companies they really want to see that you're expressing some sort of interest in ml you've been
|
160 |
+
|
161 |
+
41
|
162 |
+
00:24:49,260 --> 00:25:29,460
|
163 |
+
attending conferences you've been taking online courses you've been doing something to sort of put get your foot in the door for getting into the field better than that is being able to demonstrate that you have some software engineering skills again for many ml organizations hiring for software engineering is in many ways more important than hiring for ML skills if you can show that you have a broad knowledge of ml so writing blog posts that synthesize a particular research area or articulating a particular algorithm in a way that is that is new or creative or compelling can be a great way to stand out but even better than that is demonstrating an ability to you know ship ml projects and the best way to do this I think if you are not
|
164 |
+
|
165 |
+
42
|
166 |
+
00:25:27,600 --> 00:26:04,980
|
167 |
+
working in ml full-time right now is through side projects these can be ideas of whatever you want to work on they can be paper re-implementation so they can be your project for this course and then probably if you really want to stand out maybe the most impressive thing that you can do is to prove that you can think creatively in ml right think Beyond just reproducing things that other people have done but be able to you know win kaggle competitions or publish papers and so this is definitely not necessary to get a job in ml but this will sort of put your resume at the top of the stack so we've talked about some of the different roles that are involved in building ml products and how to think about hiring for those roles or being
|
168 |
+
|
169 |
+
43
|
170 |
+
00:26:03,419 --> 00:26:43,320
|
171 |
+
hired for those roles the next thing that we're going to talk about is how machine learning teams fit into the context of the rest of the organization since we're still in the relatively early days of adopting this technology there's no real consensus yet in terms of the best way to structure an ml team but what we'll cover today is taxonomy of some of the best practices for different security levels of organizations and how they think about structuring their ml teams and so we'll think about this as scaling a mountain from least mature ml team to most mature so the bottom of the mountain is the nascent or ad hoc ml archetype so what this looks like is you know your company has just started thinking about mL no
|
172 |
+
|
173 |
+
44
|
174 |
+
00:26:41,940 --> 00:27:20,460
|
175 |
+
one's really doing it yet or maybe there's a little of it being done on an ad hoc basis by the analytics team or one of the product teams and most smaller medium businesses are at most in this category but some of the less technology for larger organizations still fall in this category as well so the great thing about being at this stage is that there's a ton of low hanging fruit often for ML to come in and help solve but the disadvantage if you're going to go in and work in an organization at this stage is that there's often little support available for ML projects you probably won't have any infrastructure that you can rely on and it can be difficult to hire and retain good talent plus leadership in the company may not really be bought
|
176 |
+
|
177 |
+
45
|
178 |
+
00:27:18,419 --> 00:27:59,940
|
179 |
+
into how useful ml could be so that's some of the things to think about if you're going to go take a role role in one of these organizations once the company has decided hey this ml thing is something exciting something that we should invest in typically they'll move up to an ml r d stage so what this looks like is they'll have a specific team or specific like subset of their r d organization that's focused on machine learning they'll typically hire researchers or phds and these folks will be focused on building prototypes internally or potentially doing external facing research so some of the larger oil and gas companies manufacturing companies telecom companies were in the stage even just a few years ago although
|
180 |
+
|
181 |
+
46
|
182 |
+
00:27:58,260 --> 00:28:36,000
|
183 |
+
they've in many cases moved on from it now if you're going to go work in one of these organizations one of the big advantages is you can get away with being less experienced on the research side and since the ml team isn't really going to be on the hook today for any sort of meaningful business outcomes another big Advantage is that these teams can work on long-term business priorities and they can focus on trying to get to what would be really big wins for the organization but the disadvantage to be aware of if you're thinking about joining a team at this stage or building a team at this stage is that oftentimes since the ml team is sort of siled off into an R D part of the organization or a separate team from
|
184 |
+
|
185 |
+
47
|
186 |
+
00:28:34,080 --> 00:29:11,580
|
187 |
+
the different products initiatives it can be difficult for them to get the data that they need to solve the problems that they need to solve it's just not a priority in many cases for other parts of the business to give them the data and then probably the biggest disadvantage of this stage is that you know it doesn't usually work it doesn't usually translate to business value for the organization and so oftentimes ml teams kind of get stuck at this stage where they don't invest very much in ml and ml is kind of siled and so they don't see strong results and they can't really justify doubling down the next evolution of ml organizations oftentimes is embedding machine learning directly into business and product teams so what
|
188 |
+
|
189 |
+
48
|
190 |
+
00:29:09,900 --> 00:29:52,140
|
191 |
+
this looks like is you'll have some product teams within the organization that have a handful of ml people side by side with their software or analytics teams and these ml teams will report up into the sort of engineering or Tech organizations directly instead of being in their own sort of reporting arm a lot of tech companies when they start adopting ml sort of pretty quickly get to this category because they're pretty agile software organizations and pretty Tech forward organizations anyway and a lot of the financial services company is tend towards this model as well the big sort of overwhelming advantage of this organizational model is that when these ml teams ship stuff successfully it almost always is able to translate
|
192 |
+
|
193 |
+
49
|
194 |
+
00:29:50,159 --> 00:30:27,419
|
195 |
+
pretty directly to business value since the people that are doing ml sit side by side with the folks that are you know building the product or building the feature that the ml is going to be part of and this gives them a really tight feedback cycle between new ideas that they have for how to make the ml better how to make the product better with ml into actual results as part of the products the disadvantages of building ml this way are oftentimes it can be hard to hire and develop really really great ml people because great ml people often want to work with other great ml people it can also be difficult to get these ml folks access to the resources that they need to be really successful so that's the infrastructure they need
|
196 |
+
|
197 |
+
50
|
198 |
+
00:30:25,620 --> 00:31:03,360
|
199 |
+
the data they need or the compute they need because they don't have sort of a central team that reports high up in the organization to ask for help and one other disadvantage of this model is that oftentimes this is where you see conflicts between the way that ml projects are run the sort of iterative process that is high risk and the way that the software teams that these ml folks are a part of are organized sometimes you'll see conflict between folks getting frustrated with the ml folks on their team for not shipping quickly or not being able to sort of commit to a timeline that they promised the next ml organization architect will cover is independent machine learning's function what this looks like is you'll
|
200 |
+
|
201 |
+
51
|
202 |
+
00:31:01,799 --> 00:31:42,960
|
203 |
+
have a machine learning division of the company that reports up to senior leadership so they report to the CEO or the CTO or something along those lines this is what distinguishes it from the mlr D archetype where the ml team is often you know reporting to someone more Junior in the organization often a foreigner as sort of a smaller bet this is the organization making a big bet to investing in machine learning so oftentimes this is also the archetype where you'll start to see mlpms or platform nlpms that work with researchers and ml engineers and some of these other roles in order to deliver like a cross-functional product the big advantage of this model is access to resources so since you have a centralized ml team you can often hire
|
204 |
+
|
205 |
+
52
|
206 |
+
00:31:40,679 --> 00:32:18,960
|
207 |
+
really really talented people and build a talent density in the organization and you can also train people more easily since you have more ml people sitting in a room together or in a zoom room together in some cases since you report to senior leadership you can also often like Marshal more resources in terms of data from the rest of the organization or budget for compute than you can in other archetypes and it makes it a lot easier when you have a centralized organization to invest in things like tooling and infrastructure and culture and best practices around developing ml in your organization the big disadvantage of this model is that it leads to handoffs and that can add friction to the process that you as an
|
208 |
+
|
209 |
+
53
|
210 |
+
00:32:16,980 --> 00:32:54,720
|
211 |
+
ml team need to run in order to actually get your models into production and the last ml organization archetype the the end State the goal if you're trying to build ml the right way in your organization is to be an ml first organization so what this looks like is you have buy-in up and down the organization that ml is something that you as a company want to invest in you have an ml division that works on the most challenging long-term projects and invests in sort of centralized data and centralized infrastructure but you also have expertise in ml in every line of business that focuses on quick wins and working with the central ml division to sort of translate the ideas they have the implementations they make into
|
212 |
+
|
213 |
+
54
|
214 |
+
00:32:52,320 --> 00:33:30,480
|
215 |
+
actual outcomes for the products that the company is building so you'll see this in the biggest tech companies like the Googles and Facebooks of the world as well as startups that were founded with ML as a core guiding principle for how they want to build the products and these days more and more you're starting to see other tech companies who began investing in ml four or five years ago start to become closer to this archetype there's mostly advantages to this model you have great access to data It's relatively easy to recruit and most importantly it's probably easiest in this archetype out of all them to get value out of ml because the products teams that you're working with understand machine learning and really
|
216 |
+
|
217 |
+
55
|
218 |
+
00:33:28,799 --> 00:34:07,740
|
219 |
+
the only disadvantage of this model is that it's difficult and expensive and it takes a long time for organizations that weren't born with this mindset to adopt it because you have to recruit a lot of really good ml people and you need to culturally embed ml thinking into your organization the next thing that we'll talk about is some of the design choices you need to make if you're building an ml team we'll talk about how those depend on the archetype of the organization that you fit into the first question is software engineering versus research so to what extent is the mltm responsible for building software versus just training models the second question is data ownership so is the ml team also responsible for creating publishing data
|
220 |
+
|
221 |
+
56
|
222 |
+
00:34:06,240 --> 00:34:43,379
|
223 |
+
or do they just consume that from other teams and the last thing is model ownership the ml team are they the ones that are going to productionize models or is that the responsibility of some other team in the mlr D archetype typically you'll prioritize research over software engineering skills and the MLT won't really have any ownership over the data or oftentimes even the skill sets to build data pipelines themselves and similarly they won't be responsible for deploying models either and in particular models will rarely make it into production so that won't really be a huge issue embedded ml teams typically they'll prioritize software engineering skills over research skills and all researchers if they even have
|
224 |
+
|
225 |
+
57
|
226 |
+
00:34:42,060 --> 00:35:21,359
|
227 |
+
researchers will need to have strong software engineers skills because everyone's expected to deploy it ml teams still generally doesn't own data because they are working with data Engineers from the rest of the organizations to build data pipelines but since the expectation in these types of organizations is that everyone deploys typically ml Engineers will own maintenance of the models that they deploy in the ml function archetype typically the requirement will be that you'll need to have a team that has a strong mix of software engineering research and data skills so the team size here starts to become larger a minimum might be something like one data engineer one ml engineer potentially a platform engineer or a devops engineer
|
228 |
+
|
229 |
+
58
|
230 |
+
00:35:18,839 --> 00:35:57,420
|
231 |
+
and potentially a PM but these teams are often working with a bunch of other functions so they can in many cases get much larger than that and you know in many cases in these organizations you'll have both software engineers and researchers working closely together within the context of a single team usually at this stage ml teams will start to have a voice in data governance discussions and they'll probably also have some strong internal data engineering functions as well and then since the ml team is centralized at this stage they'll hand off models to a user but in many cases they'll still be responsible for maintaining them although that line is blurry in a lot of organizations that run this model finally in ml first organizations
|
232 |
+
|
233 |
+
59
|
234 |
+
00:35:55,320 --> 00:36:32,700
|
235 |
+
there's no real standardization around how teams are research oriented or not but research teams do tend to work pretty closely with software engineering teams to get things done in some cases the ml team is actually the one that owns company-wide data infrastructure because ml is such a central bet for the company that it makes sense for the ml team to make some of the sort of main decisions about how data will be organized then finally if the ml team is the one that actually built the model they'll typically hand it off to a user who since they have the basic ml skills and knowledge to do this they'll actually be the one to maintain the model and here's all this on one slide if you want to look at it all together
|
236 |
+
|
237 |
+
60
|
238 |
+
00:36:31,140 --> 00:37:06,720
|
239 |
+
all right so we've talked about machine learning teams and organizations and how these come together and the next thing that we're going to talk about is team management and product management for machine learning so the first thing to know about product management and team management for ML is that it tends to be really challenging there's a few reasons for this the first is that it's hard to tell in advance how easy or hard something is going to be so this is an example from a blog post by Lucas B Walt where they ran a kaggle competition and in the first week of that kago competition they saw a huge increase in the accuracy of the best performing model they went from 35 to 70 accuracy within one week and they were thinking
|
240 |
+
|
241 |
+
61
|
242 |
+
00:37:04,859 --> 00:37:42,359
|
243 |
+
this is great like we're gonna hit 95 accuracy and this contest is going to be a huge success but then if you zoom out and look at the entire course of the project over three months it turns out that most of that accuracy gain came in the first week and the improvements thereafter were just marginal and that's not because of a lack of effort the number of participating teams was still growing really rapidly over the course of that time so the upshot is it's really hard to tell in advance how easier or hard something is in ml and looking at signals like how quickly are we able to make progress on this project can be very misleading or related challenge is that progress on ML projects tends to be very non-linear so
|
244 |
+
|
245 |
+
62
|
246 |
+
00:37:40,260 --> 00:38:17,460
|
247 |
+
it's very common for projects to stall for weeks or longer because the ideas that you're trying just don't work or because you hit some sort of unforeseen snag with not having the right data or something like that that causes you to really get stuck and on top of that in the earliest stages of doing the project it can be very difficult to plan and to tell how long the project is going to take because it's unclear what approach will actually work for training a model that's good enough to solve the problem and the upshot of all this is that estimating the timeline for a project when you're in the project planning phase can be very difficult in other words production ml is still somewhere between research and Engineering another
|
248 |
+
|
249 |
+
63
|
250 |
+
00:38:14,700 --> 00:38:54,540
|
251 |
+
challenge for managing ml teams is that there's cultural gaps that exist between research and Engineering organizations these folks tend to come from different backgrounds they have different training they have different values goals and norms for example oftentimes you know stereotypically researchers care about novelty and about how exciting the approach is that they took to solve a problem whereas you know again stereotypically oftentimes software Engineers care about did we make the thing work and in more toxic cultures these two sides often can class and even if they don't Clash directly they might not really value each other as much as they should because both sides are often necessary to build the thing that you
|
252 |
+
|
253 |
+
64
|
254 |
+
00:38:53,099 --> 00:39:28,740
|
255 |
+
want to build to make batteries worse when you're managing a team as part of an organization you're not just responsible for making sure the team does what they're supposed to do but you'll also have to manage up to help leadership understand your progress and what the Outlook is for the thing that you're building since ml is such a new technology many leaders and organizations even in good technology organizations don't really understand it so next I want to talk about some of the ways that you can manage machine learning projects better and the first approach that I'll talk about is doing project planning probabilistically so oftentimes when we think about project planning for software projects we think
|
256 |
+
|
257 |
+
65
|
258 |
+
00:39:26,520 --> 00:40:07,740
|
259 |
+
about it as sort of a waterfall right where you have a set of tasks and you have a set of time estimates for those tasks and a set of dependencies for those tasks and you can plan these out one after another so if task G depends on tasks D and F then task G will happen once those are done if task D depends on C which depends on task a you'll start Task D after a and C are done Etc but in machine learning this can lead to frustration and badly estimated timelines because each of these projects has a higher chance of failure than it does in a typical software project what we ended up doing at open AI was doing project planning probabilistically so rather than assuming that like a particular task is going to take a
|
260 |
+
|
261 |
+
66
|
262 |
+
00:40:06,060 --> 00:40:45,540
|
263 |
+
certain amount of time instead we assign probabilities to let the likelihood of completion of each of these tasks and potentially pursue alternate tasks that allow us to unlock the same dependency in parallel so in this example you know maybe task fee and task C are both alternative approaches to unlocking task D so we might do both of them at the same time and so if we realize all of a sudden that task C is not going to work and task B is taking longer than we expected then we can adjust the timeline appropriately and then we can start planning the next wave of tasks once we know how we're going to solve the prerequisite tasks that we needed a coral area of doing machine learning project planning probabilistically is
|
264 |
+
|
265 |
+
67
|
266 |
+
00:40:43,560 --> 00:41:20,700
|
267 |
+
that you you shouldn't have any path critical projects that are fundamentally research research projects have a very very high rate of failure rather than just saying like this is how we're going to solve this problem instead you should be willing to try a variety of approaches to solve that problem that doesn't necessarily mean that you need to do them all in parallel but many good machine learning organizations do so one way to think about this is you know if you know that you need to build like a model that's never been built in your organization before you can have like a friendly competition of ideas if you have a culture that's built around working together as a team to get to the right answer and not just rewarding the
|
268 |
+
|
269 |
+
68
|
270 |
+
00:41:19,140 --> 00:41:55,079
|
271 |
+
one person who solves the problem correctly another corollary to this idea that that many machine learning ideas can and will fail is that when you're doing Performance Management it's important not to get hung up on just who is the person whose ideas worked in the long term it's important for people to do things that work like over the course of you know many many months or years if nothing that you try works then that's maybe an indication that you're not trying the right things you're not executing effectively but on at any given project object like on a timeline of weeks or a quarter then the success measure that you should be looking at is how well you executed on the project not whether the project happened to be one
|
272 |
+
|
273 |
+
69
|
274 |
+
00:41:53,579 --> 00:42:29,820
|
275 |
+
of the ones that worked one failure mode that I've seen in organizations that hire both researchers and Engineers is implicitly valuing one side more than the other so thinking engineering is more important than research which can lead to things getting stuck on the ml side because the ml side is not getting the resources or attention that they deserve or thinking that research is more important than engineering which can lead to creating ml innovations that are not actually useful so oftentimes the way around this is to have engineers and researchers work very closely together in fact like sometimes uncomfortably close together like working together on the same code base for the same project and understanding
|
276 |
+
|
277 |
+
70
|
278 |
+
00:42:28,320 --> 00:43:05,700
|
279 |
+
that these folks bring different skill sets to the table another key to success I've seen is trying to get quick wins so rather than trying to build a perfect model and then deploy it trying to ship something quickly to demonstrate that this thing can work and then iterate on it over time and then the last thing that you need to do if you're in a position of being the product manager or the engineering manager for an ml team is to put more emphasis than you might think that you need on educating the rest of your organization on how ml Works diving into that a bit more if your organization is relatively new to adopting ml I'd be willing to bet that a lot of people in the organization don't understand one or more of these things
|
280 |
+
|
281 |
+
71
|
282 |
+
00:43:03,359 --> 00:43:43,260
|
283 |
+
for us as like ml practitioners it can be really natural to think about where ml can and can't be used but for a lot of technologists or Business Leaders that are new to ml the uses of ml that are practical can be kind of counter-intuitive and so they might have ideas for ML projects that are feasible and they might miss ideas for ML projects that are pretty easy that don't fit their mental model of what ml can use another common point of friction in dealing with the rest of the organization is convincing the rest of the organization that the ml that you built actually works Business Leaders and folks from product teams typically the same metrics that convince us as ml practitioners that this model is useful
|
284 |
+
|
285 |
+
72
|
286 |
+
00:43:41,460 --> 00:44:18,780
|
287 |
+
won't convince them like just looking at an F1 score or an accuracy score doesn't really tell them what they need to know about whether this model is really solving the task that it needs to solve for the business outcome that they're aiming for and one particular way that this presents itself pretty frequently is in Business Leaders and other stakeholders not really sort of wrapping their heads around the fact that ml is inherently probabilistic and that means that it will fail in production and so a lot of times where ml efforts get hung up is in the same stakeholders potentially that champion the project to begin with not really being able to get comfortable with the fact that once the model is out in the world it's you know
|
288 |
+
|
289 |
+
73
|
290 |
+
00:44:17,339 --> 00:44:57,780
|
291 |
+
the users are going to start to see failures that it makes in almost all cases and the last common failure mode in working with the rest of the organization is the rest of the organization treating ml projects like other software projects and not realizing that they need to be managed differently than other software projects too and one particular way that I've seen this become a problem is when leadership gets frustrated at ml team because they're not able to really accurately convey how long projects are going to take to complete so educating leadership and other stakeholders on the probabilistic nature of ml projects is important to maintaining your sanity as an ml team if you want to share some resources with your execs that they can
|
292 |
+
|
293 |
+
74
|
294 |
+
00:44:55,500 --> 00:45:37,079
|
295 |
+
use to learn more about how these projects play out in the practice of real organizations I would recommend Peter beale's AI strategy class from the business school at UC Berkeley and Google's people in AI guidebook which we'll be referring to a lot more in the rest of the lecture as well the last thing I'll say on educating the rest of the organization on ml is that mlpms I think play like one of the most critical roles in doing this effectively to illustrate this I'm going to make an analogy to the two types of ml engineers and describe two prototypal types of mlpms that I see in different organizations so on one hand we have our task mlpms these are like a PM that's responsible for a specific product or
|
296 |
+
|
297 |
+
75
|
298 |
+
00:45:35,280 --> 00:46:16,680
|
299 |
+
specific product feature that heavily uses ml these folks will need to have a pretty specialized knowledge of ML and how it applies to the particular domain that they're working on so for example they might be the PM for the trust and safety product for your team or particular recommendation product for your team and these are probably the more common type of mlpms in Industry today but an emerging type of mlpm is the platform mlpm platform mlpms tend to start to make sense when you have a centralized ml team and that centralized ml team needs to play some role in educating the rest of the organization in terms of like what are productive uses of ml in all the products that the organization is building because these
|
300 |
+
|
301 |
+
76
|
302 |
+
00:46:14,640 --> 00:46:55,260
|
303 |
+
folks are responsible for managing the workflow in and out of the ml team so helping filter out projects that aren't really high priority for the business or aren't good uses of ml helping proactively find projects that might have a big impact on the the product or the company by spending a lot of time with PMS from the rest of the organization and communicating those priorities to the ml team and outward to the rest of the organization this requires a broad knowledge of ml because a lot of what this role entails is trying to really understand where ml tan and should and shouldn't be applied in the context of all the things the organization is doing and one of the other critical roles that platform MLT
|
304 |
+
|
305 |
+
77
|
306 |
+
00:46:52,619 --> 00:47:30,720
|
307 |
+
and PMs could play is spreading ml knowledge and culture throughout the rest of the organization not just going to PMs and business stakeholders from the other product functions and Gathering requirements from them but also helping educate them on what's possible to do with ML and helping them come up with ideas to use ml in their areas of responsibility that they find exciting so that they can over time really start to build their own intuition about what types of things they should be considering ml to be used for and then another really critical role that these platform mlpms can play is mitigating the risks of you know we've built a model but we can't convince the rest of the organization to actually use it by being really crisp
|
308 |
+
|
309 |
+
78
|
310 |
+
00:47:28,920 --> 00:48:07,440
|
311 |
+
about what are the requirements that we actually need this model to fulfill and then proactively communicating with the other folks that need to be bought in about the model's performance to help them understand all the things that they'll need to understand about them also really trust its performance so platform mlpms are or I think a newer Trend in ml organizations but I think one that can have a big impact on the success of ml organizations when you're in this phase starting to build a centralized ml team or trans transition from a centralized ml team to becoming an ml first organization one question I get a lot about ml product management is what's the equivalent of agile or any of these established development
|
312 |
+
|
313 |
+
79
|
314 |
+
00:48:05,280 --> 00:48:46,980
|
315 |
+
methodologies for software in ml is there something like that that we can just take off the shelf and apply and deliver successful ml products and the answer is there's a couple of emerging ml project management methodologies the first is Chris DM which is actually an older methodology but it was originally focused on Data Mining and has been subsequently applied to data science and ML and the second is the team data science process tdsp from Microsoft what these two things have in common is that they describe the stages of ml projects as sort of a loop where you start by trying to understand the problem that you're trying to solve acquiring data building a model evaluating it and then finally deploying it so the main reason
|
316 |
+
|
317 |
+
80
|
318 |
+
00:48:45,180 --> 00:49:24,960
|
319 |
+
to use one of these methodologies would be if you really want standardization for what you call the different stages of the Project Life Cycle if you're choosing between these two tdsp tends to be a little bit more structured it provides like sort of more granular list of roles responsibilities templates that you can use to actually execute on this process crisp DM is a bit higher level so if you need an actual like granular project management framework then I would start by trying tdsp but I'll see more generally it's reasonable to use these if you truly have a large scale coordination problem if you're trying to get a large ml team working together successfully for the first time but I would otherwise recommend skipping these
|
320 |
+
|
321 |
+
81
|
322 |
+
00:49:23,280 --> 00:50:03,660
|
323 |
+
because they're more focused on traditional data mining or data science processes and they'll probably slow you down so I would sort of exercise caution before implementing one of these methodologies in full the last thing I want to talk about is designing products that lend themselves well to being powered by Machine learning so I think the fundamental challenge in doing this is a gap between in what users expect when they're ended in AI powered products and what they actually get and so what users tend to think when they're given an AI powered product is you know their mental model is often human intelligence but better and in Silicon so they think it um has this knowledge of the world that it as achieved by
|
324 |
+
|
325 |
+
82
|
326 |
+
00:50:02,099 --> 00:50:41,099
|
327 |
+
reading the whole internet oftentimes they think that this product knows me better than I know myself because it has all the data about me from every interaction I've ever had with software they think that AI Power Products learn from their mistakes and that they generalize to new problems right because it's intelligence it's able to learn from new examples to solve new tasks but I think a better mental model for what you actually get with an ml powered products is a dog that you train to solve a puzzle right so it's amazing that it can solve the puzzle and it's able to solve surprisingly hard puzzles but at the end of the day it's just a dog solving a puzzle and in particular dogs are weird little guys right they
|
328 |
+
|
329 |
+
83
|
330 |
+
00:50:39,000 --> 00:51:18,180
|
331 |
+
tend to fail and strange and unexpected ways that you know we as people with like human intelligence might not expect they also get distracted easily right like if you take them outside they might not be able to solve the same problem that they're able to solve inside they don't generalize outside of a narrow domain The Stereotype is that you can't teach an old dog new tricks and in ml it's often hard to adapt general knowledge should new tasks or new contexts dogs are great at learning tricks but they can't do it if you don't give them treats and similarly machine Learning Systems don't tend to learn well without feedback or rewards in place to help understand where they're performing well and where they're not
|
332 |
+
|
333 |
+
84
|
334 |
+
00:51:16,020 --> 00:51:54,240
|
335 |
+
performing well and lastly both dogs learning tricks and machine Learning Systems might misbehave if you leave them unattended the implication is that there's a big gap between users mental model for machine learning products and what they actually get from machine learning products so the upshot is that the goal of good ml product design is to bridge the user's expectation with reality and there's a few components to that the first is helping users understand what they're actually getting from the model and also its limitations the the second is that since failures are inevitable we need to be able to handle those failures gracefully which means not over relying on Automation and being able to fall back in many cases
|
336 |
+
|
337 |
+
85
|
338 |
+
00:51:52,260 --> 00:52:37,140
|
339 |
+
too human in the loop and then the final goal of ml product design is to build in feedback loops that help us use data from our users to actually improve the system one of the best practices for ML product design is explaining the benefits and limitations of the system to users one way that you can do that is since users tend to have misconceptions about what AI can and can't do focus on what problem the product is actually solving for the user not on the fact that it's AI powered and similarly the more open-ended and human feeling you make the product experience like allowing users to enter any information that they want to or ask questions in whatever natural language that they want to the more they're going to treat it as
|
340 |
+
|
341 |
+
86
|
342 |
+
00:52:34,980 --> 00:53:15,240
|
343 |
+
human-like and expose some of the failure modes that the system still has so one example of this was when Amazon Alexa was first released one of the sort of controversial decisions that they made was they limited it to a very specific set of prompts that you could say to it rather than having it be an open-ended language or dialogue system and that allowed them to really focus on training users to interact with the system in a way that it was likely to be able to understand and then finally the reality is that your model has limitations and so you should explain those limitations to users and consider actually just baking those limitations into the model as guardrails so not letting your users provide input to your
|
344 |
+
|
345 |
+
87
|
346 |
+
00:53:13,680 --> 00:53:53,880
|
347 |
+
model that you know the model is not going to perform well on so that could be as simple as you know if your NLP system was designed to perform well on English text then detecting if users input text in some other language and you know either warning them or not allowing them to input text in a language where your model is not going to perform well the next best practice for ML product design is to not over rely on Automation and instead try to design where possible for a human in the loop automation is great but failed automation can be worse than automation at all so it's worth thinking about even if you know what the right answer is for your users how can you add low friction ways to let users confirm the model's
|
348 |
+
|
349 |
+
88
|
350 |
+
00:53:52,140 --> 00:54:28,920
|
351 |
+
predictions so that they don't have a terrible experience when the model does something wrong and they have no way to fix it one example of this was back when Facebook had an auto tagging feature of you know recognizing your face and pictures and suggesting who the person was they didn't just assign the tag to the face even though they almost always knew exactly who that person was because it'd be a really bad experience if all of a sudden you were tagged in some picture of someone else instead they just add like simple yes no that lets you confirm that they in fact got the prediction that this is your face correctly in order to mitigate the effect of when the model inevitably does make some bad predictions there's a
|
352 |
+
|
353 |
+
89
|
354 |
+
00:54:27,720 --> 00:55:06,300
|
355 |
+
couple of patterns that can help there the first is it's a really good idea to always bake in some way of letting users take control of the system like in a self-driving car to be able to grab the wheel and steer the car back on track if it makes a mistake and another pattern for mitigating the cost so bad predictions is looking at how confident the model is in its response and maybe being prudent about only showing responses to users that are pretty high confidence potentially falling back to a rules-based system or just telling the user that you don't have a good answer to that question the third best practice for ML product design is building in feedback loops with your users so let's talk about some of the different types
|
356 |
+
|
357 |
+
90
|
358 |
+
00:55:04,859 --> 00:55:42,000
|
359 |
+
of feedback that you might collect from your users on the x-axis is how easy it is to use the feedback that you get in order to actually directly make your model better on the y-axis is how much friction does it add to your users to collect this feedback so roughly speaking you could think about like above this line on the middle of the chart is implicit feedback that you collect from your users without needing to change their behavior and on the right side of the chart are signals that you can train on directly without needing to have some human intervention the type of feedback that introduces the least friction to your user is just collecting indirect implicit feedback on how well the prediction is working for
|
360 |
+
|
361 |
+
91
|
362 |
+
00:55:40,079 --> 00:56:17,520
|
363 |
+
them so these are signals about user behavior that tend to be a proxy for mobile performance like did the user churn or not these tend to be super easy to collect because they're often instrumented in your product already and they're really useful because they correspond to important outcomes for our products the challenge in using these is that it's often very difficult to tell whether the model is the cause because these are high level sort of business outcomes that may depend on many other things other than just your model's prediction so to get more directly useful signals from your users you can consider collecting direct implicit feedback where you collect signals from the products that measure how useful
|
364 |
+
|
365 |
+
92
|
366 |
+
00:56:15,240 --> 00:56:51,240
|
367 |
+
this prediction is to the user directly rather than indirectly for example if you're giving the user a recommendation you can measure whether they clicked on the recommendation or if you're suggesting an email for them to send did they send that email or did they copy the suggestion so they can use it in some other application oftentimes these take the form of did the user take the next step in whatever process that they're running that they take the prediction you gave them and use it Downstream for whatever tasks they're trying to do the great thing about this type of feedback is that you can often train on directly because it gives you a signal about you know which predictions the model made that were actually good
|
368 |
+
|
369 |
+
93
|
370 |
+
00:56:48,900 --> 00:57:27,119
|
371 |
+
at solving the task for the user but the challenge is that not every setup of your product lends itself to collecting this type of feedback so you may need to redesign your products in order to collect feedback like this next we'll move on to explicit types of user feedback explicit feedback is where you ask your user directly to provide feedback on the model's performance and the lowest friction way to do this for users tends to be to give them some sort of binary feedback mechanism which can be like a thumbs up or thumbs down button in your product this is pretty easy for users because it just requires them to like click one button and it can be a decent training signal there's some research and using signals like this in
|
372 |
+
|
373 |
+
94
|
374 |
+
00:57:24,660 --> 00:58:03,960
|
375 |
+
order to guide the learning process of models to be more aligned with users preferences if you want a little bit more signal than just was this prediction good or bad you can also ask users to help you categorize the feedback that they're giving they could for example like flag certain predictions as incorrect or offensive or irrelevant or not useful to me you can even set this up as a second step in the process after binary feedback so users will still give you binary feedback even if they don't want to spend the time to categorize that feedback and these signals can be really useful for debugging but it's difficult to set things up in such a way that you can train on them directly another way you can get more granular feedback on Mall's
|
376 |
+
|
377 |
+
95
|
378 |
+
00:58:02,220 --> 00:58:38,040
|
379 |
+
predictions is to have like some sort of free text input where users can tell you what they thought about in prediction this often manifests itself in support tickets or support requests for your model this requires a lot of work on the part of your users and it can be very difficult to use as a model developer because you have to parse through this like unstructured feedback about your model's predictions yet it tends to be quite useful sometimes in practice because since it's high friction to actually provide this kind of feedback the feedback that users do provide can be very high signal it can highlight in some cases like the highest friction predictions since users are willing to put in the time to complain about them
|
380 |
+
|
381 |
+
96
|
382 |
+
00:58:36,180 --> 00:59:19,200
|
383 |
+
and then finally the gold standard for user feedback if it's possible to do in the context of your products and your user experience is is to have users correct the predictions that your model actually makes so if you can get users to label stuff for you directly then that's great then you're in a really good spot here and so one way to think about like where this can actually be feasible is if the thing that you're making a prediction for is useful to the user Downstream within the same product experience that you're building not is this useful for them to copy and use in a different app but is it useful for them to use within my app so one example of this is in product called great scope which Sergey built there is a model that
|
384 |
+
|
385 |
+
97
|
386 |
+
00:59:16,020 --> 00:59:59,700
|
387 |
+
when students submit their exams it tries to match the handwritten name on the exam with the name of the student in the student registry now if the model doesn't really know who that student is if it's low confidence or if it gets the prediction wrong then the instructor can go in and re-categorize that to be the correct name that's really useful to them because they need to have the exam categorized to the correct student anyway but it's also very direct supervisory signal for the model so it's Best of Both Worlds whenever you're thinking about building explicit feedback into your products it's always worth keeping in mind that you know users are not always as altruistic as we might hope that they would be and so you
|
388 |
+
|
389 |
+
98
|
390 |
+
00:59:57,720 --> 01:00:35,460
|
391 |
+
should also think about like how is it going to be worthwhile for users to actually spend the time to give us feedback on this the sort of most foolproof way of doing this is as we described before to gather feedback as part of an existing user workflow but if that's not possible if the goal of users providing the feedback is to make the model better then one way you can encourage them to do that is to make it explicit how the feedback will make their user experience better and generally speaking like the more explicit you can be here and the shorter the time interval is between when they give the feedback and when they actually see the product get better the more of a sort of positive feedback loops this
|
392 |
+
|
393 |
+
99
|
394 |
+
01:00:33,900 --> 01:01:14,040
|
395 |
+
creates for that the more likely is that they're actually going to do it a good example here is to acknowledge user feedback and adjust automatically so so if your user provided you feedback saying hey I really like running up hills then sort of good response to that feedback might be great here's another hell that you can run up in 1.2 kilometers they see the results of that feedback immediately and it's very clear how it's being used to make the product experience better less good is the example to the right of that where the response to the feedback just says thank you for your feedback because I as a user when I give that feedback there's no way for me to know whether that feedback is actually making the product
|
396 |
+
|
397 |
+
100
|
398 |
+
01:01:12,000 --> 01:01:50,520
|
399 |
+
experience better so it discourages me from getting more feedback in the future the main Takeaway on product design for machine learning is that great ml powered products and product experiences are not just you know take an existing product that works well in both and then on top of it they're actually designed from scratch with machine learning and the particularities of machine learning in mind and some reasons for that include that unlike what your users might think machine learning is not superhuman intelligence encoded in Silicon and so your product experience needs to help users understand that in the context of the particular problem that you are solving for them it also needs to help them interact safely with
|
400 |
+
|
401 |
+
101
|
402 |
+
01:01:48,420 --> 01:02:25,079
|
403 |
+
this model that has failure modes via human in the loop and guard rails around the experience with interacting with that model and finally great ml products are powered by great feedback loops right because the perfect version of the model doesn't exist and certainly it doesn't exist in the first version of the model that you deployed and so one important thing to think about when you're designing your product is how can you help your users make the product experience better by collecting the right feedback from them this is a pretty young and underexplored topic and so here's a bunch of resources that I would recommend checking out if you want to learn more about this many of the examples that we used in the previous
|
404 |
+
|
405 |
+
102
|
406 |
+
01:02:23,579 --> 01:02:59,880
|
407 |
+
slides are pulled from these resources and in particular the resource from Google in the top bullet point is really good if you want to understand the basics of this field so to wrap up this lecture we talk about a bunch of different topics related to how to build machine learning products as a team and the first is machine learning roles and the sort of takeaway here is that there's many different skills involved in production machine learning machine production ml is inherently interdisciplinary so there's an opportunity for lots of different skill sets to help contribute when you're building machine learning teams since there's a scarcity of talent especially talent that is good at both software engineering and machine learning it's
|
408 |
+
|
409 |
+
103
|
410 |
+
01:02:58,140 --> 01:03:32,339
|
411 |
+
important to be specific about what you really need for these roles but paradoxically as an outsider it can be difficult to break into the field and the sort of main recommendation that we had for how to get around that is by using projects to build awareness of your thinking about machine learning the next thing that we talk about is how machine learning teams fit into the broader organization we covered a bunch of different archetypes for how that can work and we looked at how machine learning teams are becoming more Standalone and more interdisciplinary in how they function next we talk about managing ml teams and managing ml products managing ml teams is hard and there's no Silver Bullet here but one
|
412 |
+
|
413 |
+
104
|
414 |
+
01:03:30,900 --> 01:04:01,400
|
415 |
+
sort of concrete thing that we looked at is probabilistic Project planning as a way to help alleviate some of the challenges of understanding how long it's going to take to finish machine learning projects and then finally we talk about product design in the context of of machine learning and the main takeaway there is that today's machine learning systems are not AGI right they're Limited in many ways and so it's important to make sure that your users understand that and that you can use the interaction that you build with your users to help mitigate those limitations so that's all for today and we'll see you next week
|
416 |
+
|
documents/lecture-09.md
ADDED
@@ -0,0 +1,825 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
description: Building ML for good while building good ML
|
3 |
+
---
|
4 |
+
|
5 |
+
# Lecture 9: Ethics
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/7FQpbYTqjAA?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
9 |
+
</div>
|
10 |
+
|
11 |
+
Lecture by [Charles Frye](https://twitter.com/charles_irl).
|
12 |
+
Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
|
13 |
+
Published October 03, 2022.
|
14 |
+
[Download slides](https://fsdl.me/2022-lecture-09-slides).
|
15 |
+
|
16 |
+
In this final lecture of FSDL 2022, we'll talk about ethics. After going
|
17 |
+
through the context of what we mean by ethics, we'll go through three
|
18 |
+
different areas where ethics come up:
|
19 |
+
|
20 |
+
1. **Tech Ethics**: ethics that anybody who works in the tech industry
|
21 |
+
broadly needs to think about.
|
22 |
+
|
23 |
+
2. **ML Ethics**: what ethics has specifically meant for the ML
|
24 |
+
industry.
|
25 |
+
|
26 |
+
3. **AI Ethics**: what ethics might mean in the future where true AGI
|
27 |
+
exists.
|
28 |
+
|
29 |
+
## 1 - Overview and Context
|
30 |
+
|
31 |
+
All ethics lectures are wrong, but some are useful. They are more useful
|
32 |
+
if we admit and state what our assumptions or biases are. We'll also
|
33 |
+
talk about three general themes that come up often when ethical concerns
|
34 |
+
are raised in tech/ML: alignment, trade-offs, and humility.
|
35 |
+
|
36 |
+
![](./media/image17.png)
|
37 |
+
|
38 |
+
In this lecture, we'll approach ethics on the basis of **concrete
|
39 |
+
cases** - specific instances where people have raised concerns. We'll
|
40 |
+
talk about **cases where people have taken actions that have led to
|
41 |
+
claims and counter-claims of ethical or unethical behavior** - such as
|
42 |
+
the use of automated weapons, the use of ML systems to make decisions
|
43 |
+
like sentencing and bail, and the use of ML algorithms to generate art.
|
44 |
+
In each case when criticism has been raised, part of that criticism has
|
45 |
+
been that the technology is unethical.
|
46 |
+
|
47 |
+
Approaching ethics in this way allows us to answer the question of "What
|
48 |
+
is ethics?" by way of Ludwig Wittgenstein's quote: "*The meaning of a
|
49 |
+
word is its use in the language*." We'll focus on times when people have
|
50 |
+
used the word "ethics" to describe what they like or dislike about a
|
51 |
+
specific technology.
|
52 |
+
|
53 |
+
If you want to try it out for yourself, you should check out the game
|
54 |
+
"[Something Something Soup
|
55 |
+
Something](https://soup.gua-le-ni.com/)." In this browser
|
56 |
+
game, you are presented with a bunch of dishes and have to decide
|
57 |
+
whether they are soup or not soup, as well as whether they can be served
|
58 |
+
to somebody who ordered soup. By playing a game like this, you'll
|
59 |
+
discover (1) how difficult it is to come up with a concrete definition
|
60 |
+
of soup and (2) how poorly your working definition of soup fits with any
|
61 |
+
given soup theory.
|
62 |
+
|
63 |
+
Because of this case-based approach, we won't be talking about ethical
|
64 |
+
schools or "trolley" problems. Rather than considering [these
|
65 |
+
hypothetical
|
66 |
+
scenarios](https://www.currentaffairs.org/2017/11/the-trolley-problem-will-tell-you-nothing-useful-about-morality),
|
67 |
+
we'll talk about concrete and specific examples from the past decade of
|
68 |
+
work in our field and adjacent fields.
|
69 |
+
|
70 |
+
![](./media/image19.png)
|
71 |
+
|
72 |
+
If you want another point of view that emphasizes the trolley problems,
|
73 |
+
you should check out [Sergey's lecture from the last edition of the
|
74 |
+
course from
|
75 |
+
2021](https://fullstackdeeplearning.com/spring2021/lecture-9/).
|
76 |
+
It presented similar ideas from a different perspective and came to the
|
77 |
+
same conclusion and some different conclusions.
|
78 |
+
|
79 |
+
A useful theme from that lecture that we should all have in mind when we
|
80 |
+
ponder ethical dilemmas is "What Is Water?" - which came up from [a
|
81 |
+
famous commencement speech by David Foster
|
82 |
+
Wallace](https://www.youtube.com/watch?v=PhhC_N6Bm_s). If
|
83 |
+
we aren't thoughtful and paying attention, things that are very
|
84 |
+
important can become background, assumptions, and invisible to us.
|
85 |
+
|
86 |
+
The approach of **relying on prominent cases risks replicating social
|
87 |
+
biases**. Some ethical claims are amplified and travel more because
|
88 |
+
people (who are involved) have more resources and are better connected.
|
89 |
+
Using these forms of case-based reasoning (where you explain your
|
90 |
+
beliefs in concrete detail) can **hide the principles that are actually
|
91 |
+
in operation**, making them disappear like water.
|
92 |
+
|
93 |
+
But in the end, **so much of ethics is deeply personal** that we can't
|
94 |
+
expect to have a perfect approach. We can just do the best we can and
|
95 |
+
hopefully become better every day.
|
96 |
+
|
97 |
+
## 2 - Themes
|
98 |
+
|
99 |
+
We'll see three themes repeatedly coming up throughout this lecture:
|
100 |
+
|
101 |
+
1. **Alignment**: a conflict between what we want and what we get.
|
102 |
+
|
103 |
+
2. **Trade-Offs**: a conflict between what we want and what others
|
104 |
+
want.
|
105 |
+
|
106 |
+
3. **Humility**: a response when we don't know what we want or how to
|
107 |
+
get it.
|
108 |
+
|
109 |
+
### Alignment
|
110 |
+
|
111 |
+
The problem of **alignment** (where what we want and what we get differ)
|
112 |
+
come up over and over again. A primary driver of this is called the
|
113 |
+
**proxy problem** - in which we often optimize or maximize some proxies
|
114 |
+
for the thing that we really care about. If the alignment (or loosely
|
115 |
+
the correlation between that proxy and the thing we care about) is poor
|
116 |
+
enough, then by trying to maximize that proxy, we can end up hurting the
|
117 |
+
thing we originally cared about.
|
118 |
+
|
119 |
+
![](./media/image16.png)
|
120 |
+
|
121 |
+
There was [a recent
|
122 |
+
paper](https://arxiv.org/abs/2102.03896) that did a
|
123 |
+
mathematical analysis of this idea. You can see these kinds of proxy
|
124 |
+
problems everywhere once you look for them.
|
125 |
+
|
126 |
+
- On the top right, we have a train and validation loss chart from one
|
127 |
+
of the training runs for the FSDL text recognizer. The thing we
|
128 |
+
can optimize is the training loss. That's what we can use to
|
129 |
+
calculate gradients and improve the parameters of our network. But
|
130 |
+
the thing we really care about is the performance of the network
|
131 |
+
on data points that it has not seen (like the validation set, the
|
132 |
+
test set, or data in production). If we optimize our training loss
|
133 |
+
too much, we can actually cause our validation loss to go up.
|
134 |
+
|
135 |
+
- Similarly, there was [an interesting
|
136 |
+
paper](https://openreview.net/forum?id=qrGKGZZvH0)
|
137 |
+
suggesting that increasing your accuracy on classification tasks
|
138 |
+
can actually result in a decrease in the utility of your
|
139 |
+
embeddings in downstream tasks.
|
140 |
+
|
141 |
+
- You can find these proxy problems outside of ML as well. [This
|
142 |
+
thread](https://skeptics.stackexchange.com/questions/22375/did-a-soviet-nail-factory-produce-useless-nails-to-improve-metrics)
|
143 |
+
reveals an example where a factory that was making chemical
|
144 |
+
machines (rather than creating a machine that was cheaper and
|
145 |
+
better) chose not to adopt producing that machine because their
|
146 |
+
output was measured in weight. So the thing that the planners
|
147 |
+
actually cared about, economic efficiency and output, was not
|
148 |
+
optimized because it was too difficult to measure.
|
149 |
+
|
150 |
+
One reason why these kinds of proxy problems arise so frequently is due
|
151 |
+
to issues of information. **The information that we are able to measure
|
152 |
+
is not the information that we want**. At a higher level, we often don't
|
153 |
+
know what it is that we truly needed. We may want the validation loss,
|
154 |
+
but what we need is the loss in production or really the value our users
|
155 |
+
will derive from this model.
|
156 |
+
|
157 |
+
### Trade-Offs
|
158 |
+
|
159 |
+
Even when we know what we want or what we need, we are likely to run
|
160 |
+
into the second problem - **the tradeoff between stakeholders**. It is
|
161 |
+
sometimes said that the need to negotiate tradeoffs is one of the
|
162 |
+
reasons why engineers do not like thinking about some of these problems
|
163 |
+
around ethics. That's not quite right because we do accept tradeoffs as
|
164 |
+
a key component of engineering.
|
165 |
+
|
166 |
+
- In [this O'Reilly book on the fundamentals of software
|
167 |
+
architecture](https://www.oreilly.com/library/view/fundamentals-of-software/9781492043447/),
|
168 |
+
the first thing they state at the beginning is that **everything
|
169 |
+
in software architecture is a tradeoff.**
|
170 |
+
|
171 |
+
- [This satirical O'Reilly
|
172 |
+
book](https://www.reddit.com/r/orlybooks/comments/50meb5/it_depends/)
|
173 |
+
says that every programming question has the answer: "It depends."
|
174 |
+
|
175 |
+
![](./media/image20.png)
|
176 |
+
|
177 |
+
|
178 |
+
The famous chart above compares the different convolutional networks on
|
179 |
+
the basis of their accuracy and the number of operations to run them.
|
180 |
+
Thinking about these tradeoffs between speed and correctness is exactly
|
181 |
+
the thing we have to do all the time in our job as engineers.
|
182 |
+
|
183 |
+
We can select the **Pareto Front** for the metrics we care about. A way
|
184 |
+
to remember what a Pareto front is [this definition of a data scientist
|
185 |
+
from Josh
|
186 |
+
Wills](https://twitter.com/josh_wills/status/198093512149958656?lang=en):
|
187 |
+
"Person who is better at statistics than any software engineer and
|
188 |
+
better at software engineering than any statistician." The Pareto Front
|
189 |
+
in the chart above includes the models that are more accurate than those
|
190 |
+
with fewer FLOPs and use fewer FLOPs than those that are more accurate.
|
191 |
+
|
192 |
+
A reason why engineers may dislike thinking about these problems is that
|
193 |
+
**it's hard to identify and quantify these tradeoffs**. These are indeed
|
194 |
+
proxy problems. Even further, once measured, where on that front do we
|
195 |
+
fall? As engineers, we may develop expertise in knowing whether we want
|
196 |
+
high accuracy or low latency, but we are not as comfortable deciding how
|
197 |
+
many current orphans we want to trade for what amount of future health.
|
198 |
+
This raises questions both in terms of measurement and decision-making
|
199 |
+
that are outside of our expertise.
|
200 |
+
|
201 |
+
### Humility
|
202 |
+
|
203 |
+
The appropriate response is **humility** because most engineers do not
|
204 |
+
explicitly train in these skills. Many engineers and managers in tech,
|
205 |
+
in fact, constitutionally prefer optimizing single metrics that are not
|
206 |
+
proxies. Therefore, when encountering a different kind of problem, it's
|
207 |
+
important to bring a humble mindset, ask for help from experts, and
|
208 |
+
recognize that the help you get might not be immediately obvious to what
|
209 |
+
you are used to.
|
210 |
+
|
211 |
+
Additionally, when intervening due to an ethical concern, it's important
|
212 |
+
to remember this humility. It's easy to think that when you are on the
|
213 |
+
good side, this humility is not necessary. But even trying to be helpful
|
214 |
+
is a delicate and dangerous undertaking. We want to make sure that as we
|
215 |
+
resolve ethical concerns, we come up with solutions that are not just
|
216 |
+
parts of the problem.
|
217 |
+
|
218 |
+
### User Orientation Undergirds Each Theme
|
219 |
+
|
220 |
+
We can resolve all of these via **user orientation**.
|
221 |
+
|
222 |
+
1. By getting feedback from users, we maintain **alignment** between
|
223 |
+
our system and our users.
|
224 |
+
|
225 |
+
2. When making **tradeoffs**, we should resolve them in consultation
|
226 |
+
with users.
|
227 |
+
|
228 |
+
3. **Humility** means we actually listen to our users because we
|
229 |
+
recognize we don't have the answers to all the questions.
|
230 |
+
|
231 |
+
## 3 - Tech Ethics
|
232 |
+
|
233 |
+
The tech industry can't afford to ignore ethics as public trust in tech
|
234 |
+
declines. We need to learn from other nearby industries that have done a
|
235 |
+
better job on professional ethics. We'll also touch on some contemporary
|
236 |
+
topics.
|
237 |
+
|
238 |
+
### Tech Industry's Ethical Crisis
|
239 |
+
|
240 |
+
Throughout the past decade, the tech industry has been plagued by
|
241 |
+
scandal - whether that's how tech companies interface with national
|
242 |
+
governments at the largest scale or how tech systems are being used or
|
243 |
+
manipulated by people who create disinformation or fake social media
|
244 |
+
accounts that hack the YouTube recommendation system.
|
245 |
+
|
246 |
+
As a result, distrust in tech companies has risen markedly in the last
|
247 |
+
ten years. [This Public Affairs Pulse
|
248 |
+
survey](https://pac.org/public-affairs-pulse-survey-2021)
|
249 |
+
shows that in 2013, the tech industry was one of the industries with
|
250 |
+
less trustworthiness on average. In 2021, it has rubbed elbows with
|
251 |
+
famously more distrusted industries such as energy and pharmaceuticals.
|
252 |
+
|
253 |
+
![](./media/image10.png)
|
254 |
+
|
255 |
+
Politicians care quite a bit about public opinion polls. In the last few
|
256 |
+
years, the fraction of people who believe that large tech companies
|
257 |
+
should be more regulated has gone up a substantial amount. [Comparing
|
258 |
+
it to 10 years ago, it's astronomically
|
259 |
+
higher](https://news.gallup.com/poll/329666/views-big-tech-worsen-public-wants-regulation.aspx).
|
260 |
+
So there will be a substantial impact on the tech industry due to this
|
261 |
+
loss of public trust.
|
262 |
+
|
263 |
+
We can learn from nearby fields: from the culture of professional ethics
|
264 |
+
in engineering in Canada (by wearing [the Iron
|
265 |
+
Ring](https://en.wikipedia.org/wiki/Iron_Ring)) to ethical
|
266 |
+
standards for human subjects research ([Nuremberg
|
267 |
+
Code](https://en.wikipedia.org/wiki/Nuremberg_Code), [1973
|
268 |
+
National Research
|
269 |
+
Act](https://en.wikipedia.org/wiki/National_Research_Act)).
|
270 |
+
We are at the point where we need a professional code of ethics for
|
271 |
+
software. Hopefully, many codes of ethics developed in different
|
272 |
+
communities can compete with each other and merge into something that
|
273 |
+
most of us can agree on. That can be incorporated into our education for
|
274 |
+
new members of our field.
|
275 |
+
|
276 |
+
Let's talk about two particular ethical concerns that arise in tech in
|
277 |
+
general: carbon emissions and dark/user-hostile design patterns.
|
278 |
+
|
279 |
+
### Tracking Carbon Emissions
|
280 |
+
|
281 |
+
Because carbon emissions scale with cost, you only need to worry about
|
282 |
+
them when the costs of what you are working on are very large. Then you
|
283 |
+
won't be alone in making these decisions and can move a bit more
|
284 |
+
deliberately to make these choices more thoughtfully.
|
285 |
+
|
286 |
+
Anthropogenic climate change from carbon emissions raises ethical
|
287 |
+
concerns - tradeoffs between the present and future generations. The
|
288 |
+
other view is that this is an issue that arises from a classic alignment
|
289 |
+
problem: many organizations are trying to maximize their profit, which
|
290 |
+
is based on prices for goods that don't include externalities (such as
|
291 |
+
environmental damage caused by carbon emissions, leading to increased
|
292 |
+
temperatures and lactic change).
|
293 |
+
|
294 |
+
![](./media/image8.png)
|
295 |
+
|
296 |
+
The primary dimension along which we have to worry about carbon
|
297 |
+
emissions is in **compute jobs that require power**. That power can
|
298 |
+
result in carbon emissions. [This
|
299 |
+
paper](https://aclanthology.org/P19-1355/) walks through
|
300 |
+
how much carbon dioxide was emitted using typical US-based cloud
|
301 |
+
infrastructure.
|
302 |
+
|
303 |
+
- The top headline shows that training a large Transformer model with
|
304 |
+
neural architecture search produces as much carbon dioxide as five
|
305 |
+
cars create during their lifetimes.
|
306 |
+
|
307 |
+
- It's important to remember that power is not free. On US-based cloud
|
308 |
+
infrastructure, \$10 of cloud spent is roughly equal to \$1 of air
|
309 |
+
travel costs. That's on the basis of something like the numbers
|
310 |
+
and the chart indicating air travel across the US from New York to
|
311 |
+
San Francisco.
|
312 |
+
|
313 |
+
- Just changing cloud regions can actually reduce your emissions quite
|
314 |
+
a bit. There's [a factor of
|
315 |
+
50x](https://www.youtube.com/watch?v=ftWlj4FBHTg)
|
316 |
+
from regions with the most to least carbon-intensive power
|
317 |
+
generation.
|
318 |
+
|
319 |
+
The interest in this problem has led to new tools.
|
320 |
+
[Codecarbon.io](https://codecarbon.io/) allows you to
|
321 |
+
track power consumption and reduce carbon emissions from your computing.
|
322 |
+
[ML CO2 Impact](https://mlco2.github.io/impact/) is
|
323 |
+
oriented directly towards machine learning.
|
324 |
+
|
325 |
+
### Deceptive Design and Dark Patterns
|
326 |
+
|
327 |
+
The other ethical concern in tech is **deceptive design**. An
|
328 |
+
unfortunate amount of deception is tolerated in some areas of software.
|
329 |
+
As seen below, on the left is a nearly complete history of the way
|
330 |
+
Google displays ads in its search engine results. It started off very
|
331 |
+
clearly colored and separated out with bright colors from the rest of
|
332 |
+
the results. Then about ten years ago, that colored background was
|
333 |
+
removed and replaced with a tiny little colored snippet that said "Ad."
|
334 |
+
Now, as of 2020, that small bit is no longer even colored. It is just
|
335 |
+
bolded. This makes it difficult for users to know which content is being
|
336 |
+
served to them because somebody paid for it (versus content served up
|
337 |
+
organically).
|
338 |
+
|
339 |
+
![](./media/image15.png)
|
340 |
+
|
341 |
+
A number of **dark patterns** of deceptive design have emerged over the
|
342 |
+
last ten years. You can read about them on the website called
|
343 |
+
[deceptive.design](https://www.deceptive.design/). There's
|
344 |
+
also a Twitter account called
|
345 |
+
[\@darkpatterns](https://twitter.com/darkpatterns) that
|
346 |
+
shares examples found in the wild.
|
347 |
+
|
348 |
+
A practice in the tech industry that's on a very shaky ethical /legal
|
349 |
+
ground is **growth hacking**. This entails a set of techniques for
|
350 |
+
achieving rapid growth in user base or revenue for a product and has all
|
351 |
+
the connotations you might expect from the name - with examples
|
352 |
+
including LinkedIn and Hotmail.
|
353 |
+
|
354 |
+
![](./media/image14.png)
|
355 |
+
|
356 |
+
**ML can actually make this problem worse if we optimize short-term
|
357 |
+
metrics**. These growth hacks and deceptive designs can often drive user
|
358 |
+
and revenue growth in the short term but worsen user experience and draw
|
359 |
+
down on goodwill towards the brand in a way that can erode the long-term
|
360 |
+
value of customers. When we incorporate ML into the design of our
|
361 |
+
products with A/B testing, we have to watch out to make sure that the
|
362 |
+
metrics that we are optimizing do not encourage this kind of deception.
|
363 |
+
|
364 |
+
These arise inside another alignment problem. One broadly-accepted
|
365 |
+
justification for the private ownership of the means of production is
|
366 |
+
that private enterprise delivers broad social value aligned by price
|
367 |
+
signals and market focus. But these private enterprises optimize metrics
|
368 |
+
that are, at best, a proxy for social value. There's the possibility of
|
369 |
+
an alignment problem where **companies pursuing and maximizing their
|
370 |
+
market capitalization can lead to net negative production of value**. If
|
371 |
+
you spend time at the intersection of funding, leadership, and
|
372 |
+
technology, you will encounter it.
|
373 |
+
|
374 |
+
![](./media/image12.png)
|
375 |
+
|
376 |
+
|
377 |
+
In the short term, you can **push for longer-term thinking within your
|
378 |
+
organization** to allow for better alignment between metrics and goals
|
379 |
+
and between goals and utility. You can also learn to recognize
|
380 |
+
user-hostile designs and **advocate for user-centered design instead**.
|
381 |
+
|
382 |
+
To wrap up this section on tech ethics:
|
383 |
+
|
384 |
+
1. The tech industry should learn from other disciplines if it wants to
|
385 |
+
avoid a trust crisis.
|
386 |
+
|
387 |
+
2. We can start by educating ourselves about common deceptive or
|
388 |
+
user-hostile practices in our industry.
|
389 |
+
|
390 |
+
## 4 - ML Ethics
|
391 |
+
|
392 |
+
The ethical concerns raised about ML have gone beyond just the ethical
|
393 |
+
questions about other kinds of technology. We'll talk about common
|
394 |
+
ethical questions in ML and lessons learned from Medical ML.
|
395 |
+
|
396 |
+
### Why Not Just Tech Ethics?
|
397 |
+
|
398 |
+
ML touches human lives more intimately than other technologies. Many ML
|
399 |
+
methods, especially deep neural networks, make human-legible data into
|
400 |
+
computer-legible data. Humans are more sensitive to errors and have more
|
401 |
+
opinions about visual and text data than they do about the type of data
|
402 |
+
manipulated by computers. As a result, there are more stakeholders with
|
403 |
+
more concerns that need to be traded off in ML applications.
|
404 |
+
|
405 |
+
Broadly speaking, ML involves being wrong pretty much all the time. Our
|
406 |
+
models are statistical and include "randomness." Randomness is almost
|
407 |
+
always an admission of ignorance. As we admit a certain degree of
|
408 |
+
ignorance in our models, our models will be wrong and misunderstand
|
409 |
+
situations that they are put into. It can be upsetting and even harmful
|
410 |
+
to be misunderstood by our models.
|
411 |
+
|
412 |
+
Against this backlash of greater interest or higher stakes, a number of
|
413 |
+
common types of ethical concerns have coalesced in the last couple of
|
414 |
+
years. There are somewhat established camps of answers to these
|
415 |
+
questions, so you should at least know where you stand on the four core
|
416 |
+
questions:
|
417 |
+
|
418 |
+
1. Is the model "fair"?
|
419 |
+
|
420 |
+
2. Is the system accountable?
|
421 |
+
|
422 |
+
3. Who owns the data?
|
423 |
+
|
424 |
+
4. Should the system be built at all?
|
425 |
+
|
426 |
+
### Common Ethical Questions in ML
|
427 |
+
|
428 |
+
#### Is The Model "Fair"?
|
429 |
+
|
430 |
+
The classic case on this comes from criminal justice with [the COMPAS
|
431 |
+
system](https://en.wikipedia.org/wiki/COMPAS_(software))
|
432 |
+
for predicting whether a defendant will be arrested again before trial.
|
433 |
+
If they are arrested again, that suggests they committed a crime during
|
434 |
+
that time. This assesses a certain degree of risk for additional harm
|
435 |
+
while the justice system decides what to do about a previous arrest and
|
436 |
+
potential crime.
|
437 |
+
|
438 |
+
The operationalization here was a 10-point re-arrest probability based
|
439 |
+
on past data about this person, and they set a goal from the very
|
440 |
+
beginning to be less biased than human judges. They operationalize that
|
441 |
+
by calibrating these arrest probabilities across subgroups. Racial bias
|
442 |
+
is a primary concern in the US criminal justice system, so they took
|
443 |
+
care to make sure that these probabilities of re-arrest were calibrated
|
444 |
+
for all racial groups.
|
445 |
+
|
446 |
+
![](./media/image2.png)
|
447 |
+
|
448 |
+
|
449 |
+
The system was deployed and used all around the US. It's proprietary and
|
450 |
+
difficult to analyze. But using the Freedom of Information Act and
|
451 |
+
coalescing together a bunch of records, [people at ProPublica were able
|
452 |
+
to run their own analysis of this
|
453 |
+
algorithm](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing).
|
454 |
+
They determined that the model was not more or less wrong for one racial
|
455 |
+
group or another. It tended to have more false positives for Black
|
456 |
+
defendants and more false negatives for White defendants. So despite the
|
457 |
+
creators of COMPAS taking into account bias from the beginning, they
|
458 |
+
still ended up with an algorithm with this undesirable property of being
|
459 |
+
more likely to falsely accuse Black defendants than White defendants.
|
460 |
+
|
461 |
+
It turned out that some quick algebra revealed that some form of
|
462 |
+
race-based bias is inevitable in this setting, as indicated [in this
|
463 |
+
paper](https://arxiv.org/abs/1610.07524). There are a large
|
464 |
+
number of fairness definitions that are mutually incompatible. [This
|
465 |
+
tutorial by Arvind
|
466 |
+
Narayanan](https://www.youtube.com/watch?v=jIXIuYdnyyk&ab_channel=ArvindNarayanan)
|
467 |
+
is an excellent one to display them.
|
468 |
+
|
469 |
+
It is noteworthy that **the impact of "unfairness" is not fixed**. The
|
470 |
+
story is often presented as "no matter what, the journalists would have
|
471 |
+
found something to complain about." But note that equalizing false
|
472 |
+
positive rates and positive predictive value across groups would lead to
|
473 |
+
a higher false negative rate for Black defendants relative to White
|
474 |
+
defendants. In the context of American politics, that's not going to
|
475 |
+
lead to complaints from the same people.
|
476 |
+
|
477 |
+
![](./media/image6.png)
|
478 |
+
|
479 |
+
|
480 |
+
This is the story about the necessity of confronting the tradeoffs that
|
481 |
+
will inevitably come up. Researchers at Google made [a nice little
|
482 |
+
tool](https://research.google.com/bigpicture/attacking-discrimination-in-ml/)
|
483 |
+
where you can think through and make these tradeoffs for yourself. It's
|
484 |
+
helpful for building intuition on these fairness metrics and what it
|
485 |
+
means to pick one over the other.
|
486 |
+
|
487 |
+
Events in this controversy kicked off a flurry of research on fairness.
|
488 |
+
[The Fairness, Accountability, and Transparency
|
489 |
+
conference](https://facctconference.org/) has been held for
|
490 |
+
several years. There has been a ton of work on both **algorithmic-level
|
491 |
+
approaches** on measuring and incorporating fairness metrics into
|
492 |
+
training and **qualitative work** on designing systems that are more
|
493 |
+
transparent and accountable.
|
494 |
+
|
495 |
+
In the case of COMPAS, **re-arrest is not the same as recidivism**.
|
496 |
+
Being rearrested requires that a police officer believes you committed a
|
497 |
+
crime. Police officers are subject to their own biases and patterns of
|
498 |
+
policing, which result in a far higher fraction of crimes being caught
|
499 |
+
for some groups than for others. Our real goal, in terms of fairness and
|
500 |
+
criminal justice, might be around reducing those kinds of unfair impacts
|
501 |
+
and using past rearrest data that have these issues.
|
502 |
+
|
503 |
+
#### Representation Matters for Model Fairness
|
504 |
+
|
505 |
+
![](./media/image18.png)
|
506 |
+
|
507 |
+
Unfortunately, it is easy to make ML-powered tech that fails for
|
508 |
+
minoritized groups. For example, off-the-shelf computer vision tools
|
509 |
+
often fail on darker sins (as illustrated in [this talk by Joy
|
510 |
+
Buolamwini](https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms)).
|
511 |
+
This is not a new issue in technology, just a more salient one with ML.
|
512 |
+
|
513 |
+
There has been a good amount of progress on this in the last five years.
|
514 |
+
An example is [Google's Model
|
515 |
+
Cards](https://modelcards.withgoogle.com/about) which show
|
516 |
+
how well a model will perform on human subgroups of interest.
|
517 |
+
HuggingFace has good integrations for creating these kinds of model
|
518 |
+
cards.
|
519 |
+
|
520 |
+
When you invite people for talks or hire people to join your
|
521 |
+
organizations, you should work to reduce the bias of that discovery
|
522 |
+
process by diversifying your network. Some good resources include
|
523 |
+
[Black in AI](https://blackinai.github.io/#/), [Diversify
|
524 |
+
Tech Job Board](https://www.diversifytech.co/job-board/),
|
525 |
+
[Women in Data Science](https://www.widsconference.org/),
|
526 |
+
and the [You Belong in AI
|
527 |
+
podcast](https://anchor.fm/ucla-acm-ai). You can make
|
528 |
+
professional connections via them to improve the representation of
|
529 |
+
minoritized groups in the engineering, design, and product management
|
530 |
+
process.
|
531 |
+
|
532 |
+
#### Is The System Accountable?
|
533 |
+
|
534 |
+
At a broader level than fairness, we should expect "accountability" from
|
535 |
+
ML systems. Some societies and states, including the EU, consider "[the
|
536 |
+
right to an explanation](https://arxiv.org/abs/1606.08813)"
|
537 |
+
in the face of important judgments to be a part of human rights.
|
538 |
+
|
539 |
+
In the GDPR act, there is [a section that enshrines
|
540 |
+
accountability](https://www.consumerfinance.gov/rules-policy/regulations/1002/interp-9/#9-b-1-Interp-1).
|
541 |
+
This isn't quite a totally new requirement; credit denials in the US
|
542 |
+
have been required to be explained since 1974. People have a right to
|
543 |
+
know what and why into making decisions for them!
|
544 |
+
|
545 |
+
If you want to impose this "accountability" on a deep neural network and
|
546 |
+
understand its selections, there are a number of methods that use the
|
547 |
+
input-output gradient to explain the model. You can see a list of
|
548 |
+
several methods in order of increasing performance below (from [this
|
549 |
+
paper](https://arxiv.org/abs/1810.03292)). These approaches
|
550 |
+
don't quite have strong theoretical underpinnings or a holistic
|
551 |
+
explanation, and are not that robust as a result. A lot of these methods
|
552 |
+
act primarily as edge detectors. The paper shows how even randomizing
|
553 |
+
layers in a model does not materially change the interpretability output
|
554 |
+
of GradCAM methods.
|
555 |
+
|
556 |
+
![](./media/image11.png)
|
557 |
+
|
558 |
+
|
559 |
+
As a result, introspecting DNNs effectively requires reverse engineering
|
560 |
+
the system to really understand what is going on, largely thanks to
|
561 |
+
efforts like [Distil](https://distil.pub/) and
|
562 |
+
[Transfomer Circuits](https://transformer-circuits.pub/).
|
563 |
+
|
564 |
+
Due to these technical challenges, machine learning systems are prone to
|
565 |
+
unaccountability that impacts most those least able to understand and
|
566 |
+
influence their outputs. Books such as [Automating
|
567 |
+
Inequality](https://www.amazon.com/Automating-Inequality-High-Tech-Profile-Police/dp/1250074312)
|
568 |
+
describe the impacts of these systems. In such a context, you should
|
569 |
+
seek to question the purpose of model, involve those impacted by the
|
570 |
+
decisions (either through direct human inputs or through other means),
|
571 |
+
and ensure that equal attention is paid to benefits and harms of
|
572 |
+
automation.
|
573 |
+
|
574 |
+
#### Who Owns The Data?
|
575 |
+
|
576 |
+
**Humans justifiably feel ownership of the data they creat**e, which is
|
577 |
+
subsequently used to train machine learning models. Large datasets used
|
578 |
+
to train models like GPT-3 are created by mining this data without the
|
579 |
+
explicit involvement of those who create the data. Many people are not
|
580 |
+
aware that this is both possible and legal. As technology has changed,
|
581 |
+
what can be done with data has changed.
|
582 |
+
|
583 |
+
[You can even verify if your data has been used to train models
|
584 |
+
on](https://haveibeentrained.com/). Some of these images
|
585 |
+
are potentially [obtained
|
586 |
+
illegally](https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/),
|
587 |
+
as a result of sensitive data being posted openly without the recorded
|
588 |
+
consent of the originator.
|
589 |
+
|
590 |
+
![](./media/image5.png)
|
591 |
+
|
592 |
+
|
593 |
+
Each of these controversies around image generation and illegal data has
|
594 |
+
opened up a new frontier in **data governance**. Focus will be placed on
|
595 |
+
ensuring new ML systems are sensitive to personal and professional
|
596 |
+
concerns of those who generate the data ML systems are trained on.
|
597 |
+
[Emad Mostaque](https://uk.linkedin.com/in/emostaque), CEO
|
598 |
+
of [Stability AI](https://stability.ai/), has gone so far
|
599 |
+
as to offer future opt out systems from systems similar to Stable
|
600 |
+
Diffusion.
|
601 |
+
|
602 |
+
Here are some practical tips: [Dataset
|
603 |
+
cards](https://huggingface.co/docs/datasets/dataset_card)
|
604 |
+
can be helpful in providing documentation in a similar fashion to model
|
605 |
+
cards. There are also ethics lists, like [the deon ethic
|
606 |
+
checklist](https://deon.drivendata.org/examples/) that
|
607 |
+
helps design proper systems. Deon also has a helpful list of failure
|
608 |
+
cases.
|
609 |
+
|
610 |
+
#### Should This Be Built At All?
|
611 |
+
|
612 |
+
The undercurrent behind this conversation is the justifiable question of
|
613 |
+
whether some of these systems should be built at all, let alone in an
|
614 |
+
ethical way.
|
615 |
+
|
616 |
+
**ML-powered weaponry** is the canonical example here, which is already
|
617 |
+
in use. The definition of these systems are blurry, as both systems old
|
618 |
+
and new have had various autonomous capacities. This is difficult to get
|
619 |
+
a sense of due to the secrecy associated with weapon systems.
|
620 |
+
|
621 |
+
Some have argued that "autonomous weapons" have existed for hundreds of
|
622 |
+
years, but even this does not mean that they are ethical. Mines are good
|
623 |
+
examples of these systems. Movements like t[he Campaign Against Killer
|
624 |
+
Robots](https://www.stopkillerrobots.org/about-us/) are
|
625 |
+
trying to prevent the cycle we entered with mines - where we invented
|
626 |
+
them, when we realized the incredible harm, and why we are trying to ban
|
627 |
+
them. Why invent these at all?
|
628 |
+
|
629 |
+
Let's wrap up this entire section with some closing questions that you
|
630 |
+
should always have a thoughtful answer to as you build a machine
|
631 |
+
learning system.
|
632 |
+
|
633 |
+
1. **Is the model "fair"?** Fairness is possible, but requires
|
634 |
+
trade-offs.
|
635 |
+
|
636 |
+
2. **Is the system accountable?** Accountability is easier than
|
637 |
+
interpretability.
|
638 |
+
|
639 |
+
3. **Who owns the data?** Answer this upfront. Changes are on the way.
|
640 |
+
|
641 |
+
4. **Should the system be built at all?** Repeatedly ask this and use
|
642 |
+
it to narrow scope.
|
643 |
+
|
644 |
+
### What Can We Learn from Medical ML
|
645 |
+
|
646 |
+
*Note: The FSDL team would like to thank [Dr. Amir Ashraf
|
647 |
+
Ganjouei](https://scholar.google.com/citations?user=pwLadpcAAAAJ)
|
648 |
+
for his feedback on this section.*
|
649 |
+
|
650 |
+
Interestingly, medicine can teach us a lot about how to apply machine
|
651 |
+
learning in a responsible way. Fundamentally, this has led to a mismatch
|
652 |
+
between how medicine works and how machine learning systems are built
|
653 |
+
today.
|
654 |
+
|
655 |
+
Let's start with a startling fact: **the machine learning response to
|
656 |
+
COVID-19 was an abject failure**. In contrast, the biomedical response
|
657 |
+
was a major triumph. For example, the vaccines were developed with
|
658 |
+
tremendous speed and precision.
|
659 |
+
|
660 |
+
![](./media/image9.png)
|
661 |
+
|
662 |
+
Machine learning did not acquit itself well with the COVID-19 problem.
|
663 |
+
Two reviews ([Roberts et al.,
|
664 |
+
2021](https://www.nature.com/articles/s42256-021-00307-0)
|
665 |
+
and [Wynants et al.,
|
666 |
+
2020-2022](https://www.bmj.com/content/369/bmj.m1328))
|
667 |
+
found that nearly all machine learning models were insufficiently
|
668 |
+
documented, had little to no external validation, and did not follow
|
669 |
+
model development best practices. A full 25% of the papers used a
|
670 |
+
dataset incorrect for the task, which simply highlighted the difference
|
671 |
+
between children and adults, not pneumonia and COVID.
|
672 |
+
|
673 |
+
Medicine has a strong culture of ethics that professionals are
|
674 |
+
integrated into from the point they start training. Medical
|
675 |
+
professionals take the Hippocratic oath of practicing two things: either
|
676 |
+
help or do not harm the patient. In contrast, the foremost belief
|
677 |
+
associated with software development tends to be the infamous "Move fast
|
678 |
+
and break things." While this approach works for harmless software like
|
679 |
+
web apps, **it has serious implications for medicine and other more
|
680 |
+
critical sectors**. Consider the example of a retinal implant that was
|
681 |
+
simply deprecated by developers and left hundreds without sight [in
|
682 |
+
this Statnews
|
683 |
+
article](https://www.statnews.com/2022/08/10/implant-recipients-shouldnt-be-left-in-the-dark-when-device-company-moves-on/).
|
684 |
+
|
685 |
+
![](./media/image4.png)
|
686 |
+
|
687 |
+
**Researchers are drawing inspiration from medicine to develop similar
|
688 |
+
standards for ML**.
|
689 |
+
|
690 |
+
- For example, clinical trial standards have been extended to ML.
|
691 |
+
These standards were developed through extensive surveys,
|
692 |
+
conferences, and consensus building (detailed in
|
693 |
+
[these](https://www.nature.com/articles/s41591-020-1037-7)
|
694 |
+
[papers](https://www.nature.com/articles/s41591-020-1034-x)).
|
695 |
+
|
696 |
+
- Progress is being made in understanding how this problem presents.
|
697 |
+
[A recent
|
698 |
+
study](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2796833)
|
699 |
+
found that while clinical activities are generally performed at a
|
700 |
+
high compliance level, statistical and data issues tend to suffer
|
701 |
+
low compliance.
|
702 |
+
|
703 |
+
- New approaches are developing [entire "auditing"
|
704 |
+
procedures](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00003-6/fulltext)
|
705 |
+
that exquisitely identify the activities required to effectively
|
706 |
+
develop models.
|
707 |
+
|
708 |
+
Like medicine, machine learning is intimately intertwined with people's
|
709 |
+
lives. The most important question to ask is "Should this system be
|
710 |
+
built at all?". Always ask yourselves this and understand the
|
711 |
+
implications!
|
712 |
+
|
713 |
+
## 5 - AI Ethics
|
714 |
+
|
715 |
+
AI ethics are a frontier in both the technology and the ethics worlds.
|
716 |
+
False claims and hype are the most pressing concerns, but other risks
|
717 |
+
could present themselves soon.
|
718 |
+
|
719 |
+
### AI Snake Oils
|
720 |
+
|
721 |
+
**False claims outpace the performance of AI**. This poses a serious
|
722 |
+
threat to adoption and satisfaction with AI systems long term.
|
723 |
+
|
724 |
+
- For example, if you call something "AutoPilot", people might truly
|
725 |
+
assume it is fully autonomous, as happened in the below case of a
|
726 |
+
Tesla user. This goes back to our discussion about how AI systems
|
727 |
+
are more like funky dogs than truly human intelligent systems.
|
728 |
+
|
729 |
+
- Another example of this is [IBM's Watson
|
730 |
+
system](https://www.ibm.com/ibm/history/ibm100/us/en/icons/watson/),
|
731 |
+
which went from tackling the future of healthcare to being sold
|
732 |
+
off for parts.
|
733 |
+
|
734 |
+
![](./media/image13.png)
|
735 |
+
|
736 |
+
These false claims tend to be amplified in the media. But this isn't
|
737 |
+
confined to traditional media. Even Geoff Hinton, a godfather of modern
|
738 |
+
machine learning, has been [a little too aggressive in his forecasts
|
739 |
+
for AI
|
740 |
+
performance](https://www.youtube.com/watch?v=2HMPRXstSvQ)!
|
741 |
+
|
742 |
+
You can call this **"AI Snake Oil"** as Arvind Narayanan does in [his
|
743 |
+
Substack](https://aisnakeoil.substack.com/) and
|
744 |
+
[talk](https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf).
|
745 |
+
|
746 |
+
Let's separate out where true progress has been made versus where
|
747 |
+
progress is likely to be overstated. On some level, AI perception has
|
748 |
+
seen tremendous progress, AI judgment has seen moderate progress, and AI
|
749 |
+
prediction of social outcomes has seen not nearly as much progress.
|
750 |
+
|
751 |
+
![](./media/image3.png)
|
752 |
+
|
753 |
+
### Frontiers: AI Rights and X-Risk
|
754 |
+
|
755 |
+
There's obvious rationale that should artificial sentient beings exist,
|
756 |
+
tremendous ethical implications would be raised. Few people believe that
|
757 |
+
we are truly on the precipice of sentient beings, but there is
|
758 |
+
disagreement on how close we are.
|
759 |
+
|
760 |
+
![](./media/image1.png)
|
761 |
+
|
762 |
+
There's a different set of concerns around how to regard self-improving
|
763 |
+
intelligent beings, for which there is already evidence. Large Language
|
764 |
+
Models have been show to be able to improve themselves in a range of
|
765 |
+
studies
|
766 |
+
([here](https://openreview.net/forum?id=92gvk82DE-) and
|
767 |
+
[here](https://arxiv.org/abs/2207.14502v1)).
|
768 |
+
|
769 |
+
Failing to pursue this technology would lead to [a huge opportunity
|
770 |
+
cost](https://nickbostrom.com/astronomical/waste) (as
|
771 |
+
argued by Nick Bostrom)! There truly is a great opportunity in having
|
772 |
+
such systems help us sold major problems and lead better lives. The key
|
773 |
+
though, is that such technology should be developed in the **safest way
|
774 |
+
possible,** not the fastest way.
|
775 |
+
|
776 |
+
[The paperclip
|
777 |
+
problem](https://www.lesswrong.com/tag/paperclip-maximizer)
|
778 |
+
shows how the potential for misalignment between AI systems and humans
|
779 |
+
could dramatically reduce human utility and even compromise our
|
780 |
+
interests. Imagine a system designed to manufacture paperclips... could
|
781 |
+
actually develop the intelligence to alter elements of society to favor
|
782 |
+
paper clips?! This thought experiments illustrates how self-learning
|
783 |
+
systems could truly change our world for the worse in a misaligned way.
|
784 |
+
|
785 |
+
These ideas around existential risk are most associated with [the
|
786 |
+
Effective Altruism community](https://www.eaglobal.org/).
|
787 |
+
Check out resources like [Giving What We
|
788 |
+
Can](https://www.givingwhatwecan.org/donate/organizations)
|
789 |
+
and [80,000 Hours](https://80000hours.org/) if you're
|
790 |
+
interested!
|
791 |
+
|
792 |
+
## 6 - What Is To Be Done?
|
793 |
+
|
794 |
+
This course can't end on a dour a note as existential risk. What can be
|
795 |
+
done to mitigate these consequences and participate in developing truly
|
796 |
+
ethical AI?
|
797 |
+
|
798 |
+
1. The first step is **to educate yourself on the topic**. There are
|
799 |
+
many great books that give lengthy, useful treatment to this
|
800 |
+
topic. We recommend [Automating
|
801 |
+
Inequality](https://www.amazon.com/Automating-Inequality-High-Tech-Profile-Police/dp/1250074312),
|
802 |
+
[Weapons of Math
|
803 |
+
Destruction](https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815),
|
804 |
+
and [The Alignment
|
805 |
+
Problem](https://www.amazon.com/Alignment-Problem-Machine-Learning-Values/dp/0393635821).
|
806 |
+
|
807 |
+
2. After reading this, **consider how to prioritize your actions**.
|
808 |
+
What do you want to impact? When do you want to do that? Place
|
809 |
+
them in this two-by-two to get a sense of where their importance
|
810 |
+
is.
|
811 |
+
|
812 |
+
![](./media/image7.png)
|
813 |
+
|
814 |
+
**Ethics cannot be purely negative**. We do good, and we want to
|
815 |
+
*prevent* bad! Focus on the good you can do and be mindful of the harm
|
816 |
+
you can prevent.
|
817 |
+
|
818 |
+
Leading organizations like
|
819 |
+
[DeepMind](https://www.deepmind.com/about/operating-principles)
|
820 |
+
and [OpenAI](https://openai.com/charter/) are leading from
|
821 |
+
the front. Fundamentally, building ML well aligns with building ML for
|
822 |
+
good. All the leading organizations emphasize effective *and*
|
823 |
+
responsible best practices for building ML powered practices. Keep all
|
824 |
+
this in mind as you make the world a better place with your AI-powered
|
825 |
+
products!
|
documents/lecture-09.srt
ADDED
@@ -0,0 +1,488 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1
|
2 |
+
00:00:00,539 --> 00:00:45,660
|
3 |
+
hey everyone welcome to the ninth and final lecture of full stack deep learning 2022. today we'll be talking about ethics after going through a little bit of context of what it is that we mean by ethics what I mean by ethics when I talk about it we'll go through three different areas where ethics comes up both Broad tech ethics ethics that anybody who works in the tech industry broadly needs to think about and care about what ethics has meant specifically for the machine learning industry what's happened in the last couple of years as ethical concerns have come to the Forefront and then finally what ethics might mean in a future where true artificial general intelligence exists so first let's do a little bit of
|
4 |
+
|
5 |
+
2
|
6 |
+
00:00:42,899 --> 00:01:33,000
|
7 |
+
context setting even more so than other topics all lectures on ethics are wrong but some of them are useful and they're more useful if we admit and state what our assumptions or biases or approaches are before we dive into the material and then I'll also talk about three kind of General themes that I see coming up again and again when ethical concerns are raised in Tech and in machine learning themes of alignment themes of trade-off and the critical theme of humility so in this lecture I'm going to approach ethics on the basis of concrete cases specific instances where people have raised concerns so we'll talk about cases where people have taken actions that have led to claims and counter claims of ethical or unethical Behavior
|
8 |
+
|
9 |
+
3
|
10 |
+
00:01:29,040 --> 00:02:13,680
|
11 |
+
the use of automated weapons the use of machine learning systems for making decisions like sentencing and bail and the use of machine learning algorithms to generate art in each case one criticism has been raised part of the criticism has been that the technology Awards impact is unethical so approaching ethics in this way allows me to give my favorite answer to the question of what is ethics which is to quote one of my favorite philosophers Ludwig wickenstein and say that the meaning of a word is its use in the language so we'll be focusing on times when people have used the word ethics to describe what they like or dislike about some piece of technology and this approach to definition is an interesting
|
12 |
+
|
13 |
+
4
|
14 |
+
00:02:11,940 --> 00:02:51,959
|
15 |
+
one if you want to try it out for yourself you should check out the game something something soup something which is a browser game at the link in the bottom left of this slide in which you presented with a bunch of dishes and you have to decide whether they are soup or not soup whether they can be served to somebody who ordered soup and by playing a game like this you can discover both how difficult it is to really put your finger on a concrete definition of soup and how poorly maybe your working definition of soup fits with any given soup theory because of this sort of case-based approach we won't be talking about ethical schools and we won't be doing any trolley problems so this article here from current affairs asks
|
16 |
+
|
17 |
+
5
|
18 |
+
00:02:50,400 --> 00:03:34,800
|
19 |
+
you to consider this particular example of a of an ethical dilemma where an asteroid containing all of the universe's top doctors who are working on a cure for all possible illnesses is hurtling towards the planet of Orphans and you can destroy the asteroid and save the orphans but if you do so the hope for a cure for all diseases will be lost forever and the question posed by the authors of this article is is this hypothetical useful at all for Illuminating any moral truths so rather than considering these hypothetical scenarios about trolley cars going down rails and fat men standing on Bridges we'll talk about concrete specific examples from the last 10 years of work in our field and adjacent Fields but
|
20 |
+
|
21 |
+
6
|
22 |
+
00:03:32,580 --> 00:04:11,040
|
23 |
+
this isn't the only way of talking about or thinking about ethics it's the way that I think about it is the way that I prefer to talk about it is not the only one and it might not be the one that works for you so if you want another point of view and one that really emphasizes and loves trolley problems then you should check out sergey's lecture from the last edition of the course from 2021 it's a really delightful talk and presents some similar ideas from a very different perspective coming to some of the same conclusions and some different conclusions a useful theme team from that lecture that I think we should all have in mind when we're pondering ethical dilemmas and the related questions that they bring up is the
|
24 |
+
|
25 |
+
7
|
26 |
+
00:04:09,060 --> 00:04:52,080
|
27 |
+
theme of what is water from last year's lecture so this is a famous little story from a commencement speech by David Foster Wallace where an older fish swing by two younger fish says morning boys how's the water and after he swims away one of the younger fish turns the other and says wait what the hell is water the idea is that if we aren't thoughtful if we aren't paying attention some things that are very important can become background can become assumption and can become invisible and so when I share these slides with Sergey he challenged me to answer this question for myself about how we were approaching ethics this time around and I'll say that this approach of relying on prominent cases risks replicating a lot of social biases
|
28 |
+
|
29 |
+
8
|
30 |
+
00:04:50,520 --> 00:05:33,060
|
31 |
+
some people's ethical claims are Amplified and some fall on unhearing ears some stories travel more because the people involved have more resources and are better connected and using these forms of case-based reasoning where you explain your response or your beliefs in terms of these concrete specifics can end up hiding the principles that are actually in operation maybe you don't even realize that that's how you're making the decision maybe some of the true ethical principles that you're operating under can disappear like water to these fish so don't claim that the approach I'm taking here is perfect but in the end so much of Ethics is deeply personal that we can't expect to have a perfect approach we can just do the best
|
32 |
+
|
33 |
+
9
|
34 |
+
00:05:30,479 --> 00:06:18,120
|
35 |
+
we can and hopefully better every day so we're gonna see three themes repeatedly come up throughout this talk two different forms of conflict that give rise to ethical disputes one when there is conflict between what we want and what we get and another when there is conflict between what we want and what others want and then finally a theme of maybe an appropriate response a response of humility when we don't know what we want or how to get it the problem of alignment where what we want and what we get differ we'll come up over and over again and one of the primary drivers of this is what you might call the proxy problem which is in the end we are often optimizing or maximizing some proxy of the thing that we really care about and
|
36 |
+
|
37 |
+
10
|
38 |
+
00:06:15,720 --> 00:06:52,080
|
39 |
+
if the alignment or Loosely the correlation between that proxy and the thing that we actually care about is poor enough then by trying to maximize that proxy we can end up hurting the thing that we originally cared about there is a nice paper that came out just very recently doing a mathematical analysis of this idea that's actually been around for quite some time excuse you can see these kinds of proxy problems everywhere once you're looking for them on the top right I have a train and validation loss chart from one of the training runs for the full stack deep learning text recognizer the thing that we can actually optimize is the training loss that's what we can use to calculate gradients and improve the
|
40 |
+
|
41 |
+
11
|
42 |
+
00:06:50,639 --> 00:07:35,400
|
43 |
+
parameters of our network but the thing that we really care about is the performance of the network on data points it hasn't seen like the validation set or the test set or data in production if we optimize our training lost too much then we can actually cause our validation loss to go up similarly there was an interesting paper that suggested that increasing your accuracy on classification tasks can actually result in a decrease in the utility of your embeddings in Downstream tasks and you can find these proxy problems outside of machine learning as well there's a famous story involving a Soviet Factory and nails that turned out to be false but in looking up a reference for it I was able to find an actual example where a factory that was
|
44 |
+
|
45 |
+
12
|
46 |
+
00:07:33,300 --> 00:08:18,120
|
47 |
+
making chemical machines rather than creating a machine that was cheaper and better chose not to adopt producing that machine because their output was measured in weight so the thing that that the planners actually cared about economic efficiency and output was not what was being optimized for because it was too difficult to measure and one reason why these kinds of proxy problems arise so frequently is due to issues of information the information that we're able to measure is not the information that we want so the training loss is the information that we have but the information that we want is the validation loss but then at a higher level we often don't even know what it is that we truly need so we may want the
|
48 |
+
|
49 |
+
13
|
50 |
+
00:08:15,360 --> 00:09:03,660
|
51 |
+
validation loss but what we need is the loss in production or really the value our users will derive from this model but even when we do know what it is that we want or what it is that we need we're likely to run into the second kind of problem the problem of a trade-off between stakeholders going back to our hypothetical example with the asteroid of doctors hurtling towards the planet of Orphans what makes this challenging is the need to determine a trade-off between the wants and needs of the people on the asteroid the wants and needs of the orphans on the planet and the wants and needs of future people who cannot be reached for comment and to weigh in on this concern is some sometimes said that this need to
|
52 |
+
|
53 |
+
14
|
54 |
+
00:09:02,040 --> 00:09:40,920
|
55 |
+
negotiate trade offices one of the reasons why Engineers don't like thinking about some of these problems around ethics I don't think that's quite right because we do accept trade-offs as a key component of engineering there's this nice O'Reilly book on the fundamentals of software architecture the first thing that they State at the very beginning is that everything in software architecture is a trade-off and even this satirical oh really book says that every programming question has the answer it depends so we're comfortable negotiating trade-offs take for example this famous chart comparing the different convolutional networks on the basis of their accuracy and the number of operations that it takes to run them
|
56 |
+
|
57 |
+
15
|
58 |
+
00:09:38,580 --> 00:10:22,080
|
59 |
+
thinking about these kinds of trade-offs between speed and correctness is exactly the sort of thing that we have to do all the time in our job as engineers and one part of it that is maybe easier is at least selecting What's called the Pareto front for the metrics that we care about my favorite way of remembering what a Pareto front is is this definition of a data scientist from Josh Wills which is a data scientist who's better at Stats than any software engineer and better at software engineering than any statistician so this Pareto front that I've drawn here is the models that have are more accurate than anybody who takes fewer flops and use fewer flops than anybody who is more accurate so I think rather than fundamentally being about
|
60 |
+
|
61 |
+
16
|
62 |
+
00:10:20,100 --> 00:11:02,640
|
63 |
+
trade-offs one of the reasons why Engineers maybe dislike thinking about these problems is that it's really hard to identify the axes for a chart like the one that I just showed it's very hard to quantify these things and if we do quantify things like the utility or the rights of people involved in a problem we know that those quantifications are far away from what what they truly want to measure there's a proxy problem in fact but even further ones measured where on that front do we fall as Engineers we maybe develop an expertise in knowing whether we want high accuracy or low latency or computational load but we are not as comfortable deciding how many current orphans we want to trade for what amount
|
64 |
+
|
65 |
+
17
|
66 |
+
00:11:01,260 --> 00:11:45,120
|
67 |
+
of future health so this raises questions both in terms of measurement and in terms of decision making that are outside of our expertise so the appropriate response here is humility because we don't explicitly train these skills the way that we do many of the other skills that are critical for our job and many folks engineers and managers in technology seem to kind of deepen their bones prefer optimizing single metrics making a number go up so there's no trade-offs to think about and those metrics are they're not proxies they're the exact same thing that you care about my goal within this company my objective for this quarter my North Star is user growth or lines of code and by God I'll make that go up so when we
|
68 |
+
|
69 |
+
18
|
70 |
+
00:11:43,380 --> 00:12:28,800
|
71 |
+
encounter a different kind of problem it's important to bring a humble mindset a student mindset to the problems to ask for help to look for experts and to recognize that the help that you get and the experts that you find might not be immediately obviously which you want or what you're used to additionally one form of this that we'll see repeatedly is that when attempting to intervene because of an ethical concern it's important to remember this same humility it's easy to think when you are on the good side that this humility is not necessary but even trying to be helpful is a delicate and dangerous undertaking one of my favorite quotes from the systems Bible so we want to make sure as we resolve the ethical concerns that
|
72 |
+
|
73 |
+
19
|
74 |
+
00:12:26,220 --> 00:13:06,000
|
75 |
+
people raise about our technology that we come up with solutions that are not just part of the problem so the way that I resolve all of these is through user orientation by getting feedback from users we maintain alignment between ourselves and the system that we're creating and the users that it's meant to serve and then when it's time to make trade-offs we should resolve them in consultation with users and in my opinion we should tilt the scales in their favor and away from the favor of other stakeholders including within our own organization and then humility is one of the reasons why we actually listen to users at all all because we are humble enough to recognize that we don't have the answers to all of these
|
76 |
+
|
77 |
+
20
|
78 |
+
00:13:03,720 --> 00:13:46,260
|
79 |
+
questions all right with our context and our themes under our belt let's dive into some concrete cases and responses we'll start by considering ethics in the broader world of technology that machine learning fights itself in so the key thing that I want folks to take away from this section is that the tech industry cannot afford to ignore ethics as public trust in Tech declines we need to learn from other nearby industries that have done a better job on professional ethics and then we'll talk about some contemporary topics some that I find particularly interesting and important throughout the past decade the technology industry has been plagued by Scandal whether that's how technology companies interface with national
|
80 |
+
|
81 |
+
21
|
82 |
+
00:13:43,680 --> 00:14:30,839
|
83 |
+
governments at the largest scale over to how technological systems are being used or manipulated by people creating disinformation or fake social media accounts or targeting children with automatically generated content that hacks the YouTube recommendation system and the impact effect of this has been that distrust in tech companies has risen markedly in the last 10 years so this is from the public affairs pulse survey just last year the tech industry went from being in 2013 one of the industries that the fewest people felt was less trustworthy than average to rubbing elbows with famously much distrusted Industries like energy and pharmaceuticals and the tech industry doesn't have to win elections so we
|
84 |
+
|
85 |
+
22
|
86 |
+
00:14:29,220 --> 00:15:16,079
|
87 |
+
don't have to care about public polling as much as politicians but politicians care quite a bit about those public opinion polls and just in the last few years the fraction of people who believe that the large tech companies should be more regulated has gone up a substantial amount and comparing it to 10 years ago it's astronomically higher so there will be substantial impacts on our industry due to this loss of public trust so as machine learning engineers and researchers we can learn from nearby Fields so I'll talk about two of them one a nice little bit about the culture of professional ethics in Engineering in Canada and then a little bit about ethical standards for human subjects research so one of the worst
|
88 |
+
|
89 |
+
23
|
90 |
+
00:15:12,779 --> 00:15:57,899
|
91 |
+
construction disasters in modern history was the collapse of the Quebec bridge in 1907. 75 people who were working on the bridge at the time were killed and a parliamentary inquiry placed the blame pretty much entirely on two engineers in response there was the development of some additional rituals that many Canadian Engineers take part in when they finish their education that are meant to impress upon them the weight of their responsibility so one component of this is a large iron ring which literally impresses that weight upon people and then another is an oath that people take a non-legally binding oath that includes saying that I will not hence forward suffer or pass or be privy to the passing of bad workmanship or
|
92 |
+
|
93 |
+
24
|
94 |
+
00:15:56,100 --> 00:16:37,800
|
95 |
+
faulty material I think the software would look quite a bit different if software Engineers took an oath like this and took it seriously one other piece I wanted to point out is that it includes within it some built-in humility asking pardon ahead of time for the assured failures lots of machine learning is still in the research stage and so some people may say that oh well that's important for the people who are building stuff but I'm working on R D for fundamental technology so I don't have to worry about that but research is also subject to regulation so this is something I was required to learn because I did my PhD in a neuroscience Department that was funded by the National Institutes of Health which
|
96 |
+
|
97 |
+
25
|
98 |
+
00:16:34,500 --> 00:17:18,360
|
99 |
+
mandates training in ethics and in the ethical conduct of research so these regulations for human subjects research date back to the 1940s when there were medical experiments on unwilling human subjects by totalitarian regimes this is still pretty much the Cornerstone for laws on human subjects research around the world through the Helsinki declaration which gets regularly updated in the US the Touchstone bit of regulation on this the 9 1973 research act requires among other things informed consent from people who are participating in research and there were two major revelations in the late 60s and early 70s that led to this legislation not dissimilar to the scandals that have plagued the technology industry recently one was the
|
100 |
+
|
101 |
+
26
|
102 |
+
00:17:16,740 --> 00:18:04,860
|
103 |
+
infliction of hepatitis on mentally disabled children in New York in order to test hepatitis treatments and the other was the non-treatment of syphilis in black men at Tuskegee in order to study the progression of the disease in both cases these subjects did not provide informed consent and seemed to be selected for being unable to advocate for themselves or to get legal redress for the harms they were suffering and so if we are running experiments and those experiments involve humans evolve our users we are expected to adhere to the same principles and one of the famous instances of mismatch between the culture in our industry and the culture of human subjects research was was when some researchers at Facebook studied
|
104 |
+
|
105 |
+
27
|
106 |
+
00:18:02,760 --> 00:18:49,440
|
107 |
+
emotional contagion by altering people's news feeds either adding more negative content or adding more positive content and they found a modest but robust effect that introducing more positive content caused people to post more positively when people found out about this they were very upset the authors noted that Facebook's data use policy includes that the user's data and interactions can be used for this but most people who were Facebook users and the editorial board of pnas where this was published did not see it that way so put together I think we are at the point where we need a professional code of ethics for software hopefully many codes of Ethics developed in different communities that can Bubble Up compete
|
108 |
+
|
109 |
+
28
|
110 |
+
00:18:48,000 --> 00:19:35,700
|
111 |
+
with each other and merge to finally something that most of us or all of us can agree on and that is incorporated into our education and acculturation of new members into our field and into more aspects of how we build to close out this section I wanted to talk about some particular ethical concerns that arise in Tech in general first around carbon emissions and then second around dark patterns and user hostile designs the good news with carbon emissions is that because they scale with cost it's only something that you need to worry about when the costs of what you're building what you're working on are very large at which time you both won't be alone in making these decisions and you can move a bit more deliberately and make these
|
112 |
+
|
113 |
+
29
|
114 |
+
00:19:32,760 --> 00:20:19,440
|
115 |
+
choices more thoughtfully so first what are the ethical concerns with carbon emissions anthropogenic climate change driven by CO2 emissions raises a classic trade-off which was dramatized in this episode of Harvey Birdman Attorney at Law in which George Jetson travels back from the future to sue the present for melting the ice caps and destroying his civilization so unfortunately we don't have future Generations present now to advocate for themselves the other view is that this is an issue that arises from a classic alignment problem which is many organizations are trying to maximize their profit that raw profit is based off of prices for goods that don't include externalities like the environmental damage caused by carbon
|
116 |
+
|
117 |
+
30
|
118 |
+
00:20:16,740 --> 00:21:06,960
|
119 |
+
dioxide emissions leading to increased temperatures and climactic change so the primary Dimension along which we have to worry about carbon emissions is in compute jobs that require power that power has to be generated somehow and that can result in the emission of carbon and so there was a nice paper Linked In This slide that walked through how much carbon dioxide was emitted using typical us-based Cloud infrastructure and the top headline from this paper was that training a large Transformer model with neural architecture search produces as much carbon dioxide as five cars create during their lifetime so that sounds like quite a bit of carbon dioxide and it is in fact but it's important to remember that power is not free and so
|
120 |
+
|
121 |
+
31
|
122 |
+
00:21:05,160 --> 00:21:49,200
|
123 |
+
there is a metric that we're quite used to tracking that is at least correlated with our carbon emissions our compute spend and if you look for the cost runs between one and three million dollars to run the neural architecture search that emitted five cars worth of CO2 and one to three million dollars is actually a bit more than it would cost to buy five cars and provide their fuel so the number that I like to use is that four us-based Cloud infrastructure like the US West one that many of us find ourselves in ten dollars of cloud spend is roughly equal to one dollar worth of air travel costs so that's on the basis of something like the numbers in the chart indicating air travel across the United States from New York to San
|
124 |
+
|
125 |
+
32
|
126 |
+
00:21:47,220 --> 00:22:28,799
|
127 |
+
Francisco I've been taking care to always say us-based cloud infrastructure because just changing Cloud regions can actually reduce your emissions quite a bit there's actually a factor of nearly 50x from some of the some of the cloud regions that have have the most carbon intensive power generation like AP Southeast 2 and the regions that have the the least carbon intensive power like ca Central one that chart comes from a nice talk from hugging face that you can find on YouTube part of their course that talks a little bit more about that paper and about managing carbon emissions interest in this problem has led to some nice new tooling one code carbon dot IO allows you to track power consumption and therefore
|
128 |
+
|
129 |
+
33
|
130 |
+
00:22:26,820 --> 00:23:12,419
|
131 |
+
CO2 emissions just like you would any of your other metrics and then there's also this mlco2 impact tool that's oriented a little bit more directly towards machine learning the other ethical concern in Tech that I wanted to bring up is deceptive design and how to recognize it an unfortunate amount of deception is tolerated in some areas of software the example on the left comes from an article by Narayanan at all that shows a fake countdown timer that claims that an offer will only be available for an hour but when it hits zero nothing the offer is still there there's also a possibly apocryphal example on the right here you may have seen these numbers next to products when online shopping saying that some number of people are currently
|
132 |
+
|
133 |
+
34
|
134 |
+
00:23:10,799 --> 00:23:57,059
|
135 |
+
looking at this product this little snippet of JavaScript here produces a random number to put in that spot so that example on the right may not be real but because of real examples like the one on the left it strikes a chord with a lot of developers and Engineers there's a kind of slippery slope here that goes from being unclear or maybe not maximally upfront about something that is a source of friction or a negative user experience in your product and then in trying to remove that friction or sand that edge down you slowly find yourself being effectively deceptive to your users on the left is a nearly complete history of the way Google displays ads in its search engine results it started off very clearly
|
136 |
+
|
137 |
+
35
|
138 |
+
00:23:54,960 --> 00:24:38,039
|
139 |
+
colored and separated out with a bright color from the rest of the results and then a about 10 years ago that colored background was removed and replaced with just a tiny little colored snippet that said add and now as of 2020 that small bit there is no longer even colored it's just bolded and so this makes it difficult for users to know which content is being served to them because somebody paid for them to see it versus being served up organically so a number of patterns of deceptive design also known as dark patterns have emerged over the last 10 years you can read about them on this website deceptive.design there's also a Twitter account at dark patterns where you can share examples that you find in the wild so some
|
140 |
+
|
141 |
+
36
|
142 |
+
00:24:36,360 --> 00:25:19,980
|
143 |
+
examples that you might be familiar with are the roach motel named after a kind of insect trap where you can get into a situation very easily but then it's very hard to get out of it if you've ever attempted to cancel a gym membership or delete your Amazon account then you may have found yourself a roach in a motel another example is trick questions where forms intentionally make it difficult to choose the option that most use users want for example using negation in a non-standard way like check this box to not receive emails from our service one practice in our industry that's on very shaky ethical and legal ground is growth hacking which is a set of techniques for achieving really rapid growth in user
|
144 |
+
|
145 |
+
37
|
146 |
+
00:25:17,820 --> 00:26:00,659
|
147 |
+
base or revenue for a product and has all the connotations you might expect from the name hack LinkedIn was famously very spammy when it first got started I'd like to add you to my Professional Network on LinkedIn became something of a meme and this was in part because LinkedIn made it very easy to unintentionally send LinkedIn invitations to every person you'd ever emailed they ended up actually having to pay out in a class action lawsuit because they were sending multiple follow-up emails when user only clicked to send an invitation once and the structure of their emails made it seem like they were being sent by the user rather than LinkedIn and the use of these growth hacks goes back to the very Inception of email Hotmail Market itself
|
148 |
+
|
149 |
+
38
|
150 |
+
00:25:58,200 --> 00:26:43,860
|
151 |
+
in part by attacking on a signature to the bottom of every email that said PS I love you get your free email at Hotmail so this seemed like it was being sent by the actual user I grabbed a snippet from a top 10 growth hacks article that said that the personal sounding nature of the message and the fact that it came from a friend made this a very effective growth hack but it's fundamentally deceptive to add this to messages in such a way that it seems personal and to not tell users that this change is being made to the emails that they're sending so machine learning can actually make this problem worse if we are optimizing short-term metrics these growth acts and deceptive designs can often Drive user and revenue
|
152 |
+
|
153 |
+
39
|
154 |
+
00:26:41,460 --> 00:27:25,679
|
155 |
+
growth in the short term but they do that by worsening user experience and drawing down on Goodwill towards the brand in a way that can erode the long-term value of customers when we incorporate machine learning into the design of our products with a B testing we have to watch out to make sure that the the metrics that we're optimizing don't encourage this kind of deception so consider these two examples on the right the top example is a very straightforwardly implemented and direct and easy to understand form for users to indicate whether they want to receive emails from the company and from its Affiliates in example B the wording of the first message has been changed so that it indicates that the first hitbox
|
156 |
+
|
157 |
+
40
|
158 |
+
00:27:23,520 --> 00:28:05,760
|
159 |
+
should be checked to not receive emails while the second one should not be ticked in order to not receive emails and if you're a b testing these two designs against each other and your metric is the number of people who sign up to receive emails then it's highly likely that the system is going to select example B so taking care and setting up a b tests such that either they're tracking longer term metrics or things that correlate with them and that the variant generation system that generates all the different possible designs can't generate any designs that we would be unhappy with as we would hopefully be unhappy with the deceptive design in example B and I think it's also important to call out that this
|
160 |
+
|
161 |
+
41
|
162 |
+
00:28:03,840 --> 00:28:46,679
|
163 |
+
problem arises inside of another alignment problem we were considering the case where the long-term value of customers and the company's interests were being harmed by these deceptive designs but unfortunately that's not always going to be the case the private Enterprises that build most technology these days are able to deliver Broad Social value to make the world a better place as they say but the way that they do that is generally by optimizing metrics that are at best a very weak proxy for that value that they're delivering like their market capitalization and so there's the possibility of an alignment problem where companies pursuing and maximizing their own profit and success can lead to net negative production of value and
|
164 |
+
|
165 |
+
42
|
166 |
+
00:28:44,880 --> 00:29:23,700
|
167 |
+
this misalignment is something that if you spend time at the intersection of capital and funding leadership and Technology development you will encounter it so it's important to consider these questions ahead of time and come to your own position whether that's trade reading this as the price of doing business or the way the world Works seeking ways to improve this alignment or considering different ways to build technology but on the shorter term you can push for longer term thinking within your organization to allow for better alignment between the metrics that you're measuring and the goals that you're setting and between the goals that you're setting and what is overall good for our industry and for
|
168 |
+
|
169 |
+
43
|
170 |
+
00:29:21,960 --> 00:30:06,480
|
171 |
+
the broader world and you can also learn to recognize these user hostile design patterns call them out when you see them and you can advocate for a More user-centered Design instead so to wrap up our section on ethics for Building Technology broadly we as an industry should learn from other disciplines if we want to avoid a trust crisis or if we want to avoid the crisis getting any worse and we can start by educating ourselves about the common user hostile practices in our industry and how to avoid them now that we've covered the kinds of ethical concerns and conflicts that come up when Building Technology in general let's talk about concerns that are specific to machine learning just in the past couple of years there have been
|
172 |
+
|
173 |
+
44
|
174 |
+
00:30:04,200 --> 00:30:42,480
|
175 |
+
more and more ethical concerns raised about the uses of machine learning and this has gone beyond just the ethical questions that can get raised about other kinds of technology so we'll talk about some of the common ethical questions that have been raised repeatedly over the last couple of years and then we'll close out by talking about what we can learn from a particular sub-discipline of machine learning medical machine learning so the fundamental reason I think that ethics is different for machine learning and maybe more Salient is that machine learning touches human lives more intimately than a lot of other kinds of technology so many machine learning methods especially deep learning methods make human legible data into computer
|
176 |
+
|
177 |
+
45
|
178 |
+
00:30:40,320 --> 00:31:22,860
|
179 |
+
legible data so we're working on things like computer vision on processing natural language and humans are more sensitive to errors in and have more opinions about this kind of data about images like this puppy than they do about the other kinds of data manipulated by computers like abstract syntax trees so because of of this there are more stakeholders with more concerns that need to be traded off in machine learning applications and then more broadly machine learning involves being wrong pretty much all the time there's the famous statement that all models are wrong though some are useful and I think the first part applies at least particularly strongly to machine learning our models are statistical and
|
180 |
+
|
181 |
+
46
|
182 |
+
00:31:20,760 --> 00:32:01,320
|
183 |
+
include in them Randomness the way that we frame our problems the way that we frame our optimization in terms of cross entropies or divergences and Randomness is almost always an admission of ignorance even the quintessential examples of Randomness like random number generation in computers and the flipping of a coin are things that we know in fact are not random truly they are in fact predictable and if we knew the right things and had the right laws of physics and the right computational power then we could predict how a coin would land we could control it we could predict what the next number to come out of a random number generator would be whether it's pseudorandom or based on some kind of Hardware Randomness and so
|
184 |
+
|
185 |
+
47
|
186 |
+
00:31:59,760 --> 00:32:41,520
|
187 |
+
we're admitting a certain degree of ignorance in our models and that means our models are going to be wrong and they're going to misunderstand situations that they are put into and it can be very upsetting and even harmful to be misunderstood by a machine learning model so against this backdrop of Greater interest or higher Stakes a number of common types of ethical concern have coalesced in the last couple of years and there are somewhat established camps of answers to these questions and you should at least know where it is you stand on these core questions so for four really important questions that you should be able to answer about about anything that you build with machine learning are is the model fair and what does that mean in
|
188 |
+
|
189 |
+
48
|
190 |
+
00:32:39,840 --> 00:33:20,039
|
191 |
+
this situation is the system that you're building accountable who owns the data involved in this system and finally and perhaps most importantly an undergirding all of these questions is should this system be built at all so first is the model we're building Fair the classic case on this comes from Criminal Justice from the compass system for predicting before trial whether a defendant will be arrested again so if they're arrested again that's just they committed a crime during that time and so this is assessing a certain degree of risk for additional harm while the justice system is deciding what to do about a previous arrest and potential crime so the operationalization here was a 10-point rearrest probability based off of past
|
192 |
+
|
193 |
+
49
|
194 |
+
00:33:17,640 --> 00:34:07,799
|
195 |
+
data about this person and they set a goal from the very beginning to be less biased than human judges so they operationalize that by calibrating these arrest probabilities and making sure that if say a person received a 2 2 on this scale they had a 20 chance of being arrested again and then critically that those probabilities were calibrated across subgroups so racial bias is one of the primary concerns around bias in criminal justice in the United States and so they took care to make sure that these probabilities of rearrest were calibrated for all racial groups the system was deployed in it is actually used all around the United States it's proprietary so it's difficult to analyze but using the Freedom of Information Act
|
196 |
+
|
197 |
+
50
|
198 |
+
00:34:05,519 --> 00:34:53,820
|
199 |
+
and by colliding together a bunch of Records some people at propublica were able to run their own analysis of this algorithm and they determined that though this calibration that Compass claimed for arrest probabilities was there so the model was not more or less wrong for one racial group or another the way that the model tended to fail was different across racial groups so the model had more false positives for black defendants so saying that somebody was higher risk but then them not going on to reoffend and had more false negatives for white defendants so labeling them as low risk and then them going on to reoffend so despite North Point the creators of compass taking into account bias from the beginning
|
200 |
+
|
201 |
+
51
|
202 |
+
00:34:51,599 --> 00:35:31,260
|
203 |
+
they ended up with an algorithm with this undesirable property of being more likely to effectively falsely accuse defendants who were black than defendants who were white this report touched off a ton of controversy and back and forth between propublica the creator of the article and North Point Craters of compass and also a bunch of research and it turned out that some quick algebra revealed that some form of race-based bias is inevitable in this setting so the things that we care about when we're building a binary classifier are relatively simple you can write down all of these metrics directly so we care about things like the false positive rate which means we've imprisoned somebody with no need the false negative
|
204 |
+
|
205 |
+
52
|
206 |
+
00:35:29,760 --> 00:36:16,380
|
207 |
+
rate which means we missed an opportunity to event a situation that led to an arrest and then we also care about the positive predictive value which is this rearrest probability that Compass was calibrated on so because all of these metrics are related to each other and related to The Joint probability distribution of our model's labels and the actual ground truth if the probability of rearrest differs across groups then we have to have that some of these numbers are different across groups and that is a form of racial bias so the basic way that this argument works just involves rearranging these numbers and saying that if the numbers on the left side of this equation are different for Group 1 and group two then it can't possibly be the
|
208 |
+
|
209 |
+
53
|
210 |
+
00:36:14,280 --> 00:36:54,900
|
211 |
+
case that all three of the numbers on the right hand side are the same for Group 1 and group two and I'm presenting this here as though it only impacts these specific binary classification metrics but there are are in fact a very large number of definitions of fairness which are mutually incompatible so there's a nice a really incredible tutorial by Arvin Narayanan who was also the first author on the dark patterns work on a bunch of these fairness definitions what they mean and why they're in commensurate so I can highly recommend that lecture so returning to our concrete case if the prevalence is differ across groups then one of our things that we're concerned with the false positive rate the false negative
|
212 |
+
|
213 |
+
54
|
214 |
+
00:36:53,160 --> 00:37:35,339
|
215 |
+
rate or the positive predictive value will not be equal and that's something that people can point to and say that's unfair in the middle that positive predictive value was equalized across groups in compass that was what they really wanted to make sure was equal cross groups and because the probability of rearrest was larger for black defendants then either the false positive rate had to be bigger or the false negative rate had to be bigger for that group and there's an analysis in this cholachova 2017 paper that suggests that the usual way that this will work is that there will be a higher false positive rate for the group with a larger prevalence so the fact that there will be some form of unfairness that we
|
216 |
+
|
217 |
+
55
|
218 |
+
00:37:33,420 --> 00:38:14,280
|
219 |
+
can't just say oh well all these metrics are the same across all groups and so everything has to be fair that fact is fixed but the impact of the unfairness of models is not fixed the story is often presented as oh well no matter what the journalists would have found something to complain about there's always critics and so you know you don't need to worry about fairness that much but I think it's important to note that the particular kind of unfairness that came about from this model from focusing on this positive predictive value led to a higher false positive rate more unnecessary imprisonment for black defendants the false positive rate and the positive predictive value were equalized across groups that would have
|
220 |
+
|
221 |
+
56
|
222 |
+
00:38:12,420 --> 00:38:54,119
|
223 |
+
led to a higher false negative rate for black defendants relative to White defendants and in the context of American politics and concerns about racial inequity in the criminal justice system bias against white defendants is not going to lead to complaints from the same people and has a different relationship to the historical operation of the American justice system and so far from this being a story about the hopelessness of thinking about or caring about fairness this is a story about the necessity of confronting the trade-offs that are inevitably going to come up so some researchers that Google made a nice little tool where you can try thinking through and making these trade-offs for yourself it's a loan decision rather
|
224 |
+
|
225 |
+
57
|
226 |
+
00:38:51,900 --> 00:39:37,740
|
227 |
+
than a criminal justice decision but it has a lot of the same properties you have a binary classifier you have different possible goals that you might set either maximizing the profit of the loaning entity or providing equal opportunity to the two groups and it's very helpful for building intuition on these fairness metrics and what it means to pay pick one over the other and these events in this controversy kicked off a real flurry of research on fairness and there's now been several years of this fairness accountability and transparity Conference fact there's tons of work on both algorithmic level approaches to try and measure these fairness metrics incorporate them into training and also more qualitative work on designing
|
228 |
+
|
229 |
+
58
|
230 |
+
00:39:36,180 --> 00:40:20,940
|
231 |
+
systems that are more transparent and accountable so the compass example is really important for dramatizing these issues of fairness but I think it's very critical for this case and for many others to step back and ask whether this model should be built at all so this algorithm for scoring risk is proprietary and uninterpretable it doesn't give answers for why a person is higher risk or not and because it is closed Source there's no way to examine it it achieves an accuracy of about 65 which is quite High given that the marginal probability of reoffence is much lower than 50 but it's important to compare the baselines here pulling together a bunch of non-experts like you would on a jury has an accuracy of about
|
232 |
+
|
233 |
+
59
|
234 |
+
00:40:17,640 --> 00:41:01,140
|
235 |
+
65 percent and creating a simple scoring system on the basis of how old the person is and how many prior arrests they have also has an accuracy of around 65 and it's much easier to feel comfortable with the system that says if you've been arrested twice then you have a higher risk of being arrested again and so you'll be imprisoned before trial then a system that just says oh well we ran the numbers and it looks like you have a high chance of committing a crime but even framing this problem in terms of who is likely to be rearrested is already potentially a mistake so a slightly different example of predicting failure to appear in court was tweeted out by Moritz heart who's one of the main researchers in this area choosing
|
236 |
+
|
237 |
+
60
|
238 |
+
00:40:59,520 --> 00:41:37,320
|
239 |
+
to try to predict who will fail to appear in court treating this as something that is then a fact of the universe that this person is likely to fail to appear in court and then intervening on this and punishing them for that for that fact it's important to recognize why people fail to appear in court in general often it's because they don't have child care to cover for the care of their dependence while they're in court they don't have transportation their work schedule is inflexible or the core deployment schedule is inflexible or unreasonable it'd be better to implement steps to mitigate these issues and reduce the number of people who are likely to fail to appear in court for example by making it possible to join
|
240 |
+
|
241 |
+
61
|
242 |
+
00:41:35,579 --> 00:42:20,940
|
243 |
+
Court remotely that's a far better approach for all involved than simply getting really really good at predicting Who currently fails to appear in court so it's important to remember that the things that we're measuring the things that we're predicting are not the be-all end-all in themselves the things that we care about are things like an effective and fair justice system and this comes up perhaps most acutely in the case of compass when we recognize that rearrest is not the same as recidivism it's not the same thing as committing more crimes being rearrested requires that a police officer believes that you committed a crime police officers are subject effect to their own biases and patterns of policing result in a far higher fraction
|
244 |
+
|
245 |
+
62
|
246 |
+
00:42:18,480 --> 00:43:04,440
|
247 |
+
of crimes being caught for some groups than for others and so our real goal in terms of fairness and criminal justice might be around reducing those kinds of unfair impacts and using past rearrest data that we know has these issues to determine who is treated more harshly by the criminal justice system is likely to exacerbate these issues there's also a notion of model fairness that is broader than just models that make decisions about human beings so even if you're deciding a model that works on text or works on images you should consider which kinds of people your model works well for and in general representation both on engineering and management teams and in data sets really matters for this kind of model fairness so it's
|
248 |
+
|
249 |
+
63
|
250 |
+
00:43:02,640 --> 00:43:46,680
|
251 |
+
unfortunately still very easy to make machine learning powered technology that fails for minoritized groups so for example off-the-shelf computer vision tools will often fail on darker skin so this is an example by Joy bull and weenie from MIT on how a computer vision based project that she was working on ran into difficulties because the face detection algorithm could not detect her face even though it could detect the faces of some of her friends with lighter skin and in fact she found that just putting on a white mask was enough to get the computer vision model to detect her face so this is unfortunately not a new issue in technology it's just a more Salient one with machine learning so one example is that hand soap
|
252 |
+
|
253 |
+
64
|
254 |
+
00:43:43,920 --> 00:44:32,640
|
255 |
+
dispensers that use infrared to determine when to dispense soap will often work better for lighter skin than darker skin and issues around lighting and vision and skin tone go back to the foundation of Photography let alone computer vision the design of film of cameras and printing processes was oriented around primarily making lighter skin photograph well as in these so-called Shirley cards that were used by code DAC for calibration these resulted in much worse experiences for people with darker skin using these cameras there has been a good amount of work on this and progress since four or five years ago one example of the kind of tool that can help with this are these model cards this particular format
|
256 |
+
|
257 |
+
65
|
258 |
+
00:44:30,660 --> 00:45:14,760
|
259 |
+
for talking about what a model can and cannot do that was published by a number of researchers including Margaret Mitchell and Timmy Gabriel it includes explicitly considering things like on which human subgroups of Interest many of them minoritized identities how well does the model perform hugging face has good Integrations for creating these kinds of model cards I think it's important to note that just solving these things by changing the data around or by calculating demographic information is not really an adequate response if the CEO of Kodak or their partner had been photographed poorly by those cameras then there's no chance that that issue would have been allowed to stay for decades so when you're
|
260 |
+
|
261 |
+
66
|
262 |
+
00:45:12,359 --> 00:45:57,839
|
263 |
+
looking at inviting people for talks hiring people or joining organizations you should try to make sure that you have worked to reduce the bias of that Discovery process by diversifying your network and your input sources the diversify Tech job board is a really wonderful source for candidates and then there are also professional organizations inside of the ml World black and Ai and women in data science being two of the larger and more successful ones these are great places to get started to make the kinds of professional connections that can improve the representations of these minoritized groups in the engineering and design and product management process where these kinds of issues should be solved a lot of progress has
|
264 |
+
|
265 |
+
67
|
266 |
+
00:45:55,680 --> 00:46:44,040
|
267 |
+
been made but these problems are still pretty difficult to solve an unbiased face detector might not be so challenging but unbiased image generation is still really difficult for example if you make an image generation model from internet scraped data without any safeguards in place then if you ask it to generate a picture of a CEO it will generate the stereotypical CEO a six foot or taller white man and this applies across a wide set of jobs and situations people can find themselves in and this led to a lot of criticism of early text damage generation models like Dolly and the solution that openai opted to this was to edit prompts that people put in if you did not fully specify what kind of person should be generated then
|
268 |
+
|
269 |
+
68
|
270 |
+
00:46:41,579 --> 00:47:21,300
|
271 |
+
race and gender words would be added to the prompt with weights based on the world's population so people discovered this somewhat embarrassingly by writing prompts like a person holding a sign that says or pixel art of a person holding a text sign that says and then seeing that the appended words were then printed out by the model suffice it to say that this change did not make very many people very happy and indicates that more work needs to be done to de-bias image generation models at a broader level than just fairness we can also ask whether the system we're building is accountable to the people it's serving or acting upon and this is important because some people can consider explanation and accountability
|
272 |
+
|
273 |
+
69
|
274 |
+
00:47:19,560 --> 00:48:00,060
|
275 |
+
in the face of important judgments to be human rights this is the right to an explanation in the European Union's general data protection regulation gdpr there is a subsection that mentions the right to obtain an explanation of a decision reached after automated assessment and the right to challenge that decision the legal status here is a little bit unclear there's a nice archive paper that talks about this a bit about what the right to an explanation might mean but what's more important for our purposes is just to know that there is an increasing chorus of people claiming that this is indeed a human right and it's not an entirely New Concept and it's not even really technology or automation specific as far
|
276 |
+
|
277 |
+
70
|
278 |
+
00:47:57,420 --> 00:48:40,800
|
279 |
+
back as 1974 has been the law in the United States that If you deny credit to a person you must disclose the principal reasons for denying that credit application and in fact I found this interesting it's expected that you provide no more than four reasons why you denied them credit but the general idea that somebody as a right to know why something happened to them in certain cases is enshrined in some laws so what are we supposed to do if we use a deep neural network to decide whether somebody should be Advanced Credit or not so there are some off-the-shelf methods for introspecting deep neural networks that are all based off of input output gradients how would changing the pixels of this input image change the
|
280 |
+
|
281 |
+
71
|
282 |
+
00:48:38,579 --> 00:49:18,359
|
283 |
+
class probabilities and the output so this captures a kind of local contribution but as you can see from the small image there it doesn't produce a very compelling map and there's no reason to think that just changing one pixel a tiny bit should really change the model's output that much one Improvement to that called Smooth grad is to add noise to the input and then average results kind of getting a sense for what the gradients look like in a general area around the input there isn't great theory on why that should give better explanations but people tend to find these explanations better and you can see in the smooth grad image on the left there that you can pick out the picture of a bird it seems like that is
|
284 |
+
|
285 |
+
72
|
286 |
+
00:49:16,440 --> 00:50:04,560
|
287 |
+
giving a better explanation or an explanation that we like better for why this network is identifying that as a picture of a bird there's a bunch of kind of hacking methods like specific tricks you need when you're when you're using the relu activation there's some methods that are better for classification like grad cam one that is more popular integrated gradients takes the integral of the gradient along a path from some baseline to the final image and this method has a nice interpretation in terms of Cooperative Game Theory something called a shapley value that quantifies how much a particular collection of players in a game contributed to the final reward and adding noise to integrated gradients tends to produce really clean
|
288 |
+
|
289 |
+
73
|
290 |
+
00:50:02,460 --> 00:50:45,480
|
291 |
+
explanations that people like but unfortunately these methods are generally not very robust their outputs tend to correlate pretty strongly in the case of images with just an edge detector there's built-in biases to convolutional networks and the architectures that we use that 10 and to emphasize certain features of images what this particular chart shows from this archive paper by Julius adebayo Moritz heart and others is that even as we randomize layers in the network going from left to right we are randomizing starting at the top of the network and then randomizing more layers going down even for popular methods like integrated gradients with smoothing or guided back propagation we can effectively randomize
|
292 |
+
|
293 |
+
74
|
294 |
+
00:50:43,800 --> 00:51:27,140
|
295 |
+
a really large fraction of the network without changing the gross features of the explanation and resulting in an explanation that people would still accept and believe even though this network is now producing random output so in general introspecting deep neural networks and figuring out what's going inside them requires something that looks a lot more like a reverse engineering process that's still very much a research problem there's some great work on distill on reverse engineering primarily Vision networks and then some great work from anthropic AI recently on Transformer circuits that's reverse engineering large Lang language models and Chris Ola is the researcher who's done the most work here but it still is the sort of thing that
|
296 |
+
|
297 |
+
75
|
298 |
+
00:51:24,300 --> 00:52:06,839
|
299 |
+
even getting a loose qualitative sense for how neural networks work and what they are doing in response to inputs is still the type of thing that takes a research team several years so Building A system that can explain why it took a particular decision is maybe not currently possible with deep neural networks but that doesn't mean that the systems that we build with them have to be unaccountable if somebody dislikes the decision that they get and the explanation that we give is well the neural network said you shouldn't get a loan and they challenge that it might be time to bring in a human in the loop to make that decision and building that in to the system so that it's an expected mode of operation and is considered an
|
300 |
+
|
301 |
+
76
|
302 |
+
00:52:03,900 --> 00:52:50,099
|
303 |
+
important part of the feedback and the operation of the system is key to building an accountable system so this book automating inequality by Virginia Eubanks talks a little bit about the ways in which Technical Systems as their build today are very prone to this unaccountability where the people who are Indian most impacted by these systems some of the most critical stakeholders for these systems for example recipients of government assistance are unable to have their voices and their needs heard and taken into account in the operation of a system so this is perhaps the point at which you should ask when building a system with machine learning whether this should be built at all and particular to ask who benefits and who
|
304 |
+
|
305 |
+
77
|
306 |
+
00:52:47,579 --> 00:53:28,500
|
307 |
+
is harmed by automating this task in addition to concerns around the behavior of models increasing concern has been pointed towards data and in particular who owns and who has rights to the data involved in the creation of machine Learning Systems it's important to remember that the training data that we use for our machine learning algorithms is almost always generated by humans and they generally feel some ownership over that data and we end up behaving a little bit like this comic on the right where they hand us some data that they made and then we say oh this is ours now I made this and in particular the large data sets you train the really large models that are pushing the frontiers of what is possible with machine learning
|
308 |
+
|
309 |
+
78
|
310 |
+
00:53:26,220 --> 00:54:09,359
|
311 |
+
are produced by crawling the Internet by searching over all the images all the text posted on the internet and pulling large fractions of it down and many people are not aware that this is possible let alone legal and so to some extent any consent that they gave to their data being used was not informed and then additionally as technology has changed in the last decade and machine learning has gotten better what can be done with data has changed somebody uploading their art a decade ago certainly did not have on their radar the idea that they were giving consent to that art being used to create an algorithm that can mimic its style and you can in fact check whether an image of interest to you has been used to
|
312 |
+
|
313 |
+
79
|
314 |
+
00:54:06,240 --> 00:54:53,520
|
315 |
+
train one of the large text image models specifically this have I been trained.com website will search through the Leon data set that is used to train the stable diffusion model for images that you upload so you can look to see if any pictures of you were incorporated into the data set and this goes further than just pictures that people might rather not have used in this way to actual data that has somehow been obtained illegally there's an Arts technical article a particular artist who was interested in this found that some of their medical photos which they did not consent to have uploaded to the internet somehow found their way into the lay on data set and so cleaning large web scraped data sets from this
|
316 |
+
|
317 |
+
80
|
318 |
+
00:54:51,540 --> 00:55:37,800
|
319 |
+
kind of illegally obtained data is definitely going to be important as more attention is paid to these models as they are product eyes and monetized and more on people's radar even for data that is obtained legally saying well technically you did agree to this does not generally satisfy people remember the Facebook emotion research study technically some reading of the Facebook user data policy did support the way that they were running their experiment but many users disagreed many artists feel that creating an art generation tool that threatens their livelihoods and copies art down to the point of even faking watermarks and logos on images when told to recreate the style of an artist is an ethical use of that data
|
320 |
+
|
321 |
+
81
|
322 |
+
00:55:35,700 --> 00:56:23,400
|
323 |
+
and it certainly is the case that creating a sort of parrot that can mimic somebody is something that a lot of people find concerning dealing with these issues around data governance is likely to be a new frontier imagine of stable diffusion has said that he's partnering with people to create mechanisms for artists to opt in or opt out of being included in training data sets for future versions of stable diffusion I found that noteworthy because mostacc has been very vocal in his defense of image generation technology and of what it can be used for but even he is interested in adjusting the way data is used there's also been work from Tech forward artists like Holly Hunter who was involved in the creation of have I been trained
|
324 |
+
|
325 |
+
82
|
326 |
+
00:56:20,520 --> 00:57:05,460
|
327 |
+
around trying to incorporate AI systems into art in a way that empowers artists and compensates them rather than immiserating them just as we can create cards for models we can also create cards for data sets that describe how they were curated what the sources were and any other potential issues with the data and perhaps in the future even how to opt out of or be removed from a data set so this is an example from a hugging face as with model cards there's lots of good examples of data set cards on hugging face there's also a nice checklist the Dion ethics checklist that is mostly focused around data ethics but covers a lot of other ground they also have this nice list of examples for each question in their checklist of cases
|
328 |
+
|
329 |
+
83
|
330 |
+
00:57:02,760 --> 00:57:47,339
|
331 |
+
where people have run into ethical or legal trouble by building an ml project that didn't satisfy a particular checklist item running underneath all of this has been this final most important question of whether this system should be built at all one particular use case that very frequently elicits this question is building ml-powered Weaponry ml powered Weaponry is already here it's already starting to be deployed in the world there are some remote controlled weapons that use computer vision for targeting deployed by the Israeli military in the West Bank using this smart shooter technology that's designed to in principle take normal weapons and add computer vision based targeting to them to make them into smart weapons
|
332 |
+
|
333 |
+
84
|
334 |
+
00:57:44,819 --> 00:58:29,220
|
335 |
+
right now this deployed system shown on the left uses only sponge tipped bullets which are designed to be less lethal but they can still cause serious injury and according to the deployers in the pilot stage so it's a little unclear to what extent autonomous Weaponry is already here and being used because the definition is a little bit blurry so for example the hayrop Drone shown in the top left is a loitering munition a type of drone that can fly around hold its position for a while and then automatically destroy any radar system that locks onto it this type of drone was used in the nagorno-karabakh war between Armenia and Azerbaijan in 2021 but there's also older autonomous weapon systems the Phalanx c-whiz is designed
|
336 |
+
|
337 |
+
85
|
338 |
+
00:58:27,660 --> 00:59:10,740
|
339 |
+
to automatically fire at Targets moving towards Naval vessels at very very high velocities so these are velocities they're usually only achieved by rocket Munitions not by manned craft and that system's been used since at least the first Gulf War in 1991. there was an analysis in 2017 by The Economist to try and look for how many systems with automated targeting there were and in particular how many of them could engage with targets without involving humans at all so that would be the last section of human out of the loop systems but given the general level of secrecy in some cases and hype and others around military technology it can be difficult to get a very clear sense and the blurriness of this definition has led
|
340 |
+
|
341 |
+
86
|
342 |
+
00:59:08,940 --> 00:59:50,339
|
343 |
+
some to say that autonomous weapons are actually at least 100 years old for example anti-personnel mines that were used starting in the 30s and in World War II attempts to you detect whether a person has come close to them and then explode and in some sense that is an autonomous weapon and if we broaden our definition that far then maybe lots of different kinds of traps are some form of autonomous weapon but just because these weapons already exist and maybe even have been around for a century does not mean that designing ml-powered weapons is ethical anti-personnel mines in fact are the subject of a mind Ban Treaty that a very large number of countries have signed unfortunately not some of the countries with the largest
|
344 |
+
|
345 |
+
87
|
346 |
+
00:59:47,400 --> 01:00:26,880
|
347 |
+
militaries in the world but that at least suggests that for one type of autonomous weapon that has caused a tremendous amount of collateral damage there's interest in Banning them and so perhaps rather than building these autonomous weapons so we can then ban them it would be better if we just didn't build them at all so the campaign to stop Killer Robots is a group to look into if this is something that's interesting to you it brings us to the end of our tour of the four common questions that people raise around the ethics of building an ml system I've provided some of my answers to these questions and some of the common answers to these questions but you should have thoughtful answers to these for the
|
348 |
+
|
349 |
+
88
|
350 |
+
01:00:25,619 --> 01:01:07,619
|
351 |
+
individual projects that you work on first is the model fair I think it's generally possible but it requires trade-offs is the system accountable I think it's pretty challenging to make interpretable deep Learning Systems where interpretability allows an explanation for why a decision was made but making a system that's accountable where answers can be changed in response to user feedback or perhaps user lawsuit is possible you'll definitely want to answer the question of who owns the data up front and be on the lookout for changes especially to these large-scale internet scraped data sets and then lastly should this be built at all you'll want to ask this repeatedly throughout the life cycle of the technology I wanted to close this
|
352 |
+
|
353 |
+
89
|
354 |
+
01:01:04,920 --> 01:01:53,460
|
355 |
+
section by talking about just how much the machine learning world can learn from medicine and from applications of machine learning to medical problems this is a field I've had a chance to work in and I've seen some of the best work on building with ML responsibly come from this field and fundamentally it's because of a mismatch between machine learning and medicine that impedance mismatch has led to a ton of learning so first we'll talk about the Fiasco that was machine learning and the covid-19 pandemic then briefly consider why medicine would have this big of a mismatch with machine learning and what the benefits of examining it closer might be and then lastly we'll talk about some concrete research on auditing
|
356 |
+
|
357 |
+
90
|
358 |
+
01:01:50,819 --> 01:02:34,680
|
359 |
+
and Frameworks for building with ML that have come out of medicine first something that should be scary and embarrassing for people in machine learning medical researchers found that almost all machine learning research on covid-19 was effectively useless this is in the context of a biomedical response to covid-19 that was an absolute Triumph in the first year vaccinations prevented some tens of millions of deaths these vaccines were designed based on novel Technologies like lipid nanoparticles for delivering mRNA and even more traditional techniques like small molecule Therapeutics for example paxilavid the quality of research that was done was extremely high so on the right we have an inferred 3D structure
|
360 |
+
|
361 |
+
91
|
362 |
+
01:02:32,099 --> 01:03:24,900
|
363 |
+
for a coronavirus protein in complex with the primary effective molecule in paxilavid allowing for a mechanistic understanding of how this drug was working at the atomic level and at this crucial time machine learning did not really acquit itself well so there were two reviews one in bmj and one in nature that reviewed a large set of prediction models for covid-19 either prognosis or diagnosis primarily prognosis in the case of the Winans at all paper in bmj or diagnosis on the basis of chest x-rays and CT scans and both of these reviews found that almost all of the papers were insufficiently documented did not follow best practices for developing models and did not have sufficient external validation testing
|
364 |
+
|
365 |
+
92
|
366 |
+
01:03:21,059 --> 01:04:06,780
|
367 |
+
on external data to justify any wider use of these models even though many of them were provided as software or apis ready to be used in a clinical setting so the depth of the errors here is really very sobering a full quarter of the papers analyzed in the Roberts at all review used a pneumonia data set as a control group so the idea was we don't want our model just to detect whether people are sick or not just having having coveted patients and healthy patients might cause models that detect all pneumonias as covid so let's incorporate this pneumonia data set but they failed to mention and perhaps failed to notice that the pneumonia data set was all children all pediatric patients so the models that they were
|
368 |
+
|
369 |
+
93
|
370 |
+
01:04:04,799 --> 01:04:52,079
|
371 |
+
training were very likely just detecting children versus adults because that would give them perfect performance on Pneumonia versus covid on that data set so it's a pretty egregious error of modeling and data set construction alongside bunch of other more subtle errors around proper validation and reporting of models and methods so I think one reason for the substantial difference in responses here is that medicine both in practice and in research has a very strong professional culture of Ethics that equips it to handle very very serious and difficult problems at least in the United States medical doctors still take the Hippocratic Oath parts of which date back all the way to Hippocrates one of the founding fathers of Greek medicine
|
372 |
+
|
373 |
+
94
|
374 |
+
01:04:49,559 --> 01:05:42,839
|
375 |
+
and one of the core precepts of that oath is to do no harm meanwhile one of the core precepts of the Contemporary tech industry represented here by this ml generated Greek bust of Mark Zuckerberg is to move fast and break things with the implication that breaking things is not so bad and well that's probably the right approach for building lots of kinds of web applications and other software when this culture gets applied to things like medicine the results can be really ugly one particularly striking example of this was when a retinal implant that was used to restore sight to some blind people was deprecated by the vendor and so stopped working and there was no recourse for these patients because there is no other organization capable
|
376 |
+
|
377 |
+
95
|
378 |
+
01:05:40,020 --> 01:06:23,460
|
379 |
+
of maintaining these devices the news here is not all bad for machine learning there are researchers who are working at the intersection of medicine and machine learning and developing and proposing solutions to some of these issues that I think might have broad applicability on building responsibly with machine learning first the clinical trial standards that are used for other medical devices and for pharmaceuticals have been extended to machine learning the spirit standard for Designing clinical trials and the consort standard for reporting results of clinical trials these have both been extended to include ml with Spirit Ai and consort AI two Links at the bottom of this slide for the details on the contents of both of
|
380 |
+
|
381 |
+
96
|
382 |
+
01:06:21,900 --> 01:07:07,380
|
383 |
+
those standards one thing I wanted to highlight here was the process by which these standards were created and which is reported in those research articles which included an international survey with over a hundred participants and then a conference with 30 participants to come up with a final checklist and then a pilot use of it to determine how well it worked so the standard for producing standards in medicine is also quite high and something we could very much learn from in machine learning so because of that work and because people have pointed out these concerns progress is being made on doing better work in machine learning for medicine this recent paper in the Journal of the American Medical Association does a
|
384 |
+
|
385 |
+
97
|
386 |
+
01:07:05,400 --> 01:07:52,680
|
387 |
+
review of clinical trials involving machine learning and finds that for many of the components of these clinical trial standards compliance and quality is very high incorporating clinical context state very clearly how the method will contribute to clinical care but there are definitely some places with poor compliance for example interestingly enough very few trials reported how low quality data was handled how data was assessed for quality and and how cases of poor quality data should be handled I think that's also something that the broader machine learning world could do a better job on and then also analysis of errors that models made which also shows up in medical research and clinical trials as analysis of Adverse Events this kind of
|
388 |
+
|
389 |
+
98
|
390 |
+
01:07:50,460 --> 01:08:34,199
|
391 |
+
error analysis was not commonly done and this is something that in talking about testing and troubleshooting and in talking about model monitoring and continual learning we've tried to emphasize the importance of this kind of error analysis for building with ML there's also this really gorgeous pair of papers by Lauren Oakton Raynor and others in the Lancet that both developed and applied this algorithmic auditing framework for medical ml so this is something that is probably easier to incorporate into other ml workflows than is a full-on clinical trial approach but still has some of the same rigor incorporates checklists and tasks and defined artifacts that highlight what the problems are and what needs to be
|
392 |
+
|
393 |
+
99
|
394 |
+
01:08:31,980 --> 01:09:11,940
|
395 |
+
tracked and shared while building a machine Learning System one particular component that I wanted to highlight and is here indicated in blue is that there's a big emphasis on failure modes and error analysis and what they call adversarial testing which is coming up with different kinds of inputs to put into the model to see how it performs so sort of like a behavioral check on the model these are all things that we've emphasized as part of how to build a model well there's lots of other components of this audit that the broader ml Community would do well to incorporate into their work there's a ton of really great work being done a lot of these papers are just within the last three or six months so I think it's
|
396 |
+
|
397 |
+
100
|
398 |
+
01:09:10,620 --> 01:09:51,779
|
399 |
+
a pretty good idea to keep your finger on the pulse here so to speak in medical ml the Stanford in Institute for AI and medicine has a regular panel that gets posted on YouTube they also share a lot of great other kinds of content via Twitter and then a lot of the researchers who did some of the work that I shared Lauren Oakton Raynor Benjamin Khan are also active on Twitter along with other folks who've done great work that I didn't get time to talk about like Judy chichoya and Matt Lundgren closing out this section like medicine machine learning can be very intimately intertwined with people's lives and so ethics is really really Salient perhaps the most important ethical question to ask ourselves over
|
400 |
+
|
401 |
+
101
|
402 |
+
01:09:48,540 --> 01:10:33,360
|
403 |
+
and over again is should this system be built at all what are the implications of building the system of automating this task or this work and it seems clear that if we don't regulate ourselves we will end up being regulated and so we should learn from older Industries like medicine rather than just assuming we can disrupt our way through so as our final section I want to talk about the ethics of artificial intelligence this is clearly a frontier both for the field of Ethics trying to think through these problems and for the technology communities that are building this I think that right now false claims and hype around artificial intelligence are the most pressing concern but we shouldn't sleep on some of the major
|
404 |
+
|
405 |
+
102
|
406 |
+
01:10:31,560 --> 01:11:18,540
|
407 |
+
ethical issues that are potentially oncoming with AI so right now claims and Hyperbole and hype around artificial intelligence are outpacing capabilities even though those capabilities are also growing fast and this risks a kind of blowback so one way to summarize this is say that if you call something autopilot people are going to treat it like autopilot and then be upset or worse when that's not the case so famously there is an incident where somebody who believed that Tesla is lean and braking assistant system autopilot was really full self-driving was killed in a car crash in this gap between what people expect out of ml systems and what they actually get is something that Josh talked about in the project management
|
408 |
+
|
409 |
+
103
|
410 |
+
01:11:16,800 --> 01:12:02,219
|
411 |
+
lecture so this is something that we're already having to incorporate into our engineering and our product design that people are overselling the capacities of ml systems in a way that gives users a bad idea of what is possible and this problem is very widespread even large and mature organizations like IBM can create products like Watson which was the capable question and answering system and then sell it as artificial intelligence and try to revolutionize or disrupt Fields like medicine and then end up falling far short of these extremely lofty goals they've set themselves and along the way they get at least the beginning journalistic coverage with pictures of robot hands reaching out to grab balls of light or
|
412 |
+
|
413 |
+
104
|
414 |
+
01:12:00,300 --> 01:12:44,400
|
415 |
+
brains inside computers or computers inside brains so not only do companies oversell what their technology can do but these overstatements are repeated or Amplified by traditional and social media and this problem even extends to Academia there is a Infamous now case where Japan in 2017 said that Radiologists at that point were like Wiley Coyote already over the edge of the cliff and haven't realized that there's no ground underneath them and that people should stop training Radiologists now because within five years AKA now deep learning is going to be better than Radiologists some of the work in the intersection of medicine in ml that I presented was done by people who were in their Radiology training at
|
416 |
+
|
417 |
+
105
|
418 |
+
01:12:42,659 --> 01:13:27,840
|
419 |
+
the time around the time this statement was made and were lucky that they continued training as Radiologists while also gaining ml expertise so that they could do the slow hard work of bringing deep learning and machine learning into Radiology this overall problem of overselling artificial intelligence you could call AI snake oil so that's the name of an upcoming book and a new sub stack by Arvin Narayanan are now very good friend and so this refers not just to people overselling the capabilities of large language models or predicting that we'll have artificial intelligence by Christmas but people who use this General Aura of hypanic segment around artificial intelligence to sell shoddy technology an example from this really
|
420 |
+
|
421 |
+
106
|
422 |
+
01:13:25,440 --> 01:14:12,960
|
423 |
+
great set of slides linked here the tool Elevate that claims to be able to assess personality and job suitability from a 30 second video including identifying whether the person in the video is a change agent or not so the call here is to separate out the actual places where there's been rapid Improvement in what's possible with machine learning for example computer perception identifying the contents of images face recognition Orion in here even includes medical diagnosis from scans from places where there's not been as much progress and so the split that he proposes that I think is helpful is that most things that involve some form of human judgment like determining whether something is hate speech or what grade an essay should
|
424 |
+
|
425 |
+
107
|
426 |
+
01:14:10,080 --> 01:14:51,179
|
427 |
+
receive these are on the borderline most forms of prediction especially around what he calls social outcomes so things like policing jobs Child Development these are places where there has not been substantial progress and where the risk of somebody essentially riding the coattails of gpt3 with some technique that doesn't perform any better than linear regression is at its highest so we don't have artificial intelligence yet but if we do synthesize intelligent agents a lot of thorny ethical questions are going to immediately arise so it's probably a good idea as a field and as individuals for us to think a little bit about these ahead of time so there's broad agreement that creating sentient intelligent beings would have ethical
|
428 |
+
|
429 |
+
108
|
430 |
+
01:14:49,020 --> 01:15:38,580
|
431 |
+
implications just this past summer Google engineer Blake Lemoine became convinced that a large language model built by Google Lambda was in fact conscious and almost everyone agrees that that's not the case for these large language models but there's pretty big disagreement on how far away we are and perhaps most importantly this concern did cause a pretty big reaction both inside the field and in the popular press in my view it's a bit unfortunate that this conversation was started so early because it's so easy to dismiss this claim if it happens too many more times we might end up inured to these kinds of conversations in a boy who cried AI type situation there's also a different set of concerns around what
|
432 |
+
|
433 |
+
109
|
434 |
+
01:15:36,719 --> 01:16:19,679
|
435 |
+
might happen with the creation of a self-improving artificial intelligence so there's already some hints in this direction for one the latest Nvidia GPU architecture Hopper incorporates a very large number of AI design circuits pictured here on the left the quality of the AI design circuits are superior this is also something that's been reported by the folks working on tpus at Google there's also cases in which large language models can be used to build better models for example large language models can teach themselves to program better and large language models can also use large language models at least as well as humans this suggests the possibility of virtuous Cycles in machine learning capabilities and
|
436 |
+
|
437 |
+
110
|
438 |
+
01:16:17,820 --> 01:17:03,300
|
439 |
+
machine intelligence and failing to pursue this kind of very powerful technology comes with a very substantial opportunity cost this is something that's argued by the philosopher Nick Bostrom in a famous paper called astronomical waste that points out just given the size of the universe the amount of resources and the amount of time it will be around there's a huge cost in terms of potential good potential lives worth living that we leave on the table if we do not develop the Necessary Technology quickly but the primary lesson that's drawn in this paper is actually not that technology should be developed as quickly as possible but rather that it should be developed as safely as possible which is to say that the probability that this
|
440 |
+
|
441 |
+
111
|
442 |
+
01:17:00,960 --> 01:17:48,540
|
443 |
+
imagined Galaxy or Universe spanning Utopia comes into being that probability should be maximized and so this concern around safety originating the work of Bostrom and others has become a central concern for people thinking about the ethical implications of artificial intelligence and so the concerns around self-improving intelligent systems that could end up being more intelligent than humans are nicely summarized in the parable of the paperclip maximizer also from Bostrom at least popularized in the book super intelligence so the idea here is a classic example of this proxy problem in alignment so we design an artificial intelligence system for building paper clips so it's designed to make sure that the paper clip producing
|
444 |
+
|
445 |
+
112
|
446 |
+
01:17:46,260 --> 01:18:27,900
|
447 |
+
component of our economy runs as effectively as possible produces as many paper clips as it can and we incorporate self-improvement into it so that it becomes smarter and more capable over time at first it improves human utility as it introduces better industrial processes for paper clips but as it becomes more intelligent perhaps it finds a way to manipulate the legal system and manipulate politics to introduce a more favorable tax code for pay-per-clip related Industries and that starts to hurt overall human utility uh even as the number of paper clips created and the capacity of the paperclip maximizer increases and of course at the point when we have mandatory national service in the paperclip mines or that all matter in
|
448 |
+
|
449 |
+
113
|
450 |
+
01:18:26,400 --> 01:19:14,280
|
451 |
+
the universe is converted to paper clips we've pretty clearly decreased human utility as this paperclip maximizer has maximized its objective and increased its own capacity so this still feels fairly far away and a lot of the speculations feel a lot more like science fiction than science fact but the stakes here are high enough that it is certainly worth having some people thinking about and working on it and many of the techniques can be applied to controlled and responsible deployment of less capable ml systems as a small aside these ideas around existential risk and super intelligences are often associated with the effective altruism Community which is concerned with the best ways to do the most good both with what you do
|
452 |
+
|
453 |
+
114
|
454 |
+
01:19:11,880 --> 01:19:54,480
|
455 |
+
with your career one of the focuses is the 80 000 hours organization and also through charitable donations as a way to by donating to the highest impact Charities and non-profits have the largest positive impact on the world so there's a lot of very interesting ideas coming out of this community and it's particularly appealing to a lot of folks who work in technology and especially in machine learning so it's worth checking out so that brings us to the end of our planned agenda here after giving some context around what our approach to Ethics in this lecture would look like we talked about ethical concerns in three different fields first past and immediate concerns around the ethical development of Technology then up and
|
456 |
+
|
457 |
+
115
|
458 |
+
01:19:52,260 --> 01:20:42,540
|
459 |
+
coming and near future concerns around building ethically with machine learning and then finally a taste of the ethical concerns we might face in a future where machine learning gives way to artificial intelligence with a reminder that we should make sure not to oversell our progress on that front so I got to the end of these slides and realized that this was the end of the course and felt that I couldn't leave it on uh dower and sad note of unusable medical algorithms and existential risk from Super intelligences so I wanted to close out with a bit of a more positive note on the things that we can do so I think the first and most obvious step is education a lot of these ideas around ethics are unfamiliar to people with a technical
|
460 |
+
|
461 |
+
116
|
462 |
+
01:20:40,560 --> 01:21:27,179
|
463 |
+
background there's a lot of great longer form content that captures a lot of these ideas and can help you build your own knowledge of the history and context and eventually your own opinions on these topics I can highly recommend each of these books the alignment problem is a great place to get started it focuses pretty tightly on ML ethics and AI ethics it covers a lot of recent research and is very easily digestible for an ml audience you might also want to consider some of these books around more Tech ethics like weapons of math destruction by Kathy O'Neill and automating inequality by Virginia Eubanks from there you can prioritize things that you want to act on make your own two by two around things that have
|
464 |
+
|
465 |
+
117
|
466 |
+
01:21:24,420 --> 01:22:10,980
|
467 |
+
impact now and can have very high impact for me I think that's things around deceptive design and dark patterns and around AI snake oil then there's also places where acting in the future might be very important and high impact for me I think that's things around ml Weaponry behind my head is existential risk from Super intelligences on super high impact but something that we can't act on right now and then all the things in between you can create your own two by two on these and then search around for organizations communities and people working on these problems to align yourself with and by way of a final goodbye as we're ending this class I want to call out that a lot of the discussion of Ethics in this lecture was
|
468 |
+
|
469 |
+
118
|
470 |
+
01:22:09,420 --> 01:22:53,040
|
471 |
+
very negative because of the framing around cases where people raised ethical concerns but ethics is not and cannot be purely negative about avoiding doing bad things the work that we do in Building Technology with machine learning can do good in the world not just avoid doing harm we can reduce suffering so this diagram here from a neuroscience from a brain machine interface paper from 2012 is what got me into the field of machine learning in the first place it shows a tetraplegic woman who has learned to control a robot arm using only other thoughts by means of an electrode attached to her head and while the technical achievements in this paper were certainly very impressive the thing that made the strongest impression on me
|
472 |
+
|
473 |
+
119
|
474 |
+
01:22:50,460 --> 01:23:33,000
|
475 |
+
reading this paper in college was the smile on the woman's face in the final panel if you've experienced this kind of limit Mobility either yourself or in someone close to you then you know that the joy even from something as simple as being able to feed yourself is very real we can also do good by increasing Joy not just reducing suffering despite the concerns that we talked about with text to image models there they're clearly being used to create Beauty and Joy or as Ted Underwood a digital Humanity scholar put it to explore a dimension of human culture that was accidentally created across the last five thousand years of captioning that's beautiful and it's something we should hold on to that's not to say that this happens
|
476 |
+
|
477 |
+
120
|
478 |
+
01:23:30,000 --> 01:24:17,040
|
479 |
+
automatically by Building Technology the world automatically becomes better but leading organizations in our field are making proactive statements on this openai around long term safely around long-term safety and Broad distribution of the benefits of machine learning and artificial intelligence research Deep Mind stating which Technologies they won't pursue and making a clear statement of a gold a broadly benefit Humanity the final bit of really great news that I have is that the tools for building ml well that you've learned throughout this class align very well with building ml for good so we saw it with the medical machine learning around failure analysis and we can also see it in the principles for for responsible
|
480 |
+
|
481 |
+
121
|
482 |
+
01:24:15,060 --> 01:25:03,199
|
483 |
+
development from these leading organizations Deep Mind mentioning accountability to people and Gathering feedback Google AI mentioning it as well and if you look closely at Google ai's list of recommended practices for responsible AI use multiple metrics to assess training and monitoring understand limitations use tests directly examine raw data Monitor and update your system after deployment these are exactly the same principles that we've been emphasizing in this course around building ml powered products the right way these techniques will also help you build machine learning that does what's right and so on that note I want to thank you for your time and your interest in this course and I wish you the best of luck as you
|
484 |
+
|
485 |
+
122
|
486 |
+
01:24:59,940 --> 01:25:03,199
|
487 |
+
go out to build with ML
|
488 |
+
|